test: testloop stub for resharding v3 #12156

Longarithm · 2024-09-26T17:14:39Z

Goal

Write stub for test for resharding v3 switch. For this, I want the chain to switch between shard layouts.
And for that, I switch to EpochConfigStore as much as I can, which implies skipping use_production_config, overrides like AllEpochConfig::config_max_kickout_stake, EpochConfig generations from GenesisConfig. This is a big step towards #11265.

The most visible changes are:
Now TestLoop generates genesis_and_epoch_config_store instead of just genesis. Later we should have a separate EpochConfigStoreBuilder which may accept some data shared between genesis and epoch configs, e.g. validators set. This is done to minimise changes.

EpochManager::new_arc_handle is the way how epoch manager is constructed on production nodes. Its logic is changed as follows:

if chain = mainnet/testnet, only EpochConfigStore::for_chain_id is used for getting epoch configs.
if chain_id.starts_with("test-chain-"), we use only EpochConfig::from(genesis_config) (see below!)
otherwise, we use only Genesis::test_epoch_config. It doesn't use any genesis data, just stays in this crate for now for convenience. This is for simple tests in single module.

Achievements

test_fix_min_stake_ratio tests exactly what we want - we take EpochConfigStore::for_chain_id("mainnet") and see that it allows to include small validator after protocol upgrade.
In test_resharding_v3 we define old and new shard layouts, and test the switch explicitly without hidden overrides.
Usage of hacky overrides is reduced. For example, EpochManager::new_from_genesis_config_with_test_overrides is removed.
If we want to launch forknet with custom epoch config, the behaviour will be more straightforward. For example, one can copy latest epoch config from mainnet to mocknet/ folder and add new condition to for_epoch_id for custom mocknet chain name.

Failures

Nayduck often configures epoch config through genesis, e.g. by setting block_producer_kickout_threshold to 0. It is much more work to change this configuration, so I add a hack: if chain_id starts with test-chain- - name which nayduck uses - epoch config is derived from genesis. Many old integration tests use this chain id as well.
However, the improvement here is that we generate only one epoch config, without any overrides.

epoch_length is sometimes taken from ChainGenesis, not from EpochConfig. To be safe, I set epoch length in both genesis and epoch configs.

This still lacks testing on live node. Using this on canary or forknet could be insightful.

codecov · 2024-09-27T21:24:11Z

Codecov Report

Attention: Patch coverage is 88.88889% with 37 lines in your changes missing coverage. Please review.

Project coverage is 71.60%. Comparing base (359564c) to head (ec0f588).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
core/chain-configs/src/test_genesis.rs	75.90%	17 Missing and 3 partials ⚠️
chain/epoch-manager/src/lib.rs	86.20%	8 Missing ⚠️
core/primitives/src/shard_layout.rs	0.00%	6 Missing ⚠️
chain/client/src/sync/state.rs	80.00%	1 Missing ⚠️
chain/client/src/test_utils/test_env_builder.rs	97.50%	0 Missing and 1 partial ⚠️
core/primitives-core/src/version.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12156      +/-   ##
==========================================
- Coverage   71.61%   71.60%   -0.02%     
==========================================
  Files         824      824              
  Lines      165348   165507     +159     
  Branches   165348   165507     +159     
==========================================
+ Hits       118412   118507      +95     
- Misses      41803    41870      +67     
+ Partials     5133     5130       -3

Flag	Coverage Δ
backward-compatibility	`0.17% <0.00%> (-0.01%)`	⬇️
db-migration	`0.17% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.25% <0.00%> (-0.01%)`	⬇️
integration-tests	`38.73% <87.08%> (+<0.01%)`	⬆️
linux	`71.39% <88.88%> (-0.01%)`	⬇️
linux-nightly	`71.18% <88.88%> (-0.01%)`	⬇️
macos	`54.13% <56.12%> (+0.01%)`	⬆️
pytests	`1.57% <30.70%> (+0.05%)`	⬆️
sanity-checks	`1.38% <30.70%> (+0.05%)`	⬆️
unittests	`65.34% <56.12%> (+<0.01%)`	⬆️
upgradability	`0.21% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tayfunelmas · 2024-09-30T10:46:35Z

chain/client/src/client.rs

+            // I *think* this is not relevant anymore, since we download
+            // already the next epoch's state.
+            // let shards_to_split = self.get_shards_to_split(sync_hash, &state_sync_info, &me)?;
+            let shards_to_split = HashMap::new();


can shards_to_split be removed at all from this function then?

It's quite amusing how many people are touching the state sync code simultaneously 🤯
I'm down for deleting all of it and re-adding whatever is necessary on top of Robin's state sync rewrite.

cc @marcelo-gonzalez WDYT?

Is it fine if we keep this change for now and remove it as part of a future PR?

That's what I plan to do, yes. This line is not triggered now anyway.

tayfunelmas · 2024-09-30T10:55:31Z

chain/epoch-manager/src/lib.rs

+            )
+        };
+
+        let epoch_config_store = EpochConfigStore::test(BTreeMap::from_iter(vec![(


tayfunelmas · 2024-09-30T10:58:31Z

chain/epoch-manager/src/lib.rs

+        }
+
+        let epoch_config = if chain_id.starts_with("test-chain-") {
+            // We still do this for localnet as nayduck depends on it.


mocknet would also depend on this (for now the configs for mocknet are commented out, also that mocknet does protocol transition, so needs to support different epoch configs)

This particular block of code won't impact mainnet, because its chain name doesn't start from test-chain-, right?

For mocknet, I have two ideas:

Change constructor every time we introduce new mocknet. Always derive base config from mainnet config of the latest version.

Add new folder to the epoch_configs/ corresponding to each separate mocknet instance. Shouldn't take too much place in repo, because we need to support only latest mainnet and couple protocol transitions on top of that.

I would vote for option 1: use mainnet latest config as the as base config and update it based on the specific mocknet setup (eg. num seats). otherwise, we will keep updating the mocknet in epoch_configs, which should be based on mainnet with fixed set of overrides anyways.

tayfunelmas · 2024-09-30T11:00:27Z

core/primitives/src/epoch_manager.rs

+            config_store: Some(epoch_config_store),
+            chain_id: chain_id.to_string(),
+            epoch_length,
+            // The fields below must be DEPRECATED.


cool. can you add a TODO referring to the issue id, so we do not forget it? (same for other todos and deprecations)

tayfunelmas · 2024-09-30T11:07:56Z

core/primitives/src/epoch_manager.rs

+                self.config_store.as_ref().unwrap().get_config(protocol_version).as_ref().clone();
+            // TODO(#11265): epoch length is overridden in many tests so we
+            // need to support it here. Consider removing `epoch_length` from
+            // EpochConfig.


I do not get this why epoch_length is special here. it is already overriden through test_epoch_config so it will be in the configs returned by the EpochConfigStore right?, so why do we need to keep a local field for epoch_length here?

Many tests have setup pattern like TestEnvBuilder::new().epoch_length(5)....
This invalidates epoch lengths given in default EpochConfigStores.
So override must happen somewhere. We could put epoch length to new or remove out of EpochConfig, but that means more lines changes which I try to minimise.

got it, we can address epoch length later, it is not a blocker for this PR from my side.

tayfunelmas · 2024-09-30T11:08:52Z

integration-tests/src/test_loop/builder.rs

@@ -43,7 +44,7 @@ use super::utils::network::{chunk_endorsement_dropper, partial_encoded_chunks_dr

 pub(crate) struct TestLoopBuilder {
    test_loop: TestLoopV2,
-    genesis: Option<Genesis>,
+    genesis_and_epoch_config_store: Option<(Genesis, EpochConfigStore)>,


can we keep these fields separate? (even though same function generates them)?

Again, that's more lines of code... I can do that, once the general idea is approved by more reviewers.

tayfunelmas · 2024-09-30T11:09:12Z

integration-tests/src/test_loop/builder.rs

@@ -97,8 +98,11 @@ impl TestLoopBuilder {
    }

    /// Set the genesis configuration for the test loop.
-    pub(crate) fn genesis(mut self, genesis: Genesis) -> Self {
-        self.genesis = Some(genesis);
+    pub(crate) fn genesis_and_epoch_config_store(


same, can we set these separately?

chain/client/src/test_utils/test_env_builder.rs

wacban

LGTM (with primary focus on the test itself)

wacban · 2024-09-30T21:09:31Z

chain/client/src/client.rs

+            // I *think* this is not relevant anymore, since we download
+            // already the next epoch's state.
+            // let shards_to_split = self.get_shards_to_split(sync_hash, &state_sync_info, &me)?;
+            let shards_to_split = HashMap::new();


It's quite amusing how many people are touching the state sync code simultaneously 🤯
I'm down for deleting all of it and re-adding whatever is necessary on top of Robin's state sync rewrite.

cc @marcelo-gonzalez WDYT?

wacban · 2024-10-01T07:15:44Z

core/primitives/src/epoch_manager.rs

+    /// The fields below are DEPRECATED.
+    /// Epoch config must be controlled by `config_store` only.
+    /// TODO(#11265): remove these fields.


nice (deprecating use production config)

wacban · 2024-10-01T07:16:58Z

core/primitives/src/shard_layout.rs

@@ -79,7 +79,7 @@ pub struct ShardLayoutV1 {
    /// Each shard contains a range of accounts from one boundary account to
    /// another - or the smallest or largest account possible. The total
    /// number of shards is equal to the number of boundary accounts plus 1.
-    boundary_accounts: Vec<AccountId>,
+    pub boundary_accounts: Vec<AccountId>,


Why do you need this public access? Same for version.

wacban · 2024-10-01T07:19:49Z

integration-tests/src/test_loop/tests/resharding_v3.rs

+    let accounts =
+        (0..8).map(|i| format!("account{}", i).parse().unwrap()).collect::<Vec<AccountId>>();
+    let clients = accounts.iter().cloned().collect_vec();
+    let block_and_chunk_producers = (0..8).map(|idx| accounts[idx].as_str()).collect_vec();


It would be nice to also setup some chunk validator only nodes.

wacban · 2024-10-01T07:21:40Z

integration-tests/src/test_loop/tests/resharding_v3.rs

+    let base_shard_layout = base_epoch_config.shard_layout.clone();
+    let base_num_shards = base_shard_layout.shard_ids().count() as ShardId;
+    let mut epoch_config = base_epoch_config.clone();
+    let ShardLayout::V1(ShardLayoutV1 { mut boundary_accounts, version, .. }) =


I would be tempted to make it use V2

wacban · 2024-10-01T07:23:16Z

integration-tests/src/test_loop/tests/resharding_v3.rs

+        (0..base_num_shards - 1).map(|i| vec![i]).collect();
+    shards_split_map.push(vec![base_num_shards - 1, base_num_shards]);
+    boundary_accounts.push(AccountId::try_from("x.near".to_string()).unwrap());
+    epoch_config.shard_layout = ShardLayout::v1(boundary_accounts, Some(shards_split_map), version);


It's pretty cool that you setup the shard layout programmatically.

wacban · 2024-10-01T07:37:24Z

integration-tests/src/test_loop/tests/resharding_v3.rs

+            client.epoch_manager.get_epoch_height_from_prev_block(&tip.prev_block_hash).unwrap();
+        assert!(epoch_height < 5);
+        let epoch_config = client.epoch_manager.get_epoch_config(&tip.epoch_id).unwrap();
+        return epoch_config.shard_layout.shard_ids().count() == base_num_shards as usize + 1;


Just curious, why count? I see len more often, is it not available here? Is it because it's an iterator?

Yeah, because it's iterator.

tayfunelmas

My only remaining comment it about splitting genesis_and_epoch_config_store but that could be addressed later. This PR already achieves a lot. Thanks. LGTM.

shreyan-gupta · 2024-10-01T17:24:50Z

integration-tests/src/test_loop/tests/resharding_v3.rs

+/// After uncommenting panics with
+/// StorageInconsistentState("Failed to find root node ... in memtrie")
+#[test]
+// #[ignore]


nit: can remove this "ignore"

oh, I need to comment that back. I couldn't make the test pass

separate genesis and epoch configs

dd6143c

Longarithm force-pushed the testloop-rv3 branch from 27e518f to dd6143c Compare September 27, 2024 13:42

Merge branch 'master' into testloop-rv3

87e9570

fixes

e24f89a

Longarithm force-pushed the testloop-rv3 branch from 70ffa65 to e24f89a Compare September 30, 2024 10:33

Longarithm requested review from wacban, tayfunelmas, shreyan-gupta and VanBarbascu September 30, 2024 10:35

Longarithm marked this pull request as ready for review September 30, 2024 10:37

Longarithm requested a review from a team as a code owner September 30, 2024 10:37

tayfunelmas reviewed Sep 30, 2024

View reviewed changes

chain/client/src/test_utils/test_env_builder.rs Show resolved Hide resolved

fixes

d7da54e

wacban approved these changes Oct 1, 2024

View reviewed changes

tayfunelmas approved these changes Oct 1, 2024

View reviewed changes

Longarithm added 2 commits October 1, 2024 15:28

Merge branch 'master' into testloop-rv3

fedbc20

final refactor

d0e456f

Longarithm enabled auto-merge October 1, 2024 17:20

shreyan-gupta reviewed Oct 1, 2024

View reviewed changes

fix

ec0f588

Longarithm added this pull request to the merge queue Oct 1, 2024

Merged via the queue into near:master with commit 810e820 Oct 1, 2024
29 of 30 checks passed

Longarithm deleted the testloop-rv3 branch October 1, 2024 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: testloop stub for resharding v3 #12156

test: testloop stub for resharding v3 #12156

Longarithm commented Sep 26, 2024 •

edited

Loading

codecov bot commented Sep 27, 2024 •

edited

Loading

tayfunelmas Sep 30, 2024

wacban Sep 30, 2024

shreyan-gupta Oct 1, 2024

Longarithm Oct 1, 2024

tayfunelmas Sep 30, 2024

tayfunelmas Sep 30, 2024

Longarithm Sep 30, 2024

tayfunelmas Oct 1, 2024

tayfunelmas Sep 30, 2024

tayfunelmas Sep 30, 2024

Longarithm Sep 30, 2024

tayfunelmas Oct 1, 2024

tayfunelmas Sep 30, 2024

Longarithm Sep 30, 2024

tayfunelmas Sep 30, 2024

wacban left a comment

wacban Sep 30, 2024

wacban Oct 1, 2024

wacban Oct 1, 2024

wacban Oct 1, 2024

wacban Oct 1, 2024

wacban Oct 1, 2024

wacban Oct 1, 2024

Longarithm Oct 1, 2024

tayfunelmas left a comment

shreyan-gupta Oct 1, 2024

Longarithm Oct 1, 2024

test: testloop stub for resharding v3 #12156

test: testloop stub for resharding v3 #12156

Conversation

Longarithm commented Sep 26, 2024 • edited Loading

Goal

Achievements

Failures

codecov bot commented Sep 27, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tayfunelmas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Longarithm commented Sep 26, 2024 •

edited

Loading

codecov bot commented Sep 27, 2024 •

edited

Loading