Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: testloop stub for resharding v3 #12156

Merged
merged 7 commits into from
Oct 1, 2024
Merged

Conversation

Longarithm
Copy link
Member

@Longarithm Longarithm commented Sep 26, 2024

Goal

Write stub for test for resharding v3 switch. For this, I want the chain to switch between shard layouts.
And for that, I switch to EpochConfigStore as much as I can, which implies skipping use_production_config, overrides like AllEpochConfig::config_max_kickout_stake, EpochConfig generations from GenesisConfig. This is a big step towards #11265.

The most visible changes are:
Now TestLoop generates genesis_and_epoch_config_store instead of just genesis. Later we should have a separate EpochConfigStoreBuilder which may accept some data shared between genesis and epoch configs, e.g. validators set. This is done to minimise changes.

EpochManager::new_arc_handle is the way how epoch manager is constructed on production nodes. Its logic is changed as follows:

  • if chain = mainnet/testnet, only EpochConfigStore::for_chain_id is used for getting epoch configs.
  • if chain_id.starts_with("test-chain-"), we use only EpochConfig::from(genesis_config) (see below!)
  • otherwise, we use only Genesis::test_epoch_config. It doesn't use any genesis data, just stays in this crate for now for convenience. This is for simple tests in single module.

Achievements

  • test_fix_min_stake_ratio tests exactly what we want - we take EpochConfigStore::for_chain_id("mainnet") and see that it allows to include small validator after protocol upgrade.
  • In test_resharding_v3 we define old and new shard layouts, and test the switch explicitly without hidden overrides.
  • Usage of hacky overrides is reduced. For example, EpochManager::new_from_genesis_config_with_test_overrides is removed.
  • If we want to launch forknet with custom epoch config, the behaviour will be more straightforward. For example, one can copy latest epoch config from mainnet to mocknet/ folder and add new condition to for_epoch_id for custom mocknet chain name.

Failures

Nayduck often configures epoch config through genesis, e.g. by setting block_producer_kickout_threshold to 0. It is much more work to change this configuration, so I add a hack: if chain_id starts with test-chain- - name which nayduck uses - epoch config is derived from genesis. Many old integration tests use this chain id as well.
However, the improvement here is that we generate only one epoch config, without any overrides.

epoch_length is sometimes taken from ChainGenesis, not from EpochConfig. To be safe, I set epoch length in both genesis and epoch configs.

This still lacks testing on live node. Using this on canary or forknet could be insightful.

Copy link

codecov bot commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 88.88889% with 37 lines in your changes missing coverage. Please review.

Project coverage is 71.60%. Comparing base (359564c) to head (ec0f588).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
core/chain-configs/src/test_genesis.rs 75.90% 17 Missing and 3 partials ⚠️
chain/epoch-manager/src/lib.rs 86.20% 8 Missing ⚠️
core/primitives/src/shard_layout.rs 0.00% 6 Missing ⚠️
chain/client/src/sync/state.rs 80.00% 1 Missing ⚠️
chain/client/src/test_utils/test_env_builder.rs 97.50% 0 Missing and 1 partial ⚠️
core/primitives-core/src/version.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12156      +/-   ##
==========================================
- Coverage   71.61%   71.60%   -0.02%     
==========================================
  Files         824      824              
  Lines      165348   165507     +159     
  Branches   165348   165507     +159     
==========================================
+ Hits       118412   118507      +95     
- Misses      41803    41870      +67     
+ Partials     5133     5130       -3     
Flag Coverage Δ
backward-compatibility 0.17% <0.00%> (-0.01%) ⬇️
db-migration 0.17% <0.00%> (-0.01%) ⬇️
genesis-check 1.25% <0.00%> (-0.01%) ⬇️
integration-tests 38.73% <87.08%> (+<0.01%) ⬆️
linux 71.39% <88.88%> (-0.01%) ⬇️
linux-nightly 71.18% <88.88%> (-0.01%) ⬇️
macos 54.13% <56.12%> (+0.01%) ⬆️
pytests 1.57% <30.70%> (+0.05%) ⬆️
sanity-checks 1.38% <30.70%> (+0.05%) ⬆️
unittests 65.34% <56.12%> (+<0.01%) ⬆️
upgradability 0.21% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// I *think* this is not relevant anymore, since we download
// already the next epoch's state.
// let shards_to_split = self.get_shards_to_split(sync_hash, &state_sync_info, &me)?;
let shards_to_split = HashMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can shards_to_split be removed at all from this function then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite amusing how many people are touching the state sync code simultaneously 🤯
I'm down for deleting all of it and re-adding whatever is necessary on top of Robin's state sync rewrite.

cc @marcelo-gonzalez WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fine if we keep this change for now and remove it as part of a future PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I plan to do, yes. This line is not triggered now anyway.

)
};

let epoch_config_store = EpochConfigStore::test(BTreeMap::from_iter(vec![(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

}

let epoch_config = if chain_id.starts_with("test-chain-") {
// We still do this for localnet as nayduck depends on it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mocknet would also depend on this (for now the configs for mocknet are commented out, also that mocknet does protocol transition, so needs to support different epoch configs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular block of code won't impact mainnet, because its chain name doesn't start from test-chain-, right?

For mocknet, I have two ideas:

  1. Change constructor every time we introduce new mocknet. Always derive base config from mainnet config of the latest version.
  2. Add new folder to the epoch_configs/ corresponding to each separate mocknet instance. Shouldn't take too much place in repo, because we need to support only latest mainnet and couple protocol transitions on top of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for option 1: use mainnet latest config as the as base config and update it based on the specific mocknet setup (eg. num seats). otherwise, we will keep updating the mocknet in epoch_configs, which should be based on mainnet with fixed set of overrides anyways.

config_store: Some(epoch_config_store),
chain_id: chain_id.to_string(),
epoch_length,
// The fields below must be DEPRECATED.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. can you add a TODO referring to the issue id, so we do not forget it? (same for other todos and deprecations)

self.config_store.as_ref().unwrap().get_config(protocol_version).as_ref().clone();
// TODO(#11265): epoch length is overridden in many tests so we
// need to support it here. Consider removing `epoch_length` from
// EpochConfig.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get this why epoch_length is special here. it is already overriden through test_epoch_config so it will be in the configs returned by the EpochConfigStore right?, so why do we need to keep a local field for epoch_length here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many tests have setup pattern like TestEnvBuilder::new().epoch_length(5)....
This invalidates epoch lengths given in default EpochConfigStores.
So override must happen somewhere. We could put epoch length to new or remove out of EpochConfig, but that means more lines changes which I try to minimise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, we can address epoch length later, it is not a blocker for this PR from my side.

@@ -43,7 +44,7 @@ use super::utils::network::{chunk_endorsement_dropper, partial_encoded_chunks_dr

pub(crate) struct TestLoopBuilder {
test_loop: TestLoopV2,
genesis: Option<Genesis>,
genesis_and_epoch_config_store: Option<(Genesis, EpochConfigStore)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep these fields separate? (even though same function generates them)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, that's more lines of code... I can do that, once the general idea is approved by more reviewers.

@@ -97,8 +98,11 @@ impl TestLoopBuilder {
}

/// Set the genesis configuration for the test loop.
pub(crate) fn genesis(mut self, genesis: Genesis) -> Self {
self.genesis = Some(genesis);
pub(crate) fn genesis_and_epoch_config_store(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, can we set these separately?

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with primary focus on the test itself)

// I *think* this is not relevant anymore, since we download
// already the next epoch's state.
// let shards_to_split = self.get_shards_to_split(sync_hash, &state_sync_info, &me)?;
let shards_to_split = HashMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite amusing how many people are touching the state sync code simultaneously 🤯
I'm down for deleting all of it and re-adding whatever is necessary on top of Robin's state sync rewrite.

cc @marcelo-gonzalez WDYT?

Comment on lines +98 to +100
/// The fields below are DEPRECATED.
/// Epoch config must be controlled by `config_store` only.
/// TODO(#11265): remove these fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice (deprecating use production config)

@@ -79,7 +79,7 @@ pub struct ShardLayoutV1 {
/// Each shard contains a range of accounts from one boundary account to
/// another - or the smallest or largest account possible. The total
/// number of shards is equal to the number of boundary accounts plus 1.
boundary_accounts: Vec<AccountId>,
pub boundary_accounts: Vec<AccountId>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this public access? Same for version.

let accounts =
(0..8).map(|i| format!("account{}", i).parse().unwrap()).collect::<Vec<AccountId>>();
let clients = accounts.iter().cloned().collect_vec();
let block_and_chunk_producers = (0..8).map(|idx| accounts[idx].as_str()).collect_vec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to also setup some chunk validator only nodes.

let base_shard_layout = base_epoch_config.shard_layout.clone();
let base_num_shards = base_shard_layout.shard_ids().count() as ShardId;
let mut epoch_config = base_epoch_config.clone();
let ShardLayout::V1(ShardLayoutV1 { mut boundary_accounts, version, .. }) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be tempted to make it use V2

(0..base_num_shards - 1).map(|i| vec![i]).collect();
shards_split_map.push(vec![base_num_shards - 1, base_num_shards]);
boundary_accounts.push(AccountId::try_from("x.near".to_string()).unwrap());
epoch_config.shard_layout = ShardLayout::v1(boundary_accounts, Some(shards_split_map), version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty cool that you setup the shard layout programmatically.

client.epoch_manager.get_epoch_height_from_prev_block(&tip.prev_block_hash).unwrap();
assert!(epoch_height < 5);
let epoch_config = client.epoch_manager.get_epoch_config(&tip.epoch_id).unwrap();
return epoch_config.shard_layout.shard_ids().count() == base_num_shards as usize + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why count? I see len more often, is it not available here? Is it because it's an iterator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, because it's iterator.

Copy link
Contributor

@tayfunelmas tayfunelmas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only remaining comment it about splitting genesis_and_epoch_config_store but that could be addressed later. This PR already achieves a lot. Thanks. LGTM.

/// After uncommenting panics with
/// StorageInconsistentState("Failed to find root node ... in memtrie")
#[test]
// #[ignore]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can remove this "ignore"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I need to comment that back. I couldn't make the test pass

@Longarithm Longarithm added this pull request to the merge queue Oct 1, 2024
Merged via the queue into near:master with commit 810e820 Oct 1, 2024
29 of 30 checks passed
@Longarithm Longarithm deleted the testloop-rv3 branch October 1, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants