Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: undo-block tool #8681

Merged
merged 1 commit into from
Apr 20, 2023
Merged

feat: undo-block tool #8681

merged 1 commit into from
Apr 20, 2023

Conversation

ppca
Copy link
Contributor

@ppca ppca commented Mar 6, 2023

undo-block tool to revert the current head of the chain to its previous block. Typically used when the state of current chain head is messed up, e.g. running a wrong binary like in the case of protocol upgrade or gas price param changes.

To use the tool:

  1. stop the node
  2. run command ./target/release/neard --home {path_to_config_directory} undo-block
  3. restart node

Assuming only the head of the chain was messed up, not a long fork, then after using the tool, the node should be able to catch up and produce block (if they are block producers).

We intentionally forbid users from reverting beyond final block so that nothing is messed up for flat storage and finality is not broken.

Design doc for the tool

@ppca ppca force-pushed the xiangyi/ND-322 branch 4 times, most recently from bc89a27 to 4ebf500 Compare March 23, 2023 19:41
let state_header_key = StateHeaderKey(shard_id, block_hash).try_to_vec()?;
self.gc_col(DBCol::StateHeaders, &state_header_key);
}
// delete flat storage columns: FlatStateChanges and FlatStateDeltaMetadata
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pugachAG @Longarithm flat state column changes are here, please help review

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I would hide multi-line code behind feature like that:
#[cfg(feature = "protocol_feature_flat_state")]
{
    line1
    line2
    ...
}
  1. Though get_simple_nightshade_layout works currently, it is technically not correct, because we may use other shard layouts once we do more reshardings. We fix shard uids and layouts for an epoch, so we need to take smth like block.epoch_id and ask epoch manager for it (or runtime adapter, if our chain refactoring is not finished yet). EpochManagerAdapter::shard_id_to_uid do the work. And I would name it just shard_uid because it is not specific for flat storage but for the whole chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding epoch manager, will let runtime = NightshadeRuntime::from_config(home_dir, store.clone(), &config); give the correct runtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we don't need anything else. home_dir is used to read genesis, store - at least for reading state data, config... for tweaking execution parameters, caching, etc.

@ppca ppca changed the title [DRAFT]Xiangyi/nd 322 feat: undo-block tool Mar 24, 2023
@ppca ppca marked this pull request as ready for review March 24, 2023 19:20
@ppca ppca requested a review from a team as a code owner March 24, 2023 19:20
chain/chain/src/store.rs Outdated Show resolved Hide resolved
chain/chain/src/store.rs Outdated Show resolved Hide resolved
chain/chain/src/store.rs Outdated Show resolved Hide resolved
neard/src/cli.rs Outdated Show resolved Hide resolved
neard/src/cli.rs Show resolved Hide resolved
tools/undo-block/src/cli.rs Outdated Show resolved Hide resolved
tools/undo-block/src/cli.rs Outdated Show resolved Hide resolved
tools/undo-block/src/cli.rs Show resolved Hide resolved
Copy link
Contributor

@nikurt nikurt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the flat storage part isn't working.
The rest looks good to me.

tools/undo-block/src/lib.rs Outdated Show resolved Hide resolved
tools/undo-block/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Member

@Longarithm Longarithm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! The code makes sense and flat storage removals are clear, I'm not sure what doesn't work for it.
Left some comments. I think it also deserves some test. Because it could be hard to implement, we can test the following scenario:

  • create chain
  • process some blocks
  • undo last block H
  • create new chain instance, check that chain metadata (tip?) is correct
  • process some other block on height H, check that it works

chain/chain/src/store.rs Outdated Show resolved Hide resolved
tools/undo-block/src/lib.rs Outdated Show resolved Hide resolved
tools/undo-block/src/lib.rs Outdated Show resolved Hide resolved
let state_header_key = StateHeaderKey(shard_id, block_hash).try_to_vec()?;
self.gc_col(DBCol::StateHeaders, &state_header_key);
}
// delete flat storage columns: FlatStateChanges and FlatStateDeltaMetadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I would hide multi-line code behind feature like that:
#[cfg(feature = "protocol_feature_flat_state")]
{
    line1
    line2
    ...
}
  1. Though get_simple_nightshade_layout works currently, it is technically not correct, because we may use other shard layouts once we do more reshardings. We fix shard uids and layouts for an epoch, so we need to take smth like block.epoch_id and ask epoch manager for it (or runtime adapter, if our chain refactoring is not finished yet). EpochManagerAdapter::shard_id_to_uid do the work. And I would name it just shard_uid because it is not specific for flat storage but for the whole chain.

chain/chain/src/store.rs Outdated Show resolved Hide resolved
@ppca
Copy link
Contributor Author

ppca commented Mar 29, 2023

@Longarithm
we can test the following scenario:

create chain
process some blocks
undo last block H
create new chain instance, check that chain metadata (tip?) is correct
process some other block on height H, check that it works


do you mean testing using the tool for the above scenarios?
what does create chain mean here?
I've tested the following in localnet: keep sending transactions to localnet, stop one node, undo-block, restart. The node is able to catch up and produce blocks. I've tested stopping in the middle of an epoch, at the first block of an epoch and at the last block of an epoch.
I've also tested the scenario of starting localnet and sending transaction, one node restarts with wrong gas cost, which creates an invalid state transition, and the node keeps doing header sync but not able to body sync. Then stop node, undo-block, restart with correct binary, and the node is able to catch up and produce blocks. I've tested stopping in the middle of an epoch, at the first block of an epoch and at the last block of an epoch.
Do you mean some other scenarios to test?

@Longarithm
Copy link
Member

I mean integration test for CI, sorry for confusion. The logic is not trivial, and it could be great to have undo block scenarios covered, so that we don't break it by chain/runtime/other changes. Especially if we want to use this code during outages.

Though integration tests are hard to write. I also don't see simple guidance for it, perhaps that's a good action item for core team. Let me give some introduction:

Integration tests live in integration-tests package. Common place for testing infra for processing blocks is integration-tests/src/tests/client/process_blocks.rs. The simplest setup would be

let mut genesis = Genesis::test(vec!["test0".parse().unwrap(), "test1".parse().unwrap()], 1);
let chain_genesis = ChainGenesis::new(&genesis);
let store = create_test_store();
let runtimes: Vec<Arc<dyn RuntimeWithEpochManagerAdapter>> =
        vec![nearcore::NightshadeRuntime::test(Path::new("../../../.."), store.clone(), genesis)];
let mut env = TestEnv::builder(chain_genesis).runtime_adapters(runtimes).build()

This one creates TestEnv with a single client and all structs necessary to test block production/processing logic.
First, we can create several blocks on chain:

let mut blocks = vec![];
for i in 1..5 {
    let block = env.clients[0].produce_block(i).unwrap().unwrap();
    blocks.push(block.clone());
    env.process_block(0, block, Provenance::PRODUCED);
}

Then, extract store update and undo block:

let mut chain_store_update = env.clients[0].chain.store().store_update();
chain_store_update.clear_block_data_undo_block(*blocks.last().unwrap().hash());

Then drop existing env, create new one using old store and try process new blocks on it. You can test your scenarios in that framework as well!

let runtimes: Vec<Arc<dyn RuntimeWithEpochManagerAdapter>> =
        vec![nearcore::NightshadeRuntime::test(Path::new("../../../.."), store, genesis)];
let mut env = TestEnv::builder(chain_genesis).runtime_adapters(runtimes).build()
for i in 3..7 {
    let block = env.clients[0].produce_block(i).unwrap().unwrap();
    env.process_block(0, block, Provenance::PRODUCED);
}

Example of test which kinda does it is test_flat_storage_creation_sanity we wrote recently.
Does it make sense?

@ppca
Copy link
Contributor Author

ppca commented Mar 29, 2023

I mean integration test for CI, sorry for confusion. The logic is not trivial, and it could be great to have undo block scenarios covered, so that we don't break it by chain/runtime/other changes. Especially if we want to use this code during outages.

Makes sense to add a test for this in CI. I'll work on adding some for this tool.

@ppca ppca requested a review from Longarithm March 30, 2023 17:50
@ppca ppca force-pushed the xiangyi/ND-322 branch 2 times, most recently from 15465c1 to 31147be Compare April 5, 2023 17:55
Copy link
Member

@Longarithm Longarithm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
Left some more comments but approving to unblock.
The complexity of change makes me think that our current store codebase is poor :( Looking forward to further improvements, perhaps in scope of #8153.

chain/chain/src/store.rs Outdated Show resolved Hide resolved
Comment on lines +1967 to +1972
store_update.delete(DBCol::BlockHeader, key);
self.chain_store.headers.pop(key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: It sucks that gc_col doesn't work here and assumes that it is called only during GC... Later we can consider renaming gc_col to remove_key - it doesn't remove column as one may think.

chain/chain/src/store.rs Show resolved Hide resolved
chain/chain/src/store.rs Outdated Show resolved Hide resolved
integration-tests/src/tests/client/undo_block.rs Outdated Show resolved Hide resolved
integration-tests/src/tests/client/undo_block.rs Outdated Show resolved Hide resolved
integration-tests/src/tests/client/undo_block.rs Outdated Show resolved Hide resolved
@ppca ppca force-pushed the xiangyi/ND-322 branch 2 times, most recently from 6c4d276 to dab6b5c Compare April 6, 2023 15:49
@bowenwang1996
Copy link
Collaborator

@ppca what prevents us from merging this PR?

@ppca
Copy link
Contributor Author

ppca commented Apr 20, 2023

@ppca what prevents us from merging this PR?

Was testing this on a testnet validator. The test looks good this morning. Will merge it today.

@ppca ppca merged commit 1054810 into near:master Apr 20, 2023
near-bulldozer bot pushed a commit that referenced this pull request Apr 24, 2023
There was a minor conflict while merging #8681 and #8761. Removing cfg(feature) because feature is stabilized now.
nikurt pushed a commit that referenced this pull request Apr 25, 2023
nikurt pushed a commit that referenced this pull request Apr 25, 2023
There was a minor conflict while merging #8681 and #8761. Removing cfg(feature) because feature is stabilized now.
nikurt pushed a commit that referenced this pull request Apr 28, 2023
nikurt pushed a commit that referenced this pull request Apr 28, 2023
There was a minor conflict while merging #8681 and #8761. Removing cfg(feature) because feature is stabilized now.
@ppca ppca deleted the xiangyi/ND-322 branch July 31, 2023 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants