Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: strict flat storage update to fix GC issue #11599

Merged
merged 14 commits into from
Jun 18, 2024

Conversation

Longarithm
Copy link
Member

@Longarithm Longarithm commented Jun 17, 2024

Solution to #11583.

The current logic to update flat storage for shard doesn't work for memtrie loading in some rare case. If shard doesn't contain active validators and didn't get any tx and receipts, the block for flat storage head will get GC-d, and attempt to read state root to assert with it will naturally panic.

This happens because we have non-strict mode, which itself is used to make StateSnapshot work.

But essentially it is enough to not move flat storage head past epoch_last_block.chunks(shard_id).prev_block_hash(). The flat storage state past this block exactly corresponds to the state we are syncing, see also #11600. So, this is exactly the new flat head candidate we compute and pass to update_flat_head. Passing tests show that non-strict mode is not needed.

After that, we will have GC problem if there are no chunks for shard or no finality in the stored epochs, which is the assumption we make during development anyway.

Nayduck will be at https://nayduck.nearone.org/#/run/149

Practical example

One edge case when state snapshot will still work is when client just processed second block in an epoch. Then last final block will be not earlier last block in prev epoch; then new flat head will be not earlier than prev_block_hash for last chunk for our shard in it. Then state snapshot still works.

For old implementation, this was guaranteed because while we pass last final block, we made two steps back by non-empty state transitions. First jump guarantees to skip last block because it may contains validator updates, the second jump guarantees to skip last chunk. So guarantees are the same.

test_load_memtrie_after_empty_chunks

  • Add GCActor to the TestLoop. It clears blocks in background and doesn't need external control.
  • Ensure that shard 0 doesn't have validators and empty chunks for a long time.
  • Unload memtrie for shard 0 and load it back. I checked that in non-strict mode, as before the fix, it panics.
  • Additionally, check that if 2 chunks in the end of epoch are always missing, and we always move flat head to the final known block, then snapshotting always fails - so accounting for the latest chunk is actually needed!

@Longarithm Longarithm changed the title draft: strict flat storage update feat: strict flat storage update to fix GC issue Jun 17, 2024
@Longarithm Longarithm added the A-storage Area: storage and databases label Jun 17, 2024
@Longarithm Longarithm marked this pull request as ready for review June 17, 2024 22:02
@Longarithm Longarithm requested a review from a team as a code owner June 17, 2024 22:02
Copy link

codecov bot commented Jun 17, 2024

Codecov Report

Attention: Patch coverage is 80.80808% with 19 lines in your changes missing coverage. Please review.

Project coverage is 71.50%. Comparing base (b02f273) to head (8e7895e).

Files Patch % Lines
chain/chain/src/chain.rs 83.33% 1 Missing and 7 partials ⚠️
chain/chain/src/chain_update.rs 0.00% 7 Missing ⚠️
core/store/src/flat/manager.rs 66.66% 1 Missing ⚠️
tools/flat-storage/src/commands.rs 0.00% 1 Missing ⚠️
tools/fork-network/src/cli.rs 0.00% 1 Missing ⚠️
tools/state-viewer/src/apply_chain_range.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11599      +/-   ##
==========================================
+ Coverage   71.46%   71.50%   +0.04%     
==========================================
  Files         788      788              
  Lines      160697   160754      +57     
  Branches   160697   160754      +57     
==========================================
+ Hits       114836   114945     +109     
+ Misses      40851    40796      -55     
- Partials     5010     5013       +3     
Flag Coverage Δ
backward-compatibility 0.23% <0.00%> (-0.01%) ⬇️
db-migration 0.23% <0.00%> (-0.01%) ⬇️
genesis-check 1.36% <0.00%> (-0.01%) ⬇️
integration-tests 37.71% <64.64%> (+<0.01%) ⬆️
linux 68.91% <80.80%> (+0.03%) ⬆️
linux-nightly 70.93% <80.80%> (-0.01%) ⬇️
macos 52.52% <73.17%> (+1.59%) ⬆️
pytests 1.59% <0.00%> (-0.01%) ⬇️
sanity-checks 1.39% <0.00%> (-0.01%) ⬇️
unittests 66.17% <73.17%> (+0.03%) ⬆️
upgradability 0.28% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@shreyan-gupta shreyan-gupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand this :(
Approving for unblocking.

@@ -365,7 +366,7 @@ impl FlatStorage {
// new_head
//
// The segment [new_head, block_hash] contains two blocks with flat state changes.
pub fn update_flat_head(
pub fn update_flat_head_impl(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider removing pub qualifier?

Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test that spins up a node with memtrie and an empty shard, runs for 5 epochs, stop it and then restart it to ensure that it works?

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +2009 to +2010
// If shard layout was changed, the update is impossible so we skip
// getting candidate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

fn garbage_collect_memtrie_roots(&self, block: &Block, shard_uid: ShardUId) {
/// Gets new flat storage head candidate for given `shard_id` and newly
/// processed `block`.
/// It will be `block.last_final_block().chunk(shard_id).prev_block_hash()`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice, this is neat

core/store/src/flat/manager.rs Show resolved Hide resolved
core/store/src/flat/manager.rs Outdated Show resolved Hide resolved
};
if blocks_until_end_of_epoch <= 2 {
info!(target: "client", shard_id, next_height, blocks_until_end_of_epoch, "SKIP!!!!!!!!!");
return Err(Error::ChunkProducer("SKIP!!!!!!!!!".to_string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@Longarithm
Copy link
Member Author

@bowenwang1996 please take a brief look at the test, I'm going to merge that soon.

@Longarithm Longarithm added this pull request to the merge queue Jun 18, 2024
Merged via the queue into near:master with commit c020ee5 Jun 18, 2024
30 checks passed
@Longarithm Longarithm deleted the fs-strict branch June 18, 2024 20:03
shreyan-gupta pushed a commit that referenced this pull request Jun 18, 2024
Solution to #11583.

The current logic to update flat storage for shard doesn't work for
memtrie loading in some rare case. If shard doesn't contain active
validators and didn't get any tx and receipts, the block for flat
storage head will get GC-d, and attempt to read state root to assert
with it will naturally panic.

This happens because we have non-strict mode, which itself is used to
make `StateSnapshot` work.

But essentially it is enough to **not** move flat storage head past
`epoch_last_block.chunks(shard_id).prev_block_hash()`. The flat storage
state **past** this block exactly corresponds to the state we are
syncing, see also #11600. So, this is exactly the new flat head
candidate we compute and pass to `update_flat_head`. Passing tests show
that non-strict mode is not needed.

After that, we will have GC problem if there are no chunks for shard or
no finality in the stored epochs, which is the assumption we make during
development anyway.

Nayduck will be at https://nayduck.nearone.org/#/run/149

One edge case when state snapshot will still work is when client just
processed **second** block in an epoch. Then last final block will be
not earlier last block in prev epoch; then new flat head will be not
earlier than prev_block_hash for last chunk for our shard in it. Then
state snapshot still works.

For old implementation, this was guaranteed because while we pass last
final block, we made _two steps back by non-empty state transitions_.
First jump guarantees to skip last block **because it may contains
validator updates**, the second jump guarantees to skip last chunk. So
guarantees are the same.

* Add GCActor to the TestLoop. It clears blocks in background and
doesn't need external control.
* Ensure that shard 0 doesn't have validators and empty chunks for a
long time.
* Unload memtrie for shard 0 and load it back. I checked that in
non-strict mode, as before the fix, it panics.
* Additionally, check that if 2 chunks in the end of epoch are always
missing, and we always move flat head to the final known block, then
snapshotting always fails - so accounting for the latest chunk is
actually needed!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: storage and databases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants