-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: strict flat storage update to fix GC issue #11599
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #11599 +/- ##
==========================================
+ Coverage 71.46% 71.50% +0.04%
==========================================
Files 788 788
Lines 160697 160754 +57
Branches 160697 160754 +57
==========================================
+ Hits 114836 114945 +109
+ Misses 40851 40796 -55
- Partials 5010 5013 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't completely understand this :(
Approving for unblocking.
core/store/src/flat/storage.rs
Outdated
@@ -365,7 +366,7 @@ impl FlatStorage { | |||
// new_head | |||
// | |||
// The segment [new_head, block_hash] contains two blocks with flat state changes. | |||
pub fn update_flat_head( | |||
pub fn update_flat_head_impl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider removing pub qualifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test that spins up a node with memtrie and an empty shard, runs for 5 epochs, stop it and then restart it to ensure that it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// If shard layout was changed, the update is impossible so we skip | ||
// getting candidate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
fn garbage_collect_memtrie_roots(&self, block: &Block, shard_uid: ShardUId) { | ||
/// Gets new flat storage head candidate for given `shard_id` and newly | ||
/// processed `block`. | ||
/// It will be `block.last_final_block().chunk(shard_id).prev_block_hash()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice, this is neat
chain/client/src/client.rs
Outdated
}; | ||
if blocks_until_end_of_epoch <= 2 { | ||
info!(target: "client", shard_id, next_height, blocks_until_end_of_epoch, "SKIP!!!!!!!!!"); | ||
return Err(Error::ChunkProducer("SKIP!!!!!!!!!".to_string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bowenwang1996 please take a brief look at the test, I'm going to merge that soon. |
Solution to #11583. The current logic to update flat storage for shard doesn't work for memtrie loading in some rare case. If shard doesn't contain active validators and didn't get any tx and receipts, the block for flat storage head will get GC-d, and attempt to read state root to assert with it will naturally panic. This happens because we have non-strict mode, which itself is used to make `StateSnapshot` work. But essentially it is enough to **not** move flat storage head past `epoch_last_block.chunks(shard_id).prev_block_hash()`. The flat storage state **past** this block exactly corresponds to the state we are syncing, see also #11600. So, this is exactly the new flat head candidate we compute and pass to `update_flat_head`. Passing tests show that non-strict mode is not needed. After that, we will have GC problem if there are no chunks for shard or no finality in the stored epochs, which is the assumption we make during development anyway. Nayduck will be at https://nayduck.nearone.org/#/run/149 One edge case when state snapshot will still work is when client just processed **second** block in an epoch. Then last final block will be not earlier last block in prev epoch; then new flat head will be not earlier than prev_block_hash for last chunk for our shard in it. Then state snapshot still works. For old implementation, this was guaranteed because while we pass last final block, we made _two steps back by non-empty state transitions_. First jump guarantees to skip last block **because it may contains validator updates**, the second jump guarantees to skip last chunk. So guarantees are the same. * Add GCActor to the TestLoop. It clears blocks in background and doesn't need external control. * Ensure that shard 0 doesn't have validators and empty chunks for a long time. * Unload memtrie for shard 0 and load it back. I checked that in non-strict mode, as before the fix, it panics. * Additionally, check that if 2 chunks in the end of epoch are always missing, and we always move flat head to the final known block, then snapshotting always fails - so accounting for the latest chunk is actually needed!
Solution to #11583.
The current logic to update flat storage for shard doesn't work for memtrie loading in some rare case. If shard doesn't contain active validators and didn't get any tx and receipts, the block for flat storage head will get GC-d, and attempt to read state root to assert with it will naturally panic.
This happens because we have non-strict mode, which itself is used to make
StateSnapshot
work.But essentially it is enough to not move flat storage head past
epoch_last_block.chunks(shard_id).prev_block_hash()
. The flat storage state past this block exactly corresponds to the state we are syncing, see also #11600. So, this is exactly the new flat head candidate we compute and pass toupdate_flat_head
. Passing tests show that non-strict mode is not needed.After that, we will have GC problem if there are no chunks for shard or no finality in the stored epochs, which is the assumption we make during development anyway.
Nayduck will be at https://nayduck.nearone.org/#/run/149
Practical example
One edge case when state snapshot will still work is when client just processed second block in an epoch. Then last final block will be not earlier last block in prev epoch; then new flat head will be not earlier than prev_block_hash for last chunk for our shard in it. Then state snapshot still works.
For old implementation, this was guaranteed because while we pass last final block, we made two steps back by non-empty state transitions. First jump guarantees to skip last block because it may contains validator updates, the second jump guarantees to skip last chunk. So guarantees are the same.
test_load_memtrie_after_empty_chunks