feat: strict flat storage update to fix GC issue #11599

Longarithm · 2024-06-17T20:45:31Z

Solution to #11583.

The current logic to update flat storage for shard doesn't work for memtrie loading in some rare case. If shard doesn't contain active validators and didn't get any tx and receipts, the block for flat storage head will get GC-d, and attempt to read state root to assert with it will naturally panic.

This happens because we have non-strict mode, which itself is used to make StateSnapshot work.

But essentially it is enough to not move flat storage head past epoch_last_block.chunks(shard_id).prev_block_hash(). The flat storage state past this block exactly corresponds to the state we are syncing, see also #11600. So, this is exactly the new flat head candidate we compute and pass to update_flat_head. Passing tests show that non-strict mode is not needed.

After that, we will have GC problem if there are no chunks for shard or no finality in the stored epochs, which is the assumption we make during development anyway.

Nayduck will be at https://nayduck.nearone.org/#/run/149

Practical example

One edge case when state snapshot will still work is when client just processed second block in an epoch. Then last final block will be not earlier last block in prev epoch; then new flat head will be not earlier than prev_block_hash for last chunk for our shard in it. Then state snapshot still works.

For old implementation, this was guaranteed because while we pass last final block, we made two steps back by non-empty state transitions. First jump guarantees to skip last block because it may contains validator updates, the second jump guarantees to skip last chunk. So guarantees are the same.

test_load_memtrie_after_empty_chunks

Add GCActor to the TestLoop. It clears blocks in background and doesn't need external control.
Ensure that shard 0 doesn't have validators and empty chunks for a long time.
Unload memtrie for shard 0 and load it back. I checked that in non-strict mode, as before the fix, it panics.
Additionally, check that if 2 chunks in the end of epoch are always missing, and we always move flat head to the final known block, then snapshotting always fails - so accounting for the latest chunk is actually needed!

codecov · 2024-06-17T22:19:16Z

Codecov Report

Attention: Patch coverage is 80.80808% with 19 lines in your changes missing coverage. Please review.

Project coverage is 71.50%. Comparing base (b02f273) to head (8e7895e).

Files	Patch %	Lines
chain/chain/src/chain.rs	83.33%	1 Missing and 7 partials ⚠️
chain/chain/src/chain_update.rs	0.00%	7 Missing ⚠️
core/store/src/flat/manager.rs	66.66%	1 Missing ⚠️
tools/flat-storage/src/commands.rs	0.00%	1 Missing ⚠️
tools/fork-network/src/cli.rs	0.00%	1 Missing ⚠️
tools/state-viewer/src/apply_chain_range.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11599      +/-   ##
==========================================
+ Coverage   71.46%   71.50%   +0.04%     
==========================================
  Files         788      788              
  Lines      160697   160754      +57     
  Branches   160697   160754      +57     
==========================================
+ Hits       114836   114945     +109     
+ Misses      40851    40796      -55     
- Partials     5010     5013       +3

Flag	Coverage Δ
backward-compatibility	`0.23% <0.00%> (-0.01%)`	⬇️
db-migration	`0.23% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.36% <0.00%> (-0.01%)`	⬇️
integration-tests	`37.71% <64.64%> (+<0.01%)`	⬆️
linux	`68.91% <80.80%> (+0.03%)`	⬆️
linux-nightly	`70.93% <80.80%> (-0.01%)`	⬇️
macos	`52.52% <73.17%> (+1.59%)`	⬆️
pytests	`1.59% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.39% <0.00%> (-0.01%)`	⬇️
unittests	`66.17% <73.17%> (+0.03%)`	⬆️
upgradability	`0.28% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

shreyan-gupta

I don't completely understand this :(
Approving for unblocking.

shreyan-gupta · 2024-06-17T22:15:14Z

core/store/src/flat/storage.rs

@@ -365,7 +366,7 @@ impl FlatStorage {
    //              new_head
    //
    // The segment [new_head, block_hash] contains two blocks with flat state changes.
-    pub fn update_flat_head(
+    pub fn update_flat_head_impl(


nit: consider removing pub qualifier?

bowenwang1996

Could we add a test that spins up a node with memtrie and an empty shard, runs for 5 epochs, stop it and then restart it to ensure that it works?

wacban

LGTM

wacban · 2024-06-18T09:24:51Z

chain/chain/src/chain.rs

+        // If shard layout was changed, the update is impossible so we skip
+        // getting candidate.


wacban · 2024-06-18T09:25:16Z

chain/chain/src/chain.rs

-    fn garbage_collect_memtrie_roots(&self, block: &Block, shard_uid: ShardUId) {
+    /// Gets new flat storage head candidate for given `shard_id` and newly
+    /// processed `block`.
+    /// It will be `block.last_final_block().chunk(shard_id).prev_block_hash()`


very nice, this is neat

core/store/src/flat/manager.rs

wacban · 2024-06-18T17:16:22Z

chain/client/src/client.rs

+            };
+        if blocks_until_end_of_epoch <= 2 {
+            info!(target: "client", shard_id, next_height, blocks_until_end_of_epoch, "SKIP!!!!!!!!!");
+            return Err(Error::ChunkProducer("SKIP!!!!!!!!!".to_string()));


Longarithm · 2024-06-18T17:32:42Z

@bowenwang1996 please take a brief look at the test, I'm going to merge that soon.

…rict

Solution to #11583. The current logic to update flat storage for shard doesn't work for memtrie loading in some rare case. If shard doesn't contain active validators and didn't get any tx and receipts, the block for flat storage head will get GC-d, and attempt to read state root to assert with it will naturally panic. This happens because we have non-strict mode, which itself is used to make `StateSnapshot` work. But essentially it is enough to **not** move flat storage head past `epoch_last_block.chunks(shard_id).prev_block_hash()`. The flat storage state **past** this block exactly corresponds to the state we are syncing, see also #11600. So, this is exactly the new flat head candidate we compute and pass to `update_flat_head`. Passing tests show that non-strict mode is not needed. After that, we will have GC problem if there are no chunks for shard or no finality in the stored epochs, which is the assumption we make during development anyway. Nayduck will be at https://nayduck.nearone.org/#/run/149 One edge case when state snapshot will still work is when client just processed **second** block in an epoch. Then last final block will be not earlier last block in prev epoch; then new flat head will be not earlier than prev_block_hash for last chunk for our shard in it. Then state snapshot still works. For old implementation, this was guaranteed because while we pass last final block, we made _two steps back by non-empty state transitions_. First jump guarantees to skip last block **because it may contains validator updates**, the second jump guarantees to skip last chunk. So guarantees are the same. * Add GCActor to the TestLoop. It clears blocks in background and doesn't need external control. * Ensure that shard 0 doesn't have validators and empty chunks for a long time. * Unload memtrie for shard 0 and load it back. I checked that in non-strict mode, as before the fix, it panics. * Additionally, check that if 2 chunks in the end of epoch are always missing, and we always move flat head to the final known block, then snapshotting always fails - so accounting for the latest chunk is actually needed!

Longarithm mentioned this pull request Jun 17, 2024

Consider removing non-strict flat storage update mode #11601

Open

fix update

c1b4fa5

Longarithm force-pushed the fs-strict branch from 733895e to c1b4fa5 Compare June 17, 2024 21:55

Merge branch 'master' into fs-strict

7beb0ad

Longarithm changed the title ~~draft: strict flat storage update~~ feat: strict flat storage update to fix GC issue Jun 17, 2024

Longarithm requested review from wacban, pugachAG and shreyan-gupta June 17, 2024 22:01

Longarithm added the A-storage Area: storage and databases label Jun 17, 2024

Longarithm marked this pull request as ready for review June 17, 2024 22:02

Longarithm requested a review from a team as a code owner June 17, 2024 22:02

shreyan-gupta approved these changes Jun 17, 2024

View reviewed changes

bowenwang1996 reviewed Jun 18, 2024

View reviewed changes

wacban approved these changes Jun 18, 2024

View reviewed changes

Longarithm added 3 commits June 18, 2024 17:23

attempt to write test - didnt work

bcd3e1a

gc test!

18d6558

remove debug

024fd38

wacban reviewed Jun 18, 2024

View reviewed changes

Longarithm added 4 commits June 18, 2024 21:22

enable stuff back

7048564

nit

80eb48c

feedback

e395614

feedback

4b29a4f

Longarithm requested a review from bowenwang1996 June 18, 2024 17:32

Longarithm and others added 5 commits June 18, 2024 21:32

Merge branch 'master' into fs-strict

086f98b

todo

8621017

Merge branch 'fs-strict' of github.com:Longarithm/nearcore into fs-st…

a2f19f8

…rict

todo

dd3df04

Merge branch 'master' into fs-strict

8e7895e

Longarithm enabled auto-merge June 18, 2024 19:28

Longarithm mentioned this pull request Jun 18, 2024

test_client_with_multi_test_loop fails with enabled GC #11605

Closed

Longarithm added this pull request to the merge queue Jun 18, 2024

Merged via the queue into near:master with commit c020ee5 Jun 18, 2024
30 checks passed

Longarithm deleted the fs-strict branch June 18, 2024 20:03

Longarithm mentioned this pull request Jul 2, 2024

Block data for flat storage head may be GC-d #11583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: strict flat storage update to fix GC issue #11599

feat: strict flat storage update to fix GC issue #11599

Longarithm commented Jun 17, 2024 •

edited

Loading

codecov bot commented Jun 17, 2024 •

edited

Loading

shreyan-gupta left a comment

shreyan-gupta Jun 17, 2024

bowenwang1996 left a comment

wacban left a comment

wacban Jun 18, 2024

wacban Jun 18, 2024

wacban Jun 18, 2024

Longarithm Jun 18, 2024

Longarithm commented Jun 18, 2024

		// If shard layout was changed, the update is impossible so we skip
		// getting candidate.

feat: strict flat storage update to fix GC issue #11599

feat: strict flat storage update to fix GC issue #11599

Conversation

Longarithm commented Jun 17, 2024 • edited Loading

Practical example

test_load_memtrie_after_empty_chunks

codecov bot commented Jun 17, 2024 • edited Loading

Codecov Report

shreyan-gupta left a comment

Choose a reason for hiding this comment

shreyan-gupta Jun 17, 2024

Choose a reason for hiding this comment

bowenwang1996 left a comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

wacban Jun 18, 2024

Choose a reason for hiding this comment

wacban Jun 18, 2024

Choose a reason for hiding this comment

wacban Jun 18, 2024

Choose a reason for hiding this comment

Longarithm Jun 18, 2024

Choose a reason for hiding this comment

Longarithm commented Jun 18, 2024

Longarithm commented Jun 17, 2024 •

edited

Loading

codecov bot commented Jun 17, 2024 •

edited

Loading