fix gc for sharding upgrade #5040

mzhangmzz · 2021-10-19T17:34:13Z

resolve #4710

Also added a python test for gc with sharding upgrade

mzhangmzz · 2021-10-20T21:13:24Z

chain/chain/src/store_validator/validate.rs

-            for item in trie_iterator {
-                unwrap_or_err!(item, "Can't find ShardChunk {:?} in Trie", chunk_header);
-            }
+    // 2) Chunk Extra with `block_hash` and `shard_uid` should be available and match with the new root


I have to change this function for two reasons:

During the epoch when we split shards, TrieChanges for shards in the next epoch will be stored. These trie changes do not have corresponding ShardChunk.

The block may not have a chunk that has chunk.shard_id() == shard_id. This is because for the first block in the new epoch after sharding is upgraded, if some chunks are missing, the block will include chunk from the last block, which has the shard id in the old shard layout. For example, if the chain is changing from 1 shard to 4 shards, and the first block in the new epoch does not include any new chunks, then all chunks in the block will have shard_id 0, because they are copied from the last block. Therefore, I changed this function to get chunks by block.chunks.get(shard_id) and only check the chunk content if the chunk is new

bowenwang1996

The test is a bit confusing to me. It does not seem to test that anything is actually garbage collected. We should probably check that after the state split is done, data for new shards can also be garbage collected

mzhangmzz · 2021-10-21T13:57:38Z

The test is a bit confusing to me. It does not seem to test that anything is actually garbage collected. We should probably check that after the state split is done, data for new shards can also be garbage collected

Good point, I just copied the code from an existing test garbage_collection.py and didn't think too much. I thought the checks just rely on the get_status rpc calls, which triggers storage validation when test_features are enabled. I'll add your suggestion.

chain/chain/src/store.rs

matklad · 2021-10-21T14:07:44Z

core/primitives/src/shard_layout.rs

@@ -158,6 +158,11 @@ impl ShardLayout {
            Self::V1(v1) => (v1.fixed_shards.len() + v1.boundary_accounts.len() + 1) as NumShards,
        }
    }
+
+    #[inline]


nitpick: don't think this inline helps here. For other functions here, #[inline] is needed so that compiler can inline getters across the crates and completely replace function calls with just loads at offsets. Here, we are allocating a Vec anyway, so there's going to be non-trivial non-inlined logic anyway, so inline doesn't make sense.

Once you have spare time, consider looking at https://matklad.github.io/2021/07/09/inline-in-rust.html -- #[inline] semantics is subtle in Rust and is useful to know.

chain/chain/src/store.rs

mzhangmzz · 2021-10-21T18:26:34Z

pytest/tests/sanity/garbage_collection_sharding_upgrade.py

+# all old data should be GCed
+blocks_count = 0
+for height in range(1, 60):
+    block0 = nodes[0].json_rpc('block', [height], timeout=15)


@bowenwang1996 I'm just checking block here, because store_validator checks if a block is gc'ed, all information related to the block, including ChunkExtra, TrieChanges are also gc'ed

bowenwang1996 · 2021-10-22T00:57:01Z

@mzhangmzz CI failed :(

In #5040 we introduced a change that could try to access already garbage collected information in the call to `get_shard_uids_to_gc`. More specifically, in `get_next_epoch_id_from_prev_block` we would try to access the block info for the first block of the epoch, which is presumably already garbage collected at this point. The reason why we did not catch it in tests is because epoch manager has a cache of size 1024 and we do not clean up the cache properly during garbage collection since `EpochManager` is not part of `Chain`. Fixes #5074 Test plan ---------- `cargo test -p integration-tests --features no_cache test_gc_long_epoch`

resolve #4710 Also added a python test for gc with sharding upgrade

In #5040 we introduced a change that could try to access already garbage collected information in the call to `get_shard_uids_to_gc`. More specifically, in `get_next_epoch_id_from_prev_block` we would try to access the block info for the first block of the epoch, which is presumably already garbage collected at this point. The reason why we did not catch it in tests is because epoch manager has a cache of size 1024 and we do not clean up the cache properly during garbage collection since `EpochManager` is not part of `Chain`. Fixes #5074 Test plan ---------- `cargo test -p integration-tests --features no_cache test_gc_long_epoch`

mzhangmzz requested a review from bowenwang1996 as a code owner October 19, 2021 17:34

mzhangmzz marked this pull request as draft October 19, 2021 17:51

Min Zhang added 2 commits October 20, 2021 16:41

fix gc for sharding_upgrade

cf653e4

add a test for sharding upgrade

33f4d63

mzhangmzz force-pushed the gc_sharding branch from 72d29eb to 33f4d63 Compare October 20, 2021 21:06

mzhangmzz changed the title ~~temp hack for gc for sharding upgrade~~ fix gc for sharding upgrade Oct 20, 2021

mzhangmzz requested a review from EgorKulikov October 20, 2021 21:07

mzhangmzz commented Oct 20, 2021

View reviewed changes

mzhangmzz marked this pull request as ready for review October 20, 2021 21:14

mzhangmzz requested review from frol and matklad as code owners October 20, 2021 21:14

bowenwang1996 reviewed Oct 20, 2021

View reviewed changes

fix CI

c08b0bb

matklad reviewed Oct 21, 2021

View reviewed changes

EgorKulikov reviewed Oct 21, 2021

View reviewed changes

chain/chain/src/store.rs Outdated Show resolved Hide resolved

Min Zhang and others added 3 commits October 21, 2021 13:07

address comments and fix CI

ece7321

Merge branch 'master' into gc_sharding

1e6df9b

modify test to run after sharding upgrade and check gc runs properly

a9d13a6

mzhangmzz commented Oct 21, 2021

View reviewed changes

mzhangmzz requested review from bowenwang1996, EgorKulikov and matklad October 21, 2021 18:26

Merge branch 'master' into gc_sharding

e0855b8

bowenwang1996 approved these changes Oct 22, 2021

View reviewed changes

remove unused tracing debug

9e20250

matklad approved these changes Oct 22, 2021

View reviewed changes

Min Zhang added 2 commits October 22, 2021 10:53

fix CI

4b099d6

fix CI

9bf4f59

mzhangmzz added 2 commits October 22, 2021 11:09

Merge branch 'master' into gc_sharding

8ca4252

Merge branch 'master' into gc_sharding

5257244

mzhangmzz added the S-automerge label Oct 22, 2021

Min Zhang and others added 3 commits October 22, 2021 12:00

fix CI

cfd8101

fix python test format

f29f75f

Merge refs/heads/master into gc_sharding

20f2184

near-bulldozer bot merged commit e2bf006 into master Oct 22, 2021

near-bulldozer bot deleted the gc_sharding branch October 22, 2021 16:31

This was referenced Oct 25, 2021

Canary node is broken due to #5040 #5074

Closed

fix(chain): garbage collection should not cause node to crash #5081

Merged

bowenwang1996 pushed a commit that referenced this pull request Oct 26, 2021

fix gc for sharding upgrade (#5040)

07d87cb

resolve #4710 Also added a python test for gc with sharding upgrade

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix gc for sharding upgrade #5040

fix gc for sharding upgrade #5040

mzhangmzz commented Oct 19, 2021 •

edited

Loading

mzhangmzz Oct 20, 2021 •

edited

Loading

bowenwang1996 left a comment

mzhangmzz commented Oct 21, 2021 •

edited

Loading

matklad Oct 21, 2021

mzhangmzz Oct 21, 2021

bowenwang1996 commented Oct 22, 2021

fix gc for sharding upgrade #5040

fix gc for sharding upgrade #5040

Conversation

mzhangmzz commented Oct 19, 2021 • edited Loading

mzhangmzz Oct 20, 2021 • edited Loading

Choose a reason for hiding this comment

bowenwang1996 left a comment

Choose a reason for hiding this comment

mzhangmzz commented Oct 21, 2021 • edited Loading

matklad Oct 21, 2021

Choose a reason for hiding this comment

mzhangmzz Oct 21, 2021

Choose a reason for hiding this comment

bowenwang1996 commented Oct 22, 2021

mzhangmzz commented Oct 19, 2021 •

edited

Loading

mzhangmzz Oct 20, 2021 •

edited

Loading

mzhangmzz commented Oct 21, 2021 •

edited

Loading