Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix gc for sharding upgrade #5040

Merged
merged 15 commits into from
Oct 22, 2021
Merged

fix gc for sharding upgrade #5040

merged 15 commits into from
Oct 22, 2021

Conversation

mzhangmzz
Copy link
Contributor

@mzhangmzz mzhangmzz commented Oct 19, 2021

resolve #4710

Also added a python test for gc with sharding upgrade

@mzhangmzz mzhangmzz marked this pull request as draft October 19, 2021 17:51
@mzhangmzz mzhangmzz changed the title temp hack for gc for sharding upgrade fix gc for sharding upgrade Oct 20, 2021
for item in trie_iterator {
unwrap_or_err!(item, "Can't find ShardChunk {:?} in Trie", chunk_header);
}
// 2) Chunk Extra with `block_hash` and `shard_uid` should be available and match with the new root
Copy link
Contributor Author

@mzhangmzz mzhangmzz Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to change this function for two reasons:

  1. During the epoch when we split shards, TrieChanges for shards in the next epoch will be stored. These trie changes do not have corresponding ShardChunk.
  2. The block may not have a chunk that has chunk.shard_id() == shard_id. This is because for the first block in the new epoch after sharding is upgraded, if some chunks are missing, the block will include chunk from the last block, which has the shard id in the old shard layout. For example, if the chain is changing from 1 shard to 4 shards, and the first block in the new epoch does not include any new chunks, then all chunks in the block will have shard_id 0, because they are copied from the last block. Therefore, I changed this function to get chunks by block.chunks.get(shard_id) and only check the chunk content if the chunk is new

@mzhangmzz mzhangmzz marked this pull request as ready for review October 20, 2021 21:14
Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is a bit confusing to me. It does not seem to test that anything is actually garbage collected. We should probably check that after the state split is done, data for new shards can also be garbage collected

@mzhangmzz
Copy link
Contributor Author

mzhangmzz commented Oct 21, 2021

The test is a bit confusing to me. It does not seem to test that anything is actually garbage collected. We should probably check that after the state split is done, data for new shards can also be garbage collected

Good point, I just copied the code from an existing test garbage_collection.py and didn't think too much. I thought the checks just rely on the get_status rpc calls, which triggers storage validation when test_features are enabled. I'll add your suggestion.

chain/chain/src/store.rs Show resolved Hide resolved
@@ -158,6 +158,11 @@ impl ShardLayout {
Self::V1(v1) => (v1.fixed_shards.len() + v1.boundary_accounts.len() + 1) as NumShards,
}
}

#[inline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: don't think this inline helps here. For other functions here, #[inline] is needed so that compiler can inline getters across the crates and completely replace function calls with just loads at offsets. Here, we are allocating a Vec anyway, so there's going to be non-trivial non-inlined logic anyway, so inline doesn't make sense.

Once you have spare time, consider looking at https://matklad.github.io/2021/07/09/inline-in-rust.html -- #[inline] semantics is subtle in Rust and is useful to know.

# all old data should be GCed
blocks_count = 0
for height in range(1, 60):
block0 = nodes[0].json_rpc('block', [height], timeout=15)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bowenwang1996 I'm just checking block here, because store_validator checks if a block is gc'ed, all information related to the block, including ChunkExtra, TrieChanges are also gc'ed

@bowenwang1996
Copy link
Collaborator

@mzhangmzz CI failed :(

@near-bulldozer near-bulldozer bot merged commit e2bf006 into master Oct 22, 2021
@near-bulldozer near-bulldozer bot deleted the gc_sharding branch October 22, 2021 16:31
bowenwang1996 added a commit that referenced this pull request Oct 25, 2021
In #5040 we introduced a change that could try to access already garbage collected information in the call to `get_shard_uids_to_gc`. More specifically, in `get_next_epoch_id_from_prev_block` we would try to access the block info for the first block of the epoch, which is presumably already garbage collected at this point. The reason why we did not catch it in tests is because epoch manager has a cache of size 1024 and we do not clean up the cache properly during garbage collection since `EpochManager` is not part of `Chain`.

Fixes #5074 

Test plan
----------
`cargo test -p integration-tests --features no_cache test_gc_long_epoch`
bowenwang1996 pushed a commit that referenced this pull request Oct 26, 2021
resolve #4710

Also added a python test for gc with sharding upgrade
bowenwang1996 added a commit that referenced this pull request Oct 26, 2021
In #5040 we introduced a change that could try to access already garbage collected information in the call to `get_shard_uids_to_gc`. More specifically, in `get_next_epoch_id_from_prev_block` we would try to access the block info for the first block of the epoch, which is presumably already garbage collected at this point. The reason why we did not catch it in tests is because epoch manager has a cache of size 1024 and we do not clean up the cache properly during garbage collection since `EpochManager` is not part of `Chain`.

Fixes #5074 

Test plan
----------
`cargo test -p integration-tests --features no_cache test_gc_long_epoch`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass shard_layout to garbage collection
4 participants