refactor: reduce places where we assume shard ids are contiguous #10230

akhi3030 · 2023-11-21T16:19:55Z

There are many places in the repo where we assume that the valid shard ids are in the range [0, num_shards). This PR is an attempt to improve the current state of affairs.

ShardLayout introduces a new method: fn shard_ids() which still employs the above assumption.
All instances above assumption are moved to calls to the above function so that the assumption is centralised in a single place.

Future work:

If we have a function that returns a list of shard ids, we do not need fn num_shards(). It can be derived from the previous function. It should be removed for consistency reasons.

codecov · 2023-11-21T16:31:25Z

Codecov Report

Attention: 56 lines in your changes are missing coverage. Please review.

Comparison is base (2318c3a) 71.80% compared to head (c3270a8) 71.82%.
Report is 1 commits behind head on master.

Files	Patch %	Lines
chain/client/src/client_actor.rs	0.00%	16 Missing ⚠️
tools/amend-genesis/src/lib.rs	0.00%	6 Missing and 1 partial ⚠️
tools/state-viewer/src/epoch_info.rs	0.00%	7 Missing ⚠️
chain/client/src/debug.rs	0.00%	6 Missing ⚠️
chain/chain/src/chain.rs	33.33%	0 Missing and 4 partials ⚠️
chain/epoch-manager/src/validator_selection.rs	78.94%	4 Missing ⚠️
chain/chunks/src/lib.rs	70.00%	0 Missing and 3 partials ⚠️
tools/state-viewer/src/state_changes.rs	0.00%	2 Missing ⚠️
chain/chain/src/flat_storage_creator.rs	50.00%	0 Missing and 1 partial ⚠️
chain/chain/src/test_utils.rs	0.00%	1 Missing ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10230      +/-   ##
==========================================
+ Coverage   71.80%   71.82%   +0.01%     
==========================================
  Files         707      707              
  Lines      141904   141944      +40     
  Branches   141904   141944      +40     
==========================================
+ Hits       101896   101953      +57     
+ Misses      35290    35272      -18     
- Partials     4718     4719       +1

Flag	Coverage Δ
backward-compatibility	`0.08% <0.00%> (-0.01%)`	⬇️
db-migration	`0.08% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.23% <0.00%> (-0.01%)`	⬇️
integration-tests	`36.16% <35.25%> (+<0.01%)`	⬆️
linux	`71.71% <59.42%> (+<0.01%)`	⬆️
linux-nightly	`71.59% <59.71%> (+0.03%)`	⬆️
macos	`55.71% <47.82%> (+1.80%)`	⬆️
pytests	`1.46% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.26% <0.00%> (-0.01%)`	⬇️
unittests	`68.16% <50.35%> (+0.01%)`	⬆️
upgradability	`0.13% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nagisa · 2023-11-22T12:39:04Z

chain/chain/src/test_utils/kv_runtime.rs

@@ -417,6 +417,10 @@ impl EpochManagerAdapter for MockEpochManager {
        Ok(self.num_shards)
    }

+    fn shard_ids(&self, _epoch_id: &EpochId) -> Result<Vec<ShardId>, EpochError> {


Allocating here runs a risk of making some hot loops that check for shard_ids quite a bit more expensive. I would advocate for a SmallVec or perhaps returning an impl Iterator<Item=ShardId> (insert the result wrapper where you think it is more appropriate.)

perhaps returning an impl Iterator<Item=ShardId> (insert the result wrapper where you think it is more appropriate.)

This is a trait method so AFAIU, it cannot return an iterator. Is there some way to doing that in rust now?

Allocating here runs a risk of making some hot loops that check for shard_ids quite a bit more expensive. I would advocate for a SmallVec

Are you worried about the performance for the test code or the production code as well? I am asking because your comment is on the test code.

We could choose to return a SmallVec or we could alternatively add an additional function fn is_valid_shard_id() to handle those use cases.

I am also wondering if we might be doing some premature optimisation. Do you have specific checks in mind that will call this in a hot loop?

Are you worried about the performance for the test code or the production code as well? I am asking because your comment is on the test code.

My concern is largely with the production code. e.g. chain/chain/src/chain.rs calls this function frequently, and I don’t have a good intuition how frequently some of the functions there get called (especially those that are more nebulous like the has_all_receipts). Anyway, I just grepped for the first definition and wrote my comment there.

This is a trait method so AFAIU, it cannot return an iterator. Is there some way to doing that in rust now?

Yeah traits make impl Iterator harder. Still possible with associated types like so:

trait Foo { type ShardIdIterator = ...; fn shard_ids(&self) -> Self::ShardIdIterator { ... } }

but unfortunately that requires an unstable feature if you can’t name the type being returned easily.

I am also wondering if we might be doing some premature optimisation. Do you have specific checks in mind that will call this in a hot loop?

A fair point. It is unfortunate we do not have a good automated performance evaluation tool, and it is also true we’ve landed other major changes to performance sensitive areas in the past without too much ceremony.

but unfortunately that requires an unstable feature if you can’t name the type being returned easily.

I am excited to see this feature land in the near future.

A fair point. It is unfortunate we do not have a good automated performance evaluation tool, and it is also true we’ve landed other major changes to performance sensitive areas in the past without too much ceremony.

I am leaning towards not worrying too much about the potential performance impact then. I believe we do do some performance testing before deploying to mainnet. So hopefully issues would be caught there.

nagisa · 2023-11-22T14:03:57Z

chain/epoch-manager/src/validator_selection.rs

@@ -64,7 +64,7 @@ pub fn proposals_to_epoch_info(
                &mut chunk_producer_proposals,
                max_cp_selected,
                min_stake_ratio,
-                num_shards,
+                shard_ids.len() as NumShards,


Should still use epoch_config.shard_layout.num_shards() here I feel, or remove that method altogether.

If this PR is merged, I plan on removing the num_shards() function all together. I wanted to do it piecemeal. Doing it in a single PR made the PR quite big.

nagisa · 2023-11-22T14:08:06Z

chain/chain/src/test_utils/kv_runtime.rs

@@ -417,6 +417,10 @@ impl EpochManagerAdapter for MockEpochManager {
        Ok(self.num_shards)
    }

+    fn shard_ids(&self, _epoch_id: &EpochId) -> Result<Vec<ShardId>, EpochError> {


Are you worried about the performance for the test code or the production code as well? I am asking because your comment is on the test code.

My concern is largely with the production code. e.g. chain/chain/src/chain.rs calls this function frequently, and I don’t have a good intuition how frequently some of the functions there get called (especially those that are more nebulous like the has_all_receipts). Anyway, I just grepped for the first definition and wrote my comment there.

This is a trait method so AFAIU, it cannot return an iterator. Is there some way to doing that in rust now?

Yeah traits make impl Iterator harder. Still possible with associated types like so:

trait Foo { type ShardIdIterator = ...; fn shard_ids(&self) -> Self::ShardIdIterator { ... } }

but unfortunately that requires an unstable feature if you can’t name the type being returned easily.

I am also wondering if we might be doing some premature optimisation. Do you have specific checks in mind that will call this in a hot loop?

A fair point. It is unfortunate we do not have a good automated performance evaluation tool, and it is also true we’ve landed other major changes to performance sensitive areas in the past without too much ceremony.

wacban

fantastic, thanks!

wacban · 2023-11-23T10:27:11Z

chain/chunks/src/client.rs

@@ -242,8 +240,8 @@ mod tests {

        tracing::info!("checking the pool after resharding");
        {
-            let num_shards = new_shard_layout.num_shards();
-            for shard_id in 0..num_shards {
+            let shard_ids: Vec<_> = new_shard_layout.shard_ids().collect();


Nit: collect_vec from itertools is an alternative to explicit type. It's just preference, up to you.

wacban · 2023-11-23T10:29:09Z

chain/chunks/src/lib.rs

@@ -1273,7 +1276,7 @@ impl ShardsManager {
            };
        }

-        if header.shard_id() >= self.epoch_manager.num_shards(&epoch_id)? {
+        if !self.epoch_manager.shard_ids(&epoch_id)?.contains(&header.shard_id()) {


This is a bit slower than before and it's hinting that the shards should actually be stored as a set rather then a vector. Totally not for this PR, just general thoughts for the future.

Yes I noticed the same thing. In general, I wonder how often we can expect number of shards of change and if the information could be cached. Still, we should do some performance analysis before introducing such complexities.

In theory it shard layout could change every epoch or every two epochs - quite a long time to begin with. In practice it's more likely to be much less frequent. The reason why it cannot change more often is becuase it's tied to the protocol version which can only change at epoch boundary.

wacban · 2023-11-23T10:30:46Z

chain/client/src/client_actor.rs

-                                )
-                            })
-                            .collect();
+                    let shards_to_sync: Vec<_> = self


it's prettier now :)

wacban · 2023-11-23T10:36:34Z

core/primitives/src/shard_layout.rs

@@ -194,7 +194,7 @@ impl ShardLayout {
    /// Returns error if `shard_id` is an invalid shard id in the current layout
    /// Panics if `self` has no parent shard layout
    pub fn get_parent_shard_id(&self, shard_id: ShardId) -> Result<ShardId, ShardLayoutError> {
-        if shard_id > self.num_shards() {
+        if !self.shard_ids().any(|id| id == shard_id) {


nit: contains

shard_ids() is returning an iterator here, not a vector. AFAICT, there is no contains function on iterators.

wacban · 2023-11-23T10:37:46Z

genesis-tools/genesis-csv-to-json/src/csv_to_json_configs.rs

@@ -38,7 +38,7 @@ pub fn csv_to_json_configs(home: &Path, chain_id: String, tracked_shards: Vec<Sh
    // Verify that key files exist.
    assert!(home.join(NODE_KEY_FILE).as_path().exists(), "Node key file should exist");

-    if tracked_shards.iter().any(|shard_id| *shard_id >= NUM_SHARDS) {
+    if tracked_shards.iter().any(|&shard_id| shard_id >= NUM_SHARDS) {


this should check if every tracked shard_id is in the shard_ids vector

Hmm... I wasn't sure what to do here. We have only defined NUM_SHARDS here and not a vector of shard ids. I think in a separate change, we can have a global vector of shard ids instead of NUM_SHARDS and that I will clean things up more. WDYT?

I just realized it's outside of the regular node operation. Yeah it's hard to tell what's the num shards without access to chain, epoch and the current shard layout. I'm totally fine leaving it as is. Do you know where is this genesis-tool used?

No idea what the purpose of this tool is sadly.

wacban · 2023-11-23T10:39:11Z

nearcore/src/runtime/mod.rs

            let mut all_proposals = vec![];
            let mut all_receipts = vec![];
-            for i in 0..num_shards {
+            for shard_id in shard_ids {


akhi3030

Thanks for the feedback folks. I will merge this as is and look into doing some additional works in separate PRs.

akhi3030 · 2023-11-23T16:19:00Z

core/primitives/src/shard_layout.rs

@@ -194,7 +194,7 @@ impl ShardLayout {
    /// Returns error if `shard_id` is an invalid shard id in the current layout
    /// Panics if `self` has no parent shard layout
    pub fn get_parent_shard_id(&self, shard_id: ShardId) -> Result<ShardId, ShardLayoutError> {
-        if shard_id > self.num_shards() {
+        if !self.shard_ids().any(|id| id == shard_id) {


shard_ids() is returning an iterator here, not a vector. AFAICT, there is no contains function on iterators.

akhi3030 · 2023-11-23T16:23:29Z

genesis-tools/genesis-csv-to-json/src/csv_to_json_configs.rs

@@ -38,7 +38,7 @@ pub fn csv_to_json_configs(home: &Path, chain_id: String, tracked_shards: Vec<Sh
    // Verify that key files exist.
    assert!(home.join(NODE_KEY_FILE).as_path().exists(), "Node key file should exist");

-    if tracked_shards.iter().any(|shard_id| *shard_id >= NUM_SHARDS) {
+    if tracked_shards.iter().any(|&shard_id| shard_id >= NUM_SHARDS) {


Hmm... I wasn't sure what to do here. We have only defined NUM_SHARDS here and not a vector of shard ids. I think in a separate change, we can have a global vector of shard ids instead of NUM_SHARDS and that I will clean things up more. WDYT?

@wacban

#10243) See #10230 and #10242 for prior works. CC: @wacban.

…on (#10238) Instead of defining the number of shards and assuming they are contiguous, define a list of shard ids. As future work, we could try putting holes in the shard id list and see what tests / code breaks. Also see #10230 (comment).

@wacban

…10242) See #10230 for prior work. CC: @wacban.

@wacban

) See #10230; #10242; etc. for prior works. I am slowly trying to remove all uses of the `num_shards()` function and improving the code quality in process. Changes made in this PR: - Use better names for variables to improve readability. - Before `shard_receipts` was a vector but it is cleaner for it to be a hashmap instead. - clarify that `account_id_to_shard_id_map` is just a simple cache. And use more idiomatic rust around accessing it. CC: @wacban

WIP

1ca831f

akhi3030 requested review from wacban and shreyan-gupta November 21, 2023 16:20

akhi3030 marked this pull request as ready for review November 21, 2023 16:20

akhi3030 requested a review from a team as a code owner November 21, 2023 16:20

fix compiler error

c3270a8

nagisa reviewed Nov 22, 2023

View reviewed changes

nagisa approved these changes Nov 23, 2023

View reviewed changes

wacban approved these changes Nov 23, 2023

View reviewed changes

akhi3030 mentioned this pull request Nov 23, 2023

Get rid of num_shards() #10237

Closed

akhi3030 commented Nov 23, 2023

View reviewed changes

akhi3030 added this pull request to the merge queue Nov 23, 2023

akhi3030 mentioned this pull request Nov 23, 2023

refactor: replace another instance of continuguous shard ids assumption #10238

Merged

Merged via the queue into near:master with commit 431ae41 Nov 23, 2023
16 of 17 checks passed

akhi3030 deleted the iter-shard-ids branch November 23, 2023 16:52

This was referenced Nov 23, 2023

refactor: remove another instance of assuming contiguous shard ids #10242

Merged

refactor: remove yet another instance of assuming contiguous shard ids #10243

Merged

github-merge-queue bot pushed a commit that referenced this pull request Nov 24, 2023

refactor: remove yet another instance of assuming contiguous shard ids (

193734e

#10243) See #10230 and #10242 for prior works. CC: @wacban.

github-merge-queue bot pushed a commit that referenced this pull request Nov 24, 2023

refactor: remove another instance of assuming contiguous shard ids (#…

6050839

…10242) See #10230 for prior work. CC: @wacban.

akhi3030 mentioned this pull request Nov 24, 2023

refactor: stop using num_shards() in build_receipts_hashes() #10249

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: reduce places where we assume shard ids are contiguous #10230

refactor: reduce places where we assume shard ids are contiguous #10230

akhi3030 commented Nov 21, 2023

codecov bot commented Nov 21, 2023 •

edited

nagisa Nov 22, 2023

akhi3030 Nov 22, 2023

nagisa Nov 22, 2023

akhi3030 Nov 22, 2023

nagisa Nov 22, 2023 •

edited

akhi3030 Nov 22, 2023

nagisa Nov 22, 2023

wacban left a comment

wacban Nov 23, 2023

wacban Nov 23, 2023

akhi3030 Nov 23, 2023

wacban Nov 24, 2023

wacban Nov 23, 2023

wacban Nov 23, 2023

akhi3030 Nov 23, 2023

wacban Nov 23, 2023

akhi3030 Nov 23, 2023

wacban Nov 24, 2023

akhi3030 Nov 24, 2023

wacban Nov 23, 2023

akhi3030 left a comment

akhi3030 Nov 23, 2023

akhi3030 Nov 23, 2023

refactor: reduce places where we assume shard ids are contiguous #10230

refactor: reduce places where we assume shard ids are contiguous #10230

Conversation

akhi3030 commented Nov 21, 2023

codecov bot commented Nov 21, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagisa Nov 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akhi3030 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 21, 2023 •

edited

nagisa Nov 22, 2023 •

edited