Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEPs on how to split states of shards for resharding #241

Merged
merged 14 commits into from
Sep 14, 2022
Merged

Conversation

mzhangmzz
Copy link
Contributor

No description provided.

@render
Copy link

render bot commented Jul 19, 2021

@mzhangmzz
Copy link
Contributor Author

Most of the spec is done. Still need to fill some analysis sections like Rationales and alternatives, Unresolved questions, etc.

Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One small thing: we prefer to have one sentence per line.

specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
Copy link
Member

@Longarithm Longarithm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, it is highly readable!

specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
is not a reason to accept the current or a future NEP. Such notes should be
in the section on motivation or rationale in this or subsequent NEPs.
The section merely provides additional information.
None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well we've discussed possibly switching to on-chain state updates once we introduce challenges...

@MaksymZavershynskyi
Copy link
Contributor

Thank you for thorough NEP and following https://gov.near.org/t/quality-control-for-protocol-api-changes/1941#the-format-4 especially for talking about drawbacks and alternatives! Can we also add a list of affected projects (maybe @bowenwang1996 can help since he has a lot of context) and pre-mortem (imagine a most likely scenario of how this design change can lead to a failure). Thanks!

Min Zhang added 2 commits July 27, 2021 21:22
…ersion. Addressed some comments but not all. Still need to finish the Future section
@frol
Copy link
Collaborator

frol commented Jul 29, 2021

I am sorry for jumping late to the discussion, but reviewing near/nearcore#4554, I feel like from the end-user point of view, they should care about some "shard id" (I feel it does not matter if it is unique in the epoch or unique across the epochs). Here are the questions I suggest we ask ourselves:

Why do you believe outside world should care about shard ord rather than unique shard id?
If outside world should care about ord, why should they care that it is “ord” rather than calling it “id”?

If we want some unique shard id, we may just introduce ShardUniqueId (ShardUid) instead of going through this massive renaming effort.

@mzhangmzz
Copy link
Contributor Author

mzhangmzz commented Jul 29, 2021

I am sorry for jumping late to the discussion, but reviewing near/nearcore#4554, I feel like from the end-user point of view, they should care about some "shard id" (I feel it does not matter if it is unique in the epoch or unique across the epochs). Here are the questions I suggest we ask ourselves:

Why do you believe outside world should care about shard ord rather than unique shard id?
If outside world should care about ord, why should they care that it is “ord” rather than calling it “id”?

If we want some unique shard id, we may just introduce ShardUniqueId (ShardUid) instead of going through this massive renaming effort.

@frol thanks for the detailed comment! Regarding your questions

Why do you believe outside world should care about shard ord rather than unique shard id?

That's a great question. I think unique shard id is a concept that closely tied to how we choose to represent shard internally in the state. Thus it may subject to change in the future if the implementation changes. By hiding it away from protocol level and the outside world, such change would not trigger protocol level upgrade in the future.

If outside world should care about ord, why should they care that it is “ord” rather than calling it “id”?

You are right on this. Initially I didn't think renaming would be such an invasive change and I wanted the naming to highlight the difference between "ord" and "id". If you think the renaming creates a lot of trouble on the API side, we can definitely not do it and just create a new name for the new id like you suggested.

Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question that occurred to me day: what happens to nonvalidator nodes which specify the shards they track? It seems that, according to what is proposed here, those nodes will start tracking all the shards that the original shard split into. The side effect is that what the node actually tracks will no longer be the same as what is specified in config.json and if a node stops and starts again after the split is done, it will stop tracking the shards whose ids are not in tracked_shards. This could be confusing for people who need to run nonvalidator nodes and we should at least document such peculiar behavior. It would certainly be better if we can make this experience more smooth for nonvalidator nodes as well.

@mzhangmzz
Copy link
Contributor Author

One question that occurred to me day: what happens to nonvalidator nodes which specify the shards they track? It seems that, according to what is proposed here, those nodes will start tracking all the shards that the original shard split into. The side effect is that what the node actually tracks will no longer be the same as what is specified in config.json and if a node stops and starts again after the split is done, it will stop tracking the shards whose ids are not in tracked_shards. This could be confusing for people who need to run nonvalidator nodes and we should at least document such peculiar behavior. It would certainly be better if we can make this experience more smooth for nonvalidator nodes as well.

That's a good point. If we want to make the process smoother for validator nodes, when specifying shards to track, we can make them supply the epoch_id along with the shard_id as well, then at start, ShardTracker can figure out what shards these shards become in the current shards.

@bowenwang1996
Copy link
Collaborator

One question that occurred to me day: what happens to nonvalidator nodes which specify the shards they track? It seems that, according to what is proposed here, those nodes will start tracking all the shards that the original shard split into. The side effect is that what the node actually tracks will no longer be the same as what is specified in config.json and if a node stops and starts again after the split is done, it will stop tracking the shards whose ids are not in tracked_shards. This could be confusing for people who need to run nonvalidator nodes and we should at least document such peculiar behavior. It would certainly be better if we can make this experience more smooth for nonvalidator nodes as well.

That's a good point. If we want to make the process smoother for validator nodes, when specifying shards to track, we can make them supply the epoch_id along with the shard_id as well, then at start, ShardTracker can figure out what shards these shards become in the current shards.

Well the main problem here is that let's say someone wants to track all shards. Today they will specify tracked_shards: [0] to do that. But after the split of shards is done and they stop and restart their node, they will stop tracking every shard other than shard 0, which is suboptimal. To combat this, I think we have the following options:

  • Allow a node to specify that it wants to track all shards in config.json. Something like tracked_shards: all
  • When a node stops, overwrite config.json with shards it actually tracks. This might not be ideal if a node only wants to track one shard even if it is split into two.

@mzhangmzz
Copy link
Contributor Author

One question that occurred to me day: what happens to nonvalidator nodes which specify the shards they track? It seems that, according to what is proposed here, those nodes will start tracking all the shards that the original shard split into. The side effect is that what the node actually tracks will no longer be the same as what is specified in config.json and if a node stops and starts again after the split is done, it will stop tracking the shards whose ids are not in tracked_shards. This could be confusing for people who need to run nonvalidator nodes and we should at least document such peculiar behavior. It would certainly be better if we can make this experience more smooth for nonvalidator nodes as well.

That's a good point. If we want to make the process smoother for validator nodes, when specifying shards to track, we can make them supply the epoch_id along with the shard_id as well, then at start, ShardTracker can figure out what shards these shards become in the current shards.

Well the main problem here is that let's say someone wants to track all shards. Today they will specify tracked_shards: [0] to do that. But after the split of shards is done and they stop and restart their node, they will stop tracking every shard other than shard 0, which is suboptimal. To combat this, I think we have the following options:

  • Allow a node to specify that it wants to track all shards in config.json. Something like tracked_shards: all
  • When a node stops, overwrite config.json with shards it actually tracks. This might not be ideal if a node only wants to track one shard even if it is split into two.

Yeah I understand what you meant. I think the issue again is that the meaning of shard id changes when resharding happens. One way to solve that is to be more specific than just providing the shard_id. The config file will include the epoch_id (or protocol version) along with the shard id. For example, let's say a validator starts with tracking shard 0 and they start from genesis. Then the config file will have tracked_shards: [0], tracked_shards_epoch: genesis_epoch_id. Then later when the validator restarts, the ShardTracker can know the shard 0 is not the current shard 0, but the genesis shard 0. It can derive which shards it should track now from the EpochInfos stored in the database.

@mzhangmzz
Copy link
Contributor Author

mzhangmzz commented Jul 29, 2021

As I am implementing the change regarding ShardId and ShardOrd, I do find that having two different ids referring to shards quite confusing and could be error prone. We need to be extremely careful to use the correct type whenever we use ShardId or ShardOrd. Since ShardId occurs so many places in the code, it is painstaking to get it exactly right. It may also introduce problems in the future if we are not careful and mix up these two. Since they are the same type (u64), it is quite easy to do so. Thus, I am having second thought on whether we should keep them separate or just use ShardId( or ShardUId as @frol calls it) everywhere.

As I mentioned, the downside of moving away from ShardOrd and using a unique number (whether we call it ShardId or ShardUId) to refer to Shard is that the unique number could be an implementation detail that subjects to change in the future. Thus, we may want to think about how we would like to approach resharding when challenges are enabled and dynamic sharding to see if we want to switch the protocol-level code to also use ShardId instead of ShardOrd.

@bowenwang1996 @abacabadabacaba what are your thoughts?

@bowenwang1996
Copy link
Collaborator

@mzhangmzz using the same type does not really prevent us from mixing them up? It seems that we need some internal representation that is different anyways. So it seems that introducing a strictly different type may be helpful (this could be purely an implementation detail)

@frol
Copy link
Collaborator

frol commented Aug 3, 2021

Today they will specify tracked_shards: [0]

I suggest to take a step back and ask ourselves why would anyone care about shard number X? and how would they know that they want to track exactly the shard number X? Maybe they care about a set of specific accounts and we should track based on the list of account ids instead of the shard ord/id/uid?

@mzhangmzz
Copy link
Contributor Author

Today they will specify tracked_shards: [0]

I suggest to take a step back and ask ourselves why would anyone care about shard number X? and how would they know that they want to track exactly the shard number X? Maybe they care about a set of specific accounts and we should track based on the list of account ids instead of the shard ord/id/uid?

I think that's a very good point. Maybe the outside world should not have information about the internal shard layout at all, since we want the sharding implementation to be as transient as possible. We can provide two options, tracking all shards, or tracking shards from accounts.

@bowenwang1996
Copy link
Collaborator

Maybe they care about a set of specific accounts and we should track based on the list of account ids instead of the shard ord/id/uid?

It seems that there are the following use cases:

  • Tracking all shards.
  • Tracking a set of accounts.

Today we already support tracking a set of accounts through tracked_accounts. We just need to make sure that this continues to work after having multiple shards. In addition, we need to make sure that people who want to track all shards have a way to do that.

mzhangmzz added a commit to near/nearcore that referenced this pull request Aug 14, 2021
This changeset implements ShardLayout and ShardConfig in near/NEPs#241
mzhangmzz added a commit to near/nearcore that referenced this pull request Aug 23, 2021
This PR implements ShardUId as specified in near/NEPs#241 so that trie states are indexed by 'shard_uidinstead ofshard_id`.

The main change is to change the interface around trie storage to use shard_uid instead of shard_id. The database columns ColState and ColTrieChanges are now indexed by shard_uid instead of shard_id.

Changes that this PR does not include:

change account_id_to_shard_id to account for changing shard layout
gc
Test Plan:
The main source for errors is where shard_id and shard_uid is mixed up or the wrong shard_version is used for shard_uid, which may result in DB not found error. To test for the first type, we will change a wide range of existing unit and integration tests to use a shard layout with shard_version = 1 so shard_id and shard_uid cannot be mixed up.

Co-authored-by: Bowen Wang <bowenwang1996@users.noreply.github.com>
specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
Comment on lines +285 to +286
In `apply_chunks`, after processing each chunk, the state changes in `apply_results` are sorted into changes to new shards.
At the end, we apply these changes to the new shards.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should also be done in a different thread, but for this specific upgrade I think it is fine

specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
specs/Proposals/0040-split-states.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

Comment on lines +212 to +216
### Database storage
The following database columns are stored with `ShardId` as part of the database key, it will be replaced by `ShardUId`
- ColState
- ColChunkExtra
- ColTrieChanges
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

#### `TrieCachingStorage`
Trie storage will be contruct database key from `ShardUId` and hash of the trie node.
Trie storage will contruct database key from `ShardUId` and hash of the trie node.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Trie storage will contruct database key from `ShardUId` and hash of the trie node.
Trie storage will construct database key from `ShardUId` and hash of the trie node.

near-bulldozer bot pushed a commit to near/nearcore that referenced this pull request Oct 13, 2021
Move SimpleNIghtshade from nightly to stable.

This feature enables the transition from one shard to four shards. For more details, see https://near.org/blog/near-launches-simple-nightshade-the-first-step-towards-a-sharded-blockchain/ and near/NEPs#241
@frol frol added the WG-protocol Protocol Standards Work Group should be accountable label Sep 5, 2022
@frol
Copy link
Collaborator

frol commented Sep 5, 2022

@mzhangmzz @bowenwang1996 Is it ready to be merged?

@bowenwang1996
Copy link
Collaborator

This was implemented more than a year ago so let's merge the pull request. @mzhangmzz could you resolve conflicts and merge the PR? Thanks!

@mzhangmzz mzhangmzz requested a review from a team as a code owner September 14, 2022 15:36
@mzhangmzz mzhangmzz merged commit 9e8603d into master Sep 14, 2022
@mzhangmzz mzhangmzz deleted the split-states branch September 14, 2022 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WG-protocol Protocol Standards Work Group should be accountable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants