Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stateless_validation] Missing ChunkExtra on load memtrie on startup #11135

Open
staffik opened this issue Apr 22, 2024 · 10 comments · Fixed by #11153
Open

[stateless_validation] Missing ChunkExtra on load memtrie on startup #11135

staffik opened this issue Apr 22, 2024 · 10 comments · Fixed by #11153
Labels
A-stateless-validation Area: stateless validation C-bug Category: This is a bug

Comments

@staffik
Copy link
Contributor

staffik commented Apr 22, 2024

Error message after restarting a stateless validation node. After restart it attempts to load memtrie on startup (shard shuffling enabled):

2024-04-22T21:15:58.252570Z  INFO memtrie: Loading trie to memory for shard s0.v2...
2024-04-22T21:15:58.252573Z DEBUG memtrie: Loading base trie from flat state... shard_uid=s0.v2
thread 'main' panicked at chain/client/src/client_actor.rs:222:6:
called `Result::unwrap()` on an `Err` value: Chain(StorageError(StorageInconsistentState("No ChunkExtra for block FJR59St3DjDVR4xvdUa8Mhf5JSoBATAqr2gkYaTBdGHR in shard s0.v2")))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: near_client::client_actor::start_client
   4: nearcore::start_with_config_and_synchronization
   5: neard::cli::RunCmd::run::{{closure}}
   6: tokio::task::local::LocalSet::run_until::{{closure}}
   7: neard::cli::NeardCmd::parse_and_run
   8: neard::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I think the solution would be to modify load_memtries_on_startup() so that it can take state_root as load_mem_trie_on_catchup() does:

pub fn load_mem_trie_on_catchup(

Example usage: https://github.com/near/nearcore/pull/10820/files#diff-ef9c6aaa80a330e446c5365f42be9bff37ba4f898cf519dadd7e17545783c77cR2787

@staffik staffik added C-bug Category: This is a bug A-stateless-validation Area: stateless validation labels Apr 22, 2024
@staffik
Copy link
Contributor Author

staffik commented Apr 22, 2024

So there is a dependency on ChunkExtra which for some reason is not available there. But we only need it to get state_root which can be obtained without ChunkExtra.

@robin-near
Copy link
Contributor

Hmm, so the startup loading logic is a bit different from the catchup loading logic because whereas during catchup we have just created flat storage from a downloaded trie, during startup the flat storage may be in any arbitrary state. The flat head is somewhere, and on top of the flat head there is any set of deltas that represent different forks we may still be choosing among in the future. When loading memtrie, we start with the flat head and then for each delta, we also construct a new memtrie root to represent the difference. So, we cannot just take a different state root for the flat head, as that may not be consistent with the state that the flat state represents, and we cannot just take some other state root that does not correspond to the flat head, because then we may be missing the state root for some fork that we end up building on.

So for example, suppose we have blocks A, B, C, D where B.parent == A, C.parent == A, D.parent == C. The flat head may be at A. We would need to load four state roots corresponding to the post state roots of A, B, C, and D, because technically we may continue building from any of these blocks. The memtrie would contain four roots, and if we apply a chunk on top of B for example, we would query the memtrie root corresponding to B.

ChunkExtra is the place where we obtain the state root, because the flat state encodes the state corresponding to the post state root of the flat head, which is stored in the ChunkExtra for the flat head. Similarly, for each flat state delta, the delta describes the state transition whose result corresponds to the post state root of the block that the flat state delta is intended for.

As for why ChunkExtra is missing, that's still to be investigated.

@robin-near
Copy link
Contributor

Ah ok, I think the bug is here. It's a problem that I deferred during the implementation of memtries that I honestly just forgot about.

let tip = chain_store.head()?;

When we load the memtries, we needed to determine which shards memtries should actually load. At that time, I simply took the tip of the blockchain, because, well, we tracked all shards anyway so that would only change upon resharding. But now, that also changes from epoch to epoch due to single shard tracking.

So the bug was triggered as follows. The node is on a tip at height (presumably) 114912712, which is the first block of a new epoch, where it is a chunk producer for shard 2. When loading memtries, it needed to start from the flat head, which is 114912710, but that is in the previous epoch, where it was not tracking shard 2. So, when querying for the ChunkExtra for shard 2 at the 114912710 height, it didn't exist.

So then, I have some questions:

  • Isn't state sync supposed to catch up on shard 2, and therefore the ChunkExtra for shard 2 should exist in the previous epoch? Does this mean state sync failed during the previous epoch?
  • Loading the memtrie based on the tip is not correct. What we want is to load all shards for which we may need memtries for. If we're in an epoch boundary, technically we may still need the memtries from the previous epoch because it's not guaranteed that this new epoch is the one we're going to go into (we may have a fork that results in a different epoch - albeit having the same memtrie requirements in that epoch). If we arrived at the epoch while the node is up, we would have two memtries loaded, one from the previous epoch that was not unloaded yet, and one loaded by state sync during catchup earlier. So, when starting up, we need to restore the exact same state. What exactly should we do here to match that same state? This seems tricky... @staffik what do you think?
  • Perhaps if we solve the above problem, the state sync problem would also be resolved, because we should probably be trying state sync again before attempting to load the memtries for the supposed-to-be-caught-up shard.

@staffik
Copy link
Contributor Author

staffik commented Apr 23, 2024

Isn't state sync supposed to catch up on shard 2, and therefore the ChunkExtra for shard 2 should exist in the previous epoch? Does this mean state sync failed during the previous epoch?

The issue happened in the middle of the epoch, so memtrie has already been loaded (in the previous epoch, on catchup), state sync worked fine in the previous epoch. Loading memtrie on catchup does not require ChunkExtra, so it might have not been available in the previous epoch yet things worked.

@staffik
Copy link
Contributor Author

staffik commented Apr 23, 2024

Loading the memtrie based on the tip is not correct. What we want is to load all shards for which we may need memtries for. If we're in an epoch boundary, technically we may still need the memtries from the previous epoch because it's not guaranteed that this new epoch is the one we're going to go into (we may have a fork that results in a different epoch - albeit having the same memtrie requirements in that epoch). If we arrived at the epoch while the node is up, we would have two memtries loaded, one from the previous epoch that was not unloaded yet, and one loaded by state sync during catchup earlier. So, when starting up, we need to restore the exact same state. What exactly should we do here to match that same state? This seems tricky... @staffik what do you think?

We need flat state to construct memtrie and we want memtrie for each flat state root.
So if we just maintain bijection: "flat state roots" - "memtries", we should be fine?

we should probably be trying state sync again before attempting to load the memtries for the supposed-to-be-caught-up shard

We would be good as long as flat state is correct. If we need state sync again because of memtrie - would it mean that some flat state is missing, and need to be state synced too?

@staffik
Copy link
Contributor Author

staffik commented Apr 23, 2024

@robin-near Do you think tracing would be useful in identifying why ChunkExtra was missing? I am currently not using it because had many merge conflicts with current master: #10843

@robin-near
Copy link
Contributor

@robin-near Do you think tracing would be useful in identifying why ChunkExtra was missing? I am currently not using it because had many merge conflicts with current master: #10843

Ah forgot to update on this; only robin-near@b85ed54 is needed now for tracing.

@robin-near
Copy link
Contributor

Does catchup not write ChunkExtra? Maybe that's where my confusion is.

@staffik
Copy link
Contributor Author

staffik commented Apr 23, 2024

Ah, yes. After loading memtrie, we write ChunkExtras in the loop here:

self.chain_store_update.save_chunk_extra(block_header.hash(), &shard_uid, new_chunk_extra);

bowenwang1996 added a commit to bowenwang1996/nearcore that referenced this issue Apr 25, 2024
github-merge-queue bot pushed a commit that referenced this issue Apr 25, 2024
@Longarithm fixed #11135 by forcing flat storage head to move after
state sync. The pytest `single_shard_tracking` exposes this issue.

Instructions to run the test
```
cargo build -p neard --features test_features,statelessnet_protocol
python3 pytest/tests/sanity/single_shard_tracking.py
```

---------

Co-authored-by: Longarithm <the.aleksandr.logunov@gmail.com>
@telezhnaya
Copy link
Contributor

It's reproducible again near/stakewars-iv#139

@telezhnaya telezhnaya reopened this Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation C-bug Category: This is a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants