[stateless_validation] Missing ChunkExtra on load memtrie on startup #11135

staffik · 2024-04-22T21:39:40Z

Error message after restarting a stateless validation node. After restart it attempts to load memtrie on startup (shard shuffling enabled):

2024-04-22T21:15:58.252570Z  INFO memtrie: Loading trie to memory for shard s0.v2...
2024-04-22T21:15:58.252573Z DEBUG memtrie: Loading base trie from flat state... shard_uid=s0.v2
thread 'main' panicked at chain/client/src/client_actor.rs:222:6:
called `Result::unwrap()` on an `Err` value: Chain(StorageError(StorageInconsistentState("No ChunkExtra for block FJR59St3DjDVR4xvdUa8Mhf5JSoBATAqr2gkYaTBdGHR in shard s0.v2")))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: near_client::client_actor::start_client
   4: nearcore::start_with_config_and_synchronization
   5: neard::cli::RunCmd::run::{{closure}}
   6: tokio::task::local::LocalSet::run_until::{{closure}}
   7: neard::cli::NeardCmd::parse_and_run
   8: neard::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I think the solution would be to modify load_memtries_on_startup() so that it can take state_root as load_mem_trie_on_catchup() does:

nearcore/core/store/src/trie/shard_tries.rs

Line 430 in 16e2321

pub fn load_mem_trie_on_catchup(

Example usage: https://github.com/near/nearcore/pull/10820/files#diff-ef9c6aaa80a330e446c5365f42be9bff37ba4f898cf519dadd7e17545783c77cR2787

The text was updated successfully, but these errors were encountered:

staffik · 2024-04-22T21:43:37Z

So there is a dependency on ChunkExtra which for some reason is not available there. But we only need it to get state_root which can be obtained without ChunkExtra.

robin-near · 2024-04-22T23:20:58Z

Hmm, so the startup loading logic is a bit different from the catchup loading logic because whereas during catchup we have just created flat storage from a downloaded trie, during startup the flat storage may be in any arbitrary state. The flat head is somewhere, and on top of the flat head there is any set of deltas that represent different forks we may still be choosing among in the future. When loading memtrie, we start with the flat head and then for each delta, we also construct a new memtrie root to represent the difference. So, we cannot just take a different state root for the flat head, as that may not be consistent with the state that the flat state represents, and we cannot just take some other state root that does not correspond to the flat head, because then we may be missing the state root for some fork that we end up building on.

So for example, suppose we have blocks A, B, C, D where B.parent == A, C.parent == A, D.parent == C. The flat head may be at A. We would need to load four state roots corresponding to the post state roots of A, B, C, and D, because technically we may continue building from any of these blocks. The memtrie would contain four roots, and if we apply a chunk on top of B for example, we would query the memtrie root corresponding to B.

ChunkExtra is the place where we obtain the state root, because the flat state encodes the state corresponding to the post state root of the flat head, which is stored in the ChunkExtra for the flat head. Similarly, for each flat state delta, the delta describes the state transition whose result corresponds to the post state root of the block that the flat state delta is intended for.

As for why ChunkExtra is missing, that's still to be investigated.

robin-near · 2024-04-23T01:03:22Z

Ah ok, I think the bug is here. It's a problem that I deferred during the implementation of memtries that I honestly just forgot about.

nearcore/chain/chain/src/chain.rs

Line 516 in 16e2321

let tip = chain_store.head()?;

When we load the memtries, we needed to determine which shards memtries should actually load. At that time, I simply took the tip of the blockchain, because, well, we tracked all shards anyway so that would only change upon resharding. But now, that also changes from epoch to epoch due to single shard tracking.

So the bug was triggered as follows. The node is on a tip at height (presumably) 114912712, which is the first block of a new epoch, where it is a chunk producer for shard 2. When loading memtries, it needed to start from the flat head, which is 114912710, but that is in the previous epoch, where it was not tracking shard 2. So, when querying for the ChunkExtra for shard 2 at the 114912710 height, it didn't exist.

So then, I have some questions:

Isn't state sync supposed to catch up on shard 2, and therefore the ChunkExtra for shard 2 should exist in the previous epoch? Does this mean state sync failed during the previous epoch?
Loading the memtrie based on the tip is not correct. What we want is to load all shards for which we may need memtries for. If we're in an epoch boundary, technically we may still need the memtries from the previous epoch because it's not guaranteed that this new epoch is the one we're going to go into (we may have a fork that results in a different epoch - albeit having the same memtrie requirements in that epoch). If we arrived at the epoch while the node is up, we would have two memtries loaded, one from the previous epoch that was not unloaded yet, and one loaded by state sync during catchup earlier. So, when starting up, we need to restore the exact same state. What exactly should we do here to match that same state? This seems tricky... @staffik what do you think?
Perhaps if we solve the above problem, the state sync problem would also be resolved, because we should probably be trying state sync again before attempting to load the memtries for the supposed-to-be-caught-up shard.

staffik · 2024-04-23T12:23:32Z

Isn't state sync supposed to catch up on shard 2, and therefore the ChunkExtra for shard 2 should exist in the previous epoch? Does this mean state sync failed during the previous epoch?

The issue happened in the middle of the epoch, so memtrie has already been loaded (in the previous epoch, on catchup), state sync worked fine in the previous epoch. Loading memtrie on catchup does not require ChunkExtra, so it might have not been available in the previous epoch yet things worked.

staffik · 2024-04-23T12:51:14Z

Loading the memtrie based on the tip is not correct. What we want is to load all shards for which we may need memtries for. If we're in an epoch boundary, technically we may still need the memtries from the previous epoch because it's not guaranteed that this new epoch is the one we're going to go into (we may have a fork that results in a different epoch - albeit having the same memtrie requirements in that epoch). If we arrived at the epoch while the node is up, we would have two memtries loaded, one from the previous epoch that was not unloaded yet, and one loaded by state sync during catchup earlier. So, when starting up, we need to restore the exact same state. What exactly should we do here to match that same state? This seems tricky... @staffik what do you think?

We need flat state to construct memtrie and we want memtrie for each flat state root.
So if we just maintain bijection: "flat state roots" - "memtries", we should be fine?

we should probably be trying state sync again before attempting to load the memtries for the supposed-to-be-caught-up shard

We would be good as long as flat state is correct. If we need state sync again because of memtrie - would it mean that some flat state is missing, and need to be state synced too?

staffik · 2024-04-23T13:56:24Z

@robin-near Do you think tracing would be useful in identifying why ChunkExtra was missing? I am currently not using it because had many merge conflicts with current master: #10843

robin-near · 2024-04-23T15:53:25Z

@robin-near Do you think tracing would be useful in identifying why ChunkExtra was missing? I am currently not using it because had many merge conflicts with current master: #10843

Ah forgot to update on this; only robin-near@b85ed54 is needed now for tracing.

robin-near · 2024-04-23T15:53:57Z

Does catchup not write ChunkExtra? Maybe that's where my confusion is.

staffik · 2024-04-23T16:08:33Z

Ah, yes. After loading memtrie, we write ChunkExtras in the loop here:

nearcore/chain/chain/src/chain_update.rs

Line 863 in f95087b

    
           self.chain_store_update.save_chunk_extra(block_header.hash(), &shard_uid, new_chunk_extra);

@Longarithm

@Longarithm fixed #11135 by forcing flat storage head to move after state sync. The pytest `single_shard_tracking` exposes this issue. Instructions to run the test ``` cargo build -p neard --features test_features,statelessnet_protocol python3 pytest/tests/sanity/single_shard_tracking.py ``` --------- Co-authored-by: Longarithm <the.aleksandr.logunov@gmail.com>

telezhnaya · 2024-06-17T11:46:51Z

It's reproducible again near/stakewars-iv#139

staffik added C-bug Category: This is a bug A-stateless-validation Area: stateless validation labels Apr 22, 2024

tayfunelmas mentioned this issue Apr 23, 2024

Print block and epoch info when node first starts #11139

Open

walnut-the-cat mentioned this issue Apr 24, 2024

[ProjectTracking]: Stateless validation Mainnet Release near/near-one-project-tracking#46

Open

52 tasks

bowenwang1996 added a commit to bowenwang1996/nearcore that referenced this issue Apr 25, 2024

pytest that reproduces near#11135

ecd1847

bowenwang1996 mentioned this issue Apr 25, 2024

Fix #11135 #11153

Merged

bowenwang1996 closed this as completed in #11153 Apr 25, 2024

github-actions bot mentioned this issue May 1, 2024

Monthly issue metrics report #11194

Open

telezhnaya reopened this Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stateless_validation] Missing ChunkExtra on load memtrie on startup #11135

[stateless_validation] Missing ChunkExtra on load memtrie on startup #11135

staffik commented Apr 22, 2024 •

edited

Loading

staffik commented Apr 22, 2024

robin-near commented Apr 22, 2024

robin-near commented Apr 23, 2024

staffik commented Apr 23, 2024

staffik commented Apr 23, 2024 •

edited

Loading

staffik commented Apr 23, 2024

robin-near commented Apr 23, 2024

robin-near commented Apr 23, 2024

staffik commented Apr 23, 2024

telezhnaya commented Jun 17, 2024

[stateless_validation] Missing ChunkExtra on load memtrie on startup #11135

[stateless_validation] Missing ChunkExtra on load memtrie on startup #11135

Comments

staffik commented Apr 22, 2024 • edited Loading

staffik commented Apr 22, 2024

robin-near commented Apr 22, 2024

robin-near commented Apr 23, 2024

staffik commented Apr 23, 2024

staffik commented Apr 23, 2024 • edited Loading

staffik commented Apr 23, 2024

robin-near commented Apr 23, 2024

robin-near commented Apr 23, 2024

staffik commented Apr 23, 2024

telezhnaya commented Jun 17, 2024

staffik commented Apr 22, 2024 •

edited

Loading

staffik commented Apr 23, 2024 •

edited

Loading