RoutingTable V2: Distance Vector Routing #9187

saketh-are · 2023-06-13T19:24:54Z

Suggested Review Path

Browse the (relatively small) changes outside of the chain/network/src/routing folder to understand the external surface of the new RoutingTableV2 component.
Check out the architecture diagram and event flows documented below.
Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol.
Return to the EdgeCache and review its implementation.
Revisit the call-sites outside of the routing folder.

Architecture

Event Flows

Network Topology Changes
- Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
- These are triggered by PeerActor and flow into PeerManagerActor then into the demux
- Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
- RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
- If the local DistanceVector changes, it is then broadcast to all peers
Handle RoutedMessage
- Received by the PeerActor, which calls into PeerManagerActor for routing decisions
- Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
- Select a "next hop" from the RoutingTableView and forward the message
Handle response to a RoutedMessage
- Received by the PeerActor, which calls into PeerManagerActor for routing decisions
- Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
Connection started
- When two nodes A and B connect, each spawns a PeerActor managing the connection
- A sends a partially signed edge, which B then signs to produce a complete signed edge
- B adds the signed edge to its local routing table, triggering re-computation of routes
- B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
Connection stopped
- Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
- Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
- A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
- If B is still running, it will go through the same steps described for A
- If B is not running, the other nodes connected to it will process a disconnection (just like A)

Configurable Parameters

To be finalized after further testing in larger topologies:

Minimum interval between routing table reconstruction: 1 second
Time after which edges are considered expired: 30 minutes
How often to refresh the nonces on edges: 10 minutes
How often to check consistency of routing table's local edges with the connection pool: every 1 minute

Resources

Design document
Zulip thread with further design discussion

Future Extensions

Set up metrics we want to collect
Implement a debug-ui view showing contents of the V2 routing table
Implement pruning of non-validator leafs
Add handling of unreliable peers
Deprecate the old RoutingTable
Deprecate negative/tombstone edges

wacban

For now just step one of Suggested Review Path. So far so good ;)

chain/network/src/network_protocol/edge.rs

chain/network/src/network_protocol/mod.rs

chain/network/src/peer/peer_actor.rs

chain/network/src/peer_manager/network_state/mod.rs

chain/network/src/peer_manager/network_state/routing.rs

chain/network/src/peer_manager/peer_manager_actor.rs

chain/network/src/peer_manager/network_state/routing.rs

chain/network/src/routing/graph_v2/mod.rs

chain/network/src/network_protocol/mod.rs

chain/network/src/routing/graph_v2/mod.rs

chain/network/src/routing/edge_cache/mod.rs

wacban

not quite there yet but making some progress, tbc

chain/network/src/network_protocol/borsh_conv.rs

chain/network/src/network_protocol/mod.rs

chain/network/src/network_protocol/network.proto

chain/network/src/routing/edge_cache/mod.rs

chain/network/src/routing/graph_v2/mod.rs

chain/network/src/routing/edge_cache/mod.rs

wacban · 2023-07-10T10:22:36Z

I don't think I comprehend all of it still but looks good and I want to unblock you.
As a final request can you update the docs in network.md or create a new one?

wacban · 2023-07-10T09:53:44Z

chain/network/src/routing/graph_v2/mod.rs

+    ///
+    /// For each node in the tree, `first_step` indicates the root's neighbor on the path
+    /// from the root to the node. The root of the tree, as well as any nodes outside
+    /// the tree, have a first_step of -1.


Option is generally better than magic values.

wacban · 2023-07-10T10:04:17Z

chain/network/src/routing/graph_v2/mod.rs

+        // If the spanning tree doesn't already include the direct edge, add it
+        let mut spanning_tree = distance_vector.edges.clone();
+        if tree_edge.is_none() {
+            debug_assert!(advertised_distances[local_node_id] == -1);


I think if this condition is true you want to also return false. The debug assert will not panic in an production build.

wacban · 2023-07-10T10:10:06Z

chain/network/src/routing/graph_v2/mod.rs

+            if edge.edge_type() != EdgeState::Removed {
+                let (peer0, peer1) = edge.key().clone();
+                // V2 routing protocol doesn't broadcast tombstones; don't bother to sign them
+                *edge = Edge::make_fake_edge(peer0, peer1, edge.nonce() + 1);


Is fake edge guaranteed to be removed? The assert below is a bit scary.

Edges work in a funny way; the edge type is determined by the parity of the nonce:

nearcore/chain/network/src/network_protocol/edge.rs

Lines 208 to 214 in 9c5e427

pub fn edge_type(&self) -> EdgeState {

if self.nonce() % 2 == 1 {

EdgeState::Active

} else {

EdgeState::Removed

}

}

Here, we take an edge which is not of type Removed and add 1 to its nonce, producing a Removed edge.

When we deprecate the V1 graph we will get rid of tombstone edges entirely and can refactor this.

Yeah, again, magic values are a bad idea, better to encode it directly as a bool field in the Edge. Glad to hear it's going away, it would be much safer and cleaner to encode it as a bool field in the edge.

wacban · 2023-07-10T10:12:07Z

chain/network/src/routing/graph_v2/mod.rs

+
+    /// Computes and returns "next hops" for all reachable destinations in the network.
+    /// Accepts a set of "unreliable peers" to avoid routing through.
+    /// TODO: Actually avoid the unreliable peers


Is this planned for this PR?

Not for this PR; it is not clear from available documentation why we define unreliable peers as we do (based on height of their chain) and why we should not route through them. It warrants further investigation, and possibly we will get rid of this concept.

saketh-are · 2023-07-12T18:26:55Z

Thanks @wacban. Regarding the network.md: it has not been maintained since a long time. Almost every part of that page is outdated at this point and it pretty much needs to be rewritten in its entirety.

I agree with the goal of publishing an updated network.md, but do you think we can leave it for a separate PR considering the amount of changes needed there?

Until then, I think this PR and the linked design doc should suffice as documentation on this project.

wacban

LGTM
please mention this change it in the CHANGELOG.md

Am I correct that once this PR is merged it's going to start building up the graph v2 immediately after it reaches production? Nothing wrong with that if that is your rollout plan but would be good to notify the release owner about this and have metrics and dashboards ready.

Re: network.md - sure I'm fine with redoing it separately. You can consider adding just once sentence there saying that networking is being reworked but it's not that important.

saketh-are · 2023-07-12T20:01:21Z

Yep, this PR already enables the shadow computation of Graph V2.
I have some work ready on debug dashboards; will send a PR immediately after this one.
I agree we can also aim to sneak in some metrics ahead of the release.

### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges

robin-near · 2023-07-20T22:44:50Z

Wow this looks amazing!

@saketh-are Would you mind taking a look at this Nayduck failure? https://nayduck.near.org/#/test/495092 It mentions DistanceVector so I wonder if it's related. Hopefully it's a simple fix!

thread 'actix-rt|system:0|arbiter:10' panicked at 'DistanceVector is not supported in Borsh encoding', chain/network/src/network_protocol/borsh_conv.rs:181:17
stack backtrace:
2023-07-20T13:14:43.045338Z DEBUG handle_block_production: client: Cannot produce any block: not enough approvals beyond 41
   0: rust_begin_unwind
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14
   2: near_network::network_protocol::borsh_conv::<impl core::convert::From<&near_network::network_protocol::PeerMessage> for near_network::network_protocol::borsh_::PeerMessage>::from
   3: near_network::network_protocol::PeerMessage::serialize
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges

@andrei-near

* fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (#9299) * near-vm-runner: move protocol-sensitive error schemas to near-primitives (#9295) This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability. But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.) * rust: 1.70.0 -> 1.71.0 (#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * chore(estimator): remove TTN read estimation (#9307) Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads. Remove the gas estimation for it. More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined). Now we only need a single number reported. The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered. ``` thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5 stack backtrace: 0: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 2: core::panicking::panic at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5 3: runtime_params_estimator::touching_trie_node_read 4: runtime_params_estimator::touching_trie_node 5: runtime_params_estimator::run_estimation 6: runtime_params_estimator::main ``` We "fix" it by removing the code. * feat: expose more RocksDB properties (#9279) This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals). In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details. * chain: remove deprecated near_peer_message_received_total metric (#9312) The metric has been deprecated since 1.30. Users should use near_peer_message_received_by_type_total instead. * refactor: improvements to logging (#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * nearcore: remove old deprecation notice about network.external_address (#9315) Users have had enough time to update their config files to no longer specify network.external_address. The comment dictates the warning should be removed by the end of 2022 which was half a year ago. * fix(state-sync): Test showing that state sync can't always generate state parts (#9294) Extracted a test from #9237 . No fix is available yet. * fix(locust): wait for base on_locust_init() to finish before other init fns (#9313) the base on_locust_init() function sets `environment.master_funding_account`, and other init functions expect it to be set when they're run. When that isn't the case, you can get this sort of error: ``` Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire handler(**kwargs) File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init funding_account = environment.master_funding_account AttributeError: 'Environment' object has no attribute 'master_funding_account ``` This error can even happen in the master, before the workers have been started, and it might be related to this issue (which has been closed due to inactivity): locustio/locust#1730. That bug mentions that `User`s get started before on_locust_init() runs, but maybe for similar reasons, we can't guarantee the order in which each on_locust_init() function will run. This doesn't seem to happen every time, and it hasn't really been triggered on MacOS, only on Linux. But this makes it kind of a blocker for setting this test up on cloud VMs (where this bug has been observed) * fix(state-sync): Simplify storage format of state sync dump progress (#9289) No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x` * Fix proxy-based nayduck tests so that they can run on non-unix systems. (#9314) Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn: 1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code; 2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation. This PR fixes these. Also, re-enable two tests which are now fixed. * fix: fixed nayduck test state_sync_fail.py for nightly build (#9320) In #9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2. * feat: add database tool subcommand for State read perf testing (#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * fix: use logging instead of print statements (#9277) @frol I went through the related code, found this is the only required edit as we already set up logging services in the nearcore. * refactor: todo to remove flat storage creation parameters (#9250) Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like #9121. * refactor(loadtest): backwards compatible type hints (#9323) `list[...]` in type hints only works for python 3.9 and up. For older python versions, we should use `typing.List[...]`. I first thought we should require newer python for locust tests, also using `match` (see #9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version. This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5. * feat(state-sync): Add config for number of downloads during catchup (#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * chore: Update RocksDB to 0.21 (#9298) This update brings a lot of new changes: - Update to RocksDB 8.1.1 - `io_uring` enabled which can be tested - Added `load_latest` to open RocksDB with the latest options file - and other fixes No degradation was seen using a `perf-state` tool * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fmt * fmt * fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (#9299) * rust: 1.70.0 -> 1.71.0 (#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * refactor: improvements to logging (#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * fix(state-sync): Test showing that state sync can't always generate state parts (#9294) Extracted a test from #9237 . No fix is available yet. * feat: add database tool subcommand for State read perf testing (#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * feat(state-sync): Add config for number of downloads during catchup (#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * Merge * Merge * fmt * fmt * fmt * fmt * fmt * fmt --------- Co-authored-by: wacban <wacban@users.noreply.github.com> Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me> Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> Co-authored-by: Jakob Meier <mail@jakobmeier.ch> Co-authored-by: Anton Puhach <anton@near.org> Co-authored-by: Michal Nazarewicz <mina86@mina86.com> Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com> Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com> Co-authored-by: Saketh Are <saketh.are@gmail.com> Co-authored-by: Yasir <goodwonder5@gmail.com> Co-authored-by: Aleksandr Logunov <alex.logunov@near.org> Co-authored-by: Razvan Barbascu <razvan@near.org> Co-authored-by: Jure Bajic <jure@near.org>

In #9187 we introduced the first new PeerMessage variant in a long time, called DistanceVector. I got a little over-zealous about our plans to deprecate borsh and [skipped implementing borsh support for the new message variant](https://github.com/saketh-are/nearcore/blob/2093819d414bd38c73574c681715e3a544daa945/chain/network/src/network_protocol/borsh_conv.rs#L180-L182). However, it turns out we have some test infrastructure still reliant on borsh-encoded connections: https://github.com/near/nearcore/blob/6cdee7cc123bdeb00f0d9029b10f8c1448eab54f/pytest/lib/proxy.py#L89-L90 In particular, the nayduck test `pytest/tests/sanity/sync_chunks_from_archival.py` makes use of the proxy tool and [is failing]( https://nayduck.near.org/#/test/497500) after #9187. This PR implements borsh support for DistanceVector as an immediate fix for the failing test. In the long run we aim to deprecate borsh entirely, at which time this code (and a bunch of other code much like it) will be removed.

### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges

@andrei-near

* fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (near#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (near#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (near#9299) * near-vm-runner: move protocol-sensitive error schemas to near-primitives (near#9295) This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability. But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.) * rust: 1.70.0 -> 1.71.0 (near#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (near#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (near#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * chore(estimator): remove TTN read estimation (near#9307) Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads. Remove the gas estimation for it. More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined). Now we only need a single number reported. The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered. ``` thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5 stack backtrace: 0: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 2: core::panicking::panic at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5 3: runtime_params_estimator::touching_trie_node_read 4: runtime_params_estimator::touching_trie_node 5: runtime_params_estimator::run_estimation 6: runtime_params_estimator::main ``` We "fix" it by removing the code. * feat: expose more RocksDB properties (near#9279) This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals). In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details. * chain: remove deprecated near_peer_message_received_total metric (near#9312) The metric has been deprecated since 1.30. Users should use near_peer_message_received_by_type_total instead. * refactor: improvements to logging (near#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * nearcore: remove old deprecation notice about network.external_address (near#9315) Users have had enough time to update their config files to no longer specify network.external_address. The comment dictates the warning should be removed by the end of 2022 which was half a year ago. * fix(state-sync): Test showing that state sync can't always generate state parts (near#9294) Extracted a test from near#9237 . No fix is available yet. * fix(locust): wait for base on_locust_init() to finish before other init fns (near#9313) the base on_locust_init() function sets `environment.master_funding_account`, and other init functions expect it to be set when they're run. When that isn't the case, you can get this sort of error: ``` Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire handler(**kwargs) File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init funding_account = environment.master_funding_account AttributeError: 'Environment' object has no attribute 'master_funding_account ``` This error can even happen in the master, before the workers have been started, and it might be related to this issue (which has been closed due to inactivity): locustio/locust#1730. That bug mentions that `User`s get started before on_locust_init() runs, but maybe for similar reasons, we can't guarantee the order in which each on_locust_init() function will run. This doesn't seem to happen every time, and it hasn't really been triggered on MacOS, only on Linux. But this makes it kind of a blocker for setting this test up on cloud VMs (where this bug has been observed) * fix(state-sync): Simplify storage format of state sync dump progress (near#9289) No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x` * Fix proxy-based nayduck tests so that they can run on non-unix systems. (near#9314) Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn: 1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code; 2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation. This PR fixes these. Also, re-enable two tests which are now fixed. * fix: fixed nayduck test state_sync_fail.py for nightly build (near#9320) In near#9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2. * feat: add database tool subcommand for State read perf testing (near#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (near#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * fix: use logging instead of print statements (near#9277) @frol I went through the related code, found this is the only required edit as we already set up logging services in the nearcore. * refactor: todo to remove flat storage creation parameters (near#9250) Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like near#9121. * refactor(loadtest): backwards compatible type hints (near#9323) `list[...]` in type hints only works for python 3.9 and up. For older python versions, we should use `typing.List[...]`. I first thought we should require newer python for locust tests, also using `match` (see near#9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version. This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5. * feat(state-sync): Add config for number of downloads during catchup (near#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * chore: Update RocksDB to 0.21 (near#9298) This update brings a lot of new changes: - Update to RocksDB 8.1.1 - `io_uring` enabled which can be tested - Added `load_latest` to open RocksDB with the latest options file - and other fixes No degradation was seen using a `perf-state` tool * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fmt * fmt * fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (near#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (near#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (near#9299) * rust: 1.70.0 -> 1.71.0 (near#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (near#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (near#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * refactor: improvements to logging (near#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * fix(state-sync): Test showing that state sync can't always generate state parts (near#9294) Extracted a test from near#9237 . No fix is available yet. * feat: add database tool subcommand for State read perf testing (near#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (near#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * feat(state-sync): Add config for number of downloads during catchup (near#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * Merge * Merge * fmt * fmt * fmt * fmt * fmt * fmt --------- Co-authored-by: wacban <wacban@users.noreply.github.com> Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me> Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> Co-authored-by: Jakob Meier <mail@jakobmeier.ch> Co-authored-by: Anton Puhach <anton@near.org> Co-authored-by: Michal Nazarewicz <mina86@mina86.com> Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com> Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com> Co-authored-by: Saketh Are <saketh.are@gmail.com> Co-authored-by: Yasir <goodwonder5@gmail.com> Co-authored-by: Aleksandr Logunov <alex.logunov@near.org> Co-authored-by: Razvan Barbascu <razvan@near.org> Co-authored-by: Jure Bajic <jure@near.org>

In near#9187 we introduced the first new PeerMessage variant in a long time, called DistanceVector. I got a little over-zealous about our plans to deprecate borsh and [skipped implementing borsh support for the new message variant](https://github.com/saketh-are/nearcore/blob/2093819d414bd38c73574c681715e3a544daa945/chain/network/src/network_protocol/borsh_conv.rs#L180-L182). However, it turns out we have some test infrastructure still reliant on borsh-encoded connections: https://github.com/near/nearcore/blob/6cdee7cc123bdeb00f0d9029b10f8c1448eab54f/pytest/lib/proxy.py#L89-L90 In particular, the nayduck test `pytest/tests/sanity/sync_chunks_from_archival.py` makes use of the proxy tool and [is failing]( https://nayduck.near.org/#/test/497500) after near#9187. This PR implements borsh support for DistanceVector as an immediate fix for the failing test. In the long run we aim to deprecate borsh entirely, at which time this code (and a bunch of other code much like it) will be removed.

### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges

@andrei-near

* fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (#9299) * near-vm-runner: move protocol-sensitive error schemas to near-primitives (#9295) This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability. But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.) * rust: 1.70.0 -> 1.71.0 (#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * chore(estimator): remove TTN read estimation (#9307) Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads. Remove the gas estimation for it. More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined). Now we only need a single number reported. The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered. ``` thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5 stack backtrace: 0: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 2: core::panicking::panic at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5 3: runtime_params_estimator::touching_trie_node_read 4: runtime_params_estimator::touching_trie_node 5: runtime_params_estimator::run_estimation 6: runtime_params_estimator::main ``` We "fix" it by removing the code. * feat: expose more RocksDB properties (#9279) This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals). In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details. * chain: remove deprecated near_peer_message_received_total metric (#9312) The metric has been deprecated since 1.30. Users should use near_peer_message_received_by_type_total instead. * refactor: improvements to logging (#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * nearcore: remove old deprecation notice about network.external_address (#9315) Users have had enough time to update their config files to no longer specify network.external_address. The comment dictates the warning should be removed by the end of 2022 which was half a year ago. * fix(state-sync): Test showing that state sync can't always generate state parts (#9294) Extracted a test from #9237 . No fix is available yet. * fix(locust): wait for base on_locust_init() to finish before other init fns (#9313) the base on_locust_init() function sets `environment.master_funding_account`, and other init functions expect it to be set when they're run. When that isn't the case, you can get this sort of error: ``` Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire handler(**kwargs) File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init funding_account = environment.master_funding_account AttributeError: 'Environment' object has no attribute 'master_funding_account ``` This error can even happen in the master, before the workers have been started, and it might be related to this issue (which has been closed due to inactivity): locustio/locust#1730. That bug mentions that `User`s get started before on_locust_init() runs, but maybe for similar reasons, we can't guarantee the order in which each on_locust_init() function will run. This doesn't seem to happen every time, and it hasn't really been triggered on MacOS, only on Linux. But this makes it kind of a blocker for setting this test up on cloud VMs (where this bug has been observed) * fix(state-sync): Simplify storage format of state sync dump progress (#9289) No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x` * Fix proxy-based nayduck tests so that they can run on non-unix systems. (#9314) Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn: 1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code; 2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation. This PR fixes these. Also, re-enable two tests which are now fixed. * fix: fixed nayduck test state_sync_fail.py for nightly build (#9320) In #9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2. * feat: add database tool subcommand for State read perf testing (#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * fix: use logging instead of print statements (#9277) @frol I went through the related code, found this is the only required edit as we already set up logging services in the nearcore. * refactor: todo to remove flat storage creation parameters (#9250) Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like #9121. * refactor(loadtest): backwards compatible type hints (#9323) `list[...]` in type hints only works for python 3.9 and up. For older python versions, we should use `typing.List[...]`. I first thought we should require newer python for locust tests, also using `match` (see #9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version. This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5. * feat(state-sync): Add config for number of downloads during catchup (#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * chore: Update RocksDB to 0.21 (#9298) This update brings a lot of new changes: - Update to RocksDB 8.1.1 - `io_uring` enabled which can be tested - Added `load_latest` to open RocksDB with the latest options file - and other fixes No degradation was seen using a `perf-state` tool * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fmt * fmt * fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (#9299) * rust: 1.70.0 -> 1.71.0 (#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * refactor: improvements to logging (#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * fix(state-sync): Test showing that state sync can't always generate state parts (#9294) Extracted a test from #9237 . No fix is available yet. * feat: add database tool subcommand for State read perf testing (#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * feat(state-sync): Add config for number of downloads during catchup (#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * Merge * Merge * fmt * fmt * fmt * fmt * fmt * fmt --------- Co-authored-by: wacban <wacban@users.noreply.github.com> Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me> Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> Co-authored-by: Jakob Meier <mail@jakobmeier.ch> Co-authored-by: Anton Puhach <anton@near.org> Co-authored-by: Michal Nazarewicz <mina86@mina86.com> Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com> Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com> Co-authored-by: Saketh Are <saketh.are@gmail.com> Co-authored-by: Yasir <goodwonder5@gmail.com> Co-authored-by: Aleksandr Logunov <alex.logunov@near.org> Co-authored-by: Razvan Barbascu <razvan@near.org> Co-authored-by: Jure Bajic <jure@near.org>

In #9187 we introduced the first new PeerMessage variant in a long time, called DistanceVector. I got a little over-zealous about our plans to deprecate borsh and [skipped implementing borsh support for the new message variant](https://github.com/saketh-are/nearcore/blob/2093819d414bd38c73574c681715e3a544daa945/chain/network/src/network_protocol/borsh_conv.rs#L180-L182). However, it turns out we have some test infrastructure still reliant on borsh-encoded connections: https://github.com/near/nearcore/blob/6cdee7cc123bdeb00f0d9029b10f8c1448eab54f/pytest/lib/proxy.py#L89-L90 In particular, the nayduck test `pytest/tests/sanity/sync_chunks_from_archival.py` makes use of the proxy tool and [is failing]( https://nayduck.near.org/#/test/497500) after #9187. This PR implements borsh support for DistanceVector as an immediate fix for the failing test. In the long run we aim to deprecate borsh entirely, at which time this code (and a bunch of other code much like it) will be removed.

### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges

@andrei-near

* fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (near#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (near#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (near#9299) * near-vm-runner: move protocol-sensitive error schemas to near-primitives (near#9295) This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability. But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.) * rust: 1.70.0 -> 1.71.0 (near#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (near#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (near#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * chore(estimator): remove TTN read estimation (near#9307) Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads. Remove the gas estimation for it. More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined). Now we only need a single number reported. The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered. ``` thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5 stack backtrace: 0: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 1: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 2: core::panicking::panic at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5 3: runtime_params_estimator::touching_trie_node_read 4: runtime_params_estimator::touching_trie_node 5: runtime_params_estimator::run_estimation 6: runtime_params_estimator::main ``` We "fix" it by removing the code. * feat: expose more RocksDB properties (near#9279) This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals). In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details. * chain: remove deprecated near_peer_message_received_total metric (near#9312) The metric has been deprecated since 1.30. Users should use near_peer_message_received_by_type_total instead. * refactor: improvements to logging (near#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * nearcore: remove old deprecation notice about network.external_address (near#9315) Users have had enough time to update their config files to no longer specify network.external_address. The comment dictates the warning should be removed by the end of 2022 which was half a year ago. * fix(state-sync): Test showing that state sync can't always generate state parts (near#9294) Extracted a test from near#9237 . No fix is available yet. * fix(locust): wait for base on_locust_init() to finish before other init fns (near#9313) the base on_locust_init() function sets `environment.master_funding_account`, and other init functions expect it to be set when they're run. When that isn't the case, you can get this sort of error: ``` Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire handler(**kwargs) File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init funding_account = environment.master_funding_account AttributeError: 'Environment' object has no attribute 'master_funding_account ``` This error can even happen in the master, before the workers have been started, and it might be related to this issue (which has been closed due to inactivity): locustio/locust#1730. That bug mentions that `User`s get started before on_locust_init() runs, but maybe for similar reasons, we can't guarantee the order in which each on_locust_init() function will run. This doesn't seem to happen every time, and it hasn't really been triggered on MacOS, only on Linux. But this makes it kind of a blocker for setting this test up on cloud VMs (where this bug has been observed) * fix(state-sync): Simplify storage format of state sync dump progress (near#9289) No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x` * Fix proxy-based nayduck tests so that they can run on non-unix systems. (near#9314) Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn: 1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code; 2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation. This PR fixes these. Also, re-enable two tests which are now fixed. * fix: fixed nayduck test state_sync_fail.py for nightly build (near#9320) In near#9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2. * feat: add database tool subcommand for State read perf testing (near#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (near#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * fix: use logging instead of print statements (near#9277) @frol I went through the related code, found this is the only required edit as we already set up logging services in the nearcore. * refactor: todo to remove flat storage creation parameters (near#9250) Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like near#9121. * refactor(loadtest): backwards compatible type hints (near#9323) `list[...]` in type hints only works for python 3.9 and up. For older python versions, we should use `typing.List[...]`. I first thought we should require newer python for locust tests, also using `match` (see near#9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version. This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5. * feat(state-sync): Add config for number of downloads during catchup (near#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * chore: Update RocksDB to 0.21 (near#9298) This update brings a lot of new changes: - Update to RocksDB 8.1.1 - `io_uring` enabled which can be tested - Added `load_latest` to open RocksDB with the latest options file - and other fixes No degradation was seen using a `perf-state` tool * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fix(db-tool): Tool to run DB migrations * fmt * fmt * fix(db-tool): Tool to run DB migrations * feat: simple nightshade v2 - shard layout with 5 shards (near#9274) Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. test repro instructions: ``` - get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT - generate localnet setup with 4 shards and 1 validator - in the genesis file overwrite: - .epoch_length=10 - .use_production_config=true - .shard_layout=$SHARD_LAYOUT - build neard with nightly not enabled - run neard for at least one epoch - build neard with nightly enabled - run neard - watch resharding happening (only enabled debug logs for "catchup" target) - see new shard layout in the debug page ``` ![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f) resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f * refactor: small refactorings and improvements (near#9296) - Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. - In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. - In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. - Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? ![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68) * refactor: refactoring and commenting some resharding code (near#9299) * rust: 1.70.0 -> 1.71.0 (near#9302) Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint with a more general `clippy::arithmentic_side_effects` lint. The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba * fix(state-sync): Always use flat storage when catching up (near#9311) The original code made the use of flat storage conditional on the node tracking that shard this epoch. If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage. Also add a metric for the latest block processed during catchup. Also fix `view-state apply-range` tool not to fail because of getting delayed indices. Also reduce verbosity of the inlining migration. * fix(state-snapshot): Tool to make DB snapshots (near#9308) Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> * refactor: improvements to logging (near#9309) There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. - Removed a few variables in tracing spans that were redundant - already included in parent span. - Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value. - Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. - Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. - **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? For any of those I can be convinced otherwise, please shout. new log lines look like this: ``` 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs ``` (with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) On a sidenote, I quite like tracing spans but we may be overdoing it a bit. * fix(state-sync): Test showing that state sync can't always generate state parts (near#9294) Extracted a test from near#9237 . No fix is available yet. * feat: add database tool subcommand for State read perf testing (near#9276) This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235). Also includes some minor refactoring around database tool. <details> <summary>Example executions</summary> ``` ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help Run performance test for State column reads Usage: neard database state-perf [OPTIONS] Options: -s, --samples <SAMPLES> Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000] -w, --warmup-samples <WARMUP_SAMPLES> Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000] -h, --help Print help ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf 2023-07-12T10:21:15.258765Z INFO neard: version="trunk" build="44a09bf39" latest_protocol=62 2023-07-12T10:21:15.292835Z INFO db: Opened a new RocksDB instance. num_instances=1 Start State perf test Generate 11000 requests to State █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished requests generation █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000 Finished State perf test overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%) block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%) block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%) block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%) block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%) block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%) block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%) block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%) block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%) block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%) block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%) block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%) 2023-07-12T10:21:46.995873Z INFO db: Closed a RocksDB instance. num_instances=0 ``` </details> * RoutingTable V2: Distance Vector Routing (near#9187) ### Suggested Review Path 1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component. 2. Check out the architecture diagram and event flows documented below. 3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol. 4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol. 5. Return to the EdgeCache and review its implementation. 6. Revisit the call-sites outside of the routing folder. ### Architecture ![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png) ### Event Flows - Network Topology Changes - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector - These are triggered by PeerActor and flow into PeerManagerActor then into the demux - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2 - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector - If the local DistanceVector changes, it is then broadcast to all peers - Handle RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache - Select a "next hop" from the RoutingTableView and forward the message - Handle response to a RoutedMessage - Received by the PeerActor, which calls into PeerManagerActor for routing decisions - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message - Connection started - When two nodes A and B connect, each spawns a PeerActor managing the connection - A sends a partially signed edge, which B then signs to produce a complete signed edge - B adds the signed edge to its local routing table, triggering re-computation of routes - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge - Connection stopped - Node A loses connection to some node B (either B stopped running, or the specific connection was broken) - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has - If B is still running, it will go through the same steps described for A - If B is not running, the other nodes connected to it will process a disconnection (just like A) ### Configurable Parameters To be finalized after further testing in larger topologies: - Minimum interval between routing table reconstruction: 1 second - Time after which edges are considered expired: 30 minutes - How often to refresh the nonces on edges: 10 minutes - How often to check consistency of routing table's local edges with the connection pool: every 1 minute ### Resources - [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg) - [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion #### Future Extensions - [ ] Set up metrics we want to collect - [ ] Implement a debug-ui view showing contents of the V2 routing table - [ ] Implement pruning of non-validator leafs - [ ] Add handling of unreliable peers - [ ] Deprecate the old RoutingTable - [ ] Deprecate negative/tombstone edges * feat(state-sync): Add config for number of downloads during catchup (near#9318) We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state. * Merge * Merge * fmt * fmt * fmt * fmt * fmt * fmt --------- Co-authored-by: wacban <wacban@users.noreply.github.com> Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me> Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com> Co-authored-by: Jakob Meier <mail@jakobmeier.ch> Co-authored-by: Anton Puhach <anton@near.org> Co-authored-by: Michal Nazarewicz <mina86@mina86.com> Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com> Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com> Co-authored-by: Saketh Are <saketh.are@gmail.com> Co-authored-by: Yasir <goodwonder5@gmail.com> Co-authored-by: Aleksandr Logunov <alex.logunov@near.org> Co-authored-by: Razvan Barbascu <razvan@near.org> Co-authored-by: Jure Bajic <jure@near.org>

In near#9187 we introduced the first new PeerMessage variant in a long time, called DistanceVector. I got a little over-zealous about our plans to deprecate borsh and [skipped implementing borsh support for the new message variant](https://github.com/saketh-are/nearcore/blob/2093819d414bd38c73574c681715e3a544daa945/chain/network/src/network_protocol/borsh_conv.rs#L180-L182). However, it turns out we have some test infrastructure still reliant on borsh-encoded connections: https://github.com/near/nearcore/blob/6cdee7cc123bdeb00f0d9029b10f8c1448eab54f/pytest/lib/proxy.py#L89-L90 In particular, the nayduck test `pytest/tests/sanity/sync_chunks_from_archival.py` makes use of the proxy tool and [is failing]( https://nayduck.near.org/#/test/497500) after near#9187. This PR implements borsh support for DistanceVector as an immediate fix for the failing test. In the long run we aim to deprecate borsh entirely, at which time this code (and a bunch of other code much like it) will be removed.

implement V2 routing table

0042f3e

saketh-are requested review from wacban and a user June 13, 2023 19:24

saketh-are requested a review from a team as a code owner June 13, 2023 19:24

saketh-are added 3 commits June 13, 2023 15:29

clean up debug output

f3ac940

remove overly aggressive memory management

a4d99c8

Merge branch 'master' into routing_table_v2

87b11c0

wacban reviewed Jun 16, 2023

View reviewed changes

wacban reviewed Jun 20, 2023

View reviewed changes

saketh-are added 5 commits June 25, 2023 20:12

rename AdvertisedRoute to AdvertisedPeerDistance

3d71c62

rmeove asserts and fix get_id

baeaf3d

refactors to avoid unwrap

2e4454e

tweak comments

19bf806

bundle calls to set_unreliable_peers

243d2ff

saketh-are requested a review from wacban June 26, 2023 17:12

saketh-are added 2 commits June 26, 2023 22:09

further rename 'route' to 'distance'

59e34f7

Merge remote-tracking branch 'upstream/master' into routing_table_v2

f50201d

wacban reviewed Jun 27, 2023

View reviewed changes

saketh-are added 3 commits June 28, 2023 00:52

tweak comments, construct_spanning_tree

2fd06b6

add comment explaining refcount

32ba965

explain tree can have better dists than claimed

b178d72

saketh-are requested a review from wacban June 28, 2023 15:41

wacban reviewed Jul 10, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into routing_table_v2

2e5ccae

wacban approved these changes Jul 12, 2023

View reviewed changes

eliminate use of magic value -1 in graph_v2

811ddaf

Merge branch 'master' into routing_table_v2

2093819

saketh-are added the S-automerge label Jul 17, 2023

near-bulldozer bot merged commit cc1b2d5 into near:master Jul 18, 2023
1 check passed

saketh-are mentioned this pull request Jul 26, 2023

Add borsh support for PeerMessage::DistanceVector #9348

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoutingTable V2: Distance Vector Routing #9187

RoutingTable V2: Distance Vector Routing #9187

saketh-are commented Jun 13, 2023 •

edited

wacban left a comment

wacban left a comment

wacban commented Jul 10, 2023

wacban Jul 10, 2023

wacban Jul 10, 2023

wacban Jul 10, 2023

saketh-are Jul 12, 2023

wacban Jul 12, 2023

wacban Jul 10, 2023

saketh-are Jul 12, 2023

saketh-are commented Jul 12, 2023 •

edited

wacban left a comment

saketh-are commented Jul 12, 2023

robin-near commented Jul 20, 2023

	pub fn edge_type(&self) -> EdgeState {
	if self.nonce() % 2 == 1 {
	EdgeState::Active
	} else {
	EdgeState::Removed
	}
	}

RoutingTable V2: Distance Vector Routing #9187

RoutingTable V2: Distance Vector Routing #9187

Conversation

saketh-are commented Jun 13, 2023 • edited

Suggested Review Path

Architecture

Event Flows

Configurable Parameters

Resources

Future Extensions

wacban left a comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

wacban commented Jul 10, 2023

wacban Jul 10, 2023

Choose a reason for hiding this comment

wacban Jul 10, 2023

Choose a reason for hiding this comment

wacban Jul 10, 2023

Choose a reason for hiding this comment

saketh-are Jul 12, 2023

Choose a reason for hiding this comment

wacban Jul 12, 2023

Choose a reason for hiding this comment

wacban Jul 10, 2023

Choose a reason for hiding this comment

saketh-are Jul 12, 2023

Choose a reason for hiding this comment

saketh-are commented Jul 12, 2023 • edited

wacban left a comment

Choose a reason for hiding this comment

saketh-are commented Jul 12, 2023

robin-near commented Jul 20, 2023

saketh-are commented Jun 13, 2023 •

edited

saketh-are commented Jul 12, 2023 •

edited