Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RoutingTable V2: Distance Vector Routing #9187

Merged
merged 17 commits into from
Jul 18, 2023

Conversation

saketh-are
Copy link
Collaborator

@saketh-are saketh-are commented Jun 13, 2023

Suggested Review Path

  1. Browse the (relatively small) changes outside of the chain/network/src/routing folder to understand the external surface of the new RoutingTableV2 component.
  2. Check out the architecture diagram and event flows documented below.
  3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
  4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core of the new routing protocol.
  5. Return to the EdgeCache and review its implementation.
  6. Revisit the call-sites outside of the routing folder.

Architecture

image

Event Flows

  • Network Topology Changes
    • Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
    • These are triggered by PeerActor and flow into PeerManagerActor then into the demux
    • Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
    • RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
    • If the local DistanceVector changes, it is then broadcast to all peers
  • Handle RoutedMessage
    • Received by the PeerActor, which calls into PeerManagerActor for routing decisions
    • Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
    • Select a "next hop" from the RoutingTableView and forward the message
  • Handle response to a RoutedMessage
    • Received by the PeerActor, which calls into PeerManagerActor for routing decisions
    • Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
  • Connection started
    • When two nodes A and B connect, each spawns a PeerActor managing the connection
    • A sends a partially signed edge, which B then signs to produce a complete signed edge
    • B adds the signed edge to its local routing table, triggering re-computation of routes
    • B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
  • Connection stopped
    • Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
    • Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
    • A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
    • If B is still running, it will go through the same steps described for A
    • If B is not running, the other nodes connected to it will process a disconnection (just like A)

Configurable Parameters

To be finalized after further testing in larger topologies:

  • Minimum interval between routing table reconstruction: 1 second
  • Time after which edges are considered expired: 30 minutes
  • How often to refresh the nonces on edges: 10 minutes
  • How often to check consistency of routing table's local edges with the connection pool: every 1 minute

Resources

Future Extensions

  • Set up metrics we want to collect
  • Implement a debug-ui view showing contents of the V2 routing table
  • Implement pruning of non-validator leafs
  • Add handling of unreliable peers
  • Deprecate the old RoutingTable
  • Deprecate negative/tombstone edges

@saketh-are saketh-are requested review from wacban and a user June 13, 2023 19:24
@saketh-are saketh-are requested a review from a team as a code owner June 13, 2023 19:24
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now just step one of Suggested Review Path. So far so good ;)

chain/network/src/network_protocol/edge.rs Show resolved Hide resolved
chain/network/src/network_protocol/mod.rs Outdated Show resolved Hide resolved
chain/network/src/network_protocol/mod.rs Outdated Show resolved Hide resolved
chain/network/src/peer/peer_actor.rs Outdated Show resolved Hide resolved
chain/network/src/peer_manager/network_state/mod.rs Outdated Show resolved Hide resolved
chain/network/src/peer_manager/peer_manager_actor.rs Outdated Show resolved Hide resolved
chain/network/src/peer_manager/network_state/routing.rs Outdated Show resolved Hide resolved
chain/network/src/routing/graph_v2/mod.rs Outdated Show resolved Hide resolved
chain/network/src/network_protocol/mod.rs Show resolved Hide resolved
chain/network/src/routing/graph_v2/mod.rs Show resolved Hide resolved
chain/network/src/routing/graph_v2/mod.rs Outdated Show resolved Hide resolved
chain/network/src/routing/graph_v2/mod.rs Outdated Show resolved Hide resolved
chain/network/src/routing/graph_v2/mod.rs Outdated Show resolved Hide resolved
chain/network/src/routing/graph_v2/mod.rs Outdated Show resolved Hide resolved
chain/network/src/routing/edge_cache/mod.rs Show resolved Hide resolved
@saketh-are saketh-are requested a review from wacban June 26, 2023 17:12
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite there yet but making some progress, tbc

chain/network/src/network_protocol/mod.rs Outdated Show resolved Hide resolved
chain/network/src/network_protocol/mod.rs Outdated Show resolved Hide resolved
chain/network/src/network_protocol/network.proto Outdated Show resolved Hide resolved
chain/network/src/network_protocol/network.proto Outdated Show resolved Hide resolved
chain/network/src/routing/edge_cache/mod.rs Show resolved Hide resolved
chain/network/src/routing/graph_v2/mod.rs Outdated Show resolved Hide resolved
chain/network/src/routing/graph_v2/mod.rs Show resolved Hide resolved
chain/network/src/routing/graph_v2/mod.rs Show resolved Hide resolved
chain/network/src/routing/edge_cache/mod.rs Show resolved Hide resolved
@saketh-are saketh-are requested a review from wacban June 28, 2023 15:41
@wacban
Copy link
Contributor

wacban commented Jul 10, 2023

I don't think I comprehend all of it still but looks good and I want to unblock you.
As a final request can you update the docs in network.md or create a new one?

///
/// For each node in the tree, `first_step` indicates the root's neighbor on the path
/// from the root to the node. The root of the tree, as well as any nodes outside
/// the tree, have a first_step of -1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option is generally better than magic values.

// If the spanning tree doesn't already include the direct edge, add it
let mut spanning_tree = distance_vector.edges.clone();
if tree_edge.is_none() {
debug_assert!(advertised_distances[local_node_id] == -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if this condition is true you want to also return false. The debug assert will not panic in an production build.

if edge.edge_type() != EdgeState::Removed {
let (peer0, peer1) = edge.key().clone();
// V2 routing protocol doesn't broadcast tombstones; don't bother to sign them
*edge = Edge::make_fake_edge(peer0, peer1, edge.nonce() + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is fake edge guaranteed to be removed? The assert below is a bit scary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edges work in a funny way; the edge type is determined by the parity of the nonce:

pub fn edge_type(&self) -> EdgeState {
if self.nonce() % 2 == 1 {
EdgeState::Active
} else {
EdgeState::Removed
}
}

Here, we take an edge which is not of type Removed and add 1 to its nonce, producing a Removed edge.

When we deprecate the V1 graph we will get rid of tombstone edges entirely and can refactor this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, again, magic values are a bad idea, better to encode it directly as a bool field in the Edge. Glad to hear it's going away, it would be much safer and cleaner to encode it as a bool field in the edge.


/// Computes and returns "next hops" for all reachable destinations in the network.
/// Accepts a set of "unreliable peers" to avoid routing through.
/// TODO: Actually avoid the unreliable peers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this planned for this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR; it is not clear from available documentation why we define unreliable peers as we do (based on height of their chain) and why we should not route through them. It warrants further investigation, and possibly we will get rid of this concept.

@saketh-are
Copy link
Collaborator Author

saketh-are commented Jul 12, 2023

Thanks @wacban. Regarding the network.md: it has not been maintained since a long time. Almost every part of that page is outdated at this point and it pretty much needs to be rewritten in its entirety.

I agree with the goal of publishing an updated network.md, but do you think we can leave it for a separate PR considering the amount of changes needed there?

Until then, I think this PR and the linked design doc should suffice as documentation on this project.

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
please mention this change it in the CHANGELOG.md

Am I correct that once this PR is merged it's going to start building up the graph v2 immediately after it reaches production? Nothing wrong with that if that is your rollout plan but would be good to notify the release owner about this and have metrics and dashboards ready.

Re: network.md - sure I'm fine with redoing it separately. You can consider adding just once sentence there saying that networking is being reworked but it's not that important.

@saketh-are
Copy link
Collaborator Author

Yep, this PR already enables the shadow computation of Graph V2.
I have some work ready on debug dashboards; will send a PR immediately after this one.
I agree we can also aim to sneak in some metrics ahead of the release.

@near-bulldozer near-bulldozer bot merged commit cc1b2d5 into near:master Jul 18, 2023
1 check passed
nikurt pushed a commit that referenced this pull request Jul 20, 2023
### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges
@robin-near
Copy link
Contributor

Wow this looks amazing!

@saketh-are Would you mind taking a look at this Nayduck failure? https://nayduck.near.org/#/test/495092 It mentions DistanceVector so I wonder if it's related. Hopefully it's a simple fix!

thread 'actix-rt|system:0|arbiter:10' panicked at 'DistanceVector is not supported in Borsh encoding', chain/network/src/network_protocol/borsh_conv.rs:181:17
stack backtrace:
2023-07-20T13:14:43.045338Z DEBUG handle_block_production: client: Cannot produce any block: not enough approvals beyond 41
   0: rust_begin_unwind
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14
   2: near_network::network_protocol::borsh_conv::<impl core::convert::From<&near_network::network_protocol::PeerMessage> for near_network::network_protocol::borsh_::PeerMessage>::from
   3: near_network::network_protocol::PeerMessage::serialize
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

nikurt pushed a commit that referenced this pull request Jul 24, 2023
### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges
nikurt pushed a commit that referenced this pull request Jul 24, 2023
### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges
near-bulldozer bot added a commit that referenced this pull request Jul 24, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (#9320)

In #9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like #9121.

* refactor(loadtest): backwards compatible type hints (#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see #9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <wacban@users.noreply.github.com>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com>
Co-authored-by: Saketh Are <saketh.are@gmail.com>
Co-authored-by: Yasir <goodwonder5@gmail.com>
Co-authored-by: Aleksandr Logunov <alex.logunov@near.org>
Co-authored-by: Razvan Barbascu <razvan@near.org>
Co-authored-by: Jure Bajic <jure@near.org>
near-bulldozer bot pushed a commit that referenced this pull request Jul 26, 2023
In #9187 we introduced the first new PeerMessage variant in a long time, called DistanceVector.

I got a little over-zealous about our plans to deprecate borsh and [skipped implementing borsh support for the new message variant](https://github.com/saketh-are/nearcore/blob/2093819d414bd38c73574c681715e3a544daa945/chain/network/src/network_protocol/borsh_conv.rs#L180-L182).

However, it turns out we have some test infrastructure still reliant on borsh-encoded connections:
https://github.com/near/nearcore/blob/6cdee7cc123bdeb00f0d9029b10f8c1448eab54f/pytest/lib/proxy.py#L89-L90

In particular, the nayduck test `pytest/tests/sanity/sync_chunks_from_archival.py` makes use of the proxy tool and [is failing]( https://nayduck.near.org/#/test/497500) after #9187. 

This PR implements borsh support for DistanceVector as an immediate fix for the failing test.

In the long run we aim to deprecate borsh entirely, at which time this code (and a bunch of other code much like it) will be removed.
nikurt pushed a commit to nikurt/nearcore that referenced this pull request Jul 26, 2023
### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges
nikurt added a commit to nikurt/nearcore that referenced this pull request Jul 26, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (near#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (near#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (near#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (near#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (near#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (near#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (near#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (near#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (near#9320)

In near#9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (near#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (near#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like near#9121.

* refactor(loadtest): backwards compatible type hints (near#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see near#9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (near#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <wacban@users.noreply.github.com>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com>
Co-authored-by: Saketh Are <saketh.are@gmail.com>
Co-authored-by: Yasir <goodwonder5@gmail.com>
Co-authored-by: Aleksandr Logunov <alex.logunov@near.org>
Co-authored-by: Razvan Barbascu <razvan@near.org>
Co-authored-by: Jure Bajic <jure@near.org>
nikurt pushed a commit to nikurt/nearcore that referenced this pull request Jul 26, 2023
In near#9187 we introduced the first new PeerMessage variant in a long time, called DistanceVector.

I got a little over-zealous about our plans to deprecate borsh and [skipped implementing borsh support for the new message variant](https://github.com/saketh-are/nearcore/blob/2093819d414bd38c73574c681715e3a544daa945/chain/network/src/network_protocol/borsh_conv.rs#L180-L182).

However, it turns out we have some test infrastructure still reliant on borsh-encoded connections:
https://github.com/near/nearcore/blob/6cdee7cc123bdeb00f0d9029b10f8c1448eab54f/pytest/lib/proxy.py#L89-L90

In particular, the nayduck test `pytest/tests/sanity/sync_chunks_from_archival.py` makes use of the proxy tool and [is failing]( https://nayduck.near.org/#/test/497500) after near#9187. 

This PR implements borsh support for DistanceVector as an immediate fix for the failing test.

In the long run we aim to deprecate borsh entirely, at which time this code (and a bunch of other code much like it) will be removed.
nikurt pushed a commit that referenced this pull request Jul 28, 2023
### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges
nikurt pushed a commit that referenced this pull request Aug 24, 2023
### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges
nikurt added a commit that referenced this pull request Aug 24, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (#9320)

In #9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like #9121.

* refactor(loadtest): backwards compatible type hints (#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see #9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (#9299)

* rust: 1.70.0 -> 1.71.0 (#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (#9294)

Extracted a test from #9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <wacban@users.noreply.github.com>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com>
Co-authored-by: Saketh Are <saketh.are@gmail.com>
Co-authored-by: Yasir <goodwonder5@gmail.com>
Co-authored-by: Aleksandr Logunov <alex.logunov@near.org>
Co-authored-by: Razvan Barbascu <razvan@near.org>
Co-authored-by: Jure Bajic <jure@near.org>
nikurt pushed a commit that referenced this pull request Aug 24, 2023
In #9187 we introduced the first new PeerMessage variant in a long time, called DistanceVector.

I got a little over-zealous about our plans to deprecate borsh and [skipped implementing borsh support for the new message variant](https://github.com/saketh-are/nearcore/blob/2093819d414bd38c73574c681715e3a544daa945/chain/network/src/network_protocol/borsh_conv.rs#L180-L182).

However, it turns out we have some test infrastructure still reliant on borsh-encoded connections:
https://github.com/near/nearcore/blob/6cdee7cc123bdeb00f0d9029b10f8c1448eab54f/pytest/lib/proxy.py#L89-L90

In particular, the nayduck test `pytest/tests/sanity/sync_chunks_from_archival.py` makes use of the proxy tool and [is failing]( https://nayduck.near.org/#/test/497500) after #9187. 

This PR implements borsh support for DistanceVector as an immediate fix for the failing test.

In the long run we aim to deprecate borsh entirely, at which time this code (and a bunch of other code much like it) will be removed.
nikurt pushed a commit to nikurt/nearcore that referenced this pull request Aug 24, 2023
### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges
nikurt added a commit to nikurt/nearcore that referenced this pull request Aug 24, 2023
* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly. 

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it. 

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature. 

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs. 

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page 
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* near-vm-runner: move protocol-sensitive error schemas to near-primitives (near#9295)

This allows to drop a dependency on `near-account-id` and `near-rpc-error-macro` crates and brings us ever-so-slightly closer to having a contract runtime suitable for limited replayability.

But more importantly this also solves a long-term pain point in the contract runtime where we never really felt too confident modifying errors that are output from the contract runtime due to our fears about it possibly affecting the protocol output. Now that the schemas are outside of `nearcore/runtime` there's also a neat rule of thumb: anything goes inside `nearcore/runtime` (as far as errors are concerned.)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* chore(estimator): remove TTN read estimation (near#9307)

Since we have flat storage for reads, we no longer charge for touched trie nodes (TTN) on reads.
Remove the gas estimation for it.

More specifically, we used to estimate TTN cost as `max(read_ttn, write_ttn)` and therefore had 3 numbers reported. (read, write, combined).
Now we only need a single number reported.

The removed code (read TTN estimation) also didn't work anymore, as it didn't actually touch any trie nodes, and hence an assertion was triggered.

```
thread 'main' panicked at 'assertion failed: nodes_touched_delta as usize >= 2 * final_key_len - 10', runtime/runtime-params-estimator/src/trie.rs:118:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:114:5
   3: runtime_params_estimator::touching_trie_node_read
   4: runtime_params_estimator::touching_trie_node
   5: runtime_params_estimator::run_estimation
   6: runtime_params_estimator::main
```

We "fix" it by removing the code.

* feat: expose more RocksDB properties (near#9279)

This expose more RocksDB properties as prometheus metrics to enable better observability around RocksDB internals: [grafana dashboard](https://nearinc.grafana.net/d/e6676bfd-2eca-46f4-91eb-02cb1714e058/rocksdb-internals).
In particular this enables us to track total RocksDB memory usage, which is useful to look at when making RocksDB configuration changes or troubleshooting increased neard memory consumption. See [the dashboard](https://nearinc.grafana.net/d/f0afab7d-1333-4234-9161-598911f64328/rocksdb-ram-usage) for more details.

* chain: remove deprecated near_peer_message_received_total metric (near#9312)

The metric has been deprecated since 1.30.  Users should use
near_peer_message_received_by_type_total instead.

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* nearcore: remove old deprecation notice about network.external_address (near#9315)

Users have had enough time to update their config files to no longer
specify network.external_address.  The comment dictates the warning
should be removed by the end of 2022 which was half a year ago.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* fix(locust): wait for base on_locust_init() to finish before other init fns (near#9313)

the base on_locust_init() function sets
`environment.master_funding_account`, and other init functions expect
it to be set when they're run. When that isn't the case, you can get
this sort of error:

```
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/locust/event.py", line 40, in fire
    handler(**kwargs)
  File "/home/ubuntu/nearcore/pytest/tests/loadtest/locust/common/social.py", line 261, in on_locust_init
    funding_account = environment.master_funding_account
AttributeError: 'Environment' object has no attribute 'master_funding_account
```

This error can even happen in the master, before the workers have been
started, and it might be related to this issue (which has been closed
due to inactivity):
locustio/locust#1730. That bug mentions that
`User`s get started before on_locust_init() runs, but maybe for similar
reasons, we can't guarantee the order in which each on_locust_init()
function will run.  This doesn't seem to happen every time, and it
hasn't really been triggered on MacOS, only on Linux. But this makes
it kind of a blocker for setting this test up on cloud VMs (where this
bug has been observed)

* fix(state-sync): Simplify storage format of state sync dump progress (near#9289)

No reason why `StateSyncDumpProgress` had to be stored as `Some(x)` instead of simply `x`

* Fix proxy-based nayduck tests so that they can run on non-unix systems. (near#9314)

Before this, running proxy-based nayduck tests (such as proxy_simple.py) fails on Mac because on Mac, multiprocessing.Process uses spawn, not fork, and our tests were written in a way that was unfriendly to spawn:

1. the entry point was not protected by `if __name__ == '__main__':`, causing spawned processes to re-execute the main module's code;
2. shared memory was not properly passed to the child process - we relied on referencing the same global variable which only worked with the fork implementation.

This PR fixes these. Also, re-enable two tests which are now fixed.

* fix: fixed nayduck test state_sync_fail.py for nightly build (near#9320)

In near#9274 I introduced simple nightshade V2 layout and added it to the nightly build. This broke the nayduck test state_sync_fail.py. Here is the fix for it. 

The test performs resharding and then checks some postconditions. It broke because it attempted to reshard from V0 shard layout to V2 shard layout. This doesn't work because ShardLayout contains shard split map that only makes sense when resharding from a shard layout version to the immediate next. 

The fix is to check what is the protocol version supported in the binary and depending on it reshard from V0 to V1 or from V1 to V2.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* fix: use logging instead of print statements (near#9277)

@frol  I went through the related code, found this is the only required edit as we already set up logging services in the nearcore.

* refactor: todo to remove flat storage creation parameters (near#9250)

Recommend future readers to stop considering these parameters, because heavy flat storage migration already happened on all nodes in the ecosystem. So this case shouldn't complicate work like near#9121.

* refactor(loadtest): backwards compatible type hints (near#9323)

`list[...]` in type hints only works for python 3.9 and up.
For older python versions, we should use `typing.List[...]`.

I first thought we should require newer python for locust tests, also using `match` (see near#9125) but it seems we are somewhat dependent on older Ubuntu versions for now. At least I've been checking out code on gcp machines created by terraform templates and needed to patch the type hints to get the code running without installing a new python version.

This PR makes the code fully backward compatible again by simply using the `typing` module which is available since python 3.5.

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* chore: Update RocksDB to 0.21 (near#9298)

This update brings a lot of new changes:
- Update to RocksDB 8.1.1
- `io_uring` enabled which can be tested
- Added `load_latest` to open RocksDB with the latest options file
- and other fixes

No degradation was seen using a `perf-state` tool

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fix(db-tool): Tool to run DB migrations

* fmt

* fmt

* fix(db-tool): Tool to run DB migrations

* feat: simple nightshade v2 - shard layout with 5 shards (near#9274)

Introduced new protocol version called SimpleNightshadeV2, guarded it behind the rust feature `protocol_feature_simple_nightshade_v2` and added it to nightly.

Refactored the AllEpochConfig::for_protocol_version a bit and added the SimpleNightshadeV2 shard layout to it.

Note that I'm only hiding the SimpleNightshadeV2 behind the rust feature, I'm not planning on adding it everywhere. I'm reusing the same ShardLayout::V1 structure, just with bumped version and an extra boundary account. This should allow for smooth development since we won't need to guard all of the new code behind the new rust feature.

I tested it manually and some sort of resharding did happen. I'm yet to fully appreciate what exactly happened and if it's any good, as well as add some proper tests. I'll do that in separate PRs.

test repro instructions:
```
- get the current layout in json by running the print_shard_layout_all test and put it in $SHARD_LAYOUT
- generate localnet setup with 4 shards and 1 validator
- in the genesis file overwrite:
  - .epoch_length=10
  - .use_production_config=true
  - .shard_layout=$SHARD_LAYOUT
- build neard with nightly not enabled
- run neard for at least one epoch
- build neard with nightly enabled
- run neard
- watch resharding happening (only enabled debug logs for "catchup" target)
- see new shard layout in the debug page
```
![Screenshot 2023-07-11 at 15 34 36](https://github.com/near/nearcore/assets/1555986/5b83d645-4fdf-4994-a215-a500c0c0092f)

resharding logs: https://gist.github.com/wacban/7b3a8c74c80f99003c71b92bea44539f

* refactor: small refactorings and improvements (near#9296)

- Renamed a lot of "dl_info" and 'to_dl" to "state_sync_info". I'm too afraid to ask what "dl" stands for but either way it's very confusing. (it could be download). I'm not sure I fully appreciate the difference between state sync, catchup and download and I'm open for a better suggestion how to rename those. 
- In the LocalnetCmd I added logic to generate default LogConfig - to get rid of a pesky log message about this config missing when starting neard. 
- In docs, renamed `SyncJobActor` to `SyncJobsActor` which is the correct name. 
- Allowing the `stable_hash` to be unused. It's only unused on macOS so we need to keep it but let's not generate a warning. All of the failed builds (red cross) below are due to this. cc @andrei-near shall we add some automation to notify us when builds are failing? Should this build be also part of PR-buildkite? 
![Screenshot 2023-07-13 at 15 03 36](https://github.com/near/nearcore/assets/1555986/3adf18bf-6adc-4bf3-9996-55dc2ac8ad68)

* refactor: refactoring and commenting some resharding code (near#9299)

* rust: 1.70.0 -> 1.71.0 (near#9302)

Announcement: https://blog.rust-lang.org/2023/07/13/Rust-1.71.0.html

Notable breakages for us involve tightened down lints and replacement of the `clippy::integer_arithtmetic` lint  with a more general `clippy::arithmentic_side_effects` lint.

The latter was particularly angry about `curve25519-dalek` crate which only exposes unchecked arithmetic operations. I had no clue what the expected behaviour there is (wrapping? a panic?) so I simply allowed the lint for now, but somebody should definitely take a look at it in the future cc @abacabadabacaba

* fix(state-sync): Always use flat storage when catching up (near#9311)

The original code made the use of flat storage conditional on the node tracking that shard this epoch.
If a node prepares to track shard S next epoch E, then it downloads its state (E-1) and applies chunks in order. To apply chunks correctly in a way compatible with the rest of the network, it needs to be using flat storage.

Also add a metric for the latest block processed during catchup.
Also fix `view-state apply-range` tool not to fail because of getting delayed indices.
Also reduce verbosity of the inlining migration.

* fix(state-snapshot): Tool to make DB snapshots (near#9308)

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

* refactor: improvements to logging (near#9309)

There are plenty of log lines that don't fit in a single line, even on a quite wide monitor. This is an attempt to improve that. 
- Removed a few variables in tracing spans that were redundant - already included in parent span.
- Removed `apply_transactions_with_optional_storage_proof` span that immediately enters `process_state_update` and doesn't provide much value.
- Set the test formatter to use a new custom time formatter that only prints seconds and milliseconds since the test started. The default one prints full date, time, and nanoseconds. 
- Mini refactor of the sharding_upgrade.rs that I'm just trying to sneak through. These tests are the inspiration for improving the spam log since I can't parse it. 
- **RFC: changed the log level of the `process_receipt` log to `trace!`. This is very subjective but my reasoning is that if a log line appears more that a few times per block, then if should have the trace level.** Since it's runtime related, cc @jakmeier @nagisa, are you fine with that change? 

For any of those I can be convinced otherwise, please shout.

new log lines look like this:

```
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: is next_block_epoch_start false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=2}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=39.2µs time.idle=3.04µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update: runtime: epoch_height=4 epoch_id=EpochId(4kD9) current_protocol_version=48 is_first_block_of_version=false
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=1}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=71.0µs time.idle=2.67µs
 1.075s DEBUG do_apply_chunks{block_height=23 block_hash=9yH4}:new_chunk{shard_id=3}:process_state_update:apply{num_transactions=0}: runtime: close time.busy=62.2µs time.idle=3.58µs
```

(with the exception of hashes, I have them shortened locally, but I'm not including that in this PR) 

On a sidenote, I quite like tracing spans but we may be overdoing it a bit.

* fix(state-sync): Test showing that state sync can't always generate state parts (near#9294)

Extracted a test from near#9237 . No fix is available yet.

* feat: add database tool subcommand for State read perf testing (near#9276)

This PR adds a tool used to evaluate State read performance as part of `neard database` CLI. For more details on the approach see [the Methodology section](near#9235).
Also includes some minor refactoring around database tool.

<details>
  <summary>Example executions</summary>

```
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf --help
Run performance test for State column reads

Usage: neard database state-perf [OPTIONS]

Options:
  -s, --samples <SAMPLES>
          Number of requsts to use for the performance evaluation. Increasing this value results in more precise measurements, but longer test execution [default: 10000]
  -w, --warmup-samples <WARMUP_SAMPLES>
          Number of requests to use for database warmup. Those requests will be excluded from the measurements [default: 1000]
  -h, --help
          Print help
ubuntu@pugachag-mainnet:~/nearcore$ ./target/quick-release/neard database state-perf
2023-07-12T10:21:15.258765Z  INFO neard: version="trunk" build="44a09bf39" latest_protocol=62
2023-07-12T10:21:15.292835Z  INFO db: Opened a new RocksDB instance. num_instances=1
Start State perf test
Generate 11000 requests to State
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished requests generation
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 11000/11000
Finished State perf test
overall | avg observed_latency: 1.45039ms, block_read_time: 1.196571ms, samples with merge: 1596 (15.96%)
block_read_count: 0, samples: 7 (0.07%): | avg observed_latency: 36.126µs, block_read_time: 0ns, samples with merge: 4 (57.14%)
block_read_count: 1, samples: 4613 (46.13%): | avg observed_latency: 886.908µs, block_read_time: 790.738µs, samples with merge: 36 (0.78%)
block_read_count: 2, samples: 1962 (19.62%): | avg observed_latency: 1.383988ms, block_read_time: 1.221933ms, samples with merge: 904 (46.08%)
block_read_count: 3, samples: 1375 (13.75%): | avg observed_latency: 1.526996ms, block_read_time: 1.271185ms, samples with merge: 363 (26.40%)
block_read_count: 4, samples: 1361 (13.61%): | avg observed_latency: 1.575212ms, block_read_time: 1.207766ms, samples with merge: 148 (10.87%)
block_read_count: 5, samples: 221 (2.21%): | avg observed_latency: 2.080291ms, block_read_time: 1.660845ms, samples with merge: 89 (40.27%)
block_read_count: 6, samples: 382 (3.82%): | avg observed_latency: 6.281688ms, block_read_time: 4.545931ms, samples with merge: 28 (7.33%)
block_read_count: 7, samples: 41 (0.41%): | avg observed_latency: 6.709164ms, block_read_time: 4.897512ms, samples with merge: 14 (34.15%)
block_read_count: 8, samples: 13 (0.13%): | avg observed_latency: 6.569955ms, block_read_time: 4.73201ms, samples with merge: 7 (53.85%)
block_read_count: 9, samples: 3 (0.03%): | avg observed_latency: 7.457121ms, block_read_time: 5.517267ms, samples with merge: 2 (66.67%)
block_read_count: 10, samples: 22 (0.22%): | avg observed_latency: 9.602637ms, block_read_time: 6.658604ms, samples with merge: 1 (4.55%)

2023-07-12T10:21:46.995873Z  INFO db: Closed a RocksDB instance. num_instances=0
```
</details>

* RoutingTable V2: Distance Vector Routing (near#9187)

### Suggested Review Path
1. Browse the (relatively small) changes outside of the `chain/network/src/routing` folder to understand the external surface of the new RoutingTableV2 component.
2. Check out the architecture diagram and event flows documented below.
3. Read the documentation for the EdgeCache component and understand the 3 purposes it serves. The primary role of this component is to support efficient implementation of the routing protocol.
4. Review the RoutingTableV2 component and understand how DistanceVectors are ingested and created. This is the core  of the new routing protocol.
5. Return to the EdgeCache and review its implementation.
6. Revisit the call-sites outside of the routing folder.

###  Architecture
![image](https://github-production-user-asset-6210df.s3.amazonaws.com/3241341/244770041-ee661c90-667c-4db7-b8ac-678c90e75830.png)

### Event Flows
- Network Topology Changes
  - Three Kinds: Peer Connected, Peer Disconnected, received a PeerMessage with new DistanceVector
  - These are triggered by PeerActor and flow into PeerManagerActor then into the demux
  - Demux sends batches of updates (up to every 1 second) to the RoutingTableV2
  - RoutingTable processes entire batch, expires any outdated routes (relying on too-old edges), then generates updated RoutingTableView and local DistanceVector
  - If the local DistanceVector changes, it is then broadcast to all peers
- Handle RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Record the "previous hop" (the peer from which we received this message) in the RouteBackCache
  - Select a "next hop" from the RoutingTableView and forward the message
- Handle response to a RoutedMessage
  - Received by the PeerActor, which calls into PeerManagerActor for routing decisions
  - Fetch the "previous hop" from the RouteBackCache and relay the response back to the originating peer for the original message
- Connection started
  - When two nodes A and B connect, each spawns a PeerActor managing the connection
  - A sends a partially signed edge, which B then signs to produce a complete signed edge
  - B adds the signed edge to its local routing table, triggering re-computation of routes
  - B broadcasts its updated DistanceVector, which provides A (and other nodes) with the signed edge
- Connection stopped
  - Node A loses connection to some node B (either B stopped running, or the specific connection was broken)
  - Node A executes fix_local_edges and notices the lost connection, triggering re-computation of routes
  - A broadcasts its updated DistanceVector, informing other nodes of the latest routes it has
  - If B is still running, it will go through the same steps described for A
  - If B is not running, the other nodes connected to it will process a disconnection (just like A)

### Configurable Parameters
To be finalized after further testing in larger topologies:
- Minimum interval between routing table reconstruction: 1 second
- Time after which edges are considered expired: 30 minutes
- How often to refresh the nonces on edges: 10 minutes
- How often to check consistency of routing table's local edges with the connection pool: every 1 minute

### Resources
- [Design document](https://docs.google.com/document/d/192NdoknskSLavttwOZk40TSYvx2R1if4xNZ51sCNFkI/edit#heading=h.j4e0bgwl42pg)
- [Zulip thread](https://near.zulipchat.com/#narrow/stream/297663-pagoda.2Fnetwork/topic/Updated.20thoughts.20on.20TIER2.20routing) with further design discussion

#### Future Extensions
- [ ] Set up metrics we want to collect
- [ ] Implement a debug-ui view showing contents of the V2 routing table
- [ ] Implement pruning of non-validator leafs
- [ ] Add handling of unreliable peers
- [ ] Deprecate the old RoutingTable
- [ ] Deprecate negative/tombstone edges

* feat(state-sync): Add config for number of downloads during catchup (near#9318)

We can limit the impact of state sync during catchup by turning this number down. This way validation of blocks will not be hindered while the node downloads the state.

* Merge

* Merge

* fmt

* fmt

* fmt

* fmt

* fmt

* fmt

---------

Co-authored-by: wacban <wacban@users.noreply.github.com>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Co-authored-by: robin-near <111538878+robin-near@users.noreply.github.com>
Co-authored-by: Saketh Are <saketh.are@gmail.com>
Co-authored-by: Yasir <goodwonder5@gmail.com>
Co-authored-by: Aleksandr Logunov <alex.logunov@near.org>
Co-authored-by: Razvan Barbascu <razvan@near.org>
Co-authored-by: Jure Bajic <jure@near.org>
nikurt pushed a commit to nikurt/nearcore that referenced this pull request Aug 24, 2023
In near#9187 we introduced the first new PeerMessage variant in a long time, called DistanceVector.

I got a little over-zealous about our plans to deprecate borsh and [skipped implementing borsh support for the new message variant](https://github.com/saketh-are/nearcore/blob/2093819d414bd38c73574c681715e3a544daa945/chain/network/src/network_protocol/borsh_conv.rs#L180-L182).

However, it turns out we have some test infrastructure still reliant on borsh-encoded connections:
https://github.com/near/nearcore/blob/6cdee7cc123bdeb00f0d9029b10f8c1448eab54f/pytest/lib/proxy.py#L89-L90

In particular, the nayduck test `pytest/tests/sanity/sync_chunks_from_archival.py` makes use of the proxy tool and [is failing]( https://nayduck.near.org/#/test/497500) after near#9187. 

This PR implements borsh support for DistanceVector as an immediate fix for the failing test.

In the long run we aim to deprecate borsh entirely, at which time this code (and a bunch of other code much like it) will be removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants