Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 Routing: Refresh distance vector nonces #9651

Merged
merged 5 commits into from
Oct 9, 2023

Conversation

saketh-are
Copy link
Collaborator

@saketh-are saketh-are commented Oct 6, 2023

DistanceVector messages contain timestamped edges, which by default expire after 30 minutes. However, transmission of DistanceVectors is triggered only by changes to the network topology. As a result, nodes in a stable network will needlessly expire their DistanceVectors after 30 minutes. When a DistanceVector expires, the connection with the associated peer is terminated, resulting in unnecessary churn of network connections.

RoutingTableUpdate is an existing message type used by the V1 protocol to periodically flood refreshed edge nonces to the whole network. This PR passes the refreshed nonces from RoutingTableUpdate messages into the local storage for the V2 protocol so that the timestamps for stable DistanceVectors can be kept up to date.

@saketh-are saketh-are marked this pull request as ready for review October 8, 2023 19:15
@saketh-are saketh-are requested a review from a team as a code owner October 8, 2023 19:15
@saketh-are saketh-are requested a review from wacban October 8, 2023 19:15
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
How damaging was the issue? How did you find it out?

@saketh-are
Copy link
Collaborator Author

The only impact I am aware of is unnecessary termination of connections (not too damaging).

It was noticeable in localnet, which tends to have a very stable topology.

@saketh-are saketh-are added this pull request to the merge queue Oct 9, 2023
Merged via the queue into near:master with commit ae10bee Oct 9, 2023
9 checks passed
@saketh-are saketh-are deleted the routing-timestamps branch October 9, 2023 15:51
nikurt pushed a commit that referenced this pull request Oct 10, 2023
DistanceVector messages contain timestamped edges, which by default
expire after 30 minutes. However, transmission of DistanceVectors is
triggered only by changes to the network topology. As a result, nodes in
a stable network will needlessly expire their DistanceVectors after 30
minutes. When a DistanceVector expires, the connection with the
associated peer is terminated, resulting in unnecessary churn of network
connections.

RoutingTableUpdate is an existing message type used by the V1 protocol
to periodically flood refreshed edge nonces to the whole network. This
PR passes the refreshed nonces from RoutingTableUpdate messages into the
local storage for the V2 protocol so that the timestamps for stable
DistanceVectors can be kept up to date.
nikurt pushed a commit that referenced this pull request Oct 11, 2023
DistanceVector messages contain timestamped edges, which by default
expire after 30 minutes. However, transmission of DistanceVectors is
triggered only by changes to the network topology. As a result, nodes in
a stable network will needlessly expire their DistanceVectors after 30
minutes. When a DistanceVector expires, the connection with the
associated peer is terminated, resulting in unnecessary churn of network
connections.

RoutingTableUpdate is an existing message type used by the V1 protocol
to periodically flood refreshed edge nonces to the whole network. This
PR passes the refreshed nonces from RoutingTableUpdate messages into the
local storage for the V2 protocol so that the timestamps for stable
DistanceVectors can be kept up to date.
marcelo-gonzalez pushed a commit to marcelo-gonzalez/nearcore that referenced this pull request Oct 18, 2023
DistanceVector messages contain timestamped edges, which by default
expire after 30 minutes. However, transmission of DistanceVectors is
triggered only by changes to the network topology. As a result, nodes in
a stable network will needlessly expire their DistanceVectors after 30
minutes. When a DistanceVector expires, the connection with the
associated peer is terminated, resulting in unnecessary churn of network
connections.

RoutingTableUpdate is an existing message type used by the V1 protocol
to periodically flood refreshed edge nonces to the whole network. This
PR passes the refreshed nonces from RoutingTableUpdate messages into the
local storage for the V2 protocol so that the timestamps for stable
DistanceVectors can be kept up to date.
marcelo-gonzalez added a commit to marcelo-gonzalez/nearcore that referenced this pull request Apr 8, 2024
the test pytest/tools/mirror/offline_test.py often fails with
many fewer transactions observed than wanted, and the logs
reveal that many transactions are invalid because the access keys
used do not exist in the target chain. This happens because some
early transactions that should have added those keys never make it
on chain. These transactions are sent successfully from the perspective
of the ClientActor, but the logs show that they're dropped by the peer manager:

```
DEBUG handle{handler="PeerManagerMessageRequest" actor="PeerManagerActor" msg_type="NetworkRequests"}: network: Failed sending message: peer not connected to=ed25519:Fz7d1xkkt3XsvTPiwk4JRhMuPru4Ss7cLS8fdhshDRj3 num_connected_peers=1 msg=Routed(RoutedMessageV2 { msg: RoutedMessage { ... body: tx GFW8HgTndXVKdcLHdsCXxURjHxDnnEqHadrbxvsLKVQb ...
```

So, the peer manager is dropping the transaction instead of routing it, and
the test fails because many subsequent transactions depended on that one. A git
bisect shows that this behavior starts after near#9651.
It seems that this failure to route messages happens for a bit longer after startup after
that PR.

The proper way to handle this might be to implement a mechanism whereby these
messages won't just silently be dropped, and the ClientActor can receive a notification
that it wasn't successful so that we can retry it later. But for now a workaround
is to just wait a little bit before sending transactions. So we'll set a 15 second
timer for the first batch of transactions, and then proceed normally with the others
github-merge-queue bot pushed a commit that referenced this pull request Apr 10, 2024
The test pytest/tools/mirror/offline_test.py often fails with many fewer
transactions observed than wanted, and the logs reveal that many
transactions are invalid because the access keys used do not exist in
the target chain. This happens because some early transactions that
should have added those keys never make it on chain. These transactions
are sent successfully from the perspective of the ClientActor, but the
logs show that they're dropped by the peer manager:

```
DEBUG handle{handler="PeerManagerMessageRequest" actor="PeerManagerActor" msg_type="NetworkRequests"}: network: Failed sending message: peer not connected to=ed25519:Fz7d1xkkt3XsvTPiwk4JRhMuPru4Ss7cLS8fdhshDRj3 num_connected_peers=1 msg=Routed(RoutedMessageV2 { msg: RoutedMessage { ... body: tx GFW8HgTndXVKdcLHdsCXxURjHxDnnEqHadrbxvsLKVQb ...
```

So, the peer manager is dropping the transaction instead of routing it,
and the test fails because many subsequent transactions depended on that
one. A git bisect shows that this behavior starts after
#9651. It seems that this failure
to route messages happens for a bit longer after startup after that PR.
The proper way to handle this might be to implement a mechanism whereby
these messages won't just silently be dropped, and the ClientActor can
receive a notification that it wasn't successful so that we can retry it
later. But for now a workaround is to just wait a little bit before
sending transactions. So we'll set a 15 second timer for the first batch
of transactions, and then proceed normally with the others
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants