authorithy-discovery: Make changing of peer-id while active a bit more robust #3786

alexggh · 2024-03-21T15:51:45Z

In the case when nodes don't persist their node-key or they want to generate a new one while being in the active set, things go wrong because both the old addresses and the new ones will still be present in DHT, so because of the distributed nature of the DHT both will survive in the network untill the old ones expires which is 36 hours. Nodes in the network will randomly resolve the authorithy-id to the old address or the new one.

More details in: #3673

This PR proposes we mitigate this problem, by:

Let the query for a DHT key retrieve more than one results(4), that is also bounded by the replication factor which is 20, currently we interrupt the querry on the first result.
~~2. Modify the authority-discovery service to keep all the discovered addresses around for 24h since they last seen an address.~~
3. Plumb through other subsystems where the assumption was that an authorithy-id will resolve only to one PeerId. Currently, the authorithy-discovery keeps just the last record it received from DHT and queries the DHT every 10 minutes. But they could always receive only the old address, only the new address or a flip-flop between them depending on what node wins the race to provide the record
Extend the SignedAuthorityRecord with a signed creation_time.
Modify authority discovery to keep track of nodes that sent us old record and once we are made aware of a new record update the nodes we know about with the new record.
Update gossip-support to try resolve authorities more often than every session.

This would gives us a lot more chances for the nodes in the networks to also discover not only the old address of the node but also the new one and should improve the time it takes for a node to be properly connected in the network. The behaviour won't be deterministic because there is no guarantee the all nodes will see the new record at least once, since they could query only nodes that have the old one.

TODO

Add unittests for the new paths.
Make sure the implementation is backwards compatible
Evaluate if there are any bad consequence of letting the query continue rather than finish it at first record found.
Bake in versi the new changes.

…e robust In the case when nodes don't persist their node-key or they want to generate a new one while being in the active set, things go wrong because both the old addresses and the new ones will still be present in DHT, so because of the distributed nature of the DHT both will survive in the network untill the old ones expires which is 36 hours. Nodes in the network will randomly resolve the authorithy-id to the old address or the new one. More details in: #3673 This PR proposes we mitigate this problem, by: 1. Let the query for a DHT key retrieve all the results, that is usually bounded by the replication factor which is 20, currently we interrupt the querry on the first result. 2. Modify the authority-discovery service to keep all the discovered addresses around for 24h since they last seen an address. 3. Plumb through other subsystems where the assumption was that an authorithy-id will resolve only to one PeerId. Currently, the authorithy-discovery keeps just the last record it received from DHT and queries the DHT every 10 minutes. But they could always receive only the old address, only the new address or a flip-flop between them depending on what node wins the race to provide the record 4. Update gossip-support to try resolve authorities more often than every session. This would gives us a lot more chances for the nodes in the networks to also discover not only the old address of the node but also the new one and should improve the time it takes for a node to be properly connected in the network. The behaviour won't be deterministic because there is no guarantee the all nodes will see the new record at least once, since they could query only nodes that have the old one. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

substrate/client/authority-discovery/src/worker.rs

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

…at_restart

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

substrate/client/authority-discovery/src/worker.rs

substrate/client/network/src/discovery.rs

substrate/client/authority-discovery/src/worker.rs

substrate/client/authority-discovery/src/worker/addr_cache.rs

polkadot/node/network/gossip-support/src/lib.rs

…_node_id_at_restart

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

…_node_id_at_restart

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

alexggh · 2024-05-01T15:41:52Z

Cargo.toml

@@ -646,3 +646,6 @@ wasmi = { opt-level = 3 }
 x25519-dalek = { opt-level = 3 }
 yamux = { opt-level = 3 }
 zeroize = { opt-level = 3 }
+
+[patch."https://github.com/paritytech/litep2p"]


Will be remove before merging once: paritytech/litep2p#96, gets merged.

This PR updates the litep2p crate to the latest version. This fixes the build for developers that want to perform `cargo update` on all their dependencies: #4343, by porting the latest changes. The peer records were introduced to litep2p to be able to distinguish and update peers with outdated records. It is going to be properly used in substrate via: #3786, however that is pending the commit to merge on litep2p master: paritytech/litep2p#96. Closes: #4343 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

…_node_id_at_restart

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

alexggh · 2024-05-02T13:00:52Z

Added parity between the litep2p and libp2p backends, thank you again @lexnv for adding the missing functionality in litep2p.

Ran some tests where peer is restarted with a different PeerID and all peers coverge to the new address.

@dmitry-markin @lexnv @bkchr when you've got time, can I get your reviews, thank you!

This PR updates the litep2p crate to the latest version. This fixes the build for developers that want to perform `cargo update` on all their dependencies: paritytech#4343, by porting the latest changes. The peer records were introduced to litep2p to be able to distinguish and update peers with outdated records. It is going to be properly used in substrate via: paritytech#3786, however that is pending the commit to merge on litep2p master: paritytech/litep2p#96. Closes: paritytech#4343 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

…_node_id_at_restart

alexggh · 2024-05-10T13:04:40Z

Ran the pull request on Versi, things work as expected.

This PR updates the litep2p crate to the latest version. This fixes the build for developers that want to perform `cargo update` on all their dependencies: paritytech#4343, by porting the latest changes. The peer records were introduced to litep2p to be able to distinguish and update peers with outdated records. It is going to be properly used in substrate via: paritytech#3786, however that is pending the commit to merge on litep2p master: paritytech/litep2p#96. Closes: paritytech#4343 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

dmitry-markin

LGTM 👍

dmitry-markin · 2024-05-10T14:54:59Z

substrate/client/network/src/discovery.rs

@@ -92,8 +92,12 @@ const MAX_KNOWN_EXTERNAL_ADDRESSES: usize = 32;
 /// record is replicated to.
 pub const DEFAULT_KADEMLIA_REPLICATION_FACTOR: usize = 20;

+// The minimum number of peers we expect an answer before we terminate the request.
+const GET_RECORD_REDUNDANCY_FACTOR: u32 = 4;


May be create an issue about making this configurable per query, as it's done now in litep2p? Then we'll be able to move this constant and the constant in litep2p-backend to a single place in authority-discovery.

Will add a ticket for this future improvement before merging.

substrate/client/network/src/litep2p/mod.rs

polkadot/node/network/gossip-support/src/tests.rs

substrate/client/authority-discovery/src/worker/tests.rs

This PR updates the litep2p crate to the latest version. This fixes the build for developers that want to perform `cargo update` on all their dependencies: paritytech#4343, by porting the latest changes. The peer records were introduced to litep2p to be able to distinguish and update peers with outdated records. It is going to be properly used in substrate via: paritytech#3786, however that is pending the commit to merge on litep2p master: paritytech/litep2p#96. Closes: paritytech#4343 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

Co-authored-by: Dmitry Markin <dmitry@markin.tech>

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

bkchr

Changing the network protocol should be done via a RFC.

bkchr · 2024-05-15T13:05:06Z

substrate/client/authority-discovery/src/worker/schema/dht-v3.proto

@@ -0,0 +1,33 @@
+syntax = "proto3";
+
+package authority_discovery_v3;


Yes this should go through an RFC. While this being compatible, it is still a change to the protocol.

alexggh · 2024-05-15T14:19:40Z

#3786 (comment)
Yes this should go through an RFC. While this being compatible, it is still a change to the protocol.a

@bkchr: That's unfortunate, I was hoping to fix this scenario sooner rather than later, somehow I missed that this change would warrant an RFC, I'll prepare one these days.

bkchr · 2024-05-15T14:25:40Z

The RFC should be quite forward. This changes the protocol and while it stays compatible it should go through a RFC.

@lexnv also had this thought. Sorry for taking that long to look at this!

alexggh · 2024-05-20T12:29:34Z

The RFC should be quite forward. This changes the protocol and while it stays compatible it should go through a RFC.

@lexnv also had this thought. Sorry for taking that long to look at this!

Posted the RFC here: polkadot-fellows/RFCs#91

alexggh mentioned this pull request Mar 21, 2024

Make polkadot behave correctly if node changes peer-id at restart #3673

Open

alexggh requested review from bkchr, eskimor, dmitry-markin and lexnv March 21, 2024 15:55

alexggh added 2 commits March 22, 2024 08:55

Make clippy happy

4172677

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Fix warnings

64f38d2

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

lexnv reviewed Mar 22, 2024

View reviewed changes

substrate/client/authority-discovery/src/worker.rs Outdated Show resolved Hide resolved

alexggh added 3 commits March 25, 2024 10:11

Refactor gossip support

ce87688

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Merge remote-tracking branch 'origin/master' into fix_change_node_id_…

bd69a55

…at_restart

Make clippy happy

a69ba99

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

dmitry-markin reviewed Mar 25, 2024

View reviewed changes

substrate/client/authority-discovery/src/worker.rs Show resolved Hide resolved

lexnv reviewed Mar 25, 2024

View reviewed changes

substrate/client/network/src/discovery.rs Outdated Show resolved Hide resolved

lexnv reviewed Mar 25, 2024

View reviewed changes

substrate/client/network/src/discovery.rs Outdated Show resolved Hide resolved

dmitry-markin reviewed Mar 25, 2024

View reviewed changes

substrate/client/authority-discovery/src/worker.rs Show resolved Hide resolved

dmitry-markin reviewed Mar 25, 2024

View reviewed changes

alexggh mentioned this pull request Mar 27, 2024

Prevent accidental change of network-key for active authorities #3852

Merged

2 tasks

alexggh added 12 commits April 1, 2024 15:53

Merge remote-tracking branch 'origin/master' into alexaggh/fix_change…

311aade

…_node_id_at_restart

Some other hacks

1305763

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Add more changes

7e16e58

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

More refactoring

a82ebed

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

A bit more refactoring

d569ab3

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

A few more improvements

1c9a40d

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Fixup even more

91e647c

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Another something

0a53ec2

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Fixup everything

ef6ddb6

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Post refactoring

b3eb615

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Fixup Cargo's

f6d4b29

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Merge remote-tracking branch 'origin/master' into alexaggh/fix_change…

b41cdf7

…_node_id_at_restart

alexggh added 2 commits May 1, 2024 16:07

Fix warning on quorum failed

20e351e

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Reconnect only if new peer ids pop-up

98398e3

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

alexggh added T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels May 1, 2024

alexggh commented May 1, 2024

View reviewed changes

This was referenced May 1, 2024

cargo: Update experimental litep2p to latest version #4344

Merged

Fix: change litep2p dependency from main to working tag #4343

Closed

alexggh added 3 commits May 2, 2024 15:28

Revert kademlia removal

8066044

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Merge remote-tracking branch 'origin/master' into alexaggh/fix_change…

8eef5c3

…_node_id_at_restart

Update cargo.lock

b4fe357

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

paritytech deleted a comment from paritytech-cicd-pr May 2, 2024

Merge branch 'master' into alexaggh/fix_change_node_id_at_restart

97f09a3

Merge remote-tracking branch 'origin/master' into alexaggh/fix_change…

586a0f1

…_node_id_at_restart

dmitry-markin approved these changes May 10, 2024

View reviewed changes

alexggh and others added 5 commits May 12, 2024 12:53

Update substrate/client/network/src/litep2p/mod.rs

09b6306

Co-authored-by: Dmitry Markin <dmitry@markin.tech>

Update polkadot/node/network/gossip-support/src/tests.rs

fb534b5

Co-authored-by: Dmitry Markin <dmitry@markin.tech>

Update substrate/client/authority-discovery/src/worker/tests.rs

7399dc5

Co-authored-by: Dmitry Markin <dmitry@markin.tech>

Update assert messages

17df838

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Address review feedback

3313fd7

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

bkchr requested changes May 15, 2024

View reviewed changes

alexggh mentioned this pull request May 20, 2024

Add creation time for DHT authority discovery records polkadot-fellows/RFCs#91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

authorithy-discovery: Make changing of peer-id while active a bit more robust #3786

authorithy-discovery: Make changing of peer-id while active a bit more robust #3786

alexggh commented Mar 21, 2024 •

edited

alexggh May 1, 2024

alexggh commented May 2, 2024

alexggh commented May 10, 2024

dmitry-markin left a comment

dmitry-markin May 10, 2024

alexggh May 12, 2024 •

edited

bkchr left a comment

bkchr May 15, 2024

alexggh commented May 15, 2024

bkchr commented May 15, 2024

alexggh commented May 20, 2024

		@@ -0,0 +1,33 @@
		syntax = "proto3";

		package authority_discovery_v3;

authorithy-discovery: Make changing of peer-id while active a bit more robust #3786

Are you sure you want to change the base?

authorithy-discovery: Make changing of peer-id while active a bit more robust #3786

Conversation

alexggh commented Mar 21, 2024 • edited

TODO

alexggh May 1, 2024

Choose a reason for hiding this comment

alexggh commented May 2, 2024

alexggh commented May 10, 2024

dmitry-markin left a comment

Choose a reason for hiding this comment

dmitry-markin May 10, 2024

Choose a reason for hiding this comment

alexggh May 12, 2024 • edited

Choose a reason for hiding this comment

bkchr left a comment

Choose a reason for hiding this comment

bkchr May 15, 2024

Choose a reason for hiding this comment

alexggh commented May 15, 2024

bkchr commented May 15, 2024

alexggh commented May 20, 2024

alexggh commented Mar 21, 2024 •

edited

alexggh May 12, 2024 •

edited