New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
authorithy-discovery: Make changing of peer-id while active a bit more robust #3786
base: master
Are you sure you want to change the base?
Conversation
…e robust In the case when nodes don't persist their node-key or they want to generate a new one while being in the active set, things go wrong because both the old addresses and the new ones will still be present in DHT, so because of the distributed nature of the DHT both will survive in the network untill the old ones expires which is 36 hours. Nodes in the network will randomly resolve the authorithy-id to the old address or the new one. More details in: #3673 This PR proposes we mitigate this problem, by: 1. Let the query for a DHT key retrieve all the results, that is usually bounded by the replication factor which is 20, currently we interrupt the querry on the first result. 2. Modify the authority-discovery service to keep all the discovered addresses around for 24h since they last seen an address. 3. Plumb through other subsystems where the assumption was that an authorithy-id will resolve only to one PeerId. Currently, the authorithy-discovery keeps just the last record it received from DHT and queries the DHT every 10 minutes. But they could always receive only the old address, only the new address or a flip-flop between them depending on what node wins the race to provide the record 4. Update gossip-support to try resolve authorities more often than every session. This would gives us a lot more chances for the nodes in the networks to also discover not only the old address of the node but also the new one and should improve the time it takes for a node to be properly connected in the network. The behaviour won't be deterministic because there is no guarantee the all nodes will see the new record at least once, since they could query only nodes that have the old one. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
…_node_id_at_restart
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
…_node_id_at_restart
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@@ -646,3 +646,6 @@ wasmi = { opt-level = 3 } | |||
x25519-dalek = { opt-level = 3 } | |||
yamux = { opt-level = 3 } | |||
zeroize = { opt-level = 3 } | |||
|
|||
[patch."https://github.com/paritytech/litep2p"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be remove before merging once: paritytech/litep2p#96, gets merged.
This PR updates the litep2p crate to the latest version. This fixes the build for developers that want to perform `cargo update` on all their dependencies: #4343, by porting the latest changes. The peer records were introduced to litep2p to be able to distinguish and update peers with outdated records. It is going to be properly used in substrate via: #3786, however that is pending the commit to merge on litep2p master: paritytech/litep2p#96. Closes: #4343 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
…_node_id_at_restart
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Ran some tests where peer is restarted with a different PeerID and all peers coverge to the new address. @dmitry-markin @lexnv @bkchr when you've got time, can I get your reviews, thank you! |
This PR updates the litep2p crate to the latest version. This fixes the build for developers that want to perform `cargo update` on all their dependencies: paritytech#4343, by porting the latest changes. The peer records were introduced to litep2p to be able to distinguish and update peers with outdated records. It is going to be properly used in substrate via: paritytech#3786, however that is pending the commit to merge on litep2p master: paritytech/litep2p#96. Closes: paritytech#4343 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
…_node_id_at_restart
Ran the pull request on Versi, things work as expected. |
This PR updates the litep2p crate to the latest version. This fixes the build for developers that want to perform `cargo update` on all their dependencies: paritytech#4343, by porting the latest changes. The peer records were introduced to litep2p to be able to distinguish and update peers with outdated records. It is going to be properly used in substrate via: paritytech#3786, however that is pending the commit to merge on litep2p master: paritytech/litep2p#96. Closes: paritytech#4343 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
@@ -92,8 +92,12 @@ const MAX_KNOWN_EXTERNAL_ADDRESSES: usize = 32; | |||
/// record is replicated to. | |||
pub const DEFAULT_KADEMLIA_REPLICATION_FACTOR: usize = 20; | |||
|
|||
// The minimum number of peers we expect an answer before we terminate the request. | |||
const GET_RECORD_REDUNDANCY_FACTOR: u32 = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be create an issue about making this configurable per query, as it's done now in litep2p
? Then we'll be able to move this constant and the constant in litep2p-backend to a single place in authority-discovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a ticket for this future improvement before merging.
This PR updates the litep2p crate to the latest version. This fixes the build for developers that want to perform `cargo update` on all their dependencies: paritytech#4343, by porting the latest changes. The peer records were introduced to litep2p to be able to distinguish and update peers with outdated records. It is going to be properly used in substrate via: paritytech#3786, however that is pending the commit to merge on litep2p master: paritytech/litep2p#96. Closes: paritytech#4343 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the network protocol should be done via a RFC.
@@ -0,0 +1,33 @@ | |||
syntax = "proto3"; | |||
|
|||
package authority_discovery_v3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this should go through an RFC. While this being compatible, it is still a change to the protocol.
@bkchr: That's unfortunate, I was hoping to fix this scenario sooner rather than later, somehow I missed that this change would warrant an RFC, I'll prepare one these days. |
The RFC should be quite forward. This changes the protocol and while it stays compatible it should go through a RFC. @lexnv also had this thought. Sorry for taking that long to look at this! |
Posted the RFC here: polkadot-fellows/RFCs#91 |
In the case when nodes don't persist their node-key or they want to generate a new one while being in the active set, things go wrong because both the old addresses and the new ones will still be present in DHT, so because of the distributed nature of the DHT both will survive in the network untill the old ones expires which is 36 hours. Nodes in the network will randomly resolve the authorithy-id to the old address or the new one.
More details in: #3673
This PR proposes we mitigate this problem, by:
Let the query for a DHT key retrieve more than one results(4), that is also bounded by the replication factor which is 20, currently we interrupt the querry on the first result.
2. Modify the authority-discovery service to keep all the discovered addresses around for 24h since they last seen an address.3. Plumb through other subsystems where the assumption was that an authorithy-id will resolve only to one PeerId. Currently, the authorithy-discovery keeps just the last record it received from DHT and queries the DHT every 10 minutes. But they could always receive only the old address, only the new address or a flip-flop between them depending on what node wins the race to provide the recordExtend the
SignedAuthorityRecord
with a signed creation_time.Modify authority discovery to keep track of nodes that sent us old record and once we are made aware of a new record update the nodes we know about with the new record.
Update gossip-support to try resolve authorities more often than every session.
This would gives us a lot more chances for the nodes in the networks to also discover not only the old address of the node but also the new one and should improve the time it takes for a node to be properly connected in the network. The behaviour won't be deterministic because there is no guarantee the all nodes will see the new record at least once, since they could query only nodes that have the old one.TODO