-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Local nodes keep disconnecting. #4272
Comments
For what it's worth, disabling the discovery mechanism fixes the issue. It's still unclear to me what is happening. |
What seems to happen is the following:
This sequence more or less repeats ad infinitum, since Node 2's discovery will continue to encounter these peer IDs differing from that of Node 1 with address If I'm not mistaken, this seems to be an interesting way for a node C to directly influence the connectivity between some nodes A and B, e.g. by advertising public addresses of A in the DHT under its own peer ID. When node B picks these up during lookups, it would disturb the connection between A and B in the above manner. This may be primarily a pitfall of the single-connection-per-node policy, though it may possibly be prevented by establishing a preference for the old (existing) connection, if a node receives a second connection as a listener when it is already a listener in the existing connection, but I'm not entirely clear about all the possible consequences at the moment. |
The reason is that typically when a node opens a new connection, it's because the old one is dead. Example situation: Node 1 and Node 2 are connected. Node 2 loses its Internet connection, realizes it, and kills all existing sockets. No FIN is actually being sent because of no Internet access. Node 2 then gains back its Internet connection and tries to re-connect to Node 1. Node 1 isn't aware that the previous connection is dead. |
In my opinion the solution is to handle multiple simultaneous connections per node (libp2p/rust-libp2p#912). |
Doesn't libp2p have keep-alive or ping protocol to handle this? |
It does, but it takes something like 30 seconds to trigger. I'm not actually sure that my scenario above is realistic, but the general idea is that we expect that when a node opens a second connection it is because the existing one is unusable. |
Note though that even when permitting multiple connections per peer, which I'm currently looking into, you will want to have a configurable limit (per peer). In a sense, the current single-connection-per-peer policy can be seen as a hard-coded limit of 1. Whatever the limit, I don't think it is a good idea to enforce the limit by dropping existing connections in favor of new ones at the lower networking layers. Rather, timely detection of broken connections is up to application protocols (or by configuring timeouts on a lower-layer protocol), and in particular what "timely" is supposed to be exactly as per the requirements of the protocol. The ping protocol can be aptly configured and used for this purpose, if desired. |
Instead of trying to enforce a single connection per peer, which involves quite a bit of additional complexity e.g. to prioritise simultaneously opened connections and can have other undesirable consequences [1], we now make multiple connections per peer a feature. The gist of these changes is as follows: The concept of a "node" with an implicit 1-1 correspondence to a connection has been replaced with the "first-class" concept of a "connection". The code from `src/nodes` has moved (with varying degrees of modification) to `src/connection`. A `HandledNode` has become a `Connection`, a `NodeHandler` a `ConnectionHandler`, the `CollectionStream` was the basis for the new `connection::Pool`, and so forth. Conceptually, a `Network` contains a `connection::Pool` which in turn internally employs the `connection::Manager` for handling the background `connection::manager::Task`s, one per connection, as before. These are all considered implementation details. On the public API, `Peer`s are managed as before through the `Network`, except now the API has changed with the shift of focus to (potentially multiple) connections per peer. The `NetworkEvent`s have accordingly also undergone changes. The Swarm APIs remain largely unchanged, except for the fact that `inject_replaced` is no longer called. It may now practically happen that multiple `ProtocolsHandler`s are associated with a single `NetworkBehaviour`, one per connection. If implementations of `NetworkBehaviour` rely somehow on communicating with exactly one `ProtocolsHandler`, this may cause issues, but it is unlikely. [1]: paritytech/substrate#4272
Instead of trying to enforce a single connection per peer, which involves quite a bit of additional complexity e.g. to prioritise simultaneously opened connections and can have other undesirable consequences [1], we now make multiple connections per peer a feature. The gist of these changes is as follows: The concept of a "node" with an implicit 1-1 correspondence to a connection has been replaced with the "first-class" concept of a "connection". The code from `src/nodes` has moved (with varying degrees of modification) to `src/connection`. A `HandledNode` has become a `Connection`, a `NodeHandler` a `ConnectionHandler`, the `CollectionStream` was the basis for the new `connection::Pool`, and so forth. Conceptually, a `Network` contains a `connection::Pool` which in turn internally employs the `connection::Manager` for handling the background `connection::manager::Task`s, one per connection, as before. These are all considered implementation details. On the public API, `Peer`s are managed as before through the `Network`, except now the API has changed with the shift of focus to (potentially multiple) connections per peer. The `NetworkEvent`s have accordingly also undergone changes. The Swarm APIs remain largely unchanged, except for the fact that `inject_replaced` is no longer called. It may now practically happen that multiple `ProtocolsHandler`s are associated with a single `NetworkBehaviour`, one per connection. If implementations of `NetworkBehaviour` rely somehow on communicating with exactly one `ProtocolsHandler`, this may cause issues, but it is unlikely. [1]: paritytech/substrate#4272
Instead of trying to enforce a single connection per peer, which involves quite a bit of additional complexity e.g. to prioritise simultaneously opened connections and can have other undesirable consequences [1], we now make multiple connections per peer a feature. The gist of these changes is as follows: The concept of a "node" with an implicit 1-1 correspondence to a connection has been replaced with the "first-class" concept of a "connection". The code from `src/nodes` has moved (with varying degrees of modification) to `src/connection`. A `HandledNode` has become a `Connection`, a `NodeHandler` a `ConnectionHandler`, the `CollectionStream` was the basis for the new `connection::Pool`, and so forth. Conceptually, a `Network` contains a `connection::Pool` which in turn internally employs the `connection::Manager` for handling the background `connection::manager::Task`s, one per connection, as before. These are all considered implementation details. On the public API, `Peer`s are managed as before through the `Network`, except now the API has changed with the shift of focus to (potentially multiple) connections per peer. The `NetworkEvent`s have accordingly also undergone changes. The Swarm APIs remain largely unchanged, except for the fact that `inject_replaced` is no longer called. It may now practically happen that multiple `ProtocolsHandler`s are associated with a single `NetworkBehaviour`, one per connection. If implementations of `NetworkBehaviour` rely somehow on communicating with exactly one `ProtocolsHandler`, this may cause issues, but it is unlikely. [1]: paritytech/substrate#4272
Instead of trying to enforce a single connection per peer, which involves quite a bit of additional complexity e.g. to prioritise simultaneously opened connections and can have other undesirable consequences [1], we now make multiple connections per peer a feature. The gist of these changes is as follows: The concept of a "node" with an implicit 1-1 correspondence to a connection has been replaced with the "first-class" concept of a "connection". The code from `src/nodes` has moved (with varying degrees of modification) to `src/connection`. A `HandledNode` has become a `Connection`, a `NodeHandler` a `ConnectionHandler`, the `CollectionStream` was the basis for the new `connection::Pool`, and so forth. Conceptually, a `Network` contains a `connection::Pool` which in turn internally employs the `connection::Manager` for handling the background `connection::manager::Task`s, one per connection, as before. These are all considered implementation details. On the public API, `Peer`s are managed as before through the `Network`, except now the API has changed with the shift of focus to (potentially multiple) connections per peer. The `NetworkEvent`s have accordingly also undergone changes. The Swarm APIs remain largely unchanged, except for the fact that `inject_replaced` is no longer called. It may now practically happen that multiple `ProtocolsHandler`s are associated with a single `NetworkBehaviour`, one per connection. If implementations of `NetworkBehaviour` rely somehow on communicating with exactly one `ProtocolsHandler`, this may cause issues, but it is unlikely. [1]: paritytech/substrate#4272
How exactly allowing multiple connections will solve this issue? This is a fairly straightforward scenario, where none of the peers misbehave or lose connectivity. Why would we want multiple connections here?
It seems to me that it should not attempt dialing an address that's already connected in the first place. |
As I explained in an earlier comment, the immediate cause of this issue is that the "listener" closes its existing connection, preferring the new over the old connection in the attempt to enforce a single connection per peer (the "dialer" then later, upon discovering the peer ID mismatch closes the new connection as well, and the connect/disconnect dance begins in this way). While my first reaction was that it doesn't seem right that new connections are preferred over old ones in this scheme, and I'd rather swap that around, @tomaka had some concerns on doing that. In any case, removing the single-connection-per-peer policy is a strictly more general and desirable solution, not just in light of this issue. In this particular scenario the "listener" then no longer has to make a choice between these connections, the old connection remains unaffected, the "dialer" eventually closes its new connection attempt upon discovering the peer ID mismatch.
In general and at the level of libp2p, I don't think it is desirable to disallow multiple connections to the same address. Of course - and I think that is what you are referring to - in the context of a specific protocol, like Kademlia, one could argue that connections should be uniquely identified by such an address. However, the (logical) overlay network of Kademlia only operates opaquely on a uniformly distributed keyspace, which also contains the node / peer IDs. Kademlia only uniquely identifies peers by these IDs and addresses to connect to are a secondary implementation artifact. In the scenario here Kademlia sees a different peer ID for the same address, i.e. another peer that supposedly also has that address (among others, possibly). While you could argue that it should disregard the peer ID in this case, seeing that it already has a connection to the same address, even though with a different peer ID, I'm really not sure this is a good idea. If multiple peers are seen advertising the same address, who is to decide which is "right", i.e. which to connect to, respectively which connection to keep and which others to ignore. |
I'd argue that new connection should not replace existing. Dead connections will eventually drop because we have keep-alive or ping protocols. Waiting for 30 seconds to restore connectivity is fine for substrate.
You don't drop existing connections. Otherwise it's an attack vector. I'd like to clarify that "multiple connections" being discussed are to support connections to the same address with different node IDs. Multiple connections to the same address/node_id still won't be allowed, right? Regarding multiple connections to the same node id, we probably don't want that in substrate/polkadot. Proposed use case sounds like: "Let's allow the second connection because the first one might be actually dead" sounds like a hack. What if the first one never closes after all? It look like we are struggling with managing connections even now, when duplicates are not allowed. This looks like it will introduce a lot of unneeded complexity for no good reason. Additionally, can there be an additional authentication mechanism introduced to Kademlia layer? Devp2p discovery would not propagate unconfirmed addresses. "Confirmed" here means that there was a signed UDP ping/pong exchange with that address first. |
I agree, hence my first reaction was to change that, as I mentioned at the end of my first comment. The same thing (i.e. not dropping the existing connection) also happens with
Sure, I hinted at the same thing at the end of my first comment. I think we are on the same page here.
I see no reason for a general-purpose networking library like
It's fine if substrate/polkadot do not intentionally make use of multiple connections per peer. Indeed, in libp2p/rust-libp2p#1440 even
That would need to be laid out in more detail in order for me to make an informed comment. In general I expressed my desire in the past to allow better curation of Kademlia's k-buckets through the public API offered by |
Instead of trying to enforce a single connection per peer, which involves quite a bit of additional complexity e.g. to prioritise simultaneously opened connections and can have other undesirable consequences [1], we now make multiple connections per peer a feature. The gist of these changes is as follows: The concept of a "node" with an implicit 1-1 correspondence to a connection has been replaced with the "first-class" concept of a "connection". The code from `src/nodes` has moved (with varying degrees of modification) to `src/connection`. A `HandledNode` has become a `Connection`, a `NodeHandler` a `ConnectionHandler`, the `CollectionStream` was the basis for the new `connection::Pool`, and so forth. Conceptually, a `Network` contains a `connection::Pool` which in turn internally employs the `connection::Manager` for handling the background `connection::manager::Task`s, one per connection, as before. These are all considered implementation details. On the public API, `Peer`s are managed as before through the `Network`, except now the API has changed with the shift of focus to (potentially multiple) connections per peer. The `NetworkEvent`s have accordingly also undergone changes. The Swarm APIs remain largely unchanged, except for the fact that `inject_replaced` is no longer called. It may now practically happen that multiple `ProtocolsHandler`s are associated with a single `NetworkBehaviour`, one per connection. If implementations of `NetworkBehaviour` rely somehow on communicating with exactly one `ProtocolsHandler`, this may cause issues, but it is unlikely. [1]: paritytech/substrate#4272
We are seeing this behavior in our private network as well due to IP reuse of the nodes, here's a way of producing it using Docker:
Then And |
* Allow multiple connections per peer in libp2p-core. Instead of trying to enforce a single connection per peer, which involves quite a bit of additional complexity e.g. to prioritise simultaneously opened connections and can have other undesirable consequences [1], we now make multiple connections per peer a feature. The gist of these changes is as follows: The concept of a "node" with an implicit 1-1 correspondence to a connection has been replaced with the "first-class" concept of a "connection". The code from `src/nodes` has moved (with varying degrees of modification) to `src/connection`. A `HandledNode` has become a `Connection`, a `NodeHandler` a `ConnectionHandler`, the `CollectionStream` was the basis for the new `connection::Pool`, and so forth. Conceptually, a `Network` contains a `connection::Pool` which in turn internally employs the `connection::Manager` for handling the background `connection::manager::Task`s, one per connection, as before. These are all considered implementation details. On the public API, `Peer`s are managed as before through the `Network`, except now the API has changed with the shift of focus to (potentially multiple) connections per peer. The `NetworkEvent`s have accordingly also undergone changes. The Swarm APIs remain largely unchanged, except for the fact that `inject_replaced` is no longer called. It may now practically happen that multiple `ProtocolsHandler`s are associated with a single `NetworkBehaviour`, one per connection. If implementations of `NetworkBehaviour` rely somehow on communicating with exactly one `ProtocolsHandler`, this may cause issues, but it is unlikely. [1]: paritytech/substrate#4272 * Fix intra-rustdoc links. * Update core/src/connection/pool.rs Co-Authored-By: Max Inden <mail@max-inden.de> * Address some review feedback and fix doc links. * Allow responses to be sent on the same connection. * Remove unnecessary remainders of inject_replaced. * Update swarm/src/behaviour.rs Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com> * Update swarm/src/lib.rs Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com> * Update core/src/connection/manager.rs Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com> * Update core/src/connection/manager.rs Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com> * Update core/src/connection/pool.rs Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com> * Incorporate more review feedback. * Move module declaration below imports. * Update core/src/connection/manager.rs Co-Authored-By: Toralf Wittner <tw@dtex.org> * Update core/src/connection/manager.rs Co-Authored-By: Toralf Wittner <tw@dtex.org> * Simplify as per review. * Fix rustoc link. * Add try_notify_handler and simplify. * Relocate DialingConnection and DialingAttempt. For better visibility constraints. * Small cleanup. * Small cleanup. More robust EstablishedConnectionIter. * Clarify semantics of `DialingPeer::connect`. * Don't call inject_disconnected on InvalidPeerId. To preserve the previous behavior and ensure calls to `inject_disconnected` are always paired with calls to `inject_connected`. * Provide public ConnectionId constructor. Mainly needed for testing purposes, e.g. in substrate. * Move the established connection limit check to the right place. * Clean up connection error handling. Separate connection errors into those occuring during connection setup or upon rejecting a newly established connection (the `PendingConnectionError`) and those errors occurring on previously established connections, i.e. for which a `ConnectionEstablished` event has been emitted by the connection pool earlier. * Revert change in log level and clarify an invariant. * Remove inject_replaced entirely. * Allow notifying all connection handlers. Thereby simplify by introducing a new enum `NotifyHandler`, used with a single constructor `NetworkBehaviourAction::NotifyHandler`. * Finishing touches. Small API simplifications and code deduplication. Some more useful debug logging. Co-authored-by: Max Inden <mail@max-inden.de> Co-authored-by: Pierre Krieger <pierre.krieger1708@gmail.com> Co-authored-by: Toralf Wittner <tw@dtex.org>
Should have been fixed by #5278, although I didn't verify that it actually is. |
It is now much worse
The peer is immediately reported as disconnected after connection. After that TCP connection stays open, and block requests from that peer come through. So they still consider the connection to be active. Also, there are multiple disconnect notifications. Update: Apparently this is resolved by #5595 |
Does this affect nodes behind NAT? E.g. if we have two nodes running behind a NAT router, they share the same IP address, but of course with different port number and peer id. Is there any reference how the DHT stores the peer id? |
Closing as stale and probably resolved. |
* Allow multiple connections per peer in libp2p-core. Instead of trying to enforce a single connection per peer, which involves quite a bit of additional complexity e.g. to prioritise simultaneously opened connections and can have other undesirable consequences [1], we now make multiple connections per peer a feature. The gist of these changes is as follows: The concept of a "node" with an implicit 1-1 correspondence to a connection has been replaced with the "first-class" concept of a "connection". The code from `src/nodes` has moved (with varying degrees of modification) to `src/connection`. A `HandledNode` has become a `Connection`, a `NodeHandler` a `ConnectionHandler`, the `CollectionStream` was the basis for the new `connection::Pool`, and so forth. Conceptually, a `Network` contains a `connection::Pool` which in turn internally employs the `connection::Manager` for handling the background `connection::manager::Task`s, one per connection, as before. These are all considered implementation details. On the public API, `Peer`s are managed as before through the `Network`, except now the API has changed with the shift of focus to (potentially multiple) connections per peer. The `NetworkEvent`s have accordingly also undergone changes. The Swarm APIs remain largely unchanged, except for the fact that `inject_replaced` is no longer called. It may now practically happen that multiple `ProtocolsHandler`s are associated with a single `NetworkBehaviour`, one per connection. If implementations of `NetworkBehaviour` rely somehow on communicating with exactly one `ProtocolsHandler`, this may cause issues, but it is unlikely. [1]: paritytech/substrate#4272 * Fix intra-rustdoc links. * Update core/src/connection/pool.rs Co-Authored-By: Max Inden <mail@max-inden.de> * Address some review feedback and fix doc links. * Allow responses to be sent on the same connection. * Remove unnecessary remainders of inject_replaced. * Update swarm/src/behaviour.rs Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com> * Update swarm/src/lib.rs Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com> * Update core/src/connection/manager.rs Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com> * Update core/src/connection/manager.rs Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com> * Update core/src/connection/pool.rs Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com> * Incorporate more review feedback. * Move module declaration below imports. * Update core/src/connection/manager.rs Co-Authored-By: Toralf Wittner <tw@dtex.org> * Update core/src/connection/manager.rs Co-Authored-By: Toralf Wittner <tw@dtex.org> * Simplify as per review. * Fix rustoc link. * Add try_notify_handler and simplify. * Relocate DialingConnection and DialingAttempt. For better visibility constraints. * Small cleanup. * Small cleanup. More robust EstablishedConnectionIter. * Clarify semantics of `DialingPeer::connect`. * Don't call inject_disconnected on InvalidPeerId. To preserve the previous behavior and ensure calls to `inject_disconnected` are always paired with calls to `inject_connected`. * Provide public ConnectionId constructor. Mainly needed for testing purposes, e.g. in substrate. * Move the established connection limit check to the right place. * Clean up connection error handling. Separate connection errors into those occuring during connection setup or upon rejecting a newly established connection (the `PendingConnectionError`) and those errors occurring on previously established connections, i.e. for which a `ConnectionEstablished` event has been emitted by the connection pool earlier. * Revert change in log level and clarify an invariant. * Remove inject_replaced entirely. * Allow notifying all connection handlers. Thereby simplify by introducing a new enum `NotifyHandler`, used with a single constructor `NetworkBehaviourAction::NotifyHandler`. * Finishing touches. Small API simplifications and code deduplication. Some more useful debug logging. Co-authored-by: Max Inden <mail@max-inden.de> Co-authored-by: Pierre Krieger <pierre.krieger1708@gmail.com> Co-authored-by: Toralf Wittner <tw@dtex.org>
Two nodes running on localhost keep disconnecting from each other. Disconnect is initiated by libp2p and not by sync or reputation change apparently. Logs have no useful information on why the disconnect has happened.
Node 1 started as:
Node2 started as:
Nodes stay connected for about 10 seconds before disconnect and reconnect happens.
The text was updated successfully, but these errors were encountered: