Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(swarm)!: Allow NetworkBehaviours to manage incoming connections #3099

Closed
wants to merge 27 commits into from

Conversation

thomaseizinger
Copy link
Contributor

@thomaseizinger thomaseizinger commented Nov 9, 2022

Description

Previously, ConnectionHandler employed the prototype pattern via the IntoConnectionHandler abstraction. This allowed the Swarm to construct an instance of IntoConnectionHandler before the connection was established. This abstraction is however unnecessary. We can model all existing usecases by delaying the call to NetworkBehavour::new_handler until the connection is established. Not only does this delete a lot of code, it also makes several APIs simpler:

  • NetworkBehaviours can track state for future connections within themselves (indexed by PeerId) and pass it into the ConnectionHandler upon construction. This removes the need for passing along a handler in NetworkBehaviourAction::Dial.
  • Removing Handler from NetworkBehaviourAction::Dial also avoids the need to pass the handler back into the behaviour in inject_dial_failure.
  • By making NetworkBehaviour::new_handler fallible, NetworkBehaviours can implement almost arbitrary connection management policies by denying the construction of a ConnectionHandler for a newly established connection.

Resolves #2824.

Links to any relevant issues

Open Questions

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@thomaseizinger

This comment was marked as outdated.

@thomaseizinger

This comment was marked as outdated.

@rkuhn
Copy link
Contributor

rkuhn commented Nov 13, 2022

The answer also depends on what inject_dial_failure shall mean, and to whom. Currently it is called upon any failure for all composed behaviours, but that won’t work anymore if every behaviour then expects its own payload — the dial was caused by one behaviour, with only one specific payload. So the composed behaviour needs to dispatch the failure to only exactly that sub-behaviour which requested the dial. While this makes some amount of sense, it is also weird, because dial failures are not specific to a particular sub-behaviour, their scope is the target peer.

@rkuhn
Copy link
Contributor

rkuhn commented Nov 13, 2022

BTW: Swarm::dial could also be kept in this case, and it could carry its own payload that is returned to the swarm poller upon dial failure — and none of the composed behaviours are informed in this case.

@thomaseizinger
Copy link
Contributor Author

The answer also depends on what inject_dial_failure shall mean, and to whom. Currently it is called upon any failure for all composed behaviours, but that won’t work anymore if every behaviour then expects its own payload — the dial was caused by one behaviour, with only one specific payload. So the composed behaviour needs to dispatch the failure to only exactly that sub-behaviour which requested the dial. While this makes some amount of sense, it is also weird, because dial failures are not specific to a particular sub-behaviour, their scope is the target peer.

That is right! This is more tricky than I thought. If I wanted to inject a DialPayload on every on_new_outbound call, I'd have to make that one up for all other behaviours that are composed because I need a handler for every behaviour.

@thomaseizinger

This comment was marked as outdated.

@thomaseizinger

This comment was marked as outdated.

@rkuhn
Copy link
Contributor

rkuhn commented Nov 14, 2022

My primary worry right now is not the structure of IntoConnectionHandler, so I don’t have good inputs there, but the proposed DialSource enum sounds very useful to me!

@thomaseizinger
Copy link
Contributor Author

I just had another idea which I think is much better:

The reason for passing a handler along with Dial was to enable users to directly inject a message into the new connection.

Previously, that required really complex tracking of state in the behaviour because of the IntoConnectionHandler abstraction and the delayed access to PeerId.

If we remove this abstraction, it is really easy for NetworkBehaviours to track initial state for a new connection: Just make a HashMap indexed by PeerId and pick out the data whenever new_outbound_handler gets called!

That is functionally equivalent to what we have today and makes several APIs a lot simpler.

@thomaseizinger

This comment was marked as outdated.

@thomaseizinger thomaseizinger changed the title WIP: Introduce NetworkBehaviour::DialPayload refactor!(swarm): Remove IntoConnectionHandler abstraction Nov 14, 2022
@thomaseizinger thomaseizinger changed the title refactor!(swarm): Remove IntoConnectionHandler abstraction refactor(swarm)!: Remove IntoConnectionHandler abstraction Nov 15, 2022
swarm/src/lib.rs Outdated
Comment on lines 1364 to 1373
let supported_protocols = todo!();

// let supported_protocols = self
// .behaviour
// .new_handler()
// .inbound_protocol()
// .protocol_info()
// .into_iter()
// .map(|info| info.protocol_name().to_vec())
// .collect();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to be difficult.

@mxinden I remember you saying that PollParameters was always something that didn't appeal to you. Did you ever think about alternative design? This might be a good time to bring them up :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now initialising these to an empty vector and they are updated every time a new connection is made.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxinden I remember you saying that PollParameters was always something that didn't appeal to you.

Yes, very much. I would like to not have PollParameters.

I am now initialising these to an empty vector and they are updated every time a new connection is made.

In my eyes that is the better behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxinden I remember you saying that PollParameters was always something that didn't appeal to you.

Yes, very much. I would like to not have PollParameters.

Tracking this here: #3124

///
/// # Example carrying state in the handler
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we replace this example? Now that new_handler gives you a PeerId, I think it is quite easy to realize that you can store data for a future connection in the behaviour and just pass it into the handler here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • NetworkBehaviours can track state for future connections within themselves (indexed by PeerId) and pass it into the ConnectionHandler upon construction. This removes the need for passing along a handler in NetworkBehaviourAction::Dial.

Say there are two NetworkBehaviour implementations, each requesting a new connection to a new peer. In such case, when NetworkBehaviour::inject_connection_established is called, neither of them knows whether this new connection corresponds to their dial request.

We could still allow the user to provide user data via DialOpts. For the case of Swarm::dial and NetworkBehaviour::inject_dial_failure we could wrap this in an Option.

I am not sure whether we need to design for the above race condition. Your proposal might be just fine. What do folks think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ipfs-embed I validate advertised peer addresses by dialling them — and closing such connections upon success. This is only reliably possible if I can detect whether the connection resulted from my own dialling attempt. OTOH, another cause may have resulted in dialling that same peer with that same address for other reasons, which would then possibly have said “nah, attempt is already underway”. But all of this would happen in addition to existing connections, so it might not be a huge problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to actively close them? Assuming reasonable keep_alive implementations of your ConnectionHandlers, the connection should get closed automatically if it is not in use.

OTOH, another cause may have resulted in dialling that same peer with that same address for other reasons, which would then possibly have said “nah, attempt is already underway”.

Can this be expressed with PeerCondition::NotDialing?

Note that new_handler also gives you access to ConnectedPoint. So in addition to storing data in your behaviour based on PeerId, you could also index it by the dialed address. new_handler being called for an address you wanted to validate IS the validation that a connection was made to this address. If you have a dedicated AddressValidationBehaviour, that behaviour could then straight up deny that connection which instantly closes it again.

@thomaseizinger thomaseizinger marked this pull request as ready for review November 15, 2022 08:28
@thomaseizinger

This comment was marked as outdated.

@thomaseizinger
Copy link
Contributor Author

thomaseizinger commented Dec 6, 2022

TODO:

  • Always box denied reason to avoid generating an enum
  • Figure out how to manage pending connections

@thomaseizinger thomaseizinger force-pushed the 2824-remove-into-connection-handler branch from eed846b to 7369cc1 Compare December 7, 2022 01:53
@thomaseizinger thomaseizinger force-pushed the 2824-remove-into-connection-handler branch from 7369cc1 to dcb4f96 Compare December 7, 2022 01:54
@thomaseizinger
Copy link
Contributor Author

Here is a proposal regarding pending connection management:

  • Remove addresses_of_peer function
  • Add:
    • fn new_outgoing_connection(Option<PeerId>, &[Multiaddr]) -> Result<Vec<Multiaddr>, Box<dyn std::error::Error + Send + 'static>>
    • fn new_incoming_connection(local_addr: Multiaddr, send_back_addr: Multiaddr) -> Result<(), Box<dyn std::error::Error + Send + 'static>>

The new_outgoing_connection replaces the addresses_of_peer function. It is given the PeerId - if known - and the addresses we already have that we will try to dial. It can decline the dial by returning an error or provide even more addresses to use.

The new_incoming_connection is given the local and remote's address. It doesn't return anything usually as there is no information needed to upgrade a connection.

I am reasonably happy with this design. We are only extending the NetworkBehaviour by one function, all three new_ functions follow a similar naming pattern and function signature.

Thoughts @mxinden?

@mergify
Copy link

mergify bot commented Dec 9, 2022

This pull request has merge conflicts. Could you please resolve them @thomaseizinger? 🙏

Copy link
Contributor

@elenaf9 elenaf9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much in favor of this! Did not do a full review yet.

Comment on lines +612 to +616
pub struct ConnectionClosed<'a, Handler> {
pub peer_id: PeerId,
pub connection_id: ConnectionId,
pub endpoint: &'a ConnectedPoint,
pub handler: <Handler as IntoConnectionHandler>::Handler,
pub handler: Handler,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR, but I do wonder if we really need the handler field on ConnectionClosed.
The handler property was added in #2191 together with adding it to Dial, DialFailure and ListenFailure. If I understand it correctly, the main motivation behind it was:

it allows one to attach state to a dial request, thus not having to keep track of it within a NetworkBehaviour implementation

Per #3099 (comment) attaching a state to a dial is not needed anymore. What's the use-case of still returning the handler in ConnectionClosed? As far as I can tell, none of our behaviours ever use it.

Motivation behind the question is that it would be great if we could get rid of the duplication between SwarmEvent and FromSwarm.
Until now the main difference was that FromSwarm also contains the handler, which should not be exposed in the SwarmEvent1: #2423 (comment). If we could remove it here, we might be able to do something like:

enum SwarmEvent<TBehaviourOutEvent> {
    Behaviour(TBehaviourOutEvent),
    FromSwarm(FromSwarmEvent)
}

Footnotes

  1. There are also some other differences; FromSwarm contains ConnectionIds, and the NewExternalAddr and ExpiredExternalAddr variants. But I don't see see a drawback in exposing them to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: #3046

@thomaseizinger
Copy link
Contributor Author

Any ideas on how to make this less breaking would be gladly appreciated! :)

@thomaseizinger
Copy link
Contributor Author

Here is a proposal regarding pending connection management:

  • Remove addresses_of_peer function
  • Add:
    • fn new_outgoing_connection(Option<PeerId>, &[Multiaddr]) -> Result<Vec<Multiaddr>, Box<dyn std::error::Error + Send + 'static>>
    • fn new_incoming_connection(local_addr: Multiaddr, send_back_addr: Multiaddr) -> Result<(), Box<dyn std::error::Error + Send + 'static>>

The new_outgoing_connection replaces the addresses_of_peer function. It is given the PeerId - if known - and the addresses we already have that we will try to dial. It can decline the dial by returning an error or provide even more addresses to use.

The new_incoming_connection is given the local and remote's address. It doesn't return anything usually as there is no information needed to upgrade a connection.

I am reasonably happy with this design. We are only extending the NetworkBehaviour by one function, all three new_ functions follow a similar naming pattern and function signature.

Thoughts @mxinden?

One slight variation on this: new_handler could now be called new_established_connection to be more consistent with the two others. We could also split it into new_established_inbound_connection and new_established_outbound_connection. It is a bit odd at the moment that for pending connections, the function is split by inbound and outbound but for the established one it is not.

@thomaseizinger
Copy link
Contributor Author

I think we can limit the breaking changes of this PR to removing the handler from Dial. Everything else should be possible to do in a non-breaking way.

@@ -96,6 +97,10 @@

- Update `rust-version` to reflect the actual MSRV: 1.62.0. See [PR 3090].

- Remove `IntoConnectionHandler` abstraction and change the signature of `NetworkBehaviour::new_handler` to accept `PeerId` and `ConnectedPoint`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of a small before and after code snippet to help folks upgrade?

fn new_handler(&mut self) -> Self::ConnectionHandler;
fn new_handler(
&mut self,
peer: &PeerId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
peer: &PeerId,
peer: PeerId,

Why provide a reference to a Copy type?

@mxinden
Copy link
Member

mxinden commented Dec 12, 2022

One slight variation on this: new_handler could now be called new_established_connection to be more consistent with the two others.

Sounds good to me.

We could also split it into new_established_inbound_connection and new_established_outbound_connection. It is a bit odd at the moment that for pending connections, the function is split by inbound and outbound but for the established one it is not.

Do I understand correctly that both methods would have the same signature and that we don't foresee the two signatures to diverge any time soon? If so, I suggest not splitting it. While nice for the sake of consistency, I don't think it justifies the additional complexity.

@@ -96,6 +99,8 @@ pub struct Client {
/// connection.
directly_connected_peers: HashMap<PeerId, Vec<ConnectionId>>,

initial_events: HashMap<PeerId, handler::In>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
initial_events: HashMap<PeerId, handler::In>,
/// [`handler::In`] event to be provided to a handler in `NetworkBehaviour::new_handler` once the corresponding connection is established.
initial_events: HashMap<PeerId, handler::In>,

What do you think?

Comment on lines +182 to +185
if let Some(event) = self.initial_events.remove(peer) {
#[allow(deprecated)]
handler.inject_event(event)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Injecting events through two ways, i.e. (1) at creation and (2) regularly during the lifetime of the handler via Swarm, seems inconsistent to me. That said, I can not think of a better way. I suggest keeping as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could hold the event back here and instead listen for the ConnectionEstablished event and then emit an event later?

Comment on lines +165 to +169
log::debug!(
"Established relayed instead of direct connection to {:?}, \
dropping initial in event {:?}.",
peer,
event
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that there is no second direct connection in progress of establishment to the given peer? How do we know that this event was predetermined for this connection and not another connection still being established?

This is rather complex. I rewrote this comment 4 times. I might be missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can index the HashMap by the dialed address instead of the peer ID. That should fix things I guess?

Comment on lines +308 to 314
self.initial_events
.insert(relay_peer_id, handler::In::Reserve { to_listener });

NetworkBehaviourAction::Dial {
opts: DialOpts::peer_id(relay_peer_id)
.addresses(vec![relay_addr])
.extend_addresses_through_behaviour()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that some NetworkBehaviour might add a relayed address here.

The relay protocol should never be spoken on top of a relayed connection. We should never do nested relaying as otherwise one could build circular relayed connections that could be misused for a DOS attack. Thus returning a dummy::ConnectionHandler on relayed connections is critical.

That said, we might be establishing both a relayed and a direct connection to a given peer in parallel. How do we know which one the initial_events event was destined for?

Not sure how to prevent this. The solution you do above, namely to drop the initial_events event in case it turns out to be a relayed connection covers the case where there is only a single connection attempt in progress. What about the case where both (relayed and direct) are in progress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only 1-to-1 ported the code that already existed, as far as I know, no functionality should currently change in this PR.

@thomaseizinger
Copy link
Contributor Author

We could also split it into new_established_inbound_connection and new_established_outbound_connection. It is a bit odd at the moment that for pending connections, the function is split by inbound and outbound but for the established one it is not.

Do I understand correctly that both methods would have the same signature and that we don't foresee the two signatures to diverge any time soon? If so, I suggest not splitting it. While nice for the sake of consistency, I don't think it justifies the additional complexity.

If split, I would inline the variants of the ConnectedPoint enum:

pub enum ConnectedPoint {

So no, the signatures would not be the same.

@divagant-martian
Copy link
Contributor

Sorry for the delay. As far as lighthouse is concerned, this would work really well in particular having the ConnectedPoint, since most of the time we decide based on percentage of inbound and outbound connections. Regarding the possibly generic error, I see why it would be a nice to have, but at least in our case that can be directly handled when creating the handler and wouldn't really need the error reported by the swarm.

Also of course, thanks for all these efforts!

@thomaseizinger
Copy link
Contributor Author

@mxinden: Instead of a new DialId or DialToken, we could also create the ConnectionId as part of the dial attempt. The Pool would later allocate one anyway so we could actually directly use that instead. This gives us one identifier for the entire connection from the moment a dial is issued up until the connection gets closed.

@thomaseizinger
Copy link
Contributor Author

@mxinden: Instead of a new DialId or DialToken, we could also create the ConnectionId as part of the dial attempt. The Pool would later allocate one anyway so we could actually directly use that instead. This gives us one identifier for the entire connection from the moment a dial is issued up until the connection gets closed.

I think with that idea, we might be able to ship the removal of Dial::Handler as a separate patch. That would be nice.

@thomaseizinger
Copy link
Contributor Author

@mxinden: Instead of a new DialId or DialToken, we could also create the ConnectionId as part of the dial attempt. The Pool would later allocate one anyway so we could actually directly use that instead. This gives us one identifier for the entire connection from the moment a dial is issued up until the connection gets closed.

I think with that idea, we might be able to ship the removal of Dial::Handler as a separate patch. That would be nice.

Turns out we can't but overall the version in #3254 is a lot less breaking which is nice :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

swarm/: Support generic connection management
5 participants