Add a "transient" network connectivity state #2696

Stebalien · 2024-01-29T22:50:14Z

Previously, we'd consider "transiently" connected peers to be connected. This meant:

We wouldn't fire a second event when transitioning to "really connected". There wasn't a way for users to wait for a "full" connection without using the old-style Notify interface.
"Connectedness" checks would be a little too eager to treat a peer as connected.

For 99% of users, "transient" peers should be treated as disconnected. So while it's technically a breaking change to split-out "transient" connectivity into a separate state, I expect it's more likely to fix bugs than anything.

Unfortunately, this change did require several changes to go-libp2p itself because go-libp2p does care about transient connections:

We want to keep peerstore information for transient peers.
We may sometimes want to treat peers as "connected" in the host.
Identify still needs to run over transient connections.

fixes #2692

core/network/network.go

p2p/host/basic/basic_host.go

Stebalien · 2024-01-29T22:55:51Z

p2p/host/routed/routed.go

 	if !forceDirect {
-		if rh.Network().Connectedness(pi.ID) == network.Connected {
+		connectedness := rh.Network().Connectedness(pi.ID)


Ideally we wouldn't try to perform routing if we're just going to wait on an ongoing connection. But that's not a new issue.

sukunrt

I'm fine with these changes. We just need to

Fix this for the close path.
Keep existing Connect behavior. I'll raise a PR to make Connect consistent with NewStream and return network.ErrTransientConn when appropriate.
Deprecate the items in network.Connectedness that we don't care about.

core/network/network.go

sukunrt · 2024-02-05T10:21:33Z

p2p/net/swarm/swarm.go

-	// Notify goroutines waiting for a direct connection
-	if !c.Stat().Transient {
+	newState := network.Transient


We should calculate newState just after we add it to conns map with connectednessUnlocked. We already have the lock it's a small set to check anyway.
There's a problem here that if someone makes a relayed connection after we have a direct connection this will send a limited connectivity event.

I am not sure I addressed it correctly, can you please double check @sukunrt ?

We need.

oldState := s.connectednessUnlocked(p) s.conns.m[p] = append(s.conns.m[p], c) newState := s.connectednessUnlocked(p)

This prevents us from sending a PeerConnectedness Limited event when we make a relayed connection after a direct connection.

p2p/net/swarm/swarm.go

lidel · 2024-04-16T13:23:27Z

@sukunrt is there any remaining work here, or awaiting your final review?

MarcoPolo · 2024-04-16T17:09:07Z

I'll review this as well

MarcoPolo · 2024-04-24T23:28:17Z

p2p/net/swarm/swarm.go

+// changed notifications consider closed connections. When remote closes a connection
+// the underlying transport connection is closed first, so tracking changes to the connectedness
+// state requires considering this recently closed connections impact on Connectedness.
+func (s *Swarm) connectednessUnlocked(p peer.ID, considerClosed bool) network.Connectedness {


This comment doesn't make sense to me. I don't really understand the point of the body either.

It seems like we have two cases:

We have multiple conns to a peer and we close one. If any other conn exists and is open then we are connected.

We only have one conn to a peer a close it. The peer is no longer connected.

Why are we potentially returning Connected in the case that every conn in the map returns c.IsClosed()=true with considerClosed=true

We may also have limited connections to the peer. So once a connection is closed, we check:

If there is at least 1 unlimited connection, return Connected

Else, if there is at least 1 limited connection, return Limited

Else, return NotConnected

Why are we potentially returning Connected in the case that every conn in the map returns c.IsClosed()=true with considerClosed=true

It is true. However, considerClosed is true only when called from addConn and removeConn, which hold the lock. removeConn makes sure to remove the connection from s.conns.m, so the function will not return Connected if we removed the last unlimited connection.

Though, I don't really understand why considerClosed is needed at all. @sukunrt could you provide some context?

I agree this is confusing.

There are two different states this function is called from:

When closing the connection from our side by calling Close, this function is called when the underlying transport connection is not closed.

When closing the connection because the remote Closed the connection, this function is called when the underlying transport connection has already been closed.

Consider that we are connected to the peer over a QUIC and a relayed connection.

Now the peer closes the QUIC connection. So we have moved from a Connected state to Limited connectivity state.

To make this work, when calculating oldState in removeConn we need to consider connectivity assuming QUIC connection was still connected.

Alternatively we can store the last sent connectedness event for a peer. That would make this calculation cleaner at the expense of more state for an edge case.

Alternatively we can store the last sent connectedness event for a peer. That would make this calculation cleaner at the expense of more state for an edge case.

I think we should do this.

sukunrt · 2024-05-02T13:30:11Z

We have a deadlock in our event sending mechanism when the consumer closes the connection in the subscription goroutine.

Closing the connection causes the subscription goroutine to emit an event and assuming the subscription queue is full this will deadlock. Check the test in 57c688e

There are two options here:

We disallow Close in subscription.
We enqueue the events in a slice and we send the events in a separate goroutine. This slice will only take 2*NumConns space since for a given connection we will only add a connectedness event at most twice. This is also the current memory usage since the goroutines block on sending the connectedness event.

We also need 2 to ensure that we send Limited vs NotConnected events in the right order.

In case we are connected to a peer via a relay and a direct connection and the direct connection and relayed connection both close concurrently, right now there is a race between which event will be sent last, 2 ensures that NotConnected is always sent last.

I have implemented 2 in 0dcc73b

I'll refactor a bit if we do decide to go ahead with 2.

MarcoPolo · 2024-05-03T00:23:34Z

p2p/net/swarm/swarm.go

+	// This ensures that if a event subscriber closes the connection from the subscription goroutine
+	// this doesn't deadlock
+	s.refs.Add(1)
+	go func() {


I would really like to avoid spawning a go routine on close

Yeah, we shouldn't do this. I haven't read the rest of the patch so I might be missing an obvious solution, but it looks like we're just trying to "raise" a flag for this peer.

I'd just avoid channels entirely and use the pattern I used in https://github.com/ipfs/boxo/blob/2559ec2c7b93ac6905fe83d7a4b38b8e9a703b0a/bitswap/network/connecteventmanager.go#L69-L81, https://github.com/ipfs/boxo/blob/2559ec2c7b93ac6905fe83d7a4b38b8e9a703b0a/bitswap/network/connecteventmanager.go#L97-L122. It's not simple, but it works and avoids stacking goroutines.

I haven’t looked closely, but it seems like changeQueue is unbounded. Is that right?

Technically, but it shouldn't be an issue in practice. We only add a peer to the queue the first time the connectivity state changes. If we end up with a lot of "dead" peers in this queue, we should burn through them very quickly (transition from disconnected -> disconnected doesn't require any events).

We can do better by removing peers that have transitioned back to the same state from the queue, but it's likely not worth the complexity.

I think a queue with this will work

We can do better by removing peers that have transitioned back to the same state from the queue, but it's likely not worth the complexity.

We do need the complexity because a malicious peer can connect, disconnect very quickly causing this queue to grow quickly if a consumer is slow. It can also do this with different peer ids.

As a general rule we should have no unbounded queues in the core logic. In this case I can think of a way that a reasonable application can fall victim to a dos attack.

Example: An application opens a new stream and does an exchange on a newly connected peer. If this is done by the same go routine that reads events then the attacker controls the rate events are popped off, and thus exploit this.

Ideally, I'd agree. But in cases like this we can argue that the queue will hit a steady-state max size and stop growing (assuming the read side isn't deadlocked forever). It's not great, but if someone can DoS us due to this queue, they can also DoS us by running us out of memory before GC can kick in.

But ideally we'd just have some form of backpressure somewhere. And I think that's what we'll need given #2696 (comment)

I am trying an approach where we have backpressure on the add side and to have an unbounded queue on the remove side(close connection) because the remove side is effectively bounded by the total number of connections since on the add connection side we do have backpressure.

How does ~~5271dd8~~ 1abc063 look?

We apply backpressure when adding connections to the swarm. This allows us to enqueue remove conn events in a slice without spawning a goroutine. Back pressure on adding connections limits the slice length. We also ensure that we send at least 1 disconnection event for every peer we connect to.

I don't really have time for a full review but:

I'd be careful to avoid too many channels/goroutines. Go tends to push you in that direction, but it can quickly get harder to understand than simple locks & conditions.

Make sure backpressure is applied all the way to accepting connections in the first place. It sounds like this might be a gap in libp2p and may be something we want to fix in a separate patch.

(I defer to @MarcoPolo on all of this, I'm just kibitzing from the sidelines)

Previously, we'd consider "transiently" connected peers to be connected. This meant: 1. We wouldn't fire a second event when transitioning to "really connected". The only option for users was to listen on the old-style per-connection notifications. 2. "Connectedness" checks would be a little too eager to treat a peer as connected. For 99% of users, "transient" peers should be treated as disconnected. So while it's technically a breaking change to split-out "transient" connectivity into a separate state, I expect it's more likely to fix bugs than anything. Unfortunately, this change _did_ require several changes to go-libp2p itself because go-libp2p _does_ care about transient connections: 1. We want to keep peerstore information for transient peers. 2. We may sometimes want to treat peers as "connected" in the host. 3. Identify still needs to run over transient connections. fixes #2692

p2p/net/swarm/connectedness_event_emitter.go

MarcoPolo

Just one question, but I think this looks a lot cleaner than what was there before. Thanks!

p2p/net/swarm/connectedness_event_emitter.go

p2p/net/swarm/swarm.go

p2p/net/swarm/swarm_event_test.go

p2p/net/swarm/connectedness_event_emitter.go

Stebalien force-pushed the steb/connected-v-transient branch from d624653 to 11515f0 Compare January 29, 2024 22:50

Stebalien commented Jan 29, 2024

View reviewed changes

Stebalien force-pushed the steb/connected-v-transient branch from 11515f0 to 57736f1 Compare January 29, 2024 22:57

Stebalien requested review from sukunrt and aschmahmann January 29, 2024 22:57

Stebalien mentioned this pull request Jan 29, 2024

Improve peer connection handling ipfs/boxo#526

Open

sukunrt requested changes Feb 5, 2024

View reviewed changes

sukunrt mentioned this pull request Feb 21, 2024

v0.34 #2704

Closed

22 tasks

guillaumemichel reviewed Mar 22, 2024

View reviewed changes

p2p/net/swarm/swarm.go Outdated Show resolved Hide resolved

sukunrt self-requested a review March 22, 2024 15:28

sukunrt force-pushed the steb/connected-v-transient branch from 03131e9 to 495cd64 Compare March 22, 2024 16:22

lidel added the need/maintainer-input Needs input from the current maintainer(s) label Apr 16, 2024

MarcoPolo self-requested a review April 16, 2024 17:08

MarcoPolo reviewed Apr 24, 2024

View reviewed changes

sukunrt removed the need/maintainer-input Needs input from the current maintainer(s) label Apr 30, 2024

sukunrt requested a review from MarcoPolo April 30, 2024 16:28

MarcoPolo approved these changes May 1, 2024

View reviewed changes

sukunrt requested a review from MarcoPolo May 2, 2024 13:33

sukunrt force-pushed the steb/connected-v-transient branch 3 times, most recently from 12d5826 to c68a035 Compare May 2, 2024 16:32

MarcoPolo reviewed May 3, 2024

View reviewed changes

sukunrt force-pushed the steb/connected-v-transient branch 5 times, most recently from cee9aa1 to 02a74d8 Compare May 4, 2024 20:00

sukunrt force-pushed the steb/connected-v-transient branch 3 times, most recently from 9a247a9 to a16b08d Compare May 5, 2024 08:44

Stebalien and others added 17 commits May 5, 2024 14:23

rename Transient to Limited and addressed review

14024ef

updated removeConn

e3215d9

corrected removeConn

5ee3942

restored comment

bed69dc

updated connectednessUnlocked

678d8b4

add option to consider closed connections for connectedness

9b9556a

More renaming

84bc0b1

replaced GetUseTransient with GetAllowLimitedConn

98191a4

improve connectedness check

d52e900

store connectedness state on swarm

eb65e0f

check for limited connectivity event

89a8442

close emitter after the events are sent

ac195c5

connectedness event deadlock test

d8e7c31

fix deadlock in sending connectedness events

02b210d

simplify connectedness events

4573ce6

Always send 1 event for a connection

25ccbf2

sukunrt force-pushed the steb/connected-v-transient branch from a16b08d to 25ccbf2 Compare May 5, 2024 08:54

MarcoPolo reviewed May 6, 2024

View reviewed changes

p2p/net/swarm/connectedness_event_emitter.go Outdated Show resolved Hide resolved

MarcoPolo reviewed May 6, 2024

View reviewed changes

review comments

b542b15

MarcoPolo approved these changes May 7, 2024

View reviewed changes

sukunrt force-pushed the steb/connected-v-transient branch from e9b2300 to b542b15 Compare May 7, 2024 19:03

sukunrt approved these changes May 8, 2024

View reviewed changes

sukunrt merged commit af0161e into master May 8, 2024
10 of 11 checks passed

sebastianst mentioned this pull request May 30, 2024

go: bump libp2p deps ethereum-optimism/optimism#10687

Merged

2color mentioned this pull request Aug 14, 2024

Peer routing fails for peers behind NAT (with /p2p-circuit) ipfs/someguy#78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a "transient" network connectivity state #2696

Add a "transient" network connectivity state #2696

Stebalien commented Jan 29, 2024 •

edited

Loading

Stebalien Jan 29, 2024

sukunrt left a comment •

edited

Loading

sukunrt Feb 5, 2024

guillaumemichel Mar 21, 2024

sukunrt Mar 22, 2024

lidel commented Apr 16, 2024

MarcoPolo commented Apr 16, 2024

MarcoPolo Apr 24, 2024

guillaumemichel Apr 25, 2024

sukunrt Apr 25, 2024

sukunrt Apr 25, 2024

MarcoPolo Apr 29, 2024

sukunrt Apr 30, 2024

sukunrt commented May 2, 2024 •

edited

Loading

MarcoPolo May 3, 2024

Stebalien May 3, 2024

MarcoPolo May 3, 2024

Stebalien May 3, 2024

sukunrt May 3, 2024 •

edited

Loading

MarcoPolo May 3, 2024

Stebalien May 3, 2024

sukunrt May 3, 2024

sukunrt May 3, 2024 •

edited

Loading

Stebalien May 3, 2024

MarcoPolo left a comment

Add a "transient" network connectivity state #2696

Add a "transient" network connectivity state #2696

Conversation

Stebalien commented Jan 29, 2024 • edited Loading

Choose a reason for hiding this comment

sukunrt left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel commented Apr 16, 2024

MarcoPolo commented Apr 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sukunrt commented May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sukunrt May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sukunrt May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoPolo left a comment

Choose a reason for hiding this comment

Stebalien commented Jan 29, 2024 •

edited

Loading

sukunrt left a comment •

edited

Loading

sukunrt commented May 2, 2024 •

edited

Loading

sukunrt May 3, 2024 •

edited

Loading

sukunrt May 3, 2024 •

edited

Loading