Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bitswap/network: refactor connectEventManager more simply in in bsnet #436

Closed
wants to merge 1 commit into from

Conversation

Jorropo
Copy link
Contributor

@Jorropo Jorropo commented Aug 17, 2023

I think this is fine because libp2p's .Connect call does not return before the .Connected callback has returned.

This new code also have similar dedup logic and state transitions.

Fixes #432

@codecov
Copy link

codecov bot commented Aug 17, 2023

Codecov Report

Merging #436 (8ba57bd) into main (dd32d67) will increase coverage by 0.13%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #436      +/-   ##
==========================================
+ Coverage   49.80%   49.94%   +0.13%     
==========================================
  Files         249      248       -1     
  Lines       29972    29888      -84     
==========================================
- Hits        14928    14927       -1     
+ Misses      13615    13532      -83     
  Partials     1429     1429              
Files Changed Coverage Δ
bitswap/network/ipfs_impl.go 0.00% <0.00%> (ø)

... and 8 files with indirect coverage changes

@Jorropo Jorropo force-pushed the fix/432 branch 2 times, most recently from 30693e7 to aca0769 Compare August 17, 2023 02:16
@Jorropo
Copy link
Contributor Author

Jorropo commented Aug 17, 2023

FWIW

hugo@rikus ~/k/boxo (fix/432)> RUN_FLAKY_TESTS=1 go test -count=1 ./bitswap/...
ok  	github.com/ipfs/boxo/bitswap	0.699s
ok  	github.com/ipfs/boxo/bitswap/client	0.289s
?   	github.com/ipfs/boxo/bitswap/client/internal	[no test files]
ok  	github.com/ipfs/boxo/bitswap/client/internal/blockpresencemanager	0.025s
?   	github.com/ipfs/boxo/bitswap/client/internal/getter	[no test files]
ok  	github.com/ipfs/boxo/bitswap/client/internal/messagequeue	1.355s
ok  	github.com/ipfs/boxo/bitswap/client/internal/notifications	0.029s
ok  	github.com/ipfs/boxo/bitswap/client/internal/peermanager	0.022s
ok  	github.com/ipfs/boxo/bitswap/client/internal/providerquerymanager	0.187s
ok  	github.com/ipfs/boxo/bitswap/client/internal/session	0.654s
ok  	github.com/ipfs/boxo/bitswap/client/internal/sessioninterestmanager	0.006s
ok  	github.com/ipfs/boxo/bitswap/client/internal/sessionmanager	0.037s
ok  	github.com/ipfs/boxo/bitswap/client/internal/sessionpeermanager	0.007s
?   	github.com/ipfs/boxo/bitswap/client/traceability	[no test files]
ok  	github.com/ipfs/boxo/bitswap/client/wantlist	0.005s
?   	github.com/ipfs/boxo/bitswap/decision	[no test files]
?   	github.com/ipfs/boxo/bitswap/internal	[no test files]
?   	github.com/ipfs/boxo/bitswap/internal/defaults	[no test files]
ok  	github.com/ipfs/boxo/bitswap/internal/testutil	0.003s
ok  	github.com/ipfs/boxo/bitswap/message	0.003s
ok  	github.com/ipfs/boxo/bitswap/message/pb	0.004s
?   	github.com/ipfs/boxo/bitswap/metrics	[no test files]
ok  	github.com/ipfs/boxo/bitswap/network	0.388s
?   	github.com/ipfs/boxo/bitswap/network/internal	[no test files]
?   	github.com/ipfs/boxo/bitswap/server	[no test files]
ok  	github.com/ipfs/boxo/bitswap/server/internal/decision	1.376s
?   	github.com/ipfs/boxo/bitswap/testinstance	[no test files]
ok  	github.com/ipfs/boxo/bitswap/testnet	0.007s
?   	github.com/ipfs/boxo/bitswap/tracer	[no test files]
?   	github.com/ipfs/boxo/bitswap/wantlist	[no test files]

I think this is fine because libp2p's .Connect call does not return before the .Connected callback has returned.

This new code also have similar dedup logic and state transitions.

Fixes #432
Copy link
Contributor Author

@Jorropo Jorropo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs tests

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was written the way it was to decouple connect/disconnect notifications from actually processing those notifications. When overloaded (lots of connect/disconnect events), this code is "well behaved" and will simply end up skipping over short-lived connections.

Specifically:

  1. We put the connected/disconnected/unresponsive states into a map when we receive a relevant event.
  2. Asynchronously, we update the system state to reflect this.
  3. Importantly, we don't spend any time processing "stale" events.

This change will:

  1. Block all new connections in libp2p until bitswap can react to them. Blocking in Connected and Disconnected event handlers is absolutely forbidden.
  2. Cause the system to grind to a halt under load as (dis)connect events get delayed behind other (dis)connect events.

So yeah, don't do it.

Also note, this kind of change should be tested on a gateway and bootstrapper before merging. Testing there and then looking at stack dumps, mutex profiles, etc. will usually reveal problems like this.

@Jorropo
Copy link
Contributor Author

Jorropo commented Aug 17, 2023

Block all new connections in libp2p until bitswap can react to them. Blocking in Connected and Disconnected event handlers is absolutely forbidden.

We already did that in #435. This is a more efficient and simpler way to do this.
I'm not against not blocking libp2p but this means completely rethinking the connection state flow to let the connect calls from the client go through connectEventManager be required and unblocked from the callback.


We could also change the sessions's internals to sort it out with the old behaviour but this isn't the surgical win @rvagg and @hannahhoward were searchig.

@rvagg
Copy link
Member

rvagg commented Aug 17, 2023

OK, well the key thing we need to solve is in the s.findMorePeers->sws.Update flow:

for p := range s.providerFinder.FindProvidersAsync(ctx, k) {
// When a provider indicates that it has a cid, it's equivalent to
// the providing peer sending a HAVE
s.sws.Update(p, nil, []cid.Cid{c}, nil)
}

We need to either make sure that when calling sws.Update with a new peer, that it either knows about it already, or is prepared to handle an as-yet unknown peer. That's where the race exists because of the async disconnect.

Some options for exploration that come to mind for me:

  • Block in that loop for each peer until we know it's been registered with PeerConnected, perhaps sws.Update could block, or perhaps we could have a notifier
  • Put the peers into a queue that we process async only when they have been properly connected
  • Make sws.Update OK with receiving a peer that it doesn't already know about, preemptively
  • Make only ProviderFinder.FindProvidersAsync sync when it comes to performing the Connect, i.e. some kind of notifier system within the ProviderFinder so that it only returns a peer once the Connect->PeerConnected flow is known to be complete.

connectEventManager can handle multiple listeners, so we should be able to surgically inject a listener just for the purpose of watching this flow.

The problem with all of these of course is in dealing with cleanup of peers we shouldn't have to care about for various reasons.

@rvagg
Copy link
Member

rvagg commented Aug 17, 2023

@Stebalien for reference, this is what we're trying to solve: #432 (comment)

@Stebalien
Copy link
Member

So, what we should be doing there is:

  1. Discover the peer.
  2. Record that the session is interested in the peer (and record that we think the peer has some CID X). I believe we do the former in the "PeerManager", I don't think we do the latter.
  3. Asynchronously try to connect to the peer.

Independently:

  1. If/when we get a connection from the peer, find any sessions interested in that peer.
  2. Notify the session of the peer.
  3. The session will now decide which "wants" it wants to send to the peer, if any. In this case, it may send X, but it may not (because, e.g., it may no longer care about X).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bitswap peer connection race
3 participants