fix: wantlist overflow handling to select newer entries #629

gammazero · 2024-06-23T20:23:31Z

wantlist overflow handling now cancels existing entries to make room for newer requests. This fix prevents the wantlist from filling up with CIDs that the server does not have.

Fixes #527

wantlist overflow handling now cancels existing entries to make room for newer requests. This fix prevents the wantlist from filling up with CIDs that the server does not have. Fixes #527

codecov · 2024-06-23T20:26:12Z

Codecov Report

Attention: Patch coverage is 95.45455% with 3 lines in your changes missing coverage. Please review.

Project coverage is 59.87%. Comparing base (dfd4a53) to head (713faee).

@@            Coverage Diff             @@
##             main     #629      +/-   ##
==========================================
+ Coverage   59.79%   59.87%   +0.07%     
==========================================
  Files         238      238              
  Lines       29984    30014      +30     
==========================================
+ Hits        17930    17971      +41     
+ Misses      10434    10425       -9     
+ Partials     1620     1618       -2

Files	Coverage Δ
bitswap/message/message.go	`81.97% <100.00%> (+0.97%)`	⬆️
bitswap/server/internal/decision/peer_ledger.go	`94.23% <100.00%> (+0.05%)`	⬆️
bitswap/server/internal/decision/engine.go	`91.78% <94.64%> (+0.55%)`	⬆️

... and 12 files with indirect coverage changes

gammazero · 2024-06-24T15:18:52Z

bitswap/server/internal/decision/engine.go

+			// are not in the new request.
+			for _, entry := range wants {
+				if e.peerLedger.CancelWant(p, entry.Cid) {
+					e.peerRequestQueue.Remove(entry.Cid, p)


It may be better to not remove the entry from peerRequestQueue here, since this one is being replaced with an identical want from the new request.

Wondertan · 2024-06-24T20:58:09Z

This option will need a minor change

aschmahmann · 2024-06-24T22:00:38Z

bitswap/server/internal/decision/engine.go

-		if e.maxCidSize != 0 && uint(entry.Cid.ByteLen()) > e.maxCidSize {
-			// Ignore requests about CIDs that big.
-			continue
+		wants = filteredWants


Not sure which line to leave this comment on, but adding some thoughts based on conversations with @gammazero and @Stebalien

It might be helpful if we logically had two "wantlists" here:

The set of blocks that the server is currently in the middle of processing responses for (i.e. request-response style semantics)

The set of blocks that the server doesn't currently have, but if it receives them it will notify / send them out to requesters (i.e. subscription semantics)

We currently have the wantlist and taskqueue which might be these two queues, but they also might not given we want to be able to do things like cancel tasks in the taskqueue without needing to re-enumerate the entire queue in case it's large. It might also be fine if the capacity for subscription-wants only really exists if the request-response wants have been satisfied already (e.g. if there are tons of new requests coming in flushing out all the old subscriptions is probably fine, but it seems ok to have more capacity for request-response which is supposed to be "moving" than for subscriptions that can stay and take up memory indefinitely).

Also, for others watching: I think it's fairly clear that some protocol changes to help with backpressure here would be great, but since they're protocol changes those will have to wait for another time 😅.

(slightly tangential)
We use Bitswap with a case where more than 1024(default max) CIDs are requested in a single GetBlocks. If we do that, the Bitswap hangs because of this limit and recovers in a minute after rebroadcasting. It would be great if Bitswap clients could handle that case and avoid sending more than protocol-wide constant by backpressure-ing the caller.

It would be great if Bitswap clients could handle that case and avoid sending more than protocol-wide constant by backpressure-ing the caller.

Could you clarify your suggestion? Reading it I see 3 options (but these might not even be correct 😅):

Having the server backpressure the client on the other side of the wire

Similar to my comment above, and AFAICT requires some protocol changes to accommodate

Have the bitswap client internally batch GetBlocks calls that are for larger than maxWantlistSize into batches before returning them in case the batch is fully returned before the rebroadcast interval?

Doable if it'll be helpful, although it's a bit gross since A) maxWantlistSize shouldn't really be a protocol/network-wide thing and be per-client B) this batching can be done outside of the bitswap client package.

The bitswap client backpressuring the caller (e.g. code that's walking a DAG)

My suspicion is this would best be served by having a streaming version of GetBlocks that you could block on which is a separate problem that seems like [ipfs/go-bitswap] Proposal: Streaming GetBlocks #121. Which if you're interested in let's chat on the issue there.

Thank you for the elaborate response!

My suggestion was some form of 2 and 3. The client should be able to deal with the server-side request rate-limiting; otherwise, the client gets stuck expecting to be served, while the server simply cuts off his wants. The rebroadcasting after one minute helps, but that's still one minute and request spamming overhead over the wire.

Ideally, we would do a protocol change as described in 1, but as it's breaking, we may consider other less clean options, like or similar to 2. Setting protocol-wide maxWantlistSize is gross, I agree. Another option might be negotiating the limit between the client and the server so the client knows it should never exceed it.

The 3 is complimentary and provides a new, powerful way to interface with a client. However, I just realized that it is not necessary for our case if the client is smart enough. In our case, we have a flat structure, i.e. we don't traverse a DAG where we unpack IPLD nodes to get more CID links to fetch them and unpack again to get to the data. Essentially, we know all the CIDs in advance and could simply ask Bitswap to get all of them over GetBlocks as long the client is smart enough not to get into any limitations of its immediate peers, which we are currently facing with this issue.

Re protocol breaking 1. We are actually fine with this being a protocol-breaking change as we are building a new Bitswap-based protocol that hasn't been deployed yet and I believe there is more or less a clear way how to handle protocol version bumps in bitswap network component.

Fix wantlist overflow handling to select newer entries.

713faee

wantlist overflow handling now cancels existing entries to make room for newer requests. This fix prevents the wantlist from filling up with CIDs that the server does not have. Fixes #527

gammazero requested a review from a team as a code owner June 23, 2024 20:23

lidel changed the title ~~Fix wantlist overflow handling to select newer entries.~~ fix: wantlist overflow handling to select newer entries Jun 24, 2024

gammazero commented Jun 24, 2024

View reviewed changes

aschmahmann reviewed Jun 24, 2024

View reviewed changes

Wondertan mentioned this pull request Jun 26, 2024

Shwap hardening and optimizations tracking issue celestiaorg/celestia-node#3513

Open

14 tasks

gammazero marked this pull request as draft June 28, 2024 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: wantlist overflow handling to select newer entries #629

fix: wantlist overflow handling to select newer entries #629

gammazero commented Jun 23, 2024

codecov bot commented Jun 23, 2024 •

edited

Loading

gammazero Jun 24, 2024

Wondertan commented Jun 24, 2024

aschmahmann Jun 24, 2024

Wondertan Jun 25, 2024 •

edited

Loading

aschmahmann Jun 25, 2024 •

edited

Loading

Wondertan Jun 26, 2024

Wondertan Jun 26, 2024

fix: wantlist overflow handling to select newer entries #629

Are you sure you want to change the base?

fix: wantlist overflow handling to select newer entries #629

Conversation

gammazero commented Jun 23, 2024

codecov bot commented Jun 23, 2024 • edited Loading

Codecov Report

gammazero Jun 24, 2024

Choose a reason for hiding this comment

Wondertan commented Jun 24, 2024

aschmahmann Jun 24, 2024

Choose a reason for hiding this comment

Wondertan Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

aschmahmann Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

Wondertan Jun 26, 2024

Choose a reason for hiding this comment

Wondertan Jun 26, 2024

Choose a reason for hiding this comment

codecov bot commented Jun 23, 2024 •

edited

Loading

Wondertan Jun 25, 2024 •

edited

Loading

aschmahmann Jun 25, 2024 •

edited

Loading