feat: set Bitswap.ProviderSearchDelay to 0s #9530

lidel · 2023-01-05T17:41:33Z

This PR sets Bitswap.ProviderSearchDelay to 0s,
effectively removing the delay described in #8807

Rationale

Kubo 0.18 makes two changes that make 1s no longer a viable default:

Connection manager will now aim to maintain ~100 peers instead of ~800 (feat: adjust ConnMgr to 32/96 #9483), which makes bitswap-based routing less appealing (Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap #8807 (comment))
Default Routing.Type=auto (feat: Routing.Type=auto (DHT+IPNI) #9475) will now use both DHT and IPNI (HTTP router) in parallel, which makes non-bitswap routing more appealing, especially for content that is indexed by IPNI.

This PR removes the delay by setting Bitswap.ProviderSearchDelay to 0s.

Why 0s and not 200ms

By default, Kubo 0.18 will keep 8x fewer connections than 0.17.
This means additional DHT/IPNI lookup cost (~4 additional requests per CID) is still less than the savings around bitswap spam being 1/8 of the old default (Kubo 0.18 NOT sending ~700 bitswap requests).
Removing all artificial delays sound like a good rule of thumb. We can refine this after benchmarks are run with DHT+IPNI.

TODO

set to 0s
update docs
fix flaky tests
- tracked in fix(ci): retry flaky t0125-twonode.sh #9486

Closes #8807
cc @guillaumemichel @aschmahmann

Ref. #8807 (comment) Closes #8807

BigLep

Great PR description. I'm aligned with the change.

Maybe the other thing to call out is that while the resource overhead for the client node is minimal, it does drive a ~16x load to the indexers per #8807 (comment) . IPNI believes they can handle it: #8807 (comment)

aschmahmann · 2023-01-09T15:05:38Z

Left some informational comments. However, I haven't been involved with the developments here over the past few months so do with this information what you will.

AFAICT from Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap #8807 when someone actually changed this number to zero on a running kubo instance (thunderdome experiments) the TTFB was worse and yet we're going ahead with this change anyway which seems strange.
- Generally speaking the only performance metrics we track are related to how a bunch of infrastructure performs (e.g. ipfs.io, preload nodes, cluster nodes, etc.). If we're not going to set this number to 0 in infra because it performs badly (e.g. the thunderdome test) how are we going to find out if it's hurting users (clients) by say increasing their CPU usage
Related to the above it's not clear how some implementation bugs/side-effects will crop up from this especially given it hasn't successfully been run with this configuration in a production-like environment. Some examples could include:
- Improperly propagated sessions resulting in extra lookups that wouldn't have existed before, with instances like:
  - I wouldn't be surprised if the current setup where the gateway code uses go-ipld-prime for path resolution (via go-path) and go-unixfs is used for actually fetch the data results in two different sessions which would add an extra routing request per gateway user request
  - IIRC the gateway UnixFS code path also results in an extra lookup differentiating the first block from the rest which is another unnecessary content routing lookup per user request
  - Overall this ends up being 3x overhead on requests on top of whatever percentage of requests wouldn't have been needed to be made in the first place (e.g. all requests to Pinata which IIRC doesn't advertise data anywhere)
- The limited number of FindProvider workers in go-bitswap end up getting clogged up by unnecessary requests leading to slower resolution on data that actually needed those content routing requests. This might not be a problem but idk 🤷
As was made obvious with some of the libp2p changes last year, a number of our protocol implementations don't deal super well with DoS/fairness issues. Having spammier clients could lead to issues if/when servers decide to either limit spam or stop being servers entirely. Given that the Hydras were taken down as too expensive to run, consider what would've happened if they were still being run but receiving >10x the number of queries. This is aside from how happy most users running DHT servers will be to take the higher load.

lidel · 2023-01-10T17:32:36Z

Ack on lack of measurements backing this change other than the assumption bitswap spam cost decrease makes it net positive.
This PR puts some pressure and is a forcing us to make a decision / fix implementation issues.

My understanding is that the TTFB tests no longer apply, multiple characteristics change in Kubo 0.18: we lowered the default number of bitswap peers AND added HTTP routing via IPNI – that is why I asked in #8807 (comment) if we should measure before making the change, or remove this artificial limit and then measure.

@BigLep we've (@aschmahmann, @guseggert, infra) briefly discussed this during Content Routing WG #3 and iiuc takeaway is:

we want to park this at least until impact measurements on cid.contact are better understood (RC2 runing with 0s, confirming we don't have unexpected code path that introduces amplification vector on their side when we t)
(TBD, ideally we also do this) we add metrics and benchmark defaults in 0.17 vs 0.18 to confirm there is no catastrophic regression with/without cid.contact (Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap #8807 (comment))

guillaumemichel · 2023-01-17T11:13:43Z

We should set a different ProviderSearchDelay for the DHT and for the Indexers #8807 (comment).

The network overhead for setting the DHT's ProviderSearchDelay to 0 has already be measured in the Bitswap Discovery Effectiveness study, and is minimal (~7 additional messages sent). Moreover, as Bitswap broadcast will be reduced (from ~850 peers to ~100), Bitswap is expected to find less CIDs with its broadcast, and be more dependent on the DHT. Hence, we really don't want to keep this artificial 1s delay before querying the DHT.

IMO it doesn't make sense to penalize DHT content routing just because IPNI has a lower capacity than the DHT. The DHT and IPNI should be treated as 2 independent Content Routers. We shouldn't need to wait on IPNI, in order to ship improvements for other content routers (Bitswap, DHT). IPNI can continue having a ProviderSearchDelay of 1s and reduce it once ready to handle more requests.

#9530 (comment)

lidel · 2023-01-17T17:25:08Z

I've set ExecuteAfter=1s for the default IPNI router in 0724d2a. This means cid.contact will see no difference, and is no longer blocking this PR (cc @willscott)

Current state of this PR:

Default IPNI delay is 1s
Default DHT delay is 0s

willscott · 2023-01-17T17:29:16Z

do you have an idea of what the increase in DHT/indexer query traffic will be?
we'd like to have the delay removed for the indexers as well - i don't see where indexer has been blocking this - although we are hoping it gets tested on gateways before it gets put in a full public release where we don't have control to roll back if it ends up being an overwhelming multiplier of requests.
presumably that would be an equal concern for dht though, so we're okay with you including us without delay at the same time.

guillaumemichel · 2023-01-17T17:53:20Z

We expect to see an increase in the number of DHT content lookup requests of ~16x compare to what we currently have (source: #8807 (comment)).

Each content request is expected to generate roughly 750 fewer Bitswap requests and 7 additional DHT request messages. So it shouldn't be a concern for DHT server nodes.

The load of content routing in Bitswap is expected to drop from 8x (850 -> 100 messages per request), and the one in the DHT/Indexers to increase 16x (~0.4 -> 7 messages per request for the DHT, 1 message per request for the indexers?). Most DHT server nodes run handle both Bitswap and DHT traffic, so the proposed change should reduce their overall load. However the Indexers (not running Bitswap AFAIU), will have a load increase of 16x (compared to the current DHT load).

lidel · 2023-02-07T18:19:39Z

Closing due to #8807 (comment) (makes more harm than good, more research/refactoring is needed).
Let's continue discussion in #8807.

lidel requested review from guseggert, ajnavarro and aschmahmann January 5, 2023 17:41

lidel changed the title ~~feat: remove Bitswap.ProviderSearchDelay~~ feat: set Bitswap.ProviderSearchDelay to 0s Jan 5, 2023

lidel self-assigned this Jan 5, 2023

feat: remove Bitswap.ProviderSearchDelay

f6e8b2c

Ref. #8807 (comment) Closes #8807

lidel force-pushed the feat/remove-bitswap-routing-delay branch from 14d7aa1 to f6e8b2c Compare January 5, 2023 21:59

lidel marked this pull request as ready for review January 5, 2023 22:19

lidel mentioned this pull request Jan 5, 2023

Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap #8807

Open

BigLep approved these changes Jan 6, 2023

View reviewed changes

ajnavarro approved these changes Jan 8, 2023

View reviewed changes

Jorropo approved these changes Jan 8, 2023

View reviewed changes

lidel added the status/blocked Unable to be worked further until needs are met label Jan 10, 2023

fix: set ExecuteAfter=1s for IPNI routers

0724d2a

#9530 (comment)

BigLep mentioned this pull request Jan 19, 2023

Release 0.18 #9417

Closed

guillaumemichel mentioned this pull request Jan 30, 2023

Milestone: Bitswap Discovery Delay probe-lab/roadmap#12

Closed

lidel closed this Feb 7, 2023

lidel deleted the feat/remove-bitswap-routing-delay branch February 7, 2023 18:19

iand mentioned this pull request Oct 2, 2023

Measure DHT Resource Consumption probe-lab/zikade#21

Open

iand mentioned this pull request Jul 11, 2023

Milestone: Refactor Go DHT codebases probe-lab/roadmap#32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: set Bitswap.ProviderSearchDelay to 0s #9530

feat: set Bitswap.ProviderSearchDelay to 0s #9530

lidel commented Jan 5, 2023 •

edited

BigLep left a comment

aschmahmann commented Jan 9, 2023

lidel commented Jan 10, 2023

guillaumemichel commented Jan 17, 2023

lidel commented Jan 17, 2023

willscott commented Jan 17, 2023

guillaumemichel commented Jan 17, 2023

lidel commented Feb 7, 2023 •

edited

feat: set Bitswap.ProviderSearchDelay to 0s #9530

feat: set Bitswap.ProviderSearchDelay to 0s #9530

Conversation

lidel commented Jan 5, 2023 • edited

Rationale

Why 0s and not 200ms

TODO

BigLep left a comment

Choose a reason for hiding this comment

aschmahmann commented Jan 9, 2023

lidel commented Jan 10, 2023

guillaumemichel commented Jan 17, 2023

lidel commented Jan 17, 2023

willscott commented Jan 17, 2023

guillaumemichel commented Jan 17, 2023

lidel commented Feb 7, 2023 • edited

lidel commented Jan 5, 2023 •

edited

lidel commented Feb 7, 2023 •

edited