Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: set Bitswap.ProviderSearchDelay to 0s #9530

Closed
wants to merge 2 commits into from

Conversation

lidel
Copy link
Member

@lidel lidel commented Jan 5, 2023

This PR sets Bitswap.ProviderSearchDelay to 0s,
effectively removing the delay described in #8807

Rationale

Kubo 0.18 makes two changes that make 1s no longer a viable default:

  1. Connection manager will now aim to maintain ~100 peers instead of ~800 (feat: adjust ConnMgr to 32/96 #9483), which makes bitswap-based routing less appealing (Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap #8807 (comment))
  2. Default Routing.Type=auto (feat: Routing.Type=auto (DHT+IPNI) #9475) will now use both DHT and IPNI (HTTP router) in parallel, which makes non-bitswap routing more appealing, especially for content that is indexed by IPNI.

This PR removes the delay by setting Bitswap.ProviderSearchDelay to 0s.

Why 0s and not 200ms

  • By default, Kubo 0.18 will keep 8x fewer connections than 0.17.
    This means additional DHT/IPNI lookup cost (~4 additional requests per CID) is still less than the savings around bitswap spam being 1/8 of the old default (Kubo 0.18 NOT sending ~700 bitswap requests).
  • Removing all artificial delays sound like a good rule of thumb. We can refine this after benchmarks are run with DHT+IPNI.

TODO

Closes #8807
cc @guillaumemichel @aschmahmann

@lidel lidel changed the title feat: remove Bitswap.ProviderSearchDelay feat: set Bitswap.ProviderSearchDelay to 0s Jan 5, 2023
@lidel lidel self-assigned this Jan 5, 2023
Copy link
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR description. I'm aligned with the change.

Maybe the other thing to call out is that while the resource overhead for the client node is minimal, it does drive a ~16x load to the indexers per #8807 (comment) . IPNI believes they can handle it: #8807 (comment)

@aschmahmann
Copy link
Contributor

Left some informational comments. However, I haven't been involved with the developments here over the past few months so do with this information what you will.

  1. AFAICT from Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap #8807 when someone actually changed this number to zero on a running kubo instance (thunderdome experiments) the TTFB was worse and yet we're going ahead with this change anyway which seems strange.
    • Generally speaking the only performance metrics we track are related to how a bunch of infrastructure performs (e.g. ipfs.io, preload nodes, cluster nodes, etc.). If we're not going to set this number to 0 in infra because it performs badly (e.g. the thunderdome test) how are we going to find out if it's hurting users (clients) by say increasing their CPU usage
  2. Related to the above it's not clear how some implementation bugs/side-effects will crop up from this especially given it hasn't successfully been run with this configuration in a production-like environment. Some examples could include:
    • Improperly propagated sessions resulting in extra lookups that wouldn't have existed before, with instances like:
      • I wouldn't be surprised if the current setup where the gateway code uses go-ipld-prime for path resolution (via go-path) and go-unixfs is used for actually fetch the data results in two different sessions which would add an extra routing request per gateway user request
      • IIRC the gateway UnixFS code path also results in an extra lookup differentiating the first block from the rest which is another unnecessary content routing lookup per user request
      • Overall this ends up being 3x overhead on requests on top of whatever percentage of requests wouldn't have been needed to be made in the first place (e.g. all requests to Pinata which IIRC doesn't advertise data anywhere)
    • The limited number of FindProvider workers in go-bitswap end up getting clogged up by unnecessary requests leading to slower resolution on data that actually needed those content routing requests. This might not be a problem but idk 🤷
  3. As was made obvious with some of the libp2p changes last year, a number of our protocol implementations don't deal super well with DoS/fairness issues. Having spammier clients could lead to issues if/when servers decide to either limit spam or stop being servers entirely. Given that the Hydras were taken down as too expensive to run, consider what would've happened if they were still being run but receiving >10x the number of queries. This is aside from how happy most users running DHT servers will be to take the higher load.

@lidel
Copy link
Member Author

lidel commented Jan 10, 2023

Ack on lack of measurements backing this change other than the assumption bitswap spam cost decrease makes it net positive.
This PR puts some pressure and is a forcing us to make a decision / fix implementation issues.

My understanding is that the TTFB tests no longer apply, multiple characteristics change in Kubo 0.18: we lowered the default number of bitswap peers AND added HTTP routing via IPNI – that is why I asked in #8807 (comment) if we should measure before making the change, or remove this artificial limit and then measure.

@BigLep we've (@aschmahmann, @guseggert, infra) briefly discussed this during Content Routing WG #3 and iiuc takeaway is:

  1. we want to park this at least until impact measurements on cid.contact are better understood (RC2 runing with 0s, confirming we don't have unexpected code path that introduces amplification vector on their side when we t)
  2. (TBD, ideally we also do this) we add metrics and benchmark defaults in 0.17 vs 0.18 to confirm there is no catastrophic regression with/without cid.contact (Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap #8807 (comment))

@lidel lidel added the status/blocked Unable to be worked further until needs are met label Jan 10, 2023
@guillaumemichel
Copy link

We should set a different ProviderSearchDelay for the DHT and for the Indexers #8807 (comment).

The network overhead for setting the DHT's ProviderSearchDelay to 0 has already be measured in the Bitswap Discovery Effectiveness study, and is minimal (~7 additional messages sent). Moreover, as Bitswap broadcast will be reduced (from ~850 peers to ~100), Bitswap is expected to find less CIDs with its broadcast, and be more dependent on the DHT. Hence, we really don't want to keep this artificial 1s delay before querying the DHT.

IMO it doesn't make sense to penalize DHT content routing just because IPNI has a lower capacity than the DHT. The DHT and IPNI should be treated as 2 independent Content Routers. We shouldn't need to wait on IPNI, in order to ship improvements for other content routers (Bitswap, DHT). IPNI can continue having a ProviderSearchDelay of 1s and reduce it once ready to handle more requests.

@lidel
Copy link
Member Author

lidel commented Jan 17, 2023

I've set ExecuteAfter=1s for the default IPNI router in 0724d2a. This means cid.contact will see no difference, and is no longer blocking this PR (cc @willscott)

Current state of this PR:

  • Default IPNI delay is 1s
  • Default DHT delay is 0s

@willscott
Copy link
Contributor

do you have an idea of what the increase in DHT/indexer query traffic will be?
we'd like to have the delay removed for the indexers as well - i don't see where indexer has been blocking this - although we are hoping it gets tested on gateways before it gets put in a full public release where we don't have control to roll back if it ends up being an overwhelming multiplier of requests.
presumably that would be an equal concern for dht though, so we're okay with you including us without delay at the same time.

@guillaumemichel
Copy link

We expect to see an increase in the number of DHT content lookup requests of ~16x compare to what we currently have (source: #8807 (comment)).

Each content request is expected to generate roughly 750 fewer Bitswap requests and 7 additional DHT request messages. So it shouldn't be a concern for DHT server nodes.

The load of content routing in Bitswap is expected to drop from 8x (850 -> 100 messages per request), and the one in the DHT/Indexers to increase 16x (~0.4 -> 7 messages per request for the DHT, 1 message per request for the indexers?). Most DHT server nodes run handle both Bitswap and DHT traffic, so the proposed change should reduce their overall load. However the Indexers (not running Bitswap AFAIU), will have a load increase of 16x (compared to the current DHT load).

@lidel
Copy link
Member Author

lidel commented Feb 7, 2023

Closing due to #8807 (comment) (makes more harm than good, more research/refactoring is needed).
Let's continue discussion in #8807.

@lidel lidel closed this Feb 7, 2023
@lidel lidel deleted the feat/remove-bitswap-routing-delay branch February 7, 2023 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/blocked Unable to be worked further until needs are met
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap
7 participants