Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: configurable ProviderSearchDelay in bitswap #9053

Merged
merged 4 commits into from
Oct 13, 2022
Merged

feat: configurable ProviderSearchDelay in bitswap #9053

merged 4 commits into from
Oct 13, 2022

Conversation

iand
Copy link
Contributor

@iand iand commented Jun 22, 2022

Fixes #8807

@iand iand requested a review from aschmahmann June 22, 2022 16:13
core/node/bitswap.go Outdated Show resolved Hide resolved
@Jorropo
Copy link
Contributor

Jorropo commented Jun 22, 2022

Also to be clear I pulled 20ms out of my ass, I have no idea what it should be.

20ms sounds short because I guess most IPFS nodes takes longer than 20ms to respond to a bitswap request (overhead of opening a new stream, ...).

But personally I'm fine with 20ms

@iand
Copy link
Contributor Author

iand commented Jun 22, 2022

Also to be clear I pulled 20ms out of my ass, I have no idea what it should be.

Yep. The original issue suggests removing it, i.e. 0, but a small step up to 20ms seems sensible. Can always be adjusted in future.

Copy link
Contributor

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing on this. Left some comments asking about why the given number was chosen as well as if the evidence here indicates we should be exposing this as a Internal.Bitswap option.

@@ -21,6 +22,7 @@ const (
DefaultTaskWorkerCount = 8
DefaultEngineTaskWorkerCount = 8
DefaultMaxOutstandingBytesPerPeer = 1 << 20
DefaultProviderSearchDelay = 20 * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding from #8807 is that there's some analysis being done to decide on the magic number here.

In particular is there any analysis on the amount of added network traffic as a result of reducing the number here and should we expose this number in the config file?

If I'm reading the graph it correctly #8807 (comment) it looks like almost no peers respond in under 100ms and some large fraction (80%?) respond by 1 second.

Dropping to 0ms, 20ms, or anything that's before Bitswap responses come back means we're sending out more requests than we do today. Understanding how many extra requests we're making vs how much time we're saving seems useful. This could also be helpful in assisting users who want to tweak the values between keeping the network chatter lower and waiting longer on Bitswap responses (e.g. gateways vs IPFS Desktop kinds of settings).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will certainly be more chatter, but if the values at the end of #8807 (comment) are any valid, it's going to be a very small fraction. According to the comment, we're talking about 20k/12million requests (~0.1%), when Bitswap comes back positive. Even if that value is off by a factor of 10, we're still talking about 1% of cases where we make two requests, when one would be enough. But we're also saving minimum one RTT (cases where we receive DONT_HAVE) and maximum 1s (timeout) in 99% of cases.

Overall, my view is that if we start in parallel (or almost in parallel) we don't have much to lose. We can also spin up a node (or a few nodes, not connected to each other) and do the measurement study in the real network when this is merged.

The node(s) would:

  • start connections in the network, as they would normally.
  • listen for CIDs requested through Bitswap.
  • ask for the same CIDs at regular time intervals thereafter, excluding asking the peers where they received the request from (presumably the peer that asked for a CID will eventually get it, so we want to avoid counting false positives).

Copy link
Contributor

@guseggert guseggert Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has anyone corroborated that 0.1% number? Seems like a pretty critical measurement. The degree to which we make unnecessary reqs is the degree to which we'd further increase the load on DHT servers, which AFAIU is so high that many folks run in client-only mode, so I'm really loathe to further increase DHT server load.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrd0ll4r this relates to your previous comment here: #8807 (comment) - any easy way you can instrument this measurement (success/failure of Bitswap discovery) from the infrastructure you already have in place? :)

Copy link
Contributor

@mrd0ll4r mrd0ll4r Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we are working on a paper where we'll be able to do this, but I don't think that's going to happen in the near future. The methodology will be the same as I used for the comment, i.e., manually sending stuff on Bitswap and collecting responses. So no, I can't get this from the passive measurements we're doing at the moment, sorry...

EDIT: also, importantly, we explicitly do not use the usual request mechanism, which uses this timeout. We send stuff on Bitswap directly, without ever triggering a DHT search. So I won't be able to test the impact of this.

About the 20k positive responses/12 million requests number: Keep in mind that

  1. I had to read those off the graph, they're not accurately from the dataset,
  2. I can't ascertain the popularity of the content I tested with in a larger context, i.e., I don't know if other content is spread much more widely, generating more positive responses (or the opposite), and
  3. Those 20k requests were the positive ones, but we did receive a response to almost all requests we sent out. This is because we set the send_dont_have flag, in order to also receive negative responses. This, critically, allows us to not rely on timeouts. (I had a number about what portion of peers respect send_dont_have, but I don't have it at hand. I believe it was the vast majority.)
    I suppose something like that would be possible here, too (although that's another issue): set send_dont_have for all requests, and start searching the DHT as soon as some portion of our neighbors responded without any positive response.

In general, I agree with 20ms being too low as a timeout, because it's unlikely that any meaningful number of responses will have been received. It effectively behaves as 0ms, i.e., do DHT and Bitswap discovery in parallel. If that's the goal, setting it to zero seems less confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't ascertain the popularity of the content I tested with in a larger context, i.e., I don't know if other content is spread much more widely, generating more positive responses (or the opposite).

Totally. Some data processing required here, but I expect that ipfs.io gets a lot of requests for data stored with popular pinning services, some of which they're peered with (more than 0.1%).

This knob here deals with resource consumption vs performance, with the added caveat that the resource consumption is both on the DHT server nodes and on the client node itself.

Let's say we take the view that we're not concerned about additional DHT server load (idk that that's a good idea, but let's say). How does a user figure out what they should set the magic number to? Have we tried running this in places like ipfs.io to see what seems optimal at least for that deployment?

While we're figuring this out it seems pretty reasonable to expose a knob in the Bitswap.Internal config, although it should come with some guidance around tuning it.


As an aside as we introduce additional routing systems (e.g. Reframe delegated routers which are already supported) we'll likely need to figure out other ways to handle this. For example, maybe it makes sense to wait 500ms before a DHT request but 0s for a delegated routing request to cid.contact?

There's probably a more ideal dynamic system here than hard coding in some numbers, but if there are real performance benefits to be gained we can hard code some numbers and encourage some follow-up work on more dynamic/adaptable systems here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ProviderSearchDelay to Internal.Bitswap config but left the default at the low value of 20ms. We could restore that to 1s and allow experiments to test the effect of changing it before enshrining a new default in code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that ^ help get results from more existing nodes @guillaumemichel?

As a note, we are starting the Bitswap Discovery Success Rate study this week and expect to have first results in the next couple of weeks. Work is being tracked here: https://pl-strflt.notion.site/Effectiveness-of-Bitswap-Discovery-Process-aab515ebc1774bee8193da415bfb98f4 - any input more than welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now can we expose this config but leave the default alone until at least we review @guillaumemichel's results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Will change to 1s

@BigLep
Copy link
Contributor

BigLep commented Sep 1, 2022

@guseggert : are you able to clarify what next steps are needed here?

I think if we aren't touching the defaults but we make this configurable we should be fine to ship.

We can come back later to adjust the default after more of the experiments have run.

@yiannisbot
Copy link
Member

Quick update from ProbeLab side: we've started looking into this in more detail and we expect some first results to come out this week. @guillaumemichel to provide a more specific update on current progress and experiments targeted. All the action is happening here if anyone wants to contribute: https://pl-strflt.notion.site/Effectiveness-of-Bitswap-Discovery-Process-aab515ebc1774bee8193da415bfb98f4

I would personally definitely support proceeding with a config setting and a recommendation until we have more stable results.

@guseggert
Copy link
Contributor

@guseggert : are you able to clarify what next steps are needed here?

I think if we aren't touching the defaults but we make this configurable we should be fine to ship.

We can come back later to adjust the default after more of the experiments have run.

Yes this sounds good to me too, let's just expose the values in config and leave the defaults alone, to give us some breathing room to understand the network impact of adjusting them.

@iand
Copy link
Contributor Author

iand commented Sep 1, 2022

Will also be testing this in Thunderdome to compare the following values: 0ms, 20ms, 50ms, 100ms, 200ms, 500ms, 750ms, 1000ms

probe-lab/thunderdome#29

@BigLep
Copy link
Contributor

BigLep commented Sep 22, 2022

@Jorropo will take getting this merged (but we'll remove changing the default).

@BigLep BigLep requested a review from Jorropo September 22, 2022 16:16
@@ -21,6 +22,7 @@ const (
DefaultTaskWorkerCount = 8
DefaultEngineTaskWorkerCount = 8
DefaultMaxOutstandingBytesPerPeer = 1 << 20
DefaultProviderSearchDelay = 20 * time.Millisecond
Copy link
Contributor

@Jorropo Jorropo Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this to 1 second please ? Let try having this as an option before changing defaults.

@BigLep BigLep mentioned this pull request Sep 22, 2022
@iand
Copy link
Contributor Author

iand commented Sep 26, 2022

@Jorropo changed default back to 1s

@Jorropo
Copy link
Contributor

Jorropo commented Sep 26, 2022

Thx, I will review this after the code freeze.

@Jorropo Jorropo self-requested a review September 26, 2022 16:07
@BigLep
Copy link
Contributor

BigLep commented Sep 26, 2022

@Jorropo : we can talk during 2022-09-27 triage, but if this is exposing a configurable parameter and not changing any defaults, I can see a case for making this in as part of 0.16 (even though it wasn't in the RC).

@BigLep
Copy link
Contributor

BigLep commented Sep 27, 2022

2022-09-27 conversation: this will land in 0.17

@BigLep BigLep removed the request for review from guseggert October 4, 2022 16:39
Copy link
Contributor

@Jorropo Jorropo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

@BigLep
Copy link
Contributor

BigLep commented Oct 12, 2022

@Jorropo : are you going to merge? I didn't think we were waiting on anyone else.

@Jorropo Jorropo merged commit 066a0b9 into ipfs:master Oct 13, 2022
@lidel lidel changed the title feat: remove provider delay interval in bitswap feat: configurable ProviderSearchDelay in bitswap Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap
7 participants