Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitswap: better diagnostics and observability on client #209

Closed
hannahhoward opened this issue Mar 16, 2023 · 20 comments · Fixed by #308
Closed

Bitswap: better diagnostics and observability on client #209

hannahhoward opened this issue Mar 16, 2023 · 20 comments · Fixed by #308
Assignees
Labels
need/triage Needs initial labeling and prioritization

Comments

@hannahhoward
Copy link
Contributor

When I call GetBlocks on the Bitswap client, I get back a channel and a synchronous error that is almost always nil. The primary way bitswap "errors" is by the context timing out. I don't know if we count find peers with the block, or we couldn't connect to them, or when we did we found they responded with don't have. So I have no way to track "what went wrong". I also don't have much of a way to track what went right -- if I get a block back, I don't know whom I got it from, the latency, the throughput, anything really.

This makes it very hard for consumers of the Bitswap library to measure performance or debug errors. Right now I am just turning on a bunch of debug logs, which is fine but means I can't do anything with those errors in code, or report them for metrics.

I realize as asynchronous code that ultimately talks to a pubsub protocol, returning errors or even channels of errors on GetBlocks may be hard. However, the client should at least be provided hooks so they can see more about why their requests are failing.

This hook seems like a good start: https://github.com/ipfs/go-libipfs/blob/main/bitswap/client/client.go#LL78C8-L78C8

For starters, we could have a hook that offered the same information, but also exposed Haves and Don't Haves.

Additionally though, it would be nice to be able to observe:

  • FindProvider calls to the DHT and what they returned
  • Message sending errors at the network level

Maybe such hooks exist somewhere? At the top level bitswap there is https://github.com/ipfs/go-libipfs/blob/main/bitswap/tracer/tracer.go#L10 but that seems not to exist for just the client? Even the tracer appears to only track successfully sent messages, not errors.

@hannahhoward hannahhoward added the need/triage Needs initial labeling and prioritization label Mar 16, 2023
@Jorropo
Copy link
Contributor

Jorropo commented Mar 16, 2023

Maybe such hooks exist somewhere? At the top level bitswap there is https://github.com/ipfs/go-libipfs/blob/main/bitswap/tracer/tracer.go#L10 but that seems not to exist for just the client? Even the tracer appears to only track successfully sent messages, not errors.

I don't know if that what you are asking for, but on the message tracking you can register for just the client: https://github.com/ipfs/go-libipfs/blob/e45eb146ba119528076276b4efb4f6123786eb29/bitswap/client/client.go#L72-L76

@BigLep
Copy link
Contributor

BigLep commented Mar 20, 2023

@hannahhoward : ACK on the request for better observability.

If I understand correctly, message tracking on the client above keeps you unblocked for Rhea. Please share if that isn't the case. Also, I'm happy to get a synchronous meeting setup to figure out how to solve specific blocking usecases,

For things like "FindProvider calls to the DHT and what they returned", I wonder if we should be developing the strategy of external tooling were can do deeper diagnostics (along the lines of IPFS check, vole, etc.).

Rewriting the Bitswap client is in the plans for 2023 in light of the operational events and continued pain with this code. Traceability will be an important element. The specifics or timeline haven't been flushed out, but cross-linking to #73

@willscott
Copy link
Contributor

I chatted briefly about this recently.
Being able at the end of a session to learn

  • has any / how many peers have been discovered that might have this cid?
  • how many of those peers are / have been connected to?

would be enough for useful triage / returned error messages

@BigLep
Copy link
Contributor

BigLep commented Mar 31, 2023

@Jorropo @guseggert : I'm not committing to when we will do this work yet (as a lot happening right now before IPFS Thing!) but it will be important/useful to get a sense of:

  1. Is this on the path of work we were planning to do anyways, or is this "throwaway" work?
  2. How much effort (ideal developer days) would it take to do this right?

I'm happy to talk in Slack or sync if we need some more back and forth.

I'd like to have a sense of this before 2023-04-03 Rhea meeting. Thank you!

@willscott
Copy link
Contributor

This came up again this week as something pretty important for rhea: we need visibility into what blocks are coming from a specific peer.

@BigLep
Copy link
Contributor

BigLep commented Apr 12, 2023

Ack. Two things here:

  1. Seems worthwhile to get the specific interfaces defined in while all together in Brussels. I'll put that on my list of things to schedule.
  2. This will be in the 0.21 release cycle, which will be started on after IPFS Thing.

@Jorropo
Copy link
Contributor

Jorropo commented Apr 13, 2023

@willscott @hannahhoward

I have a few questions:

  • Do you want a programmatic API or could this be done with Otel Tracing ? (Otel is easier to setup and view.)

FindProvider calls to the DHT and what they returned

Assuming contextes are correctly passed (I need to test #172 with tracing), this could be done easily by viewing events in the traces. Would that be fine ?

Message sending errors at the network level

You can wrap the bitswap network object to do this. I don't think it make sense for this to be exposed in the API because GetBlock{,s} don't care if a message failed to send, it is eventually consistent and will retry later or with other peers.

If you want thoses errors at the GetBlock{,s} level, then I think the best solution is to errors.Join network errors with ctx.Err() when ctx.Done() expires.
In other words, keep the eventually consistent behaviour, but also decorates ctx.Err() with reasons why the request might have not been executed in time when the context.Context expires.

What of the propositions above would solve your problem ?

has any / how many peers have been discovered that might have this cid?

That is easy to do, this already exists as SessionPeerManager but is not exposed.
I guess this wont be high quality tho, because not all downloads from a sessions goes through the session but this should help at finding bugs.

how many of those peers are / have been connected to?

This is not a metric bitswap internally keeps track of nor use in the algorithm, it wouldn't be very hard to expose this.
But I think that exposing metrics which are not used to make decisions about the download would lead to a missrepresentation of how the bitswap client works.

@willscott
Copy link
Contributor

Do you want a programmatic API or could this be done with Otel Tracing ?

Hannah's original post mentions "Right now I am just turning on a bunch of debug logs, which is fine but means I can't do anything with those errors in code, or report them for metrics."

At the core we'd like to be able to correlate the peer-block-fetch activity back to sessions / requests.

I think if instrumentation was through tracing we'd need to set up a custom exporter on each call to retrieve the spans? I don't know if that ends up being practical/efficient - my sense is that the tracing is meant to be sampled when run at scale for efficiency, but i might be wrong.

it is eventually consistent and will retry later or with other peers.

I think i agree that a decorated context error is a reasonable way to surface what happened in timeout cases.
My main question is how that works?
isn't the caller that will cancel? how does bitswap decorate that error? - per https://pkg.go.dev/context#CancelCauseFunc it looks like errors set in a child context cancellation will not propagate back up to the parent.

would lead to a missrepresentation of how the bitswap client works.

some visibility into connectability is the main thing here: separating the cases into:

  • we couldn't find peers
  • we found a peer, but couldn't connect
  • we connected, but couldn't get the block

@Jorropo
Copy link
Contributor

Jorropo commented Apr 13, 2023

My main question is how that works?
isn't the caller that will cancel? how does bitswap decorate that error? - per https://pkg.go.dev/context#CancelCauseFunc it looks like errors set in a child context cancellation will not propagate back up to the parent.

We would use errors.Join or write a custom wrapping error type and return some errors along side ctx.Err() and return this decorated in the error from GetBlock{,s}.
https://pkg.go.dev/github.com/ipfs/boxo@v0.8.0/exchange#Fetcher.GetBlock

This isn't propagated using the context, this is returned, but we wrap ctx.Err() in the decorator error type before returning it.


This will correctly supports https://pkg.go.dev/context#CancelCauseFunc assuming consumers errors.Is and errors.As as they should.

@hannahhoward
Copy link
Contributor Author

I don't wrapping a context error is the right approach, cause we need metrics information on all requests, not just failures.

For example, in the current cost challenges we had around web3.storage, the metric we needed to track is total downloaded from Web3.storage. But there's no public way to know "when I called GetBlock or GetBlocks, and got data successfully, who actually sent me the the block?" -- having in the bitswap code myself, I know this is knowable, it's just not exposed. It would also be great to know-- who did I ask for the block? How many did I send a want-have vs want-block to?

Ultimately, Lassie wants to aggregate these values into metrics to surface things like:

  • "How much data did we download from this peer in this time period"?
  • "How often did this peer respond successfully to a request for data it claimed to have?"
  • "How quickly does this peer respond (when already connected) ? -- i.e. TTFB"

If we don't want to modify GetBlock signature, then we could certainly return information through a listener of some sort.

I agree that these are slightly different requests than in the original ticket, cause our understanding has evolved over time. One thing that might be useful to look at is the SP Impact section of the Lassie metrics dashboard -- this is what we're able to assemble from GraphSync requests -- what would it take to assemble the same precision of data from Bitswap? But yes, I agree with @BigLep the best thing would be to discuss this synchronously at IPFS Thing.

@BigLep
Copy link
Contributor

BigLep commented Apr 17, 2023

@Jorropo
Copy link
Contributor

Jorropo commented Apr 18, 2023

I'm gonna submit a pull request that creates a custom block type which indicates from which peer we downloaded a block and how long did it took from first attempt to successful download because it's not that hard and can get some valuable information.
For long term solution solving most problems we are looking at a rewrite, either with something like the selfish bitswap client, rapide or a new lower level network API which helps building new clients like rapide or selfish bitswap client. Technically it's also possible to hack content routing awareness in the current client but it will make the code much worst given it's broadcast and timing oriented right now.

@BigLep
Copy link
Contributor

BigLep commented May 17, 2023

For visibility, kubo/boxo maintainers are planning to do the short term item @Jorropo discussed here as part of this current iteration cycle (0.21). This issue is on the board.

@BigLep
Copy link
Contributor

BigLep commented Jun 2, 2023

2023-06-02 update on where we're at. This issue hasn't been forgotten. Week of 2023-06-05, @Jorropo will be

  1. submit you a PR on lassie implementing the required metrics as he understands them (using what already exists in boxo/bitswap).
  2. bitswap/client: add basic traceable blocks #308 will also be completed, but isn't critical for the short term ask.

@Jorropo will do this once he completes the Kubo 0.21 release blocker items on his plate (the first 3 "required" items in ipfs/kubo#9814).

Screenshot:
image

@BigLep
Copy link
Contributor

BigLep commented Jun 8, 2023

Per 2023-06-08 maintainer conversation, @Jorropo still expects to complete this by end of week (2023-06-09).

@BigLep
Copy link
Contributor

BigLep commented Jun 20, 2023

Notes from 2023-06-20 maintainer conversation:

There was a communication gap to to acknowledge and update on the date slippage.

What will be delivered: given Lassie having a custom metric setup, a PR won't be made in Lassie. Instead @Jorropo will produce a Prometheus example of bitswap metrics in Kubo.

This has been delayed due to fixing goroutine leaks and deadlocks discovered in the latest Kubo 0.21 RC. @Jorropo is doing the Thunderdome testing to complete the 0.21 release (tracking issue).

The only other items ahead of doing the bitswap work above is to complete PL EngRes performance review activities.

We expect this to be completed by the end of the week (2023-06-23). @Jorropo will update otherwise.

@BigLep
Copy link
Contributor

BigLep commented Jul 10, 2023

@Jorropo: please update if I have this incorrect, but my understanding from last week is that this will get done this week (week of 2023-07-10), and will happen after ipfs/kubo#9927.

Jorropo added a commit that referenced this issue Jul 20, 2023
@Jorropo Jorropo mentioned this issue Jul 20, 2023
3 tasks
@BigLep
Copy link
Contributor

BigLep commented Jul 20, 2023

@Jorropo has posted a PR: #413

@willscott
Copy link
Contributor

is there some background / motivation behind the new approach in #413 vs the previous iteration in #308 ?
it seemed like we already had what we needed in #308 so i wonder if we are just adding extra work in iterating again.

cc @davidd8

@BigLep
Copy link
Contributor

BigLep commented Jul 25, 2023

is there some background / motivation behind the new approach in #413 vs the previous iteration in #308 ? it seemed like we already had what we needed in #308 so i wonder if we are just adding extra work in iterating again.

Per slack, "boxo 308 and Boxo 413 solves completely different problems". My understanding is that #413 is doing what we said we'd do in this issue here.

Anyways, #308 will happen this week given the feedback that that is higher priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/triage Needs initial labeling and prioritization
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants