Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relay: don't report errors for failed requests as events but log them instead as warn! #4757

Open
thomaseizinger opened this issue Oct 29, 2023 · 4 comments

Comments

@thomaseizinger
Copy link
Contributor

Description

As part of developing #4718, the question came up whether we actually need to report the following as events:

  • ReservationReqAcceptFailed
  • ReservationReqDenyFailed
  • CircuitReqDenyFailed
  • CircuitReqAcceptFailed
  • CircuitReqOutboundConnectFailed

The first four can only happen as part of a protocol violation of the remote, i.e. sending badly formatted messages. The last one might occur if the remote rejects the circuit we are trying to establish for another node.

Hypothesis

The hypothesis is: These events are unnecessary. Instead of surfacing them to the user, we should log them on an appropriate level.

Events are useful to programmatically tie into the behaviour of libp2p. The question is, what would you want to do about these errors other than logging them? Chances are, because a typical rust-libp2p application is emitting a lot of events, you are not handling all events individually but instead have a e => log::debug!("{e:?}" catch-all branch at the end of your event loop. That doesn't surface errors like this using an appropriate level.

These events are all about errors before we created some kind of resource or entity (a circuit or a reservation). The lack of state changes to the system make it unlikely that you need to perform some kind of corrective action when encountering such an event.

Suggestion

My suggestion is to remove these events and instead log them at the warn level. Most users will enable all warn messages within a system as those mean that nothing bad has happened yet but an increased number of warn logs might indicate a problem down the line.

From the PoV of a relay-node operator, an increased number of warnings for failed reservations or circuits means that for some reason, those other nodes are not using a compatible wire-format or - in the case of CircuitReqOutboundConnectFailed are denying inbound circuits with an increased rate. On a higher level, this means the relay server is not fulfilling its job by connecting other nodes with each other. Whilst not fatal to the system, this will likely lead to connectivity problems in a larger network. Thus, I think the use of warn! is well justified here.

Alternatives

Do you (reader) use these events for something else other than logging them? Please speak up if removing these somehow conflicts with your usecase.

Motivation

Simplify the public API for our users and notify them about potential problems using appropriate log levels.

Current Implementation

We just forward the errors to the user.

Are you planning to do it yourself in a pull request ?

Yes

@mxinden
Copy link
Member

mxinden commented Oct 30, 2023

I exposed these events in the initial implementation in order to track them via libp2p-metrics. I believe that alerting based on metrics is the preferred way over logs. The former are cheap and structured, the latter are expensive and unstructured.

Short term I would like to keep the events for libp2p-metrics. Long term I would like to integrate metric instrumentation into libp2p-* directly and would thus be fine with removing the events.

@thomaseizinger
Copy link
Contributor Author

thomaseizinger commented Oct 30, 2023

I believe that alerting based on metrics is the preferred way over logs. The former are cheap and structured, the latter are expensive and unstructured.

Is that universally true? I think alerting based on metrics is powerful but I am not sure if it should be applied to errors like this? The key difference is that metrics are always sampled whereas logs are precise because every instance shows up.

Following from this, I think alerting via metrics make sense if you already have a repeated reporting of something, like request duration for example. Then, it is easy and effective to create an alert that fires when that duration deviates from the usual measurement.

With (unexpected) errors like this, the base line really should be 0 and depending on your sampling interval and actual rate of errors, a single one for example might not show up in a way that one can reliably create an alert for it. My experience with grafana and prometheus is limited here so happy to be corrected :)

Thus, if the baseline is zero, it is much easier to create an alert when a certain module emits a warn log. If end up adopting tracing, those are structured too.

@drHuangMHT
Copy link
Contributor

I believe those events are from the server side, right? Since reservations are handled automatically, I can't really come up with a use case for them other than metrics. And for servers, those errors should be put into logs.

@thomaseizinger
Copy link
Contributor Author

I believe those events are from the server side, right?

Correct, they concern the server only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants