relay: don't report errors for failed requests as events but log them instead as `warn!` #4757

thomaseizinger · 2023-10-29T23:05:28Z

Description

As part of developing #4718, the question came up whether we actually need to report the following as events:

ReservationReqAcceptFailed
ReservationReqDenyFailed
CircuitReqDenyFailed
CircuitReqAcceptFailed
CircuitReqOutboundConnectFailed

The first four can only happen as part of a protocol violation of the remote, i.e. sending badly formatted messages. The last one might occur if the remote rejects the circuit we are trying to establish for another node.

Hypothesis

The hypothesis is: These events are unnecessary. Instead of surfacing them to the user, we should log them on an appropriate level.

Events are useful to programmatically tie into the behaviour of libp2p. The question is, what would you want to do about these errors other than logging them? Chances are, because a typical rust-libp2p application is emitting a lot of events, you are not handling all events individually but instead have a e => log::debug!("{e:?}" catch-all branch at the end of your event loop. That doesn't surface errors like this using an appropriate level.

These events are all about errors before we created some kind of resource or entity (a circuit or a reservation). The lack of state changes to the system make it unlikely that you need to perform some kind of corrective action when encountering such an event.

Suggestion

My suggestion is to remove these events and instead log them at the warn level. Most users will enable all warn messages within a system as those mean that nothing bad has happened yet but an increased number of warn logs might indicate a problem down the line.

From the PoV of a relay-node operator, an increased number of warnings for failed reservations or circuits means that for some reason, those other nodes are not using a compatible wire-format or - in the case of CircuitReqOutboundConnectFailed are denying inbound circuits with an increased rate. On a higher level, this means the relay server is not fulfilling its job by connecting other nodes with each other. Whilst not fatal to the system, this will likely lead to connectivity problems in a larger network. Thus, I think the use of warn! is well justified here.

Alternatives

Do you (reader) use these events for something else other than logging them? Please speak up if removing these somehow conflicts with your usecase.

Motivation

Simplify the public API for our users and notify them about potential problems using appropriate log levels.

Current Implementation

We just forward the errors to the user.

Are you planning to do it yourself in a pull request ?

Yes

The text was updated successfully, but these errors were encountered:

mxinden · 2023-10-30T10:08:37Z

I exposed these events in the initial implementation in order to track them via libp2p-metrics. I believe that alerting based on metrics is the preferred way over logs. The former are cheap and structured, the latter are expensive and unstructured.

Short term I would like to keep the events for libp2p-metrics. Long term I would like to integrate metric instrumentation into libp2p-* directly and would thus be fine with removing the events.

thomaseizinger · 2023-10-30T11:26:03Z

I believe that alerting based on metrics is the preferred way over logs. The former are cheap and structured, the latter are expensive and unstructured.

Is that universally true? I think alerting based on metrics is powerful but I am not sure if it should be applied to errors like this? The key difference is that metrics are always sampled whereas logs are precise because every instance shows up.

Following from this, I think alerting via metrics make sense if you already have a repeated reporting of something, like request duration for example. Then, it is easy and effective to create an alert that fires when that duration deviates from the usual measurement.

With (unexpected) errors like this, the base line really should be 0 and depending on your sampling interval and actual rate of errors, a single one for example might not show up in a way that one can reliably create an alert for it. My experience with grafana and prometheus is limited here so happy to be corrected :)

Thus, if the baseline is zero, it is much easier to create an alert when a certain module emits a warn log. If end up adopting tracing, those are structured too.

drHuangMHT · 2023-11-01T03:17:24Z

I believe those events are from the server side, right? Since reservations are handled automatically, I can't really come up with a use case for them other than metrics. And for servers, those errors should be put into logs.

thomaseizinger · 2023-11-01T06:40:58Z

I believe those events are from the server side, right?

Correct, they concern the server only.

thomaseizinger mentioned this issue Oct 29, 2023

feat(relay): don't close connections upon errors in relay server #4718

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

relay: don't report errors for failed requests as events but log them instead as `warn!` #4757

relay: don't report errors for failed requests as events but log them instead as `warn!` #4757

thomaseizinger commented Oct 29, 2023

mxinden commented Oct 30, 2023

thomaseizinger commented Oct 30, 2023 •

edited

Loading

drHuangMHT commented Nov 1, 2023

thomaseizinger commented Nov 1, 2023

relay: don't report errors for failed requests as events but log them instead as warn! #4757

relay: don't report errors for failed requests as events but log them instead as warn! #4757

Comments

thomaseizinger commented Oct 29, 2023

Description

Hypothesis

Suggestion

Alternatives

Motivation

Current Implementation

Are you planning to do it yourself in a pull request ?

mxinden commented Oct 30, 2023

thomaseizinger commented Oct 30, 2023 • edited Loading

drHuangMHT commented Nov 1, 2023

thomaseizinger commented Nov 1, 2023

relay: don't report errors for failed requests as events but log them instead as `warn!` #4757

relay: don't report errors for failed requests as events but log them instead as `warn!` #4757

thomaseizinger commented Oct 30, 2023 •

edited

Loading