-
Notifications
You must be signed in to change notification settings - Fork 950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
relay: don't report errors for failed requests as events but log them instead as warn!
#4757
Comments
I exposed these events in the initial implementation in order to track them via Short term I would like to keep the events for |
Is that universally true? I think alerting based on metrics is powerful but I am not sure if it should be applied to errors like this? The key difference is that metrics are always sampled whereas logs are precise because every instance shows up. Following from this, I think alerting via metrics make sense if you already have a repeated reporting of something, like request duration for example. Then, it is easy and effective to create an alert that fires when that duration deviates from the usual measurement. With (unexpected) errors like this, the base line really should be 0 and depending on your sampling interval and actual rate of errors, a single one for example might not show up in a way that one can reliably create an alert for it. My experience with grafana and prometheus is limited here so happy to be corrected :) Thus, if the baseline is zero, it is much easier to create an alert when a certain module emits a |
I believe those events are from the server side, right? Since reservations are handled automatically, I can't really come up with a use case for them other than metrics. And for servers, those errors should be put into logs. |
Correct, they concern the server only. |
Description
As part of developing #4718, the question came up whether we actually need to report the following as events:
ReservationReqAcceptFailed
ReservationReqDenyFailed
CircuitReqDenyFailed
CircuitReqAcceptFailed
CircuitReqOutboundConnectFailed
The first four can only happen as part of a protocol violation of the remote, i.e. sending badly formatted messages. The last one might occur if the remote rejects the circuit we are trying to establish for another node.
Hypothesis
The hypothesis is: These events are unnecessary. Instead of surfacing them to the user, we should log them on an appropriate level.
Events are useful to programmatically tie into the behaviour of libp2p. The question is, what would you want to do about these errors other than logging them? Chances are, because a typical rust-libp2p application is emitting a lot of events, you are not handling all events individually but instead have a
e => log::debug!("{e:?}"
catch-all branch at the end of your event loop. That doesn't surface errors like this using an appropriate level.These events are all about errors before we created some kind of resource or entity (a circuit or a reservation). The lack of state changes to the system make it unlikely that you need to perform some kind of corrective action when encountering such an event.
Suggestion
My suggestion is to remove these events and instead log them at the
warn
level. Most users will enable allwarn
messages within a system as those mean that nothing bad has happened yet but an increased number ofwarn
logs might indicate a problem down the line.From the PoV of a relay-node operator, an increased number of warnings for failed reservations or circuits means that for some reason, those other nodes are not using a compatible wire-format or - in the case of
CircuitReqOutboundConnectFailed
are denying inbound circuits with an increased rate. On a higher level, this means the relay server is not fulfilling its job by connecting other nodes with each other. Whilst not fatal to the system, this will likely lead to connectivity problems in a larger network. Thus, I think the use ofwarn!
is well justified here.Alternatives
Do you (reader) use these events for something else other than logging them? Please speak up if removing these somehow conflicts with your usecase.
Motivation
Simplify the public API for our users and notify them about potential problems using appropriate log levels.
Current Implementation
We just forward the errors to the user.
Are you planning to do it yourself in a pull request ?
Yes
The text was updated successfully, but these errors were encountered: