New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MM-51700: Drop websocket slow/full warns to debug #24207
Conversation
This reverts commit 2287dff. In addition it drops the `websocket.slow` and `websocket.full` logging to Debug.
@grubbins: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. I understand the commands that are listed here |
@grubbins - Thanks for the PR. The idea behind this was to prevent this log line from flooding the logs and reduce noise for the admin. Is the interval of one minute still too small for you? Have you run into issues with this in production? From your forum post, it looks like you haven't deployed it yet.
The user_id will always be the same. Only the event will be sampled at random, which I think is fine. We might even remove the event type to make all lines exactly the same. The event name isn't really important. The user_id is.
As I mentioned earlier, the purpose behind this was to reduce the spammy nature of the log. But it's still actionable. I don't think the frequency of a warn log should decide whether or not it's more or less actionable. All warn logs should be actionable. In this case, the admin should check the user_id and see if it's only a particular user who is facing problems or whether it's coming from all users. The difference might mean either the server is under load or something limited to a single user's network. So unless the frequency of 1 minute is still too high, and your log lines are still flooding with this message, I would still lean on keeping this as a warning. Let me know your thoughts. |
Hey, thanks for the response!
I had overlooked the fact that the filtering is actually applied per user,
sorry - I thought the idea was for lastLogTimeSlow to be global across all
web connections.
But, even if this is rate-limited per user, I have the same complaints:
1. We still lose the information of the count / rate of these events. I
personally find it just generally bad logging design to say in the log "one
event like this just happened, and there may be more, but I won't tell you
about any of them for another 60s". If we do it this way I would very much
rather report a count.
2. If there are, say, 20 users with slow connections, I will still have
1200 warnings per hour from this and I do think it's way too much.
Especially because...
3. ...this is just not actionable! When trying to push for a clean,
actionable log of warnings and errors, I think we're better off with this
as Debug. After all, if for some reason I'm looking into things at this
level of detail, I will just enable Debug logs - but the recommended course
of action is to ignore these anyway.
If you really want to *warn* a sys admin about the situation, surely better
to expose a metric, or perhaps gather hourly stats in the log as info, or
better still, offer a way to query for a user's connection stats?
Hope that clarifies!
Dave
…On Tue, 8 Aug 2023 at 18:51, Agniva De Sarker ***@***.***> wrote:
@grubbins <https://github.com/grubbins> - Thanks for the PR. The idea
behind this was to prevent this log line from flooding the logs and reduce
noise for the admin. Is the interval of one minute still too small for you?
Have you run into issues with this in production? From your forum post, it
looks like you haven't deployed it yet.
With this change, the logging is even less informative since we discard
most of the info and pick one random warning every minute
The user_id will always be the same. Only the event will be sampled at
random, which I think is fine. We might even remove the event type to make
all lines exactly the same. The event name isn't really important. The
user_id is.
This updated logging, while using fewer log lines, is even less actionable
than before. Since site admins really can't act on this, then please let's
make it Debug...
As I mentioned earlier, the purpose behind this was to reduce the spammy
nature of the log. But it's still actionable. I don't think the frequency
of a warn log should decide whether or not it's more or less actionable.
All warn logs should be actionable. In this case, the admin should check
the user_id and see if it's only a particular user who is facing problems
or whether it's coming from all users. The difference might mean either the
server is under load or something limited to a single user's network.
So unless the frequency of 1 minute is still too high, and your log lines
are still flooding with this message, I would still lean on keeping this as
a warning.
Let me know your thoughts.
—
Reply to this email directly, view it on GitHub
<#24207 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJZUX5OTREXEIOAFYG5QVTXUH4XXANCNFSM6AAAAAA3HW26HU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hey Dave, We do have a metric for this. And that's the global websocket broadcast buffer. Spikes in that indicate that that the system is generating too much websocket events that the system as a whole can push out. We also have websocket event counters and broadcast counters. If you are on the enterprise edition, you should be able to get these on a Prometheus/Grafana setup. As I mentioned before, the recommended course of action is not to ignore if this is seen for a large portion of users. If it's limited to one or two users, then it's likely the users networks that's slow. Following your example, if you have 20 users consistently with a slow connection for a full hour, that's not good! And it's not something we have seen in practice. So if you do see something like that, then something is definitely not right, and you'll need to investigate further. I would recommend that you upgrade to the latest MM version and see how frequently do these come up for you. If the 1 minute interval is still not right, we can look into tuning it further. |
What are the relevant metric names? Are they on the standard grafana
dashboards that Mattermost suggest setting up (we have all of those)?
…On Wed, 9 Aug 2023 at 01:30, Agniva De Sarker ***@***.***> wrote:
Hey Dave,
We do have a metric for this. And that's the global websocket broadcast
buffer. Spikes in that indicate that that the system is generating too much
websocket events that the system as a whole can push out. We also have
websocket event counters and broadcast counters. If you are on the
enterprise edition, you should be able to get these on a Prometheus/Grafana
setup.
As I mentioned before, the recommended course of action is *not* to
ignore if this is seen for a *large* portion of users. If it's limited to
one or two users, then it's likely the users networks that's slow.
Following your example, if you have 20 users consistently with a slow
connection for a full hour, that's not good! And it's not something we have
seen in practice. So if you *do* see something like that, then something
is definitely not right, and you'll need to investigate further.
I would recommend that you upgrade to the latest MM version and see how
frequently do these come up for you. If the 1 minute interval is still not
right, we can look into tuning it further.
—
Reply to this email directly, view it on GitHub
<#24207 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJZUX6MIA3EQQ5MDHSNC7TXUJLR7ANCNFSM6AAAAAA3HW26HU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
They aren't. It's pending on me to add those panels to the v2 monitoring dashboard. Here they are if you want to add them right away:
The last one points to a websocket_event variable which is declared with a query: |
I have added these to our dashboard, although I couldn't get the topk in
the websocket_event query working (I can plot this as a set of time series,
but as a variable, it returns no values...)
So - which of these metrics would correspond to the websocket.slow warning
for a user? Does that only apply to broadcasts?
…On Thu, 10 Aug 2023 at 12:57, Agniva De Sarker ***@***.***> wrote:
They aren't. It's pending on me to add those panels to the v2 monitoring
dashboard
<https://grafana.com/grafana/dashboards/15582-mattermost-performance-monitoring-v2/>
.
Here they are if you want to add them right away:
1. sum(mattermost_websocket_broadcast_buffer_size) by (instance)
2. sum(rate(mattermost_websocket_broadcasts_total[5m])) by (name)
3. sum(rate(mattermost_websocket_event_total{type=~"${websocket_event}"}[5m]))
by (type)
The last one points to a websocket_event variable which is declared with a
query: topk(20,sum(rate(mattermost_websocket_event_total{namespace="${namespace}"}[${__range_s}s]))
by (type)) and filtered with a regex of .*type="(?!custom_)(.*?)".*.
—
Reply to this email directly, view it on GitHub
<#24207 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJZUX6DX4B7HZH6ZIA5KHTXURE2ZANCNFSM6AAAAAA3HW26HU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
We log warnings whenever our websocket buffer sizes exceed certain thresholds. The problem with that is, when this happens, the logs are completely spammed with these lines making it annoying for the customer. To improve the situation, we use a timer that only gets reset every minute. https://mattermost.atlassian.net/browse/MM-51700 ```release-note NONE ```
This PR has been automatically labelled "stale" because it hasn't had recent activity. |
Summary
Rather than sample
websocket.full|slow
warnings at one per minute, make them debug logging.Ticket Link
A proposed change to MM-51700.
See also this forum post.
Rationale is on the forum post, and also on this comment on the original diff.