MM-51700: Drop websocket slow/full warns to debug #24207

grubbins · 2023-08-08T00:55:38Z

Summary

Rather than sample websocket.full|slow warnings at one per minute, make them debug logging.

Ticket Link

A proposed change to MM-51700.

See also this forum post.

Rationale is on the forum post, and also on this comment on the original diff.

This reverts commit 2287dff. In addition it drops the `websocket.slow` and `websocket.full` logging to Debug.

mm-cloud-bot · 2023-08-08T00:55:40Z

@grubbins: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

I understand the commands that are listed here

mattermost-build · 2023-08-08T00:55:42Z

Hello @grubbins,

Thanks for your pull request! A Core Committer will review your pull request soon. For code contributions, you can learn more about the review process here.

agnivade · 2023-08-08T08:50:55Z

@grubbins - Thanks for the PR. The idea behind this was to prevent this log line from flooding the logs and reduce noise for the admin. Is the interval of one minute still too small for you? Have you run into issues with this in production? From your forum post, it looks like you haven't deployed it yet.

With this change, the logging is even less informative since we discard most of the info and pick one random warning every minute

The user_id will always be the same. Only the event will be sampled at random, which I think is fine. We might even remove the event type to make all lines exactly the same. The event name isn't really important. The user_id is.

This updated logging, while using fewer log lines, is even less actionable than before. Since site admins really can't act on this, then please let's make it Debug...

As I mentioned earlier, the purpose behind this was to reduce the spammy nature of the log. But it's still actionable. I don't think the frequency of a warn log should decide whether or not it's more or less actionable. All warn logs should be actionable. In this case, the admin should check the user_id and see if it's only a particular user who is facing problems or whether it's coming from all users. The difference might mean either the server is under load or something limited to a single user's network.

So unless the frequency of 1 minute is still too high, and your log lines are still flooding with this message, I would still lean on keeping this as a warning.

Let me know your thoughts.

grubbins · 2023-08-08T09:34:27Z

Hey, thanks for the response! I had overlooked the fact that the filtering is actually applied per user, sorry - I thought the idea was for lastLogTimeSlow to be global across all web connections. But, even if this is rate-limited per user, I have the same complaints: 1. We still lose the information of the count / rate of these events. I personally find it just generally bad logging design to say in the log "one event like this just happened, and there may be more, but I won't tell you about any of them for another 60s". If we do it this way I would very much rather report a count. 2. If there are, say, 20 users with slow connections, I will still have 1200 warnings per hour from this and I do think it's way too much. Especially because... 3. ...this is just not actionable! When trying to push for a clean, actionable log of warnings and errors, I think we're better off with this as Debug. After all, if for some reason I'm looking into things at this level of detail, I will just enable Debug logs - but the recommended course of action is to ignore these anyway. If you really want to *warn* a sys admin about the situation, surely better to expose a metric, or perhaps gather hourly stats in the log as info, or better still, offer a way to query for a user's connection stats? Hope that clarifies! Dave

…

On Tue, 8 Aug 2023 at 18:51, Agniva De Sarker ***@***.***> wrote: @grubbins <https://github.com/grubbins> - Thanks for the PR. The idea behind this was to prevent this log line from flooding the logs and reduce noise for the admin. Is the interval of one minute still too small for you? Have you run into issues with this in production? From your forum post, it looks like you haven't deployed it yet. With this change, the logging is even less informative since we discard most of the info and pick one random warning every minute The user_id will always be the same. Only the event will be sampled at random, which I think is fine. We might even remove the event type to make all lines exactly the same. The event name isn't really important. The user_id is. This updated logging, while using fewer log lines, is even less actionable than before. Since site admins really can't act on this, then please let's make it Debug... As I mentioned earlier, the purpose behind this was to reduce the spammy nature of the log. But it's still actionable. I don't think the frequency of a warn log should decide whether or not it's more or less actionable. All warn logs should be actionable. In this case, the admin should check the user_id and see if it's only a particular user who is facing problems or whether it's coming from all users. The difference might mean either the server is under load or something limited to a single user's network. So unless the frequency of 1 minute is still too high, and your log lines are still flooding with this message, I would still lean on keeping this as a warning. Let me know your thoughts. — Reply to this email directly, view it on GitHub <#24207 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJZUX5OTREXEIOAFYG5QVTXUH4XXANCNFSM6AAAAAA3HW26HU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

agnivade · 2023-08-08T15:30:28Z

Hey Dave,

We do have a metric for this. And that's the global websocket broadcast buffer. Spikes in that indicate that that the system is generating too much websocket events that the system as a whole can push out. We also have websocket event counters and broadcast counters. If you are on the enterprise edition, you should be able to get these on a Prometheus/Grafana setup.

As I mentioned before, the recommended course of action is not to ignore if this is seen for a large portion of users. If it's limited to one or two users, then it's likely the users networks that's slow. Following your example, if you have 20 users consistently with a slow connection for a full hour, that's not good! And it's not something we have seen in practice. So if you do see something like that, then something is definitely not right, and you'll need to investigate further.

I would recommend that you upgrade to the latest MM version and see how frequently do these come up for you. If the 1 minute interval is still not right, we can look into tuning it further.

grubbins · 2023-08-09T23:44:30Z

What are the relevant metric names? Are they on the standard grafana dashboards that Mattermost suggest setting up (we have all of those)?

…

On Wed, 9 Aug 2023 at 01:30, Agniva De Sarker ***@***.***> wrote: Hey Dave, We do have a metric for this. And that's the global websocket broadcast buffer. Spikes in that indicate that that the system is generating too much websocket events that the system as a whole can push out. We also have websocket event counters and broadcast counters. If you are on the enterprise edition, you should be able to get these on a Prometheus/Grafana setup. As I mentioned before, the recommended course of action is *not* to ignore if this is seen for a *large* portion of users. If it's limited to one or two users, then it's likely the users networks that's slow. Following your example, if you have 20 users consistently with a slow connection for a full hour, that's not good! And it's not something we have seen in practice. So if you *do* see something like that, then something is definitely not right, and you'll need to investigate further. I would recommend that you upgrade to the latest MM version and see how frequently do these come up for you. If the 1 minute interval is still not right, we can look into tuning it further. — Reply to this email directly, view it on GitHub <#24207 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJZUX6MIA3EQQ5MDHSNC7TXUJLR7ANCNFSM6AAAAAA3HW26HU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

agnivade · 2023-08-10T02:57:36Z

They aren't. It's pending on me to add those panels to the v2 monitoring dashboard.

Here they are if you want to add them right away:

sum(mattermost_websocket_broadcast_buffer_size) by (instance)
sum(rate(mattermost_websocket_broadcasts_total[5m])) by (name)
sum(rate(mattermost_websocket_event_total{type=~"${websocket_event}"}[5m])) by (type)

The last one points to a websocket_event variable which is declared with a query: topk(20,sum(rate(mattermost_websocket_event_total{namespace="${namespace}"}[${__range_s}s])) by (type)) and filtered with a regex of .*type="(?!custom_)(.*?)".*.

grubbins · 2023-08-15T05:49:30Z

I have added these to our dashboard, although I couldn't get the topk in the websocket_event query working (I can plot this as a set of time series, but as a variable, it returns no values...) So - which of these metrics would correspond to the websocket.slow warning for a user? Does that only apply to broadcasts?

…

On Thu, 10 Aug 2023 at 12:57, Agniva De Sarker ***@***.***> wrote: They aren't. It's pending on me to add those panels to the v2 monitoring dashboard <https://grafana.com/grafana/dashboards/15582-mattermost-performance-monitoring-v2/> . Here they are if you want to add them right away: 1. sum(mattermost_websocket_broadcast_buffer_size) by (instance) 2. sum(rate(mattermost_websocket_broadcasts_total[5m])) by (name) 3. sum(rate(mattermost_websocket_event_total{type=~"${websocket_event}"}[5m])) by (type) The last one points to a websocket_event variable which is declared with a query: topk(20,sum(rate(mattermost_websocket_event_total{namespace="${namespace}"}[${__range_s}s])) by (type)) and filtered with a regex of .*type="(?!custom_)(.*?)".*. — Reply to this email directly, view it on GitHub <#24207 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJZUX6DX4B7HZH6ZIA5KHTXURE2ZANCNFSM6AAAAAA3HW26HU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

We log warnings whenever our websocket buffer sizes exceed certain thresholds. The problem with that is, when this happens, the logs are completely spammed with these lines making it annoying for the customer. To improve the situation, we use a timer that only gets reset every minute. https://mattermost.atlassian.net/browse/MM-51700 ```release-note NONE ```

mattermost-build · 2023-08-26T01:01:46Z

This PR has been automatically labelled "stale" because it hasn't had recent activity.
A core team member will check in on the status of the PR to help with questions.
Thank you for your contribution!

agnivade · 2023-08-26T07:16:01Z

Hey @grubbins - I have a strong feeling that most of the logs were spuriously generated due to 0e75982. This is available from 8.1.1 onwards. I'd suggest to upgrade to that version and report back if you still see a lot of logs.

I will go ahead and close this out.

MM-51700: Drop websocket slow/full warns to debug

6850356

This reverts commit 2287dff. In addition it drops the `websocket.slow` and `websocket.full` logging to Debug.

mm-cloud-bot added the do-not-merge/release-note-label-needed label Aug 8, 2023

mattermost-build added the Contributor label Aug 8, 2023

mattermost-build added the Lifecycle/1:stale label Aug 26, 2023

agnivade closed this Aug 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MM-51700: Drop websocket slow/full warns to debug #24207

MM-51700: Drop websocket slow/full warns to debug #24207

grubbins commented Aug 8, 2023

mm-cloud-bot commented Aug 8, 2023

mattermost-build commented Aug 8, 2023

agnivade commented Aug 8, 2023

grubbins commented Aug 8, 2023 via email

agnivade commented Aug 8, 2023

grubbins commented Aug 9, 2023 via email

agnivade commented Aug 10, 2023

grubbins commented Aug 15, 2023 via email

mattermost-build commented Aug 26, 2023

agnivade commented Aug 26, 2023

MM-51700: Drop websocket slow/full warns to debug #24207

MM-51700: Drop websocket slow/full warns to debug #24207

Conversation

grubbins commented Aug 8, 2023

Summary

Ticket Link

mm-cloud-bot commented Aug 8, 2023

mattermost-build commented Aug 8, 2023

agnivade commented Aug 8, 2023

grubbins commented Aug 8, 2023 via email

agnivade commented Aug 8, 2023

grubbins commented Aug 9, 2023 via email

agnivade commented Aug 10, 2023

grubbins commented Aug 15, 2023 via email

mattermost-build commented Aug 26, 2023

agnivade commented Aug 26, 2023