Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MM-51700: Drop websocket slow/full warns to debug #24207

Closed
wants to merge 1 commit into from

Conversation

grubbins
Copy link

@grubbins grubbins commented Aug 8, 2023

Summary

Rather than sample websocket.full|slow warnings at one per minute, make them debug logging.

Ticket Link

A proposed change to MM-51700.

See also this forum post.

Rationale is on the forum post, and also on this comment on the original diff.

This reverts commit 2287dff.

In addition it drops the `websocket.slow` and
`websocket.full` logging to Debug.
@mm-cloud-bot
Copy link

@grubbins: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

I understand the commands that are listed here

@mattermost-build
Copy link
Contributor

Hello @grubbins,

Thanks for your pull request! A Core Committer will review your pull request soon. For code contributions, you can learn more about the review process here.

@agnivade
Copy link
Member

agnivade commented Aug 8, 2023

@grubbins - Thanks for the PR. The idea behind this was to prevent this log line from flooding the logs and reduce noise for the admin. Is the interval of one minute still too small for you? Have you run into issues with this in production? From your forum post, it looks like you haven't deployed it yet.

With this change, the logging is even less informative since we discard most of the info and pick one random warning every minute

The user_id will always be the same. Only the event will be sampled at random, which I think is fine. We might even remove the event type to make all lines exactly the same. The event name isn't really important. The user_id is.

This updated logging, while using fewer log lines, is even less actionable than before. Since site admins really can't act on this, then please let's make it Debug...

As I mentioned earlier, the purpose behind this was to reduce the spammy nature of the log. But it's still actionable. I don't think the frequency of a warn log should decide whether or not it's more or less actionable. All warn logs should be actionable. In this case, the admin should check the user_id and see if it's only a particular user who is facing problems or whether it's coming from all users. The difference might mean either the server is under load or something limited to a single user's network.

So unless the frequency of 1 minute is still too high, and your log lines are still flooding with this message, I would still lean on keeping this as a warning.

Let me know your thoughts.

@grubbins
Copy link
Author

grubbins commented Aug 8, 2023 via email

@agnivade
Copy link
Member

agnivade commented Aug 8, 2023

Hey Dave,

We do have a metric for this. And that's the global websocket broadcast buffer. Spikes in that indicate that that the system is generating too much websocket events that the system as a whole can push out. We also have websocket event counters and broadcast counters. If you are on the enterprise edition, you should be able to get these on a Prometheus/Grafana setup.

As I mentioned before, the recommended course of action is not to ignore if this is seen for a large portion of users. If it's limited to one or two users, then it's likely the users networks that's slow. Following your example, if you have 20 users consistently with a slow connection for a full hour, that's not good! And it's not something we have seen in practice. So if you do see something like that, then something is definitely not right, and you'll need to investigate further.

I would recommend that you upgrade to the latest MM version and see how frequently do these come up for you. If the 1 minute interval is still not right, we can look into tuning it further.

@grubbins
Copy link
Author

grubbins commented Aug 9, 2023 via email

@agnivade
Copy link
Member

They aren't. It's pending on me to add those panels to the v2 monitoring dashboard.

Here they are if you want to add them right away:

  1. sum(mattermost_websocket_broadcast_buffer_size) by (instance)
  2. sum(rate(mattermost_websocket_broadcasts_total[5m])) by (name)
  3. sum(rate(mattermost_websocket_event_total{type=~"${websocket_event}"}[5m])) by (type)

The last one points to a websocket_event variable which is declared with a query: topk(20,sum(rate(mattermost_websocket_event_total{namespace="${namespace}"}[${__range_s}s])) by (type)) and filtered with a regex of .*type="(?!custom_)(.*?)".*.

@grubbins
Copy link
Author

grubbins commented Aug 15, 2023 via email

agnivade referenced this pull request Aug 23, 2023
We log warnings whenever our websocket buffer sizes exceed
certain thresholds. The problem with that is, when this happens,
the logs are completely spammed with these lines
making it annoying for the customer.

To improve the situation, we use a timer that only gets reset
every minute.

https://mattermost.atlassian.net/browse/MM-51700

```release-note
NONE
```
@mattermost-build
Copy link
Contributor

This PR has been automatically labelled "stale" because it hasn't had recent activity.
A core team member will check in on the status of the PR to help with questions.
Thank you for your contribution!

@agnivade
Copy link
Member

Hey @grubbins - I have a strong feeling that most of the logs were spuriously generated due to 0e75982. This is available from 8.1.1 onwards. I'd suggest to upgrade to that version and report back if you still see a lot of logs.

I will go ahead and close this out.

@agnivade agnivade closed this Aug 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants