-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] failed to update status with content for dashboard #789
Comments
I think we should be careful with what we store in the status field, because etcd has limits on its storage size: https://etcd.io/docs/v3.3/dev-guide/limit/ |
If the content is already in the spec (either json, gzipJson, or jsonnet) it shouldn't be duplicated in the status as well. The status content is only intended as a cache for fetched dashboards. Luckily should be an easy fix. Don't quite understand why the error would cause excessive load on the API server. Is the operator retrying a lot? As for fetched dashboard my first though is to just gzip the content in the status. Dashboard definitions should compress pretty nicely, right? |
I believe so, the log was filled with errors relating to updating the status. The CPU usage of the operator pod spiked quite high after the upgrade, ~0.5 cores while it ran at 0.02 prior to the update. It could likely be reproduced by configuring a dashboard via url that exceeds the size of etcd for the current cluster.
I had another look at the original dashboard that it was failing to update and found the user had set both json and url in the GrafanaDashboard CR which would explain why it was duplicated. Had they only configured it via url it should have fit into the one CR. I guess the point remains if the remote dashboard exceeded the max size for etcd then the same issue could be hit, but the gzip solution you proposed sounds like it would solve that. |
Have had a quick look through and created a preliminary PR in #790. If you would want to give it a spin there's an image here: Haven't investigated the error-retry CPU usage part yet, though that PR should keep the error from occurring in the first place. |
@addreas The docs say the following:
https://github.com/grafana-operator/grafana-operator/blob/master/documentation/dashboards.md#dashboard-properties |
fixed with #790 |
Describe the bug
Version 4.5 of the operator adds a status field to the Dashboard CRD. For large dashboards duplicating the
spec.json
and also placing intostatus
might exceed maximum size in etcd causing update to fail. I'm not too familiar with the operator and if the same status content field is set for dashboards configured via URL, but I assume the error/limitation could be hit if any dashboard exceeds the maximum size for etcd for any of the supported ways to configure dashboards if the contents are also stored in the status field.After updating the operator, this caused excessive load on the kubernetes api-server and the top
apiserver_watch_events_sizes_sum
in the cluster was forgrafanadashboards.integreatly.org
Grafana operator log
Version
4.5.0
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The fact it does not update due to limitations of etcd object size (that's my assumption anyway) is not really an issue. It's more that the failure caused excessive load on the cluster apiserver and etcd. If there were a way to handle the error cleaner it would be ideal, perhaps even suggest a solution so they're aware why it's not working properly. The excessive load to apiserver caused instability on our control plane & etcd.
If using
gzipJson
is the suggested solution, the operator could propagate the message in the log or generate an event and stop attempting to update the dashboard. Not sure if there is a way to catch it before the user applies it, admission webhook that checks the size?Suspect component/Location where the bug might be occuring:
Addition of status field was introduced in this PR
#689
Screenshots
If applicable, add screenshots to help explain your problem.
apiserver_watch_events_sizes_sum
Runtime (please complete the following information):
The text was updated successfully, but these errors were encountered: