New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding the KubeAPIErrorBudgetBurn Alert Reason #464
Comments
If I'm not mistaken, this particular alert is only about errors, meaning that if errors continue at the rate as they are happening, then the SLO target will not be able to be reached. @metalmatze sounds like the descriptions of the alerts need to be improved. |
Looking at the record that it uses https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/rules/kube_apiserver.libsonnet#L23 it compares both slow requests and errors over time. This means if the error budget alerts fires it could be caused by either of those. There used to be alerts for either errors or latency specifically but now they are combined which makes it difficult to track the issue. |
I think alerts are fine in this case as they are typically SLO alerts https://landing.google.com/sre/workbook/chapters/alerting-on-slos/#4-alert-on-burn-rate Have you looked at the API dashboard? I think it has separate columns for errors/latency also write/read groups, etc? it should be pretty clear what part of the system is lagging or throwing errors and then you need to dive deeper in to logs / traces / pprofs. |
Yes, the alerts are firing if there are too many errors and too slow requests. The reasoning behind this is: If the requests return errors (5xx) they fail for the user but at the same time, the requests that are too slow, are most likely to impact the users too, as controllers and operators might not be able to reconcile within time. As @povilasv said, you generally want to check the APIServer dashboard, see if there are too many errors and if not then your APIServer is probably too slow to guarantee proper service. Therefore check etcd and see if it's too slow too - which is most likely what is happening. Give etcd more CPU or memory - which is often the bottleneck - or the APIServer itself (not as common). |
I've since created a runbook for the alert in the kube-prometheus wiki |
We have updated to the latest version of the mixin and now we are receiving the KubeAPIErrorBudgetBurn alerts up to the critical one however the api server seems fine and it's hard for me to track down the reason for the alert.
Is there a way to track down the exact reason for the alert since it compares at least two different metrics (slow and errors)? There used to be Latency and Error alerts but those seem to have been removed which now makes it difficult to know which is the issue.
The text was updated successfully, but these errors were encountered: