Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the KubeAPIErrorBudgetBurn Alert Reason #464

Closed
mitchellmaler opened this issue Jul 15, 2020 · 5 comments
Closed

Understanding the KubeAPIErrorBudgetBurn Alert Reason #464

mitchellmaler opened this issue Jul 15, 2020 · 5 comments

Comments

@mitchellmaler
Copy link
Contributor

mitchellmaler commented Jul 15, 2020

We have updated to the latest version of the mixin and now we are receiving the KubeAPIErrorBudgetBurn alerts up to the critical one however the api server seems fine and it's hard for me to track down the reason for the alert.

Is there a way to track down the exact reason for the alert since it compares at least two different metrics (slow and errors)? There used to be Latency and Error alerts but those seem to have been removed which now makes it difficult to know which is the issue.

@brancz
Copy link
Member

brancz commented Jul 15, 2020

If I'm not mistaken, this particular alert is only about errors, meaning that if errors continue at the rate as they are happening, then the SLO target will not be able to be reached.

@metalmatze sounds like the descriptions of the alerts need to be improved.

@mitchellmaler
Copy link
Contributor Author

Looking at the record that it uses https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/rules/kube_apiserver.libsonnet#L23 it compares both slow requests and errors over time. This means if the error budget alerts fires it could be caused by either of those. There used to be alerts for either errors or latency specifically but now they are combined which makes it difficult to track the issue.

@povilasv
Copy link
Contributor

I think alerts are fine in this case as they are typically SLO alerts https://landing.google.com/sre/workbook/chapters/alerting-on-slos/#4-alert-on-burn-rate

Have you looked at the API dashboard? I think it has separate columns for errors/latency also write/read groups, etc? it should be pretty clear what part of the system is lagging or throwing errors and then you need to dive deeper in to logs / traces / pprofs.

@metalmatze
Copy link
Member

Yes, the alerts are firing if there are too many errors and too slow requests. The reasoning behind this is: If the requests return errors (5xx) they fail for the user but at the same time, the requests that are too slow, are most likely to impact the users too, as controllers and operators might not be able to reconcile within time.

As @povilasv said, you generally want to check the APIServer dashboard, see if there are too many errors and if not then your APIServer is probably too slow to guarantee proper service. Therefore check etcd and see if it's too slow too - which is most likely what is happening. Give etcd more CPU or memory - which is often the bottleneck - or the APIServer itself (not as common).
This should be a runbook most likely indeed.

@metalmatze
Copy link
Member

I've since created a runbook for the alert in the kube-prometheus wiki
https://github.com/prometheus-operator/kube-prometheus/wiki/KubeAPIErrorBudgetBurn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants