Understanding the KubeAPIErrorBudgetBurn Alert Reason #464

mitchellmaler · 2020-07-15T04:43:34Z

We have updated to the latest version of the mixin and now we are receiving the KubeAPIErrorBudgetBurn alerts up to the critical one however the api server seems fine and it's hard for me to track down the reason for the alert.

Is there a way to track down the exact reason for the alert since it compares at least two different metrics (slow and errors)? There used to be Latency and Error alerts but those seem to have been removed which now makes it difficult to know which is the issue.

brancz · 2020-07-15T06:12:57Z

If I'm not mistaken, this particular alert is only about errors, meaning that if errors continue at the rate as they are happening, then the SLO target will not be able to be reached.

@metalmatze sounds like the descriptions of the alerts need to be improved.

mitchellmaler · 2020-07-15T14:44:22Z

Looking at the record that it uses https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/rules/kube_apiserver.libsonnet#L23 it compares both slow requests and errors over time. This means if the error budget alerts fires it could be caused by either of those. There used to be alerts for either errors or latency specifically but now they are combined which makes it difficult to track the issue.

povilasv · 2020-07-20T08:36:23Z

I think alerts are fine in this case as they are typically SLO alerts https://landing.google.com/sre/workbook/chapters/alerting-on-slos/#4-alert-on-burn-rate

Have you looked at the API dashboard? I think it has separate columns for errors/latency also write/read groups, etc? it should be pretty clear what part of the system is lagging or throwing errors and then you need to dive deeper in to logs / traces / pprofs.

metalmatze · 2020-08-05T15:13:42Z

Yes, the alerts are firing if there are too many errors and too slow requests. The reasoning behind this is: If the requests return errors (5xx) they fail for the user but at the same time, the requests that are too slow, are most likely to impact the users too, as controllers and operators might not be able to reconcile within time.

As @povilasv said, you generally want to check the APIServer dashboard, see if there are too many errors and if not then your APIServer is probably too slow to guarantee proper service. Therefore check etcd and see if it's too slow too - which is most likely what is happening. Give etcd more CPU or memory - which is often the bottleneck - or the APIServer itself (not as common).
This should be a runbook most likely indeed.

metalmatze · 2021-05-12T16:43:06Z

I've since created a runbook for the alert in the kube-prometheus wiki
https://github.com/prometheus-operator/kube-prometheus/wiki/KubeAPIErrorBudgetBurn

metalmatze closed this as completed May 12, 2021

d-m mentioned this issue May 29, 2021

KubeAPIErrorBudgetBurn Alert Reason #615

Open

vijay-veeranki mentioned this issue Dec 14, 2021

Investigate alertname="KubeAPIErrorBudgetBurn" in "live" cluster ministryofjustice/cloud-platform#3429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the KubeAPIErrorBudgetBurn Alert Reason #464

Understanding the KubeAPIErrorBudgetBurn Alert Reason #464

mitchellmaler commented Jul 15, 2020 •

edited

brancz commented Jul 15, 2020

mitchellmaler commented Jul 15, 2020

povilasv commented Jul 20, 2020

metalmatze commented Aug 5, 2020

metalmatze commented May 12, 2021

Understanding the KubeAPIErrorBudgetBurn Alert Reason #464

Understanding the KubeAPIErrorBudgetBurn Alert Reason #464

Comments

mitchellmaler commented Jul 15, 2020 • edited

brancz commented Jul 15, 2020

mitchellmaler commented Jul 15, 2020

povilasv commented Jul 20, 2020

metalmatze commented Aug 5, 2020

metalmatze commented May 12, 2021

mitchellmaler commented Jul 15, 2020 •

edited