Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubeAPIErrorBudgetBurn Alert Reason #615

Open
d-m opened this issue May 29, 2021 · 5 comments
Open

KubeAPIErrorBudgetBurn Alert Reason #615

d-m opened this issue May 29, 2021 · 5 comments

Comments

@d-m
Copy link

d-m commented May 29, 2021

Hello all,

I was hoping that someone might be able to help me with understanding why the KubeAPIErrorBudgetBurn alert (long: 3d, short 6h) was firing.

I reviewed the API Server dashboard and noticed that there were large spikes for an entry with no resource label:

Screen Shot 2021-05-28 at 10 16 13 PM

The dashboard uses the query cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{verb="read"}.

I also read through #464 and the very helpful runbook mentioned in a comment in that ticket. The only example query in the runbook that returned any results was the resource scoped slow read request query but it didn't have a resource name, either:

Screen Shot 2021-05-28 at 10 22 54 PM

Any suggestions for next steps would be appreciated.

Thanks.

@metalmatze
Copy link
Member

You should be able to figure out the slow resource be removing the sum() to only have the rate which won't aggregate anymore.

@paulfantom
Copy link
Member

We might want to link to https://github.com/prometheus-operator/kube-prometheus/wiki/KubeAPIErrorBudgetBurn somewhere and/or improve it.

@mihail-velikov
Copy link

Hello everyone,

Since last week I also started getting this alert and I am pretty much clueless on how to proceed.

Our cluster setup is deployed using kubespray on:
k8s 1.19.2
OS: ubuntu 20.04
3 master - 4 CPU/16GB RAM
20 workers - 8 CPU/64 GB RAM
All this is hosted on premise with vmware as the underlying hypervisor and calico as the network plugin with VXLAN and ipinip disabled. The master nodes are disabled for scheduling and thus run only the cluster components + etcd.

Looking at the API dashboard I noticed that we have slow Write SLI queries:
Screenshot 2021-08-26 at 13 31 50

The two slow queries seems to be related to "ingress" and "pods". I checked the API logs and I saw that some "Patch" events for ingress take very long time. Example:
(I0826 10:17:07.864302 1 trace.go:205] Trace[162001560]: "Patch" url:/apis/extensions/v1beta1/namespaces/ews-int/ingresses/ews-int-redis-commander-generic,user-agent:kubectl/v1.21.0 (linux/amd64) kubernetes/cb303e6,client:172.17.42.247 (26-Aug-2021 10:16:59.314) (total time: 8549ms):

I have the suspicion that this is related to the old API endpoint "apis/extensions/v1beta1/" and I will double check that by removing this specific ingresses. I have already checked node CPU/RAM usage on the masters and it is very low. I have also checked the etcd logs and it doesn't have any obvious issues - no slow queries/disc sync/etc.

Regarding the slow pod write queries - I have no idea how to further investigate this besides enabling "profiling" for the api server.

Any hints will be greatly appreciated.

Kind Regards,
Mihail Velikov

@mihail-velikov
Copy link

Update:
It seems that my suspicion was incorrect. We updated all ingress endpoints to the latest version of the api but the problem persists.
Additionally I tried enabling the profiling of the api-server but not much more information was added in the logs about the slow requests.

@povilasv
Copy link
Contributor

One approach to use tracing if you would run newer k8s version -> https://kubernetes.io/blog/2021/09/03/api-server-tracing/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants