Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric kube_pod_container_status_terminated_reason don't detect all events #2153

Closed
lombardialess opened this issue Aug 18, 2023 · 10 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@lombardialess
Copy link

What happened:
The metric kube_pod_container_status_terminated_reason is still experiment but since a very long time.
This metric can be very useful for monitoring alerts but is not detecting all "Errors" or "OOMKilled" often the event is not collected.

What you expected to happen:
Trace of any event of termination

How to reproduce it (as minimally and precisely as possible):
I don't think is reproducible but for example OOMKilled is very often not detected in the metric

Anything else we need to know?:

Environment:

  • kube-state-metrics version: 2.9.2
  • Kubernetes version (use kubectl version): 1.25.6
  • Cloud provider or hardware configuration: Azure AKS
  • Other info:
@lombardialess lombardialess added the kind/bug Categorizes issue or PR as related to a bug. label Aug 18, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 18, 2023
@dashpole
Copy link

cc @CatherineF-dev
/triage accepted
/assign @dgrisonnet

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 24, 2023
@dobesv
Copy link

dobesv commented Aug 30, 2023

Have you tried kube_pod_container_status_last_terminated_reason ? It might be that the pod was restarted and the state was moved to the "last" field.

@lombardialess
Copy link
Author

lombardialess commented Aug 30, 2023

yes but on this metric I cannot set an alert, if the pod goes OOM the last reason is always OOMKilled and last field always 1 also if this happen multiple times.
Is better to have a metric saved on kube_pod_container_status_terminated_reason that can be triggered with an alert all the times that the issue happen.

@TadeoCloud
Copy link

Same problem here. When an OOMKill is detected, kube-metric always shows the value "1". But no more OOMKills have occurred.
image

And if I test the function increase, it detects a change from null to zero.
image

@TadeoCloud
Copy link

When using kube-metrics v2.4.1 the sum value goes from "null" to "1" and then back to "null".
image

@dobesv
Copy link

dobesv commented Sep 11, 2023

It would be nice to find a good alert query for this.

Maybe when querying the kube_pod_container_status_last_terminated_reason you also need to query whether the pod has restarted recently?

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} and on(namespace, pod, container) (increase(kube_pod_container_status_restarts_total[30m]) > 0)

@TadeoCloud
Copy link

It would be nice to find a good alert query for this.

Maybe when querying the kube_pod_container_status_last_terminated_reason you also need to query whether the pod has restarted recently?

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} and on(namespace, pod, container) (increase(kube_pod_container_status_restarts_total[30m]) > 0)

It seems to be working! I'm testing it. Thanks

@CatherineF-dev
Copy link
Contributor

QQ: is it working?

If you don't have other questions, we will close it.

@TadeoCloud
Copy link

Confirmed, it's working for me. Thanks!

@dgrisonnet
Copy link
Member

FWIW, there is also a new metrics in kubelet to better detect OOMKilled containers: kubernetes/kubernetes#108004

Closing since the initial problem seem to have been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants