Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Evicted" pods don't register metrics #1389

Closed
yfried opened this issue Feb 21, 2021 · 26 comments
Closed

"Evicted" pods don't register metrics #1389

yfried opened this issue Feb 21, 2021 · 26 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@yfried
Copy link

yfried commented Feb 21, 2021

What happened:
I have many pods with Evicted state:

kubectl get pod -A  | grep Evicted | wc -l
     117

But no metric with reason=Evicted. The following query returns empty for the last week

{job="kube-state-metrics", reason="Evicted"} == 1

What you expected to happen:
I expected the above query to return with metrics of the evicted pods.
How to reproduce it (as minimally and precisely as possible):
Check that evicted nodes exist and then query prometheus as mentioned above

Environment:

  • kube-state-metrics version: 1.9.7
  • Kubernetes version (use kubectl version): 1.18
  • Cloud provider or hardware configuration: EKS on AWS
@yfried yfried added the kind/bug Categorizes issue or PR as related to a bug. label Feb 21, 2021
@brancz
Copy link
Member

brancz commented Feb 22, 2021

Which metric did you expect to be there but wasn't?

@yfried
Copy link
Author

yfried commented Feb 24, 2021

Which metric did you expect to be there but wasn't?

@brancz All {job="kube-state-metrics", reason="Evicted"} metrics are 0 even though there are eviction events

@cfindmypast
Copy link

cfindmypast commented Mar 17, 2021

Hi, I can confirm we are also experiencing the same issue with kube_pod_container_status_terminated_reason. We have evicted pods in our cluster, but the query won't gather the metric. It works fine with reason="Completed" etc, but reason="Evicted" unfortunately does not work.

@shabbskagalwala
Copy link

shabbskagalwala commented Mar 18, 2021

Can confirm this is happening in our cluster right now too

➜ (⎈ gke:test) tmp  ✗ k get po | grep -i evicted | wc -l
     996

But the metrics never show a value of 1 as far as we can look. Our current workaround is to use

kube_pod_status_phase{namespace="test",phase="Failed"}

@m1o1
Copy link

m1o1 commented Apr 8, 2021

Same, but we are using kube_pod_container_status_last_terminated_reason (#344), though since evicted pods seem to stick around and a new one is created, either one should work I guess.

We have some alerts, and reason="OOMKilled" works fine, but reason="Evicted" does not.

@CarpathianUA
Copy link

Expiriencing the same!

@strangeman
Copy link

We expiriencing the same problem

@ritheshgm
Copy link

+1

@qingguee
Copy link

+1 Looks like we also met this issue.

@yfried
Copy link
Author

yfried commented May 23, 2021

Hi @brancz
Looks like this is effecting many users.
Did you get a chance to look into this?

@saltbo
Copy link
Member

saltbo commented Jun 17, 2021

any progress on this issue?

@sedflix
Copy link

sedflix commented Jun 23, 2021

+1

@cnelson
Copy link

cnelson commented Jun 24, 2021

Same issue with reason="Shutdown" which went beta in v1.21

Looking at a pod on the cluster I see:

status:
  message: Node is shutting, evicting pods
  phase: Failed
  reason: Shutdown

and I see the pods in these metrics:

kube_pod_status_phase{phase="Failed"}

but I do not see these pods when looking at terminated metrics like

kube_pod_container_status_terminated_reason{reason="Shutdown"}

@slashpai
Copy link
Contributor

I tried max_over_time, avg_over_time and sum_over_time for kube_pod_container_status_last_terminated_reason all queries returned 0 even if reason="Evicted exists. So this value never seems to be incremented

Example query
max_over_time(kube_pod_container_status_last_terminated_reason{reason="Evicted"}[2h])

I will try to figure out why Evicted reason is not registered in metric like other reasons

@fpetkovski
Copy link
Contributor

fpetkovski commented Jun 30, 2021

@yfried This is a bug with the version you are running where KSM conflates pod and container states. This should be fixed in versions 2.0.0 and later. The metric containing this information is kube_pod_status_reason

Since 1.x.x has other issues according to the compatibility matrix, are you able to upgrade KSM to one of the 2.x.x versions?

Also keep in mind that only pods can have the Evicted reason in their status. Metrics about pod containers will likely not reflect this information.

@saltbo
Copy link
Member

saltbo commented Jun 30, 2021

Note: The v2.0.0-alpha.2+ and master releases of kube-state-metrics work on Kubernetes v1.17 and v1.18 excluding Ingress or CertificateSigningRequest resource metrics. If you require those metrics and are on an older Kubernetes version, use v2.0.0-alpha.1 or v1.9.8 kube-state-metrics release.

Does v1.9.8 fix this problem? And can v1.9.8 be used on k8s v1.16?

@fpetkovski
Copy link
Contributor

fpetkovski commented Jun 30, 2021

Unfortunately v1.9.8 does not fix this problem, it's only fixed in 2.0.0 and onward. Are you running Kubernetes 1.16?

@saltbo
Copy link
Member

saltbo commented Jun 30, 2021

Yeah, we running v1.16, and no plan to update k8s version currently...

@fpetkovski
Copy link
Contributor

This is the commit which adds the kube_pod_status_reason metric that you are looking for: 8f45cd8#diff-9f1d27fdf1cb96ba2fc5bde3d90760eec19e66ee632a5756bf8da5ec28b54ab6.

We can try to backport parts of it to v1.9.8 but I am not sure what the current support policy is for pre 2.0.0 versions. Maybe @tariq1890 or @mrueg could provide their input before we raise the PR.

@mrueg
Copy link
Member

mrueg commented Jun 30, 2021

My personal thoughts here: I would still consider v1.9.x "community supported", which means no active feature development, backports etc.
Backports for bug fixes can be contributed through the community and releases happen upon request.

@marcusio888
Copy link

marcusio888 commented Jul 6, 2021

good night I could do it with the following metric
sum by (namespace) (kube_pod_status_reason {reason = "Evicted"})> 0
version 2.0.0 kube-state-metrics
I hope it helps you. Additionally I leave the alertmanager rule if they want to use it

  • name: pods-evicted
    rules:
    • alert: PodsEvicted
      annotations:
      description: Pods with evicted status detected.
      summary: Pods with evicted status detected.
      expr: |
      sum by (namespace) (kube_pod_status_reason{reason="Evicted"}) > 0
      for: 15m
      labels:
      severity: warning

@leepatrick-goop
Copy link

+1

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 6, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 5, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests