Auto-delete failed Pods #99986

qiutongs · 2021-03-09T07:11:14Z

Overview

Currently, the only mechanism to auto-delete failed pods is the garbage collection. However, the default threshold is incredibly high - 12500[1]. Therefore it is useless in practice for most of customers.

There are circumstances that pods become failed due to a k8s bug. I have seen two cases so far: "Predicate NodeAffinity failed" pods[2] and OutOfCPU pods[3].

So my question is about if we want to make some components to auto-delete failed pods, like some controllers or even kubelet.

Additional Context

lower the default threshold for GC have a saner default value of --terminated-pod-gc-threshold #78693
Kubelet bug leading to failed pods with "Node affinity" error Kubelet rejects pod scheduled based on newly added node labels which have not been observed by the kubelet yet #93338
When using kube-proxy as a static pod, there could be a short time that scheduler is not aware of it. Then, scheduler doesn't account for the resource taken by kube-proxy and schedules some pod. But the node rejects the pod because of OutOfCPU.

Updates

04/12/2021

A user has to implement a "watcher" to periodically detect and delete failed pods. That is bad because k8s should be able to take that responsibility.

alculquicondor · 2021-03-09T14:24:41Z

/sig node
/sig apps

alculquicondor · 2021-03-09T14:26:17Z

cc @kow3ns @derekwaynecarr

bobbypage · 2021-06-11T22:34:25Z

/cc

SergeyKanzhelev · 2021-06-11T22:39:08Z

/cc

SergeyKanzhelev · 2021-06-11T22:41:35Z

Additional consideration here - have a different thresholds for different failure reasons. In case of pods terminated because of node graceful termination - pods will likely be less interesting for troubleshooting. See #102820

jinleizh · 2021-06-24T03:40:26Z

/triage accepted
/kind support

k8s-ci-robot · 2021-06-24T03:40:33Z

@ctrlzhang: The label triage/accepted cannot be applied. Only GitHub organization members can add the label.

In response to this:

/triage accepted
/kind support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wzshiming · 2021-06-24T03:56:39Z

/triage accepted

ehashman · 2021-06-24T17:19:16Z

/remove-kind support

I don't think this is a support request, I think this is a feature request for making this threshold configurable.

/kind feature

alculquicondor · 2021-07-12T20:14:27Z

The threshold is actually configurable. The problem is that setting it low potentially makes Jobs unusable.
I think we can only solve this in 1.23 when kubernetes/enhancements#2307 graduates to beta.

schellj · 2021-08-12T01:20:59Z

Additional consideration here - have a different thresholds for different failure reasons. In case of pods terminated because of node graceful termination - pods will likely be less interesting for troubleshooting. See #102820

Not sure if this is the best place, but we're being affected by this topic and I'd like to to add some context.

GKE users making use of preemptible nodes, which have a lifetime of ~24 hours, are being affected by the change in graceful node shutdown behavior introduced in 1.20.5-gke.500 in that pods scheduled on preemptible nodes that have been shut down do not get deleted. As issue #102820 notes, I was indeed confused by this behavior as the pods and nodes are working as intended, yet the pods are considered "failed". This is also causing such nodes themselves to not get deleted and I've seen new pods get scheduled onto them as well.

ejose19 · 2021-09-18T21:23:19Z

I'm in the same boat as @schellj, besides the kubectl get pod clutter, it also stop other commands from working without any extra steps (like kubectl logs) so one needs to manually remove all shutdown pods or register a job for it. A configurable threshold for just shutdown pods (like @SergeyKanzhelev mentioned) seems like a must (and probably, most users would expect that shutdown pods gets auto-deleted by default).

alculquicondor · 2021-09-20T13:17:18Z

Note that if we deliver a solution just for shutdown pods in kubernetes, it would arrive at 1.24 at the earliest (it's a new feature and needs to go through the KEP process). By then, the Job API will be fixed and we can already lower the threshold.

schellj · 2021-09-20T16:12:25Z

@alculquicondor Understood. At least in my mind, having a lower gc threshold doesn't entirely solve the issue with shutdown pods as those are pods that have behaved as intended and shouldn't be considered failed and kept around in the first place.

alculquicondor · 2021-09-20T16:28:21Z

cc @bobbypage to answer why the pods are not simply deleted.

SergeyKanzhelev · 2021-09-20T16:28:40Z

those are pods that have behaved as intended and shouldn't be considered failed

this statement is questionable. We just don't know in some cases. If it was a job that hasn't finished yet, it is failed, even with the clean termination. If the pod failed to clearly terminate, we also want to keep information about that pod around. It is clear that pods that were terminated with the graceful termination are "happier" then ones that crashed on their own. But it is clearly not 100% expected behavior in a general sense.

That's said, better cleanup rules might be beneficial here for sure.

alculquicondor · 2021-09-20T16:32:23Z

If it was a job that hasn't finished yet, it is failed, even with the clean termination.

The fixed Job controller will consider any pod deletion a failure, even after the pod is completely removed from the API.

bobbypage · 2021-09-20T22:48:00Z

Thanks @schellj for the feedback. We are looking into the behavior for handling pods on shutdown.

In the original design of the graceful node shutdown KEP it was decided to not explicitly delete pods, but rather put pods into failed phase. This followed the pattern for kubelet evictions and also made it possible for users to see why their pods were terminated. If they were deleted, pods could appear to vanish suddenly without explanation why which could also appear to be confusing. We are seeing if this makes sense as behavior long term.

preferable for the pods to simply be deleted, as they were in versions of kubernetes prior to 1.20.5.

I'm not super clear what you're referring to here... what mechanism are you suggesting was used to delete pods prior to 1.20.5?

schellj · 2021-09-20T22:54:58Z

@bobbypage I'm not entirely sure, but I'm guessing that on GKE prior to 1.20.5-gke.500, there was a different process that would cordon and drain their preemptible nodes.

jameskunc · 2021-10-21T13:02:43Z

@bobbypage

This followed the pattern for kubelet evictions and also made it possible for users to see why their pods were terminated. If they were deleted, pods could appear to vanish suddenly without explanation why which could also appear to be confusing.

Having a terminated-pod-gc-expiration configuration option in conjunction with the terminated-pod-gc-threshold would work for us, then we can still debug the failed ones but they go away after some time.

k8s-triage-robot · 2022-01-19T13:08:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

schellj · 2022-01-19T14:47:35Z

/remove-lifecycle stale

k8s-triage-robot · 2022-04-19T15:20:17Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

schellj · 2022-04-19T15:23:51Z

/remove-lifecycle stale

k8s-triage-robot · 2022-07-18T16:22:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

schellj · 2022-07-18T16:23:16Z

/remove-lifecycle stale

k8s-triage-robot · 2022-10-16T17:15:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-11-15T17:34:45Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-12-15T17:52:18Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2022-12-15T17:52:23Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pcj · 2023-04-10T15:24:20Z

/reopen

k8s-ci-robot · 2023-04-10T15:24:24Z

@pcj: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zorgzerg · 2024-03-12T16:14:35Z

/reopen

k8s-ci-robot · 2024-03-12T16:14:39Z

@zorgzerg: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 9, 2021

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 9, 2021

glebiller mentioned this issue Jun 14, 2021

Improve terminated pod message when node is shutting down #102840

Merged

k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Jun 24, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2021

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/support Categorizes issue or PR as a support question. labels Jun 24, 2021

jaymebrd mentioned this issue Jul 5, 2021

Allow deletion of failed pods kubernetes-sigs/descheduler#596

Closed

alculquicondor mentioned this issue Jul 21, 2021

Kube-controller-manager add a flag --delete-all-terminated-pods to delete all terminated pods #103486

Closed

olegy2008 mentioned this issue Sep 24, 2021

linkerd check is failing on pods in Shutdown status linkerd/linkerd2#6967

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2022

philipgough mentioned this issue Feb 21, 2022

Old Kubernetes SD endpoints are still "discovered" and scraped despite no longer existing prometheus/prometheus#10257

Open

DrAuYueng mentioned this issue Mar 10, 2022

[feature request] add feature-gate to let cloneset gc inactive pods of it openkruise/kruise#925

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 16, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 15, 2022

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-delete failed Pods #99986

Auto-delete failed Pods #99986

qiutongs commented Mar 9, 2021 •

edited

alculquicondor commented Mar 9, 2021

alculquicondor commented Mar 9, 2021

bobbypage commented Jun 11, 2021

SergeyKanzhelev commented Jun 11, 2021

SergeyKanzhelev commented Jun 11, 2021

jinleizh commented Jun 24, 2021

k8s-ci-robot commented Jun 24, 2021

wzshiming commented Jun 24, 2021

ehashman commented Jun 24, 2021

alculquicondor commented Jul 12, 2021

schellj commented Aug 12, 2021 •

edited

ejose19 commented Sep 18, 2021

alculquicondor commented Sep 20, 2021

schellj commented Sep 20, 2021 •

edited

alculquicondor commented Sep 20, 2021

SergeyKanzhelev commented Sep 20, 2021 •

edited

alculquicondor commented Sep 20, 2021

bobbypage commented Sep 20, 2021 •

edited

schellj commented Sep 20, 2021

jameskunc commented Oct 21, 2021

k8s-triage-robot commented Jan 19, 2022

schellj commented Jan 19, 2022

k8s-triage-robot commented Apr 19, 2022

schellj commented Apr 19, 2022

k8s-triage-robot commented Jul 18, 2022

schellj commented Jul 18, 2022

k8s-triage-robot commented Oct 16, 2022

k8s-triage-robot commented Nov 15, 2022

k8s-triage-robot commented Dec 15, 2022

k8s-ci-robot commented Dec 15, 2022

pcj commented Apr 10, 2023

k8s-ci-robot commented Apr 10, 2023

zorgzerg commented Mar 12, 2024

k8s-ci-robot commented Mar 12, 2024

Auto-delete failed Pods #99986

Auto-delete failed Pods #99986

Comments

qiutongs commented Mar 9, 2021 • edited

Overview

Additional Context

Updates

04/12/2021

alculquicondor commented Mar 9, 2021

alculquicondor commented Mar 9, 2021

bobbypage commented Jun 11, 2021

SergeyKanzhelev commented Jun 11, 2021

SergeyKanzhelev commented Jun 11, 2021

jinleizh commented Jun 24, 2021

k8s-ci-robot commented Jun 24, 2021

wzshiming commented Jun 24, 2021

ehashman commented Jun 24, 2021

alculquicondor commented Jul 12, 2021

schellj commented Aug 12, 2021 • edited

ejose19 commented Sep 18, 2021

alculquicondor commented Sep 20, 2021

schellj commented Sep 20, 2021 • edited

alculquicondor commented Sep 20, 2021

SergeyKanzhelev commented Sep 20, 2021 • edited

alculquicondor commented Sep 20, 2021

bobbypage commented Sep 20, 2021 • edited

schellj commented Sep 20, 2021

jameskunc commented Oct 21, 2021

k8s-triage-robot commented Jan 19, 2022

schellj commented Jan 19, 2022

k8s-triage-robot commented Apr 19, 2022

schellj commented Apr 19, 2022

k8s-triage-robot commented Jul 18, 2022

schellj commented Jul 18, 2022

k8s-triage-robot commented Oct 16, 2022

k8s-triage-robot commented Nov 15, 2022

k8s-triage-robot commented Dec 15, 2022

k8s-ci-robot commented Dec 15, 2022

pcj commented Apr 10, 2023

k8s-ci-robot commented Apr 10, 2023

zorgzerg commented Mar 12, 2024

k8s-ci-robot commented Mar 12, 2024

qiutongs commented Mar 9, 2021 •

edited

schellj commented Aug 12, 2021 •

edited

schellj commented Sep 20, 2021 •

edited

SergeyKanzhelev commented Sep 20, 2021 •

edited

bobbypage commented Sep 20, 2021 •

edited