Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-delete failed Pods #99986

Closed
qiutongs opened this issue Mar 9, 2021 · 37 comments
Closed

Auto-delete failed Pods #99986

qiutongs opened this issue Mar 9, 2021 · 37 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@qiutongs
Copy link
Contributor

qiutongs commented Mar 9, 2021

Overview

Currently, the only mechanism to auto-delete failed pods is the garbage collection. However, the default threshold is incredibly high - 12500[1]. Therefore it is useless in practice for most of customers.

There are circumstances that pods become failed due to a k8s bug. I have seen two cases so far: "Predicate NodeAffinity failed" pods[2] and OutOfCPU pods[3].

So my question is about if we want to make some components to auto-delete failed pods, like some controllers or even kubelet.

Additional Context

  1. lower the default threshold for GC have a saner default value of --terminated-pod-gc-threshold  #78693
  2. Kubelet bug leading to failed pods with "Node affinity" error Kubelet rejects pod scheduled based on newly added node labels which have not been observed by the kubelet yet #93338
  3. When using kube-proxy as a static pod, there could be a short time that scheduler is not aware of it. Then, scheduler doesn't account for the resource taken by kube-proxy and schedules some pod. But the node rejects the pod because of OutOfCPU.

Updates

04/12/2021

A user has to implement a "watcher" to periodically detect and delete failed pods. That is bad because k8s should be able to take that responsibility.

@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 9, 2021
@alculquicondor
Copy link
Member

/sig node
/sig apps

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 9, 2021
@alculquicondor
Copy link
Member

cc @kow3ns @derekwaynecarr

@bobbypage
Copy link
Member

/cc

1 similar comment
@SergeyKanzhelev
Copy link
Member

/cc

@SergeyKanzhelev
Copy link
Member

Additional consideration here - have a different thresholds for different failure reasons. In case of pods terminated because of node graceful termination - pods will likely be less interesting for troubleshooting. See #102820

@jinleizh
Copy link

/triage accepted
/kind support

@k8s-ci-robot
Copy link
Contributor

@ctrlzhang: The label triage/accepted cannot be applied. Only GitHub organization members can add the label.

In response to this:

/triage accepted
/kind support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Jun 24, 2021
@wzshiming
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2021
@ehashman
Copy link
Member

/remove-kind support

I don't think this is a support request, I think this is a feature request for making this threshold configurable.

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/support Categorizes issue or PR as a support question. labels Jun 24, 2021
@alculquicondor
Copy link
Member

The threshold is actually configurable. The problem is that setting it low potentially makes Jobs unusable.
I think we can only solve this in 1.23 when kubernetes/enhancements#2307 graduates to beta.

@schellj
Copy link

schellj commented Aug 12, 2021

Additional consideration here - have a different thresholds for different failure reasons. In case of pods terminated because of node graceful termination - pods will likely be less interesting for troubleshooting. See #102820

Not sure if this is the best place, but we're being affected by this topic and I'd like to to add some context.

GKE users making use of preemptible nodes, which have a lifetime of ~24 hours, are being affected by the change in graceful node shutdown behavior introduced in 1.20.5-gke.500 in that pods scheduled on preemptible nodes that have been shut down do not get deleted. As issue #102820 notes, I was indeed confused by this behavior as the pods and nodes are working as intended, yet the pods are considered "failed". This is also causing such nodes themselves to not get deleted and I've seen new pods get scheduled onto them as well.

@ejose19
Copy link

ejose19 commented Sep 18, 2021

I'm in the same boat as @schellj, besides the kubectl get pod clutter, it also stop other commands from working without any extra steps (like kubectl logs) so one needs to manually remove all shutdown pods or register a job for it. A configurable threshold for just shutdown pods (like @SergeyKanzhelev mentioned) seems like a must (and probably, most users would expect that shutdown pods gets auto-deleted by default).

@alculquicondor
Copy link
Member

Note that if we deliver a solution just for shutdown pods in kubernetes, it would arrive at 1.24 at the earliest (it's a new feature and needs to go through the KEP process). By then, the Job API will be fixed and we can already lower the threshold.

@schellj
Copy link

schellj commented Sep 20, 2021

@alculquicondor Understood. At least in my mind, having a lower gc threshold doesn't entirely solve the issue with shutdown pods as those are pods that have behaved as intended and shouldn't be considered failed and kept around in the first place.

@alculquicondor
Copy link
Member

cc @bobbypage to answer why the pods are not simply deleted.

@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Sep 20, 2021

those are pods that have behaved as intended and shouldn't be considered failed

this statement is questionable. We just don't know in some cases. If it was a job that hasn't finished yet, it is failed, even with the clean termination. If the pod failed to clearly terminate, we also want to keep information about that pod around. It is clear that pods that were terminated with the graceful termination are "happier" then ones that crashed on their own. But it is clearly not 100% expected behavior in a general sense.

That's said, better cleanup rules might be beneficial here for sure.

@alculquicondor
Copy link
Member

If it was a job that hasn't finished yet, it is failed, even with the clean termination.

The fixed Job controller will consider any pod deletion a failure, even after the pod is completely removed from the API.

@bobbypage
Copy link
Member

bobbypage commented Sep 20, 2021

Thanks @schellj for the feedback. We are looking into the behavior for handling pods on shutdown.

In the original design of the graceful node shutdown KEP it was decided to not explicitly delete pods, but rather put pods into failed phase. This followed the pattern for kubelet evictions and also made it possible for users to see why their pods were terminated. If they were deleted, pods could appear to vanish suddenly without explanation why which could also appear to be confusing. We are seeing if this makes sense as behavior long term.

preferable for the pods to simply be deleted, as they were in versions of kubernetes prior to 1.20.5.

I'm not super clear what you're referring to here... what mechanism are you suggesting was used to delete pods prior to 1.20.5?

@schellj
Copy link

schellj commented Sep 20, 2021

@bobbypage I'm not entirely sure, but I'm guessing that on GKE prior to 1.20.5-gke.500, there was a different process that would cordon and drain their preemptible nodes.

@jameskunc
Copy link

@bobbypage

This followed the pattern for kubelet evictions and also made it possible for users to see why their pods were terminated. If they were deleted, pods could appear to vanish suddenly without explanation why which could also appear to be confusing.

Having a terminated-pod-gc-expiration configuration option in conjunction with the terminated-pod-gc-threshold would work for us, then we can still debug the failed ones but they go away after some time.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2022
@schellj
Copy link

schellj commented Jan 19, 2022

/remove-lifecycle stale

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2022
@schellj
Copy link

schellj commented Apr 19, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2022
@schellj
Copy link

schellj commented Jul 18, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 16, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 15, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2022
@pcj
Copy link

pcj commented Apr 10, 2023

/reopen

@k8s-ci-robot
Copy link
Contributor

@pcj: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zorgzerg
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@zorgzerg: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests