Add a TTL for Pods on workloads other than Jobs #122187

kannon92 · 2023-12-05T15:02:27Z

What would you like to be added?

For batch users, they are able to specify a ttlSecondsAfterFinished for Pods of a Job. This means that when the Job is complete, pods will be GC after a specified time.

In TTL-KEP, there was a mention to adapt TTLAfterFinished for Pods. The future work has some details on what work would be needed to extend the TTLAfterFinished Controller for other pods.

The controller that handles this is located here.

Why is this needed?

Generally, PodGC is handled at the cluster level (for objects other than Jobs) and there has been some requests to set this on certain workloads. It would be nice to have a way for Pods or other sig-apps workloads to be GC if the pods are complete. When pods are terminated, they are left and they are only GC with --terminated-pod-gc-threshold. One issue with cluster settings is not all users have access to let this and tune on their workloads. See aws/containers-roadmap#1544 for an example.

https://kubernetes.slack.com/archives/C0BP8PW9G/p1701683843554669 is another example.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-12-05T15:02:37Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kannon92 · 2023-12-05T15:02:38Z

/sig node
/sig apps

AxeZhan · 2023-12-06T02:49:04Z

/cc
I wonder which workloads can benefit from this (other than jobs)🤔️
Maybe cronJobs and orphan pods?

kannon92 · 2023-12-06T13:40:29Z

/cc I wonder which workloads can benefit from this (other than jobs)🤔️ Maybe cronJobs and orphan pods?

I think Deployments, DaemonSets and orphan pods.

CronJob uses the Job Template and I know you can use the Job ttlSecondsAfterFinished since it composes a Job.

kannon92 · 2023-12-06T13:44:03Z

/cc @alculquicondor @pacoxu

WDYT about this?

I am also a bit confused on who owns the ttlafterfinished controller. Is that sig-apps? I know that since this would require a new pod field, sig-node should also be involved.

dejanzele · 2023-12-06T13:50:34Z

I had a brief chat with @kannon92 and I have capacity to help with this issue

kannon92 · 2023-12-06T13:52:01Z

I had a brief chat with @kannon92 and I have capacity to help with this issue

I think we should float this idea and see if there is interest in it.

AxeZhan · 2023-12-06T14:43:37Z

I think we should float this idea and see if there is interest in it.

Well, I'm interested in it :).

I think Deployments, DaemonSets and orphan pods.

I'm confused how deployments/dameonsets can benefit from this, do they ever complete?
I think what we currently have is job GC(delete a job after it finished), and this issue aims to implement a pod GC, not a deployment GC, right? And what's the meaning of having a pod template for a deployment that will complete?🤔 Won't the deployment just create a new same pod?

alculquicondor · 2023-12-06T15:52:10Z

I also would like to understand the use case better, given that most Pods are long running, unless they are Jobs.

kannon92 · 2023-12-06T18:36:09Z

I think we should float this idea and see if there is interest in it.

Well, I'm interested in it :).

I think Deployments, DaemonSets and orphan pods.

I'm confused how deployments/dameonsets can benefit from this, do they ever complete? I think what we currently have is job GC(delete a job after it finished), and this issue aims to implement a pod GC, not a deployment GC, right? And what's the meaning of having a pod template for a deployment that will complete?🤔 Won't the deployment just create a new same pod?

The case where I see users getting into this is during node shutdowns, draining nodes, etc. You are correct that this shouldn't impact users.

For example, #122122 (comment) is one area where this could benefit users.

Another one from the slack link:

Is there a better way to regularly clean up pods in the Failed state than a CronJob executing kubectl delete pods -A --field-selector status.phase=Failed ? I am pondering --terminated-pod-gc-threshold but this feels a little wrong to set at a cluster level, e.g. if I pick 100 now, maybe in 3 years time we have 150 CronJobs on the cluster and we clean them up immediately. Is that the wrong way to think about it?
The context is that descheduler is causing a bunch of evictions as nodes are restarted to keep things balanced, which is fine and the desired behaviour, but it results in a bunch of evicted/unknown pods (unknown due to this bug) that hog resources and need to be cleaned up

alculquicondor · 2023-12-06T19:33:28Z

In that case, it might be preferred to make it a gc setting, rather than a per-pod API.

kannon92 · 2023-12-06T19:57:50Z

True, I see that this change requires admin rights and it doesn't really allow one to tune based on their workloads. But I don't know if its pressing.

AxeZhan · 2023-12-07T06:55:00Z

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/statefulset/stateful_set_control.go#L386-L400
Seems now we will delete completed pods controlled by statefulset by default? Shouldn't we unify this behavior for all workloads? Either make deployment delete the completed pods by default, or make both deployment/statefulSet delete completed pods based on ttlSecondsAfterFinished.

kannon92 · 2023-12-13T15:48:30Z

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/statefulset/stateful_set_control.go#L386-L400 Seems now we will delete completed pods controlled by statefulset by default? Shouldn't we unify this behavior for all workloads? Either make deployment delete the completed pods by default, or make both deployment/statefulSet delete completed pods based on ttlSecondsAfterFinished.

The main issue for this is a design decision for Set based app workloads. StatefulSet and DaemonSet require that there are no duplicate pods. Deployment could allow this as there is no uniqueness guarantee for the pod name.

AxeZhan · 2023-12-13T16:10:43Z

I see.
So for statefulset, completed pods got deleted immediately to make new pods being created, reduce the duration this service being unavailable.
For deployments, we made its ttl configurable in case user want to run some diagnostics in completed pods?
Sgtm.

atiratree · 2023-12-13T17:05:42Z

We do not have completed pods in deployments as we enforce restartPolicy: Always on the pods. In some circumstances (eg when the kubelet decides it cannot run the pod), it will move even these pods to the Failed phase. But it is hard for users to foresee these problems and apply the ttl pre-emptively. I suppose, you could have a custom admission webhook that would add the ttl to selected pods that are problematic.

In general, the feature might be useful for other workloads (custom controllers, or plain pods) that leave their pods behind.

shlevy · 2024-01-06T12:05:00Z

I'd like this for plain pods as well, currently I'd have to use a job (which makes it more complex to wait for the pod to be ready to stream the logs) to ensure a completed pod gets cleaned up.

alculquicondor · 2024-01-08T14:30:23Z

And are you using bare Pods (no Deployment or anything else)? Why not use a Job for your workload?

adilGhaffarDev · 2024-02-29T14:03:01Z

This will be very helpful in following scenario too:

Because of this change, in cases where application pods are started before device plugin pod (say after node reboot),
because devices are not healthy, the pod would fail with UnexpectedAdmissionError error. If the pod is part of a deployment,
another pod would be created but that would stay in Pending state waiting for the pod to be scheduled to the node. This
pod would go Running after the node capacity is updated followed by start up of device plugin pod and its registration.

The pod which fails at admission time continues to exist on the node and needs to be removed manually:

kubectl delete pods --field-selector status.phase=Failed

ref: #116376

edwardzjl · 2024-03-05T13:12:05Z

I have a specific use case to address:

Our current setup uses Jupyter Enterprise Gateway on Kubernetes, initiating kernels within plain pods. This system enables our data scientists to efficiently carry out their tasks on these kernels.

The Jupyter Enterprise Gateway comes equipped with a culling mechanism designed to remove idle kernels after a default inactive period of 3600 seconds. Unfortunately, we have encountered issues (mostly network-related) during the culling process. Specifically, the kernel gets deleted, but the associated pod persists.

When this happens, those orphan pods are hard to detect, and may last forever in the cluster.

I know that this scenario is somewhat uncommon, but it would greatly enhance our system if Kubernetes could introduce a designated type of pod that undergoes GC after a specified period of inactivity, particularly in terms of network activity.

rishiraj88 · 2024-03-13T06:02:01Z

Very useful suggestion by @edwardzjl . We should see to try this GC config out. Thanks.

adilGhaffarDev · 2024-03-13T07:59:40Z

can we get this triaged and start work on it?

alculquicondor · 2024-03-13T13:47:11Z

I suggest you attend a SIG Apps meeting to present a proposal.

kannon92 added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 5, 2023

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 5, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 5, 2023

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 5, 2023

kannon92 mentioned this issue Feb 22, 2024

Cleanup Failed pods once replacement Pods are present kubernetes-sigs/kueue#1762

Closed

3 tasks

alculquicondor mentioned this issue Apr 23, 2024

Kubelet accepting pod, setting OutOfCpu on scheduled pods #115325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a TTL for Pods on workloads other than Jobs #122187

Add a TTL for Pods on workloads other than Jobs #122187

kannon92 commented Dec 5, 2023 •

edited

k8s-ci-robot commented Dec 5, 2023

kannon92 commented Dec 5, 2023

AxeZhan commented Dec 6, 2023

kannon92 commented Dec 6, 2023

kannon92 commented Dec 6, 2023

dejanzele commented Dec 6, 2023

kannon92 commented Dec 6, 2023

AxeZhan commented Dec 6, 2023

alculquicondor commented Dec 6, 2023

kannon92 commented Dec 6, 2023

alculquicondor commented Dec 6, 2023

kannon92 commented Dec 6, 2023

AxeZhan commented Dec 7, 2023

kannon92 commented Dec 13, 2023

AxeZhan commented Dec 13, 2023

atiratree commented Dec 13, 2023

shlevy commented Jan 6, 2024

alculquicondor commented Jan 8, 2024

adilGhaffarDev commented Feb 29, 2024

edwardzjl commented Mar 5, 2024

rishiraj88 commented Mar 13, 2024

adilGhaffarDev commented Mar 13, 2024

alculquicondor commented Mar 13, 2024

Add a TTL for Pods on workloads other than Jobs #122187

Add a TTL for Pods on workloads other than Jobs #122187

Comments

kannon92 commented Dec 5, 2023 • edited

What would you like to be added?

Why is this needed?

k8s-ci-robot commented Dec 5, 2023

kannon92 commented Dec 5, 2023

AxeZhan commented Dec 6, 2023

kannon92 commented Dec 6, 2023

kannon92 commented Dec 6, 2023

dejanzele commented Dec 6, 2023

kannon92 commented Dec 6, 2023

AxeZhan commented Dec 6, 2023

alculquicondor commented Dec 6, 2023

kannon92 commented Dec 6, 2023

alculquicondor commented Dec 6, 2023

kannon92 commented Dec 6, 2023

AxeZhan commented Dec 7, 2023

kannon92 commented Dec 13, 2023

AxeZhan commented Dec 13, 2023

atiratree commented Dec 13, 2023

shlevy commented Jan 6, 2024

alculquicondor commented Jan 8, 2024

adilGhaffarDev commented Feb 29, 2024

edwardzjl commented Mar 5, 2024

rishiraj88 commented Mar 13, 2024

adilGhaffarDev commented Mar 13, 2024

alculquicondor commented Mar 13, 2024

kannon92 commented Dec 5, 2023 •

edited