Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a TTL for Pods on workloads other than Jobs #122187

Open
kannon92 opened this issue Dec 5, 2023 · 23 comments
Open

Add a TTL for Pods on workloads other than Jobs #122187

kannon92 opened this issue Dec 5, 2023 · 23 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@kannon92
Copy link
Contributor

kannon92 commented Dec 5, 2023

What would you like to be added?

For batch users, they are able to specify a ttlSecondsAfterFinished for Pods of a Job. This means that when the Job is complete, pods will be GC after a specified time.

In TTL-KEP, there was a mention to adapt TTLAfterFinished for Pods. The future work has some details on what work would be needed to extend the TTLAfterFinished Controller for other pods.

The controller that handles this is located here.

Why is this needed?

Generally, PodGC is handled at the cluster level (for objects other than Jobs) and there has been some requests to set this on certain workloads. It would be nice to have a way for Pods or other sig-apps workloads to be GC if the pods are complete. When pods are terminated, they are left and they are only GC with --terminated-pod-gc-threshold. One issue with cluster settings is not all users have access to let this and tune on their workloads. See aws/containers-roadmap#1544 for an example.

https://kubernetes.slack.com/archives/C0BP8PW9G/p1701683843554669 is another example.

@kannon92 kannon92 added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 5, 2023
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 5, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 5, 2023
@kannon92
Copy link
Contributor Author

kannon92 commented Dec 5, 2023

/sig node
/sig apps

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 5, 2023
@AxeZhan
Copy link
Member

AxeZhan commented Dec 6, 2023

/cc
I wonder which workloads can benefit from this (other than jobs)🤔️
Maybe cronJobs and orphan pods?

@kannon92
Copy link
Contributor Author

kannon92 commented Dec 6, 2023

/cc I wonder which workloads can benefit from this (other than jobs)🤔️ Maybe cronJobs and orphan pods?

I think Deployments, DaemonSets and orphan pods.

CronJob uses the Job Template and I know you can use the Job ttlSecondsAfterFinished since it composes a Job.

@kannon92
Copy link
Contributor Author

kannon92 commented Dec 6, 2023

/cc @alculquicondor @pacoxu

WDYT about this?

I am also a bit confused on who owns the ttlafterfinished controller. Is that sig-apps? I know that since this would require a new pod field, sig-node should also be involved.

@dejanzele
Copy link
Contributor

I had a brief chat with @kannon92 and I have capacity to help with this issue

@kannon92
Copy link
Contributor Author

kannon92 commented Dec 6, 2023

I had a brief chat with @kannon92 and I have capacity to help with this issue

I think we should float this idea and see if there is interest in it.

@AxeZhan
Copy link
Member

AxeZhan commented Dec 6, 2023

I think we should float this idea and see if there is interest in it.

Well, I'm interested in it :).

I think Deployments, DaemonSets and orphan pods.

I'm confused how deployments/dameonsets can benefit from this, do they ever complete?
I think what we currently have is job GC(delete a job after it finished), and this issue aims to implement a pod GC, not a deployment GC, right? And what's the meaning of having a pod template for a deployment that will complete?🤔 Won't the deployment just create a new same pod?

@alculquicondor
Copy link
Member

I also would like to understand the use case better, given that most Pods are long running, unless they are Jobs.

@kannon92
Copy link
Contributor Author

kannon92 commented Dec 6, 2023

I think we should float this idea and see if there is interest in it.

Well, I'm interested in it :).

I think Deployments, DaemonSets and orphan pods.

I'm confused how deployments/dameonsets can benefit from this, do they ever complete? I think what we currently have is job GC(delete a job after it finished), and this issue aims to implement a pod GC, not a deployment GC, right? And what's the meaning of having a pod template for a deployment that will complete?🤔 Won't the deployment just create a new same pod?

The case where I see users getting into this is during node shutdowns, draining nodes, etc. You are correct that this shouldn't impact users.

For example, #122122 (comment) is one area where this could benefit users.

Another one from the slack link:

Is there a better way to regularly clean up pods in the Failed state than a CronJob executing kubectl delete pods -A --field-selector status.phase=Failed ? I am pondering --terminated-pod-gc-threshold but this feels a little wrong to set at a cluster level, e.g. if I pick 100 now, maybe in 3 years time we have 150 CronJobs on the cluster and we clean them up immediately. Is that the wrong way to think about it?
The context is that descheduler is causing a bunch of evictions as nodes are restarted to keep things balanced, which is fine and the desired behaviour, but it results in a bunch of evicted/unknown pods (unknown due to this bug) that hog resources and need to be cleaned up

@alculquicondor
Copy link
Member

In that case, it might be preferred to make it a gc setting, rather than a per-pod API.

@kannon92
Copy link
Contributor Author

kannon92 commented Dec 6, 2023

True, I see that this change requires admin rights and it doesn't really allow one to tune based on their workloads. But I don't know if its pressing.

@AxeZhan
Copy link
Member

AxeZhan commented Dec 7, 2023

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/statefulset/stateful_set_control.go#L386-L400
Seems now we will delete completed pods controlled by statefulset by default? Shouldn't we unify this behavior for all workloads? Either make deployment delete the completed pods by default, or make both deployment/statefulSet delete completed pods based on ttlSecondsAfterFinished.

@kannon92
Copy link
Contributor Author

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/statefulset/stateful_set_control.go#L386-L400 Seems now we will delete completed pods controlled by statefulset by default? Shouldn't we unify this behavior for all workloads? Either make deployment delete the completed pods by default, or make both deployment/statefulSet delete completed pods based on ttlSecondsAfterFinished.

The main issue for this is a design decision for Set based app workloads. StatefulSet and DaemonSet require that there are no duplicate pods. Deployment could allow this as there is no uniqueness guarantee for the pod name.

@AxeZhan
Copy link
Member

AxeZhan commented Dec 13, 2023

I see.
So for statefulset, completed pods got deleted immediately to make new pods being created, reduce the duration this service being unavailable.
For deployments, we made its ttl configurable in case user want to run some diagnostics in completed pods?
Sgtm.

@atiratree
Copy link
Member

We do not have completed pods in deployments as we enforce restartPolicy: Always on the pods. In some circumstances (eg when the kubelet decides it cannot run the pod), it will move even these pods to the Failed phase. But it is hard for users to foresee these problems and apply the ttl pre-emptively. I suppose, you could have a custom admission webhook that would add the ttl to selected pods that are problematic.

In general, the feature might be useful for other workloads (custom controllers, or plain pods) that leave their pods behind.

@shlevy
Copy link

shlevy commented Jan 6, 2024

I'd like this for plain pods as well, currently I'd have to use a job (which makes it more complex to wait for the pod to be ready to stream the logs) to ensure a completed pod gets cleaned up.

@alculquicondor
Copy link
Member

And are you using bare Pods (no Deployment or anything else)? Why not use a Job for your workload?

@adilGhaffarDev
Copy link
Contributor

This will be very helpful in following scenario too:

Because of this change, in cases where application pods are started before device plugin pod (say after node reboot),
because devices are not healthy, the pod would fail with UnexpectedAdmissionError error. If the pod is part of a deployment,
another pod would be created but that would stay in Pending state waiting for the pod to be scheduled to the node. This
pod would go Running after the node capacity is updated followed by start up of device plugin pod and its registration.

The pod which fails at admission time continues to exist on the node and needs to be removed manually:

kubectl delete pods --field-selector status.phase=Failed

ref: #116376

@edwardzjl
Copy link

I have a specific use case to address:

Our current setup uses Jupyter Enterprise Gateway on Kubernetes, initiating kernels within plain pods. This system enables our data scientists to efficiently carry out their tasks on these kernels.

The Jupyter Enterprise Gateway comes equipped with a culling mechanism designed to remove idle kernels after a default inactive period of 3600 seconds. Unfortunately, we have encountered issues (mostly network-related) during the culling process. Specifically, the kernel gets deleted, but the associated pod persists.

When this happens, those orphan pods are hard to detect, and may last forever in the cluster.

I know that this scenario is somewhat uncommon, but it would greatly enhance our system if Kubernetes could introduce a designated type of pod that undergoes GC after a specified period of inactivity, particularly in terms of network activity.

@rishiraj88
Copy link

Very useful suggestion by @edwardzjl . We should see to try this GC config out. Thanks.

@adilGhaffarDev
Copy link
Contributor

can we get this triaged and start work on it?

@alculquicondor
Copy link
Member

I suggest you attend a SIG Apps meeting to present a proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Status: Needs Triage
Development

No branches or pull requests

10 participants