Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale down a deployment by removing specific pods (PodDeletionCost) #2255

Open
8 tasks done
ahg-g opened this issue Jan 12, 2021 · 53 comments
Open
8 tasks done

Scale down a deployment by removing specific pods (PodDeletionCost) #2255

ahg-g opened this issue Jan 12, 2021 · 53 comments
Labels
sig/apps Categorizes an issue or PR as relevant to SIG Apps. stage/beta Denotes an issue tracking an enhancement targeted for Beta status tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team

Comments

@ahg-g
Copy link
Member

ahg-g commented Jan 12, 2021

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 12, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Jan 12, 2021

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 12, 2021
@annajung annajung added stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team labels Jan 27, 2021
@annajung annajung added this to the v1.21 milestone Jan 27, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Feb 3, 2021

@annajung @JamesLaverack james, you mentioned in the sig-apps slack channel that this enhancement is at risk, can you clarify why? it meets the criteria.

@JamesLaverack
Copy link
Member

JamesLaverack commented Feb 4, 2021

@ahg-g Just to follow up here too, we discussed in Slack and this was due to a delay in reviewing. We've now marked this as "Tracked" on the enhancements spreadsheet for 1.21.

Thank you for getting back to us. :)

@ahg-g ahg-g changed the title Scale down a deployment by removing specific pods Scale down a deployment by removing specific pods (PodDeletionCost) Feb 17, 2021
@JamesLaverack
Copy link
Member

JamesLaverack commented Feb 19, 2021

Hi @ahg-g,

Since your Enhancement is scheduled to be in 1.21, please keep in mind the important upcoming dates:

  • Tuesday, March 9th: Week 9 — Code Freeze
  • Tuesday, March 16th: Week 10 — Docs Placeholder PR deadline
    • If this enhancement requires new docs or modification to existing docs, please follow the steps in the Open a placeholder PR doc to open a PR against k/website repo.

As a reminder, please link all of your k/k PR(s) and k/website PR(s) to this issue so we can track them.

Thanks!

@ahg-g
Copy link
Member Author

ahg-g commented Feb 26, 2021

Hi @ahg-g,

Since your Enhancement is scheduled to be in 1.21, please keep in mind the important upcoming dates:

  • Tuesday, March 9th: Week 9 — Code Freeze

  • Tuesday, March 16th: Week 10 — Docs Placeholder PR deadline

    • If this enhancement requires new docs or modification to existing docs, please follow the steps in the Open a placeholder PR doc to open a PR against k/website repo.

As a reminder, please link all of your k/k PR(s) and k/website PR(s) to this issue so we can track them.

Thanks!

done.

@JamesLaverack
Copy link
Member

JamesLaverack commented Mar 2, 2021

Hi @ahg-g

Enhancements team is currently tracking the following PRs

As this PR is merged, can we mark this enhancement complete for code freeze or do you have other PR(s) that are being worked on as part of the release?

@ahg-g
Copy link
Member Author

ahg-g commented Mar 2, 2021

Hi @JamesLaverack , yes the k/k code is merged, docs PR still open though.

@JamesLaverack JamesLaverack added tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team and removed tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team labels Apr 25, 2021
@ahg-g
Copy link
Member Author

ahg-g commented May 5, 2021

/stage beta

@k8s-ci-robot k8s-ci-robot added stage/beta Denotes an issue tracking an enhancement targeted for Beta status and removed stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status labels May 5, 2021
@ahg-g
Copy link
Member Author

ahg-g commented May 5, 2021

/milestone v1.22

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Nov 18, 2021

@ahg-g I still want to see a couple of things:

Feel free to make a concrete proposal.

1 - The ability for a Pod to (safely) update its own pod-deletion-cost:

Right now this is never going to be safe, as updating the annotation requires a PATCH API call, so running a sidecar container to update the pod-deletion cost would overload the apiserver.

The KEP made it clear that this is not meant to be updated frequently.

2 - Some level of integration with HorizontalPodAutoscaler:

Right now, HPAs can't be used with pod-deletion-cost, as there is no way to intercept a scale-down event and annotate Pods before it happens.

See kubernetes/kubernetes#45509 (comment) for a way to integrate this feature with HPA.

3 - The possibility of non ReplicaSets taking pod-deletion-cost into account:

Like what @khenidak mentioned with cluster-autoscaler taking the Pod cost into account when deciding which Node to remove.

pod-deletion-cost is a pod-level value, the ReplicaSet controller is aware of it and takes it into account on scale down, nothing prevents CA from taking that value into account as well on node scale downs.

@SwarajShekhar
Copy link

SwarajShekhar commented Dec 2, 2021

Is pod-deletion-cost feature enabled by default as a beta feature in 1.22 version?

@liggitt
Copy link
Member

liggitt commented Dec 2, 2021

Is pod-deletion-cost feature enabled by default as a beta feature in 1.22 version?

yes

@ahg-g
Copy link
Member Author

ahg-g commented Dec 9, 2021

I would like to start a conversation around GAing the current basic semantics.

@liggitt regarding the api, I wonder if we should stay with the current one (annotation) or we want to revisit this decision and move it to status.

@liggitt
Copy link
Member

liggitt commented Dec 9, 2021

I'd hesitate to promote it to GA as-is without more consensus from the folks that weren't in favor of the annotation-based approach.

I would not expect to move it to pod status (unless it was part of a bigger change to have the kubelet populate it based on signals from the pod itself), since we don't intend to give write access to pod status to the types of clients that would want to set this.

@ahg-g
Copy link
Member Author

ahg-g commented Dec 9, 2021

I would not expect to move it to pod status (unless it was part of a bigger change to have the kubelet populate it based on signals from the pod itself), since we don't intend to give write access to pod status to the types of clients that would want to set this.

ok, that is a good point.

So the GA path is either:

  1. get consensus on the annotation with the current semantics.
  2. a bigger change in semantics that translates to having the "cost" moved to status and only updated by kubelet.

@thesuperzapper
Copy link

thesuperzapper commented Dec 14, 2021

@ahg-g @liggitt @khenidak @thockin @JamesLaverack @SwarajShekhar @reylejano

The controller.kubernetes.io/pod-deletion-cost annotation feels like a "bolt-on", rather than a first-class concept of the Pod. This is because it decouples setting the pod-deletion-cost from the Pod itself, requiring that a non-Kubernetes system updates it.

I am strongly against the controller.kubernetes.io/pod-deletion-cost annotation going GA, because this implementation ignores the fact that the Pod itself is the best source-of-truth about its own deletion cost (for example, it knows how many clients its serving, or what % of a job is complete).


Therefore, I propose we don't GA controller.kubernetes.io/pod-deletion-cost, and instead do the following.

First, I want to highlight that the only time the pod-deletion-cost needs to be correct is just before a Pod is removed. Therefore, it makes very little sense to store this cost anywhere, as it's immediately out of date, irrespective of whether it's in a status OR annotation.

The two big questions then become:

  1. How do we efficiently "ask" a pod for its deletion cost?
  2. How do we make the kubelet only "ask" just before an event that is removing Pods?

The answer to 1 is likely a new exec/http probe defined in the ContainerSpec, this probe could return one of the following statuses when "asked" by the kubelet:

  • Cost = 0 (I am able to be killed immediately - allows us to not check all Pods)
  • Cost = n (I cost n much to be killed - assume that all Pods return comparable values - kill lowest)
  • Cost = +Inf (I am not able to be killed - if all pods are unkillable, revert to current behavior)

The answer to 2 is a little more complex, as there are a number of situations where a Pod is deleted. But if we just aim to fill the same gap as the current annotation, we can initially only check the probe when the replicas of a ReplicaSet are decreased.

In the longer term, things like cluster-autoscaler may want to consider this cost probe, but this would have to be discussed, as not all Pods are going to have a probe defined, and the Cost = n values will not be comparable between Pods running different applications.

@minimonsters
Copy link

minimonsters commented Dec 24, 2021

@ahg-g @liggitt @khenidak @thockin @JamesLaverack @SwarajShekhar @reylejano

The controller.kubernetes.io/pod-deletion-cost annotation feels like a "bolt-on", rather than a first-class concept of the Pod. This is because it decouples setting the pod-deletion-cost from the Pod itself, requiring that a non-Kubernetes system updates it.

I am strongly against the controller.kubernetes.io/pod-deletion-cost annotation going GA, because this implementation ignores the fact that the Pod itself is the best source-of-truth about its own deletion cost (for example, it knows how many clients its serving, or what % of a job is complete).

Therefore, I propose we don't GA controller.kubernetes.io/pod-deletion-cost, and instead do the following.

First, I want to highlight that the only time the pod-deletion-cost needs to be correct is just before a Pod is removed. Therefore, it makes very little sense to store this cost anywhere, as it's immediately out of date, irrespective of whether it's in a status OR annotation.

The two big questions then become:

  1. How do we efficiently "ask" a pod for its deletion cost?
  2. How do we make the kubelet only "ask" just before an event that is removing Pods?

The answer to 1 is likely a new exec/http probe defined in the ContainerSpec, this probe could return one of the following statuses when "asked" by the kubelet:

  • Cost = 0 (I am able to be killed immediately - allows us to not check all Pods)
  • Cost = n (I cost n much to be killed - assume that all Pods return comparable values - kill lowest)
  • Cost = +Inf (I am not able to be killed - if all pods are unkillable, revert to current behavior)

The answer to 2 is a little more complex, as there are a number of situations where a Pod is deleted. But if we just aim to fill the same gap as the current annotation, we can initially only check the probe when the replicas of a ReplicaSet are decreased.

In the longer term, things like cluster-autoscaler may want to consider this cost probe, but this would have to be discussed, as not all Pods are going to have a probe defined, and the Cost = n values will not be comparable between Pods running different applications.

This approach assumes that the pod is the only source of information about what pods should be cleaned up. There is a whole set of cases where external factors may come into play - for example when the node that the pod is on should be taken into account because it is on a more expensive cloud-based burst service. These types of values only need to be set once.

@thesuperzapper
Copy link

thesuperzapper commented Jan 11, 2022

@minimonsters using the approach I laid out in #2255 (comment), there is nothing preventing you from giving the Pod information about its node-type with the downwards api, and using this information to decrease the "cost" for more expensive nodes (note, lower "cost" pods get killed first).

The whole idea about what I proposed is that you can do pretty much anything with it.

@thesuperzapper
Copy link

thesuperzapper commented Jan 11, 2022

@ahg-g can you clarify if the plan is still to promote the current "Pod Deletion Cost" to GA?

Personally, I think we should leave it at Beta for now and consider implementing something like I proposed in #2255 (comment), I am really uncomfortable with marking the current implementation as GA and supporting it forever.

@ahg-g
Copy link
Member Author

ahg-g commented Jan 11, 2022

I am not pursuing the graduation to GA this cycle because we need to resolve the concerns posted here.

@minimonsters using the approach I laid out in #2255 (comment), there is nothing preventing you from giving the Pod information about its node-type with the downwards api, and using this information to decrease the "cost" for more expensive nodes (note, lower "cost" pods get killed first).

This still assumes that the information is local, I am not sure that is always the case.

The answer to 2 is a little more complex, as there are a number of situations where a Pod is deleted. But if we just aim to fill the same gap as the current annotation, we can initially only check the probe when the replicas of a ReplicaSet are decreased.

I don't think we can have the ReplicaSet controller calling each pod to get the costs on every scale down (if that is what you meant), this is neither scalable nor a proper design pattern for controllers.

I buy tim and kal's argument for needing to have native way for the pod to declare its cost, but I think we need that to propagate to the pod status for controllers to act on it. I am wondering how does that compare to liveness checks, @thockin do we update the pod status on each liveness check?

@thesuperzapper
Copy link

thesuperzapper commented Jan 12, 2022

This still assumes that the information is local, I am not sure that is always the case.

If a use-case needs non-local information, the application designer can use some other method to make the application running in the Pod aware of it, and then expose the calculated "cost" in its the exec/http probe as normal.

I don't think we can have the ReplicaSet controller calling each pod to get the costs on every scale down (if that is what you meant), this is neither scalable nor a proper design pattern for controllers.

@ahg-g there are two obvious protection methods for this:

  1. A Pod can return 0 as its "cost" allowing the ReplicaSet controller to exit early, and just kill that Pod.
  2. Have a configurable timeout (possibly per ReplicaSet, or possibly cluster-wide), which if reached, the ReplicaSet just kills the lowest "cost" Pod it has found so far.

I buy tim and kal's argument for needing to have native way for the pod to declare its cost, but I think we need that to propagate to the pod status for controllers to act on it. I am wondering how does that compare to liveness checks, @thockin do we update the pod status on each liveness check?

I still think storing the "cost" anywhere is a mistake because the only time the "cost" needs to be correct is just before a Pod is removed. Therefore, it makes very little sense to store this cost anywhere, as it's immediately out of date, irrespective of whether it's in a status OR annotation.

Further, there is going to be some level of overhead to check the "cost" of a Pod, and this goes to waste if we arent going to scale down the number of replicas in the near future. (This also raises the question of "cost" probes which take a long time, but I think my suggestion above for a timeout addresses this neatly)

@ahg-g
Copy link
Member Author

ahg-g commented Jan 12, 2022

This still assumes that the information is local, I am not sure that is always the case.

If a use-case needs non-local information, the application designer can use some other method to make the application running in the Pod aware of it, and then expose the calculated "cost" in its the exec/http probe as normal.

looks like we are simplifying one case at the expense of the other, but may be this simplifies the most common case, so I will have to get back to this with a more concrete example.

I don't think we can have the ReplicaSet controller calling each pod to get the costs on every scale down (if that is what you meant), this is neither scalable nor a proper design pattern for controllers.

@ahg-g there are two obvious protection methods for this:

  1. A Pod can return 0 as its "cost" allowing the ReplicaSet controller to exit early, and just kill that Pod.
  2. Have a configurable timeout (possibly per ReplicaSet, or possibly cluster-wide), which if reached, the ReplicaSet just kills the lowest "cost" Pod it has found so far.

This doesn't change the asymptotic complexity; the replicaset controller still have to call on the order of number of pods before each scale down. Having the control plane calling into the workloads in general isn't something we consider scalable; @wojtek-t and @liggitt if you have an opinion on this.

I buy tim and kal's argument for needing to have native way for the pod to declare its cost, but I think we need that to propagate to the pod status for controllers to act on it. I am wondering how does that compare to liveness checks, @thockin do we update the pod status on each liveness check?

I still think storing the "cost" anywhere is a mistake because the only time the "cost" needs to be correct is just before a Pod is removed. Therefore, it makes very little sense to store this cost anywhere, as it's immediately out of date, irrespective of whether it's in a status OR annotation.

The most scalable scenario for the ReplicaSet controller is to have the cost set with the pod.

Further, there is going to be some level of overhead to check the "cost" of a Pod, and this goes to waste if we arent going to scale down the number of replicas in the near future. (This also raises the question of "cost" probes which take a long time, but I think my suggestion above for a timeout addresses this neatly)

The current approach doesn't suffer from that overhead. The controller that decides when to scale down sets the costs right before issuing a scale down, and it can do that based on metrics exposed by the pods for example. Granted no quite a built in approach and requires building a controller, but it doesn't have this problem.

@thesuperzapper
Copy link

thesuperzapper commented Jan 17, 2022

@ahg-g (and others watching), I have just written up a pretty comprehensive proposal in kubernetes/kubernetes#107598, that proposes an extension to the ReplicaSet and Deployment spec for configuring down-scaling behavior.

Within that proposal, I have fixed your issues with my "cost probe" idea from #2255 (comment) in two ways:

  1. I have made it heuristic, in that it will now only "probe" from a sample of the Pods.
    • (Note, this is called HeuristicCostProbeScaleConfig in the new proposal)
  2. I have introduced the idea of a "central cost probe" API, which returns the cost of many pods at the same time:
    • (Note, this is called CostAPIScaleConfig in the new proposal)

cc @liggitt @khenidak @thockin @JamesLaverack @SwarajShekhar @reylejano

@thesuperzapper
Copy link

thesuperzapper commented Jan 27, 2022

Hi All, I have now raised KEP-3189, which takes lots of the discussion from my proposal in kubernetes/kubernetes#107598, and simplifies it into a single change.

Extend Deployments/ReplicaSets with a downscalePodPicker field that specifies a user-provided REST API to help decide which Pods are removed when replicas is decreased.

@k8s-triage-robot
Copy link

k8s-triage-robot commented Apr 27, 2022

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2022
@k8s-triage-robot
Copy link

k8s-triage-robot commented May 27, 2022

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2022
@thesuperzapper
Copy link

thesuperzapper commented May 30, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 30, 2022
@thesuperzapper
Copy link

thesuperzapper commented Aug 10, 2022

Hey all watching! After thinking more about how we can make pod-deletion-cost GA, I believe I have an idea that will address most of the annotation-related concerns of the current implementation (while still maintaining backward compatibility with annotations, if they are present).

I still need to write up a full proposal and KEP, but my initial thoughts can be found at:

The gist of the idea is that we can make pod-deletion-cost a more transient value (rather than only storing it in annotations), by extending the /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale API so when a caller sends a PATCH which reduces replicas, they can include the pod-deletion-cost of one-or-more Pods, these costs will only affect the current down-scale (unlike the annotations, which must be manually cleared after scaling to remove their effect).

@remiville
Copy link

remiville commented Aug 26, 2022

Many thanks for all your works and reflections on this subject, I'm strongly interest in the capacity to choose pods to be evicted during scale in and I try to follow up corresponding discussions and feature developments or proposals.

I searched for a long time how to achieve this correctly, I was happy with the PodDeletionCost but now I am a little disappointed as it seems that it will stay a beta (please do not remove this feature until an equivalent one is released).
To give my two cents, I will share my understanding of the issue and maybe, I hope, help to solve it in a simple and globally compatible manner (may be I should post elsewhere, I'm not familiar with your processes).

My need (which is maybe different of yours) is to selectively evict or replace terminated pods to keep a dynamic number of fresh pod replicas without terminating potentially running pods (I mean pods running applications currently processing something).
It is more or less a pool of pods with a minimum and maximum replicas, a current number of replica varying in function of external demands, and the rule to forbid to terminate a pod with activity inside.

I may be wrong, but I think the root cause of the problem is the incompatibility between the automatic pod restart and the scale-in features.
If the ReplicaSet automatically restarts terminated pods then it gives no chance to the application itself to indicate which pod to be evicted during scale in (I mean without using the API).

Without PodDeletionCost, one known workaround is to:

  • stop or delete the ReplicaSet
  • delete selected pods
  • decrease replica count accordingly
  • start or recreate the ReplicaSet.

For me this workaround speaks in favor the incompatibility between ReplicaSet and scale-in features to select pods to be evicted : currently that cannot work when mixed together.

Also I think one should avoid any controller to terminate a pod, it should be the application inside the pod that terminates, implying the its pod to terminate, then a controller could evict only already terminated pod.

Here is my proposal :

  • add an option to ReplicaSet or Deployment etc to not restart terminated pods (succeeded and/or failed).
    Currently the restart policy can only be Always.
  • during scale-in prioritize terminated pods to be evicted (maybe that's already the case ?)

With these behaviors, scale-in will select pods to be evicted based on the inside pod applications termination status (here Succeeded or Failed) instead of external indicators.
If a custom controller is used to maintain a dynamic number of replica it will be able to remove or replace terminated pods just by decreasing the replica count or deleting them.

If this proposal is acceptable and can work it is maybe achievable with a minimal coding effort.

What do you think ?

@remiville
Copy link

remiville commented Aug 30, 2022

Maybe my need is different because I need to automatically replace or delete terminated pods.
I have been able to select pods to remove from the replicaset by not terminating pods but setting the pod-delection-cost annotation instead, then a custom controller decrease the replica or delete pods accordingly.
Like evoked in PROPOSAL configurable down-scaling behaviour in ReplicaSets & Deployments something like a pod deletion cost probe would be better than the annotation to let the application indicates by itself that it must be prioritized for deletion.

I think there are two cases to distinguish during scale-in: the capacity to remove terminated pods from replicaset (without replacing them, which imply a ReplicaSet restartPolicy different than Always), and the capacity to remove running pods (using the probe).

@rhockenbury
Copy link

rhockenbury commented Oct 1, 2022

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.22 milestone Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/apps Categorizes an issue or PR as relevant to SIG Apps. stage/beta Denotes an issue tracking an enhancement targeted for Beta status tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team
Projects
None yet
Development

No branches or pull requests