Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale down a deployment by removing specific pods (PodDeletionCost) #2255

Open
8 tasks done
ahg-g opened this issue Jan 12, 2021 · 58 comments
Open
8 tasks done

Scale down a deployment by removing specific pods (PodDeletionCost) #2255

ahg-g opened this issue Jan 12, 2021 · 58 comments
Labels
sig/apps Categorizes an issue or PR as relevant to SIG Apps. stage/beta Denotes an issue tracking an enhancement targeted for Beta status

Comments

@ahg-g
Copy link
Member

ahg-g commented Jan 12, 2021

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 12, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Jan 12, 2021

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 12, 2021
@annajung annajung added stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team labels Jan 27, 2021
@annajung annajung added this to the v1.21 milestone Jan 27, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Feb 3, 2021

@annajung @JamesLaverack james, you mentioned in the sig-apps slack channel that this enhancement is at risk, can you clarify why? it meets the criteria.

@JamesLaverack
Copy link
Member

@ahg-g Just to follow up here too, we discussed in Slack and this was due to a delay in reviewing. We've now marked this as "Tracked" on the enhancements spreadsheet for 1.21.

Thank you for getting back to us. :)

@ahg-g ahg-g changed the title Scale down a deployment by removing specific pods Scale down a deployment by removing specific pods (PodDeletionCost) Feb 17, 2021
@JamesLaverack
Copy link
Member

Hi @ahg-g,

Since your Enhancement is scheduled to be in 1.21, please keep in mind the important upcoming dates:

  • Tuesday, March 9th: Week 9 — Code Freeze
  • Tuesday, March 16th: Week 10 — Docs Placeholder PR deadline
    • If this enhancement requires new docs or modification to existing docs, please follow the steps in the Open a placeholder PR doc to open a PR against k/website repo.

As a reminder, please link all of your k/k PR(s) and k/website PR(s) to this issue so we can track them.

Thanks!

@ahg-g
Copy link
Member Author

ahg-g commented Feb 26, 2021

Hi @ahg-g,

Since your Enhancement is scheduled to be in 1.21, please keep in mind the important upcoming dates:

  • Tuesday, March 9th: Week 9 — Code Freeze

  • Tuesday, March 16th: Week 10 — Docs Placeholder PR deadline

    • If this enhancement requires new docs or modification to existing docs, please follow the steps in the Open a placeholder PR doc to open a PR against k/website repo.

As a reminder, please link all of your k/k PR(s) and k/website PR(s) to this issue so we can track them.

Thanks!

done.

@JamesLaverack
Copy link
Member

Hi @ahg-g

Enhancements team is currently tracking the following PRs

As this PR is merged, can we mark this enhancement complete for code freeze or do you have other PR(s) that are being worked on as part of the release?

@ahg-g
Copy link
Member Author

ahg-g commented Mar 2, 2021

Hi @JamesLaverack , yes the k/k code is merged, docs PR still open though.

@JamesLaverack JamesLaverack added tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team and removed tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team labels Apr 25, 2021
@ahg-g
Copy link
Member Author

ahg-g commented May 5, 2021

/stage beta

@k8s-ci-robot k8s-ci-robot added stage/beta Denotes an issue tracking an enhancement targeted for Beta status and removed stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status labels May 5, 2021
@ahg-g
Copy link
Member Author

ahg-g commented May 5, 2021

/milestone v1.22

@thesuperzapper
Copy link

@ahg-g can you clarify if the plan is still to promote the current "Pod Deletion Cost" to GA?

Personally, I think we should leave it at Beta for now and consider implementing something like I proposed in #2255 (comment), I am really uncomfortable with marking the current implementation as GA and supporting it forever.

@ahg-g
Copy link
Member Author

ahg-g commented Jan 11, 2022

I am not pursuing the graduation to GA this cycle because we need to resolve the concerns posted here.

@minimonsters using the approach I laid out in #2255 (comment), there is nothing preventing you from giving the Pod information about its node-type with the downwards api, and using this information to decrease the "cost" for more expensive nodes (note, lower "cost" pods get killed first).

This still assumes that the information is local, I am not sure that is always the case.

The answer to 2 is a little more complex, as there are a number of situations where a Pod is deleted. But if we just aim to fill the same gap as the current annotation, we can initially only check the probe when the replicas of a ReplicaSet are decreased.

I don't think we can have the ReplicaSet controller calling each pod to get the costs on every scale down (if that is what you meant), this is neither scalable nor a proper design pattern for controllers.

I buy tim and kal's argument for needing to have native way for the pod to declare its cost, but I think we need that to propagate to the pod status for controllers to act on it. I am wondering how does that compare to liveness checks, @thockin do we update the pod status on each liveness check?

@thesuperzapper
Copy link

This still assumes that the information is local, I am not sure that is always the case.

If a use-case needs non-local information, the application designer can use some other method to make the application running in the Pod aware of it, and then expose the calculated "cost" in its the exec/http probe as normal.

I don't think we can have the ReplicaSet controller calling each pod to get the costs on every scale down (if that is what you meant), this is neither scalable nor a proper design pattern for controllers.

@ahg-g there are two obvious protection methods for this:

  1. A Pod can return 0 as its "cost" allowing the ReplicaSet controller to exit early, and just kill that Pod.
  2. Have a configurable timeout (possibly per ReplicaSet, or possibly cluster-wide), which if reached, the ReplicaSet just kills the lowest "cost" Pod it has found so far.

I buy tim and kal's argument for needing to have native way for the pod to declare its cost, but I think we need that to propagate to the pod status for controllers to act on it. I am wondering how does that compare to liveness checks, @thockin do we update the pod status on each liveness check?

I still think storing the "cost" anywhere is a mistake because the only time the "cost" needs to be correct is just before a Pod is removed. Therefore, it makes very little sense to store this cost anywhere, as it's immediately out of date, irrespective of whether it's in a status OR annotation.

Further, there is going to be some level of overhead to check the "cost" of a Pod, and this goes to waste if we arent going to scale down the number of replicas in the near future. (This also raises the question of "cost" probes which take a long time, but I think my suggestion above for a timeout addresses this neatly)

@ahg-g
Copy link
Member Author

ahg-g commented Jan 12, 2022

This still assumes that the information is local, I am not sure that is always the case.

If a use-case needs non-local information, the application designer can use some other method to make the application running in the Pod aware of it, and then expose the calculated "cost" in its the exec/http probe as normal.

looks like we are simplifying one case at the expense of the other, but may be this simplifies the most common case, so I will have to get back to this with a more concrete example.

I don't think we can have the ReplicaSet controller calling each pod to get the costs on every scale down (if that is what you meant), this is neither scalable nor a proper design pattern for controllers.

@ahg-g there are two obvious protection methods for this:

  1. A Pod can return 0 as its "cost" allowing the ReplicaSet controller to exit early, and just kill that Pod.
  2. Have a configurable timeout (possibly per ReplicaSet, or possibly cluster-wide), which if reached, the ReplicaSet just kills the lowest "cost" Pod it has found so far.

This doesn't change the asymptotic complexity; the replicaset controller still have to call on the order of number of pods before each scale down. Having the control plane calling into the workloads in general isn't something we consider scalable; @wojtek-t and @liggitt if you have an opinion on this.

I buy tim and kal's argument for needing to have native way for the pod to declare its cost, but I think we need that to propagate to the pod status for controllers to act on it. I am wondering how does that compare to liveness checks, @thockin do we update the pod status on each liveness check?

I still think storing the "cost" anywhere is a mistake because the only time the "cost" needs to be correct is just before a Pod is removed. Therefore, it makes very little sense to store this cost anywhere, as it's immediately out of date, irrespective of whether it's in a status OR annotation.

The most scalable scenario for the ReplicaSet controller is to have the cost set with the pod.

Further, there is going to be some level of overhead to check the "cost" of a Pod, and this goes to waste if we arent going to scale down the number of replicas in the near future. (This also raises the question of "cost" probes which take a long time, but I think my suggestion above for a timeout addresses this neatly)

The current approach doesn't suffer from that overhead. The controller that decides when to scale down sets the costs right before issuing a scale down, and it can do that based on metrics exposed by the pods for example. Granted no quite a built in approach and requires building a controller, but it doesn't have this problem.

@thesuperzapper
Copy link

@ahg-g (and others watching), I have just written up a pretty comprehensive proposal in kubernetes/kubernetes#107598, that proposes an extension to the ReplicaSet and Deployment spec for configuring down-scaling behavior.

Within that proposal, I have fixed your issues with my "cost probe" idea from #2255 (comment) in two ways:

  1. I have made it heuristic, in that it will now only "probe" from a sample of the Pods.
    • (Note, this is called HeuristicCostProbeScaleConfig in the new proposal)
  2. I have introduced the idea of a "central cost probe" API, which returns the cost of many pods at the same time:
    • (Note, this is called CostAPIScaleConfig in the new proposal)

cc @liggitt @khenidak @thockin @JamesLaverack @SwarajShekhar @reylejano

@thesuperzapper
Copy link

Hi All, I have now raised KEP-3189, which takes lots of the discussion from my proposal in kubernetes/kubernetes#107598, and simplifies it into a single change.

Extend Deployments/ReplicaSets with a downscalePodPicker field that specifies a user-provided REST API to help decide which Pods are removed when replicas is decreased.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2022
@thesuperzapper
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 30, 2022
@thesuperzapper
Copy link

Hey all watching! After thinking more about how we can make pod-deletion-cost GA, I believe I have an idea that will address most of the annotation-related concerns of the current implementation (while still maintaining backward compatibility with annotations, if they are present).

I still need to write up a full proposal and KEP, but my initial thoughts can be found at:

The gist of the idea is that we can make pod-deletion-cost a more transient value (rather than only storing it in annotations), by extending the /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale API so when a caller sends a PATCH which reduces replicas, they can include the pod-deletion-cost of one-or-more Pods, these costs will only affect the current down-scale (unlike the annotations, which must be manually cleared after scaling to remove their effect).

@remiville
Copy link

Many thanks for all your works and reflections on this subject, I'm strongly interest in the capacity to choose pods to be evicted during scale in and I try to follow up corresponding discussions and feature developments or proposals.

I searched for a long time how to achieve this correctly, I was happy with the PodDeletionCost but now I am a little disappointed as it seems that it will stay a beta (please do not remove this feature until an equivalent one is released).
To give my two cents, I will share my understanding of the issue and maybe, I hope, help to solve it in a simple and globally compatible manner (may be I should post elsewhere, I'm not familiar with your processes).

My need (which is maybe different of yours) is to selectively evict or replace terminated pods to keep a dynamic number of fresh pod replicas without terminating potentially running pods (I mean pods running applications currently processing something).
It is more or less a pool of pods with a minimum and maximum replicas, a current number of replica varying in function of external demands, and the rule to forbid to terminate a pod with activity inside.

I may be wrong, but I think the root cause of the problem is the incompatibility between the automatic pod restart and the scale-in features.
If the ReplicaSet automatically restarts terminated pods then it gives no chance to the application itself to indicate which pod to be evicted during scale in (I mean without using the API).

Without PodDeletionCost, one known workaround is to:

  • stop or delete the ReplicaSet
  • delete selected pods
  • decrease replica count accordingly
  • start or recreate the ReplicaSet.

For me this workaround speaks in favor the incompatibility between ReplicaSet and scale-in features to select pods to be evicted : currently that cannot work when mixed together.

Also I think one should avoid any controller to terminate a pod, it should be the application inside the pod that terminates, implying the its pod to terminate, then a controller could evict only already terminated pod.

Here is my proposal :

  • add an option to ReplicaSet or Deployment etc to not restart terminated pods (succeeded and/or failed).
    Currently the restart policy can only be Always.
  • during scale-in prioritize terminated pods to be evicted (maybe that's already the case ?)

With these behaviors, scale-in will select pods to be evicted based on the inside pod applications termination status (here Succeeded or Failed) instead of external indicators.
If a custom controller is used to maintain a dynamic number of replica it will be able to remove or replace terminated pods just by decreasing the replica count or deleting them.

If this proposal is acceptable and can work it is maybe achievable with a minimal coding effort.

What do you think ?

@remiville
Copy link

Maybe my need is different because I need to automatically replace or delete terminated pods.
I have been able to select pods to remove from the replicaset by not terminating pods but setting the pod-delection-cost annotation instead, then a custom controller decrease the replica or delete pods accordingly.
Like evoked in PROPOSAL configurable down-scaling behaviour in ReplicaSets & Deployments something like a pod deletion cost probe would be better than the annotation to let the application indicates by itself that it must be prioritized for deletion.

I think there are two cases to distinguish during scale-in: the capacity to remove terminated pods from replicaset (without replacing them, which imply a ReplicaSet restartPolicy different than Always), and the capacity to remove running pods (using the probe).

@rhockenbury
Copy link

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.22 milestone Oct 1, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2022
@thockin thockin removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2023
@thockin
Copy link
Member

thockin commented Jan 14, 2023

@ahg-g I'm not in love with annotations as APIs. Do we REALLY think this is the best answer?

@ahg-g
Copy link
Member Author

ahg-g commented Jan 14, 2023

@ahg-g I'm not in love with annotations as APIs. Do we REALLY think this is the best answer?

I think we have a reasonable counter proposal in kubernetes/kubernetes#107598 (comment); can we hold this in its current beta state until that proposal makes progress?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2023
@hoerup
Copy link

hoerup commented Apr 14, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2023
@Atharva-Shinde Atharva-Shinde removed the tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team label May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/apps Categorizes an issue or PR as relevant to SIG Apps. stage/beta Denotes an issue tracking an enhancement targeted for Beta status
Projects
None yet
Development

No branches or pull requests