Pod Topology Spread constraints should be taken into account on scale down #96748

ahg-g · 2020-11-20T12:11:40Z

What would you like to be added:
Pod topology spread constraints are currently only evaluated when scheduling a pod.

The ask is to do that in kube-controller-manager when scaling down a replicaset. The logic would select the failure domain with the highest number of pods when selecting a victim.

The risk is impacting kube-controller-manager performance.

Why is this needed:
To maintain the spread.

This is a special case of kubernetes/enhancements#1828 and kubernetes/enhancements#1888

/sig apps
/sig scheduling

ahg-g · 2020-11-20T12:12:13Z

/cc @janetkuo @alculquicondor @Huang-Wei

ahg-g · 2020-11-20T13:07:56Z

This means that controller-manager should start watching nodes, I am not sure if it does that already.

Other Questions:

How would this combine with delete-priority (Update guestbook-go. #1828)?
Should we also consider other constraints like inter-pod (anti)affinity, node affinity and taints?
If so, how should the rankings of different constraints be combined? additively just like we do in the scheduler?

I am inclined to say that topology spread is a special enough case to have native support for it in core k8s since it is critical for HA, and for all other cases perhaps delete-priority is the answer?

alculquicondor · 2020-11-20T14:40:56Z

How would this combine with delete-priority (Update guestbook-go. #1828)?

delete-priority should take precedence.

As for performance, the current logic already takes into consideration active Pods per Node. So it wouldn't be much of a increase to do the same for other topologies.

Huang-Wei · 2020-11-20T20:06:15Z

How would this combine with delete-priority (Update guestbook-go. #1828)?

+1 to Aldo that delete-priority should take precedence.

I think delete-priority, if approved, acts as an extensible hook for 3rd-party integrator to impact the down-scaling, for any feature. As it's not convinced (yet) how compelling that could be, I think for GAed feature like PodTopologySpread which have clear API semantics, the down-scaling optimization logic should be implemented in a more controllable and deterministic way.

Should we also consider other constraints like inter-pod (anti)affinity, node affinity and taints?

I guess we have to consider case by case. We can start with PodTopologySpread.

If so, how should the rankings of different constraints be combined? additively just like we do in the scheduler?

Are we talking about hard constraints, or soft? It'd be easier to only cover the hard constraints.

ahg-g · 2020-11-20T20:07:25Z

Are we talking about hard constraints, or soft? It'd be easier to only cover the hard constraints.

we should do both in the same way, best effort.

alculquicondor · 2020-11-20T20:21:48Z

Are we talking about hard constraints, or soft? It'd be easier to only cover the hard constraints.

We could choose to somehow honor maxSkew, but that's just going to make the algorithm unnecessarily complicated. We can simply select the Pods from topologies with less instances. In that sense, there wouldn't be any difference between hard and soft constraints.

Going into implementation details, we only need to change the Rank

kubernetes/pkg/controller/controller_utils.go

Line 855 in 78318c7

if s.Rank[i] != s.Rank[j] {

Today, the rank is the number of Pods in the same Node

kubernetes/pkg/controller/replicaset/replica_set.go

Line 823 in 4ae11da

ranks[i] = podsOnNode[pod.Spec.NodeName]

ahg-g · 2020-11-20T20:37:23Z

one nit here is that pod topology spread constraints could define different selectors than the replicaset.

alculquicondor · 2020-11-26T22:30:13Z

I was pondering in the code and started thinking that the overhead won't be negligible as we would need to add a Node informer, and perhaps a Node labels converted into a slice to reduce map accesses.

While that implementation is still worth exploring, I was playing around with an idea that I've had for a while: relax the timestamp comparisons to a logarithmic scale (rounded to the nearest integer). This allows for some level of randomness, which most of the times is probably sufficient. Changes are pretty compact #96898

ahg-g · 2020-12-10T19:39:48Z

/cc @seanmalloy @damemi @soltysh

damemi · 2020-12-10T19:50:04Z

I will let @soltysh speak more on the controller implementation side, but for the selection algorithm I agree with @ahg-g that a best-effort case shouldn't need to differentiate between hard and soft constraints.

I also don't think considering MaxSkew would be too difficult, it could probably use the same algorithm we added for this in descheduler (or something similar).

It would be great to have this balancing able to react on scale-down though. Perhaps this could also tie into our work to export the scheduling framework, so that the controller-manager could use the logic right from the plugins?

ahg-g · 2020-12-10T20:27:55Z

/assign @damemi

alculquicondor · 2020-12-10T21:20:41Z

I think it might be too hard to reuse plugins logic while keeping the replicaset controller performant. But I'd be very pleased if you prove me wrong 😅

Huang-Wei · 2020-12-12T01:34:38Z

/triage accepted

alculquicondor · 2020-12-15T16:24:12Z

I'm starting a KEP on this kubernetes/enhancements#2185

I'll pass down the implementation

alculquicondor · 2021-01-06T22:04:25Z

/area workload-api/deployment

Huang-Wei · 2021-03-24T20:17:38Z

Are we going to treat this issue resolved by #99212 completely? Or prefer to still leave it open and explore other solutions in the future?

ahg-g · 2021-03-24T20:20:42Z

I think #99212 and if necessary https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost (with a webhook to randomly set the cost) can address this issue reasonably ok.

fejta-bot · 2021-06-22T21:20:24Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-07-22T21:58:30Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

ahg-g · 2021-08-20T20:56:31Z

/close

k8s-ci-robot · 2021-08-20T20:56:44Z

@ahg-g: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ahg-g added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 20, 2020

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2020

alculquicondor mentioned this issue Nov 26, 2020

WIP Logarithmic timestamp comparison for dowscaling #96898

Closed

k8s-ci-robot assigned damemi Dec 10, 2020

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 12, 2020

alculquicondor mentioned this issue Dec 15, 2020

Random Pod selection on ReplicaSet downscaling kubernetes/enhancements#2185

Open

12 tasks

damemi mentioned this issue Jan 6, 2021

Add KEP for random ReplicaSet downscale kubernetes/enhancements#2233

Merged

k8s-ci-robot added the area/workload-api/deployment label Jan 6, 2021

damemi mentioned this issue Feb 18, 2021

Logarithmic timestamp comparison for downscaling #99212

Merged

damemi mentioned this issue Jun 21, 2021

Fix ReplicaSet deletion rank for multiple colocation nodes #102949

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 22, 2021

k8s-ci-robot closed this as completed Aug 20, 2021

damemi mentioned this issue Aug 3, 2022

Affinity should be enabled when RS deletes pods #111655

Closed

damemi mentioned this issue Feb 21, 2023

Topology spread constraint not respected when pods scale down #115911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod Topology Spread constraints should be taken into account on scale down #96748

Pod Topology Spread constraints should be taken into account on scale down #96748

ahg-g commented Nov 20, 2020

ahg-g commented Nov 20, 2020

ahg-g commented Nov 20, 2020

alculquicondor commented Nov 20, 2020

Huang-Wei commented Nov 20, 2020

ahg-g commented Nov 20, 2020

alculquicondor commented Nov 20, 2020 •

edited

Loading

ahg-g commented Nov 20, 2020 •

edited

Loading

alculquicondor commented Nov 26, 2020 •

edited

Loading

ahg-g commented Dec 10, 2020

damemi commented Dec 10, 2020

ahg-g commented Dec 10, 2020

alculquicondor commented Dec 10, 2020

Huang-Wei commented Dec 12, 2020

alculquicondor commented Dec 15, 2020

alculquicondor commented Jan 6, 2021

Huang-Wei commented Mar 24, 2021

ahg-g commented Mar 24, 2021

fejta-bot commented Jun 22, 2021

fejta-bot commented Jul 22, 2021

ahg-g commented Aug 20, 2021

k8s-ci-robot commented Aug 20, 2021

Pod Topology Spread constraints should be taken into account on scale down #96748

Pod Topology Spread constraints should be taken into account on scale down #96748

Comments

ahg-g commented Nov 20, 2020

ahg-g commented Nov 20, 2020

ahg-g commented Nov 20, 2020

alculquicondor commented Nov 20, 2020

Huang-Wei commented Nov 20, 2020

ahg-g commented Nov 20, 2020

alculquicondor commented Nov 20, 2020 • edited Loading

ahg-g commented Nov 20, 2020 • edited Loading

alculquicondor commented Nov 26, 2020 • edited Loading

ahg-g commented Dec 10, 2020

damemi commented Dec 10, 2020

ahg-g commented Dec 10, 2020

alculquicondor commented Dec 10, 2020

Huang-Wei commented Dec 12, 2020

alculquicondor commented Dec 15, 2020

alculquicondor commented Jan 6, 2021

Huang-Wei commented Mar 24, 2021

ahg-g commented Mar 24, 2021

fejta-bot commented Jun 22, 2021

fejta-bot commented Jul 22, 2021

ahg-g commented Aug 20, 2021

k8s-ci-robot commented Aug 20, 2021

alculquicondor commented Nov 20, 2020 •

edited

Loading

ahg-g commented Nov 20, 2020 •

edited

Loading

alculquicondor commented Nov 26, 2020 •

edited

Loading