Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod Topology Spread constraints should be taken into account on scale down #96748

Closed
ahg-g opened this issue Nov 20, 2020 · 21 comments
Closed

Pod Topology Spread constraints should be taken into account on scale down #96748

ahg-g opened this issue Nov 20, 2020 · 21 comments
Assignees
Labels
area/workload-api/deployment kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ahg-g
Copy link
Member

ahg-g commented Nov 20, 2020

What would you like to be added:
Pod topology spread constraints are currently only evaluated when scheduling a pod.

The ask is to do that in kube-controller-manager when scaling down a replicaset. The logic would select the failure domain with the highest number of pods when selecting a victim.

The risk is impacting kube-controller-manager performance.

Why is this needed:
To maintain the spread.

This is a special case of kubernetes/enhancements#1828 and kubernetes/enhancements#1888

/sig apps
/sig scheduling

@ahg-g ahg-g added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 20, 2020
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2020
@ahg-g
Copy link
Member Author

ahg-g commented Nov 20, 2020

@ahg-g
Copy link
Member Author

ahg-g commented Nov 20, 2020

This means that controller-manager should start watching nodes, I am not sure if it does that already.

Other Questions:

  • How would this combine with delete-priority (Update guestbook-go. #1828)?
  • Should we also consider other constraints like inter-pod (anti)affinity, node affinity and taints?
  • If so, how should the rankings of different constraints be combined? additively just like we do in the scheduler?

I am inclined to say that topology spread is a special enough case to have native support for it in core k8s since it is critical for HA, and for all other cases perhaps delete-priority is the answer?

@alculquicondor
Copy link
Member

delete-priority should take precedence.

As for performance, the current logic already takes into consideration active Pods per Node. So it wouldn't be much of a increase to do the same for other topologies.

@Huang-Wei
Copy link
Member

+1 to Aldo that delete-priority should take precedence.

I think delete-priority, if approved, acts as an extensible hook for 3rd-party integrator to impact the down-scaling, for any feature. As it's not convinced (yet) how compelling that could be, I think for GAed feature like PodTopologySpread which have clear API semantics, the down-scaling optimization logic should be implemented in a more controllable and deterministic way.

  • Should we also consider other constraints like inter-pod (anti)affinity, node affinity and taints?

I guess we have to consider case by case. We can start with PodTopologySpread.

  • If so, how should the rankings of different constraints be combined? additively just like we do in the scheduler?

Are we talking about hard constraints, or soft? It'd be easier to only cover the hard constraints.

@ahg-g
Copy link
Member Author

ahg-g commented Nov 20, 2020

Are we talking about hard constraints, or soft? It'd be easier to only cover the hard constraints.

we should do both in the same way, best effort.

@alculquicondor
Copy link
Member

alculquicondor commented Nov 20, 2020

Are we talking about hard constraints, or soft? It'd be easier to only cover the hard constraints.

We could choose to somehow honor maxSkew, but that's just going to make the algorithm unnecessarily complicated. We can simply select the Pods from topologies with less instances. In that sense, there wouldn't be any difference between hard and soft constraints.

Going into implementation details, we only need to change the Rank

if s.Rank[i] != s.Rank[j] {

Today, the rank is the number of Pods in the same Node

ranks[i] = podsOnNode[pod.Spec.NodeName]

@ahg-g
Copy link
Member Author

ahg-g commented Nov 20, 2020

one nit here is that pod topology spread constraints could define different selectors than the replicaset.

@alculquicondor
Copy link
Member

alculquicondor commented Nov 26, 2020

I was pondering in the code and started thinking that the overhead won't be negligible as we would need to add a Node informer, and perhaps a Node labels converted into a slice to reduce map accesses.

While that implementation is still worth exploring, I was playing around with an idea that I've had for a while: relax the timestamp comparisons to a logarithmic scale (rounded to the nearest integer). This allows for some level of randomness, which most of the times is probably sufficient. Changes are pretty compact #96898

@ahg-g
Copy link
Member Author

ahg-g commented Dec 10, 2020

/cc @seanmalloy @damemi @soltysh

@damemi
Copy link
Contributor

damemi commented Dec 10, 2020

I will let @soltysh speak more on the controller implementation side, but for the selection algorithm I agree with @ahg-g that a best-effort case shouldn't need to differentiate between hard and soft constraints.

I also don't think considering MaxSkew would be too difficult, it could probably use the same algorithm we added for this in descheduler (or something similar).

It would be great to have this balancing able to react on scale-down though. Perhaps this could also tie into our work to export the scheduling framework, so that the controller-manager could use the logic right from the plugins?

@ahg-g
Copy link
Member Author

ahg-g commented Dec 10, 2020

/assign @damemi

@alculquicondor
Copy link
Member

I think it might be too hard to reuse plugins logic while keeping the replicaset controller performant. But I'd be very pleased if you prove me wrong 😅

@Huang-Wei
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 12, 2020
@alculquicondor
Copy link
Member

I'm starting a KEP on this kubernetes/enhancements#2185

I'll pass down the implementation

@alculquicondor
Copy link
Member

/area workload-api/deployment

@Huang-Wei
Copy link
Member

Are we going to treat this issue resolved by #99212 completely? Or prefer to still leave it open and explore other solutions in the future?

@ahg-g
Copy link
Member Author

ahg-g commented Mar 24, 2021

I think #99212 and if necessary https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost (with a webhook to randomly set the cost) can address this issue reasonably ok.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 22, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Aug 20, 2021

/close

@k8s-ci-robot
Copy link
Contributor

@ahg-g: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workload-api/deployment kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants