-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod Topology Spread constraints should be taken into account on scale down #96748
Comments
This means that controller-manager should start watching nodes, I am not sure if it does that already. Other Questions:
I am inclined to say that topology spread is a special enough case to have native support for it in core k8s since it is critical for HA, and for all other cases perhaps delete-priority is the answer? |
delete-priority should take precedence. As for performance, the current logic already takes into consideration active Pods per Node. So it wouldn't be much of a increase to do the same for other topologies. |
+1 to Aldo that delete-priority should take precedence. I think delete-priority, if approved, acts as an extensible hook for 3rd-party integrator to impact the down-scaling, for any feature. As it's not convinced (yet) how compelling that could be, I think for GAed feature like PodTopologySpread which have clear API semantics, the down-scaling optimization logic should be implemented in a more controllable and deterministic way.
I guess we have to consider case by case. We can start with PodTopologySpread.
Are we talking about hard constraints, or soft? It'd be easier to only cover the hard constraints. |
we should do both in the same way, best effort. |
We could choose to somehow honor Going into implementation details, we only need to change the Rank kubernetes/pkg/controller/controller_utils.go Line 855 in 78318c7
Today, the rank is the number of Pods in the same Node
|
one nit here is that pod topology spread constraints could define different selectors than the replicaset. |
I was pondering in the code and started thinking that the overhead won't be negligible as we would need to add a Node informer, and perhaps a Node labels converted into a slice to reduce map accesses. While that implementation is still worth exploring, I was playing around with an idea that I've had for a while: relax the timestamp comparisons to a logarithmic scale (rounded to the nearest integer). This allows for some level of randomness, which most of the times is probably sufficient. Changes are pretty compact #96898 |
I will let @soltysh speak more on the controller implementation side, but for the selection algorithm I agree with @ahg-g that a best-effort case shouldn't need to differentiate between hard and soft constraints. I also don't think considering MaxSkew would be too difficult, it could probably use the same algorithm we added for this in descheduler (or something similar). It would be great to have this balancing able to react on scale-down though. Perhaps this could also tie into our work to export the scheduling framework, so that the controller-manager could use the logic right from the plugins? |
/assign @damemi |
I think it might be too hard to reuse plugins logic while keeping the replicaset controller performant. But I'd be very pleased if you prove me wrong 😅 |
/triage accepted |
I'm starting a KEP on this kubernetes/enhancements#2185 I'll pass down the implementation |
/area workload-api/deployment |
Are we going to treat this issue resolved by #99212 completely? Or prefer to still leave it open and explore other solutions in the future? |
I think #99212 and if necessary https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost (with a webhook to randomly set the cost) can address this issue reasonably ok. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/close |
@ahg-g: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What would you like to be added:
Pod topology spread constraints are currently only evaluated when scheduling a pod.
The ask is to do that in kube-controller-manager when scaling down a replicaset. The logic would select the failure domain with the highest number of pods when selecting a victim.
The risk is impacting kube-controller-manager performance.
Why is this needed:
To maintain the spread.
This is a special case of kubernetes/enhancements#1828 and kubernetes/enhancements#1888
/sig apps
/sig scheduling
The text was updated successfully, but these errors were encountered: