-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add max number of replicas per node/topologyKey to pod anti-affinity #40358
Comments
A use case from us ("ignoring the applications running within the same cluster") our Jenkins CI server is currently running on GKE with local SSD disks attached to the nodes. Our builds are generally blazing fast, however when multiple pods start stacking up on the same machine disk I/O kills this performance. I would love to be able to be able to tell kubernetes to schedule no more than 2 pods per node. |
We considered handling this use case @mwielgus described when we were designing pod anti-affinity, but we decided to wait until people really needed it, to avoid complicating the implementation. I think this is an important feature, but there are a couple of issues with implementing it: (1) The easiest API change is to just add an integer field to PodAffinityTerm. This could be considered a backward compatible change since we would just be adding a new field and could ignore it when not set. But then it will be possible to use it with all four combinations of {hard, soft} x {affinity, anti-affinity} and we only want this to be usable by one of those four combinations (hard anti-affinity). So we'd have to do something hacky like having validation reject setting this integer field when you are using anything other than hard anti-affinity (and of course add a comment to the API definition). So the "right" way to do it is to create a HardPodAntiAffinityTerm, which is the same as today's PodAffinityTerm except it also has the new integer field. Then the old PodAffinityTerm would be used by WeightedPodAffinityTerm and by PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution, and the new HardPodAntiAffinityTerm would be used only by PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution. This is obviously a non-backward-compatible change, so we'd need to make it before moving to Beta. But this feature is too complicated to finish for 1.6, so we'd need to revert #39478 which moved pod affinity to Beta for 1.6. But wait, it gets even worse -- since there is just a single Affinity type, if we decide to delay moving pod affinity to Beta, we should probably also delay moving node affinity to Beta, which would mean also rolling back #37299. (Otherwise we'd end up in the messy situation of having an Affinity in pod annotation for pod affinity and an Affinity in PodSpec for node affinity in 1.6. It works but is pretty confusing/messy.) (2) We'd need to think carefully about how to implement this efficiently. The current way of specifying affinity/anti-affinity lets you match against arbitrary labels so you can write things like "max of three pods that match this arbitrary label selector per rack" and I'm not sure how efficiently that can be implemented. So we have two options: Option A
Option B
My opinion is that we should go with Option B. Option A is going to be a ton of work for no benefit other than making the API maybe a little clearer. Though one could argue that actually Option A will obscure the similarity between the different types of affinity and thus will actually make the API more confusing -- in which case it's completely pointless. Thoughts? cc/ @kubernetes/sig-scheduling-misc @derekwaynecarr @rrati |
I agree this is needed, and a huge +1 to the builder use case. We have no iop fences and this is probably the best knob we would have for a while. Option A isn't really an option ;-). |
+1 for B; for time, I'd like to work this, @timothysc would you shepherd it? I'll draft an doc for that. |
cc/ @bsalamat |
@davidopp , @mwielgus , @bsalamat, I draft an design doc here , would you help to review it? I'm going to create PRs for it. And there's two points I'd like to highlight:
|
I didn't read the doc yet, but we should consider a couple of additional things beyond what this issue originally proposed, namely:
I'm not really sold on the second thing -- we already have preference versions for pod affinity and anti-affinity with N=1 and it's not clear to me that people would really need to combine N>1 with preference (OTOH combining N>1 with hard requirement makes some sense). I think that for now we should only implement the thing described originally in this issue, but should at least consider those other cases to make sure the design could accommodate them in the future if we decided to do them. |
Sure; in this issue, I only address original requirements.
To me, |
Original issue: #3945 |
/assign @k82cn |
/cc @alfred-huangjian @kubernetes/huawei |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Instead of the hard integer limit as described for option B, it would be nice to do this based on percentage as well. Max n% of replicas of the RC/Deployment replicas. |
/remove-lifecycle stale |
According to wojtek-t@ 's comments at #41718 (comment) and its following discussion with bsalamat@ , we'd like to describe the requirements as "I want to have my app equally spreaded across zones". So I close the PR based on AntiAffintiy. I'll try to find other solution to address this requirements; and it's great if anyone has some suggestion on that :) |
@k82cn Does |
Something similar with |
In my case, a deployment should have a specific number (N>1) of replicas per node, regardless of the number of nodes. Is there any way to achieve this? And (if the answer is yes) is it also possible to have a different N for each node (or a set of nodes matching some selector)? |
@BrendanThompson There was a design doc approaching "max pods per topology". But after brainstorming with Bobby, it turns out that might not be a good idea, esp. on how to set that "max value", see the discussion here. Recently I drafted up a KEP to achieve "even pods spreading" so as to resolve this and similar problems. Any comments are welcome. |
`Require` is good. and `perfer` is redundant. Just image that when we
deploy a daemonset in our K8s, we actually knows the replicas number. Some
resources and replicas number is deliberated by users with some tools. So
we need not consider too much on graceful solution.
Wei Huang <notifications@github.com> 于2019年2月22日周五 上午9:42写道:
… @BrendanThompson <https://github.com/BrendanThompson> There was a design
doc approaching "max pods per topology". But after brainstorming with
Bobby, it turns out that might not be a good idea, esp. on how to set that
"max value", see the discussion here
<https://docs.google.com/document/d/1llmjRNp5UEfzXR0Tv7btt5zB04FW8bu960t_o91mueI/edit?usp=sharing>
.
Recently I drafted up a KEP
<kubernetes/enhancements#851> to achieve "even
pods spreading" so as to resolve this and similar problems. Any comments
are welcome.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40358 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE6mHk3QHKwnOnkm8YaOy6Ez7JF2W6O5ks5vP0sQgaJpZM4LsPRF>
.
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
PodTopologySpread (a.k.a EvenPodsSpead) is implemented (alpha in 1.16, beta in 1.18) to resolve the issue described here. @mwielgus Can we close this issue? |
Relevant: #87662 |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Are there any plans to fix this? I see the issue was automatically closed (again) but it's been open for +3 years. Does anybody know any workaround? Thanks. |
@j3j5 I think the solution is Pod Topology Spread Constraints and/or Horizontal Pod Autoscaler. |
Thanks @adampl, I think the Topology Spread Constraint is what I was looking for. I guess I'll have to wait until an upgrade for 1.18 is available on my clusters. |
In the current design there are basically two choices that a user can make - either allow indefinite number of pods to be scheduled on one node or not allow to have more than 1 pod at all. In many cases it might be good to have some middle ground option - have maximum of N pod scheduled in one node/topologyKey.
My primary use case is ClusterAutoscaler. In CA when a replicaset/deployment is created and its pods cannot schedule due to lack of capacity new nodes are added. If the nodes used in the clusters are big enough it may happen that all replicas end up on a single node which is probably not the user wanted when creating pods with multiple replicas. If the user specifies hard pod anti-affinity it will get as many nodes as replicas which may lead to poor node utilization.
The other use case is to ensure that not all pods go to a single rack as the rack may die at some time causing a big outage. On the other hand we don't want to have only ONE pod in a rack as there might not be enough racks.
cc: @davidopp @wojtek-t @gmarek
The text was updated successfully, but these errors were encountered: