Add max number of replicas per node/topologyKey to pod anti-affinity #40358

mwielgus · 2017-01-24T13:06:35Z

In the current design there are basically two choices that a user can make - either allow indefinite number of pods to be scheduled on one node or not allow to have more than 1 pod at all. In many cases it might be good to have some middle ground option - have maximum of N pod scheduled in one node/topologyKey.

My primary use case is ClusterAutoscaler. In CA when a replicaset/deployment is created and its pods cannot schedule due to lack of capacity new nodes are added. If the nodes used in the clusters are big enough it may happen that all replicas end up on a single node which is probably not the user wanted when creating pods with multiple replicas. If the user specifies hard pod anti-affinity it will get as many nodes as replicas which may lead to poor node utilization.

The other use case is to ensure that not all pods go to a single rack as the rack may die at some time causing a big outage. On the other hand we don't want to have only ONE pod in a rack as there might not be enough racks.

cc: @davidopp @wojtek-t @gmarek

jredl-va · 2017-01-24T13:48:36Z

A use case from us ("ignoring the applications running within the same cluster") our Jenkins CI server is currently running on GKE with local SSD disks attached to the nodes. Our builds are generally blazing fast, however when multiple pods start stacking up on the same machine disk I/O kills this performance. I would love to be able to be able to tell kubernetes to schedule no more than 2 pods per node.

davidopp · 2017-01-24T14:45:14Z

We considered handling this use case @mwielgus described when we were designing pod anti-affinity, but we decided to wait until people really needed it, to avoid complicating the implementation.

I think this is an important feature, but there are a couple of issues with implementing it:

(1) The easiest API change is to just add an integer field to PodAffinityTerm. This could be considered a backward compatible change since we would just be adding a new field and could ignore it when not set. But then it will be possible to use it with all four combinations of {hard, soft} x {affinity, anti-affinity} and we only want this to be usable by one of those four combinations (hard anti-affinity). So we'd have to do something hacky like having validation reject setting this integer field when you are using anything other than hard anti-affinity (and of course add a comment to the API definition).

So the "right" way to do it is to create a HardPodAntiAffinityTerm, which is the same as today's PodAffinityTerm except it also has the new integer field. Then the old PodAffinityTerm would be used by WeightedPodAffinityTerm and by PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution, and the new HardPodAntiAffinityTerm would be used only by PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution. This is obviously a non-backward-compatible change, so we'd need to make it before moving to Beta. But this feature is too complicated to finish for 1.6, so we'd need to revert #39478 which moved pod affinity to Beta for 1.6.

But wait, it gets even worse -- since there is just a single Affinity type, if we decide to delay moving pod affinity to Beta, we should probably also delay moving node affinity to Beta, which would mean also rolling back #37299. (Otherwise we'd end up in the messy situation of having an Affinity in pod annotation for pod affinity and an Affinity in PodSpec for node affinity in 1.6. It works but is pretty confusing/messy.)

(2) We'd need to think carefully about how to implement this efficiently. The current way of specifying affinity/anti-affinity lets you match against arbitrary labels so you can write things like "max of three pods that match this arbitrary label selector per rack" and I'm not sure how efficiently that can be implemented.

So we have two options:

Option A

roll back [scheduling] Moved pod affinity and anti-affinity from annotations to api fields #25319 #39478 and [scheduling] Moved node affinity from annotations to api fields. #35518 #37299 now
start working on this feature now, with the expectation it will not be finished in time for 1.6; implement it using the "right" way I described above (refactoring the affinity-related types, which is not backward compatible, so blocks us from moving affinity to beta)
in 1.7, finish this feature and re-apply [scheduling] Moved pod affinity and anti-affinity from annotations to api fields #25319 #39478 and [scheduling] Moved node affinity from annotations to api fields. #35518 #37299 to move both node and pod affinity to beta fields

Option B

don't roll back anything
start working on this feature whenever we have time; implement this feature using the "hacky" approach (add an integer to PodAfinityTerm, and have validation reject use of this field except with hard pod anti-affinity); this is backward-compatible so does not block us from moving affinity to beta in 1.6

My opinion is that we should go with Option B. Option A is going to be a ton of work for no benefit other than making the API maybe a little clearer. Though one could argue that actually Option A will obscure the similarity between the different types of affinity and thus will actually make the API more confusing -- in which case it's completely pointless.

Thoughts?

cc/ @kubernetes/sig-scheduling-misc @derekwaynecarr @rrati

davidopp · 2017-01-24T14:46:58Z

@jredl-va I think your use case can already be handled using PodAntiAffinity.PreferredDuringSchedulingIgnoredDuringExecution. See this page of the user guide for more details.

timothysc · 2017-02-01T22:31:45Z

I agree this is needed, and a huge +1 to the builder use case. We have no iop fences and this is probably the best knob we would have for a while.

Option A isn't really an option ;-).
+1 to Option B, when time permits.

k82cn · 2017-02-02T07:48:05Z

+1 for B; for time, I'd like to work this, @timothysc would you shepherd it? I'll draft an doc for that.

davidopp · 2017-02-11T03:19:23Z

cc/ @bsalamat

k82cn · 2017-02-19T08:25:42Z

@davidopp , @mwielgus , @bsalamat, I draft an design doc here , would you help to review it? I'm going to create PRs for it. And there's two points I'd like to highlight:

When checking anti-affinity, "max number of replicas" will be pod level, e.g. if two has 3 anti-affinity matched, scheduler will only count 1 (one pod) instead of 3.
Another concern is about performance; in satisfiesExistingPodsAntiAffinity, it need to calculate replicas for "max number of replicas" between existing pods, and then, account current pods to check anti-affinity. The calculation between existing pods is the overhead (performance impact).

davidopp · 2017-02-19T08:37:29Z

I didn't read the doc yet, but we should consider a couple of additional things beyond what this issue originally proposed, namely:

add a number to pod affinity that means "minimum of N per topology value", similar to this proposal of adding a number to pod anti-affinity that means "maximum of N per topology value"
preference versions of both (with associated weights, as with preference today)

I'm not really sold on the second thing -- we already have preference versions for pod affinity and anti-affinity with N=1 and it's not clear to me that people would really need to combine N>1 with preference (OTOH combining N>1 with hard requirement makes some sense).

I think that for now we should only implement the thing described originally in this issue, but should at least consider those other cases to make sure the design could accommodate them in the future if we decided to do them.

k82cn · 2017-02-19T09:55:41Z

Sure; in this issue, I only address original requirements.

preference versions of both (with associated weights, as with preference today)

To me, N>1 seems not necessary for preference; any issue for that?

bgrant0607 · 2017-03-08T06:45:56Z

Original issue: #3945

k82cn · 2017-03-08T07:41:11Z

/assign @k82cn

kevin-wangzefeng · 2017-04-06T08:59:19Z

/cc @alfred-huangjian @kubernetes/huawei

fejta-bot · 2017-12-22T23:50:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

Bekt · 2018-01-03T01:04:00Z

Instead of the hard integer limit as described for option B, it would be nice to do this based on percentage as well. Max n% of replicas of the RC/Deployment replicas.

bgrant0607 · 2018-01-22T20:35:30Z

/remove-lifecycle stale

k82cn · 2018-02-28T14:51:04Z

According to wojtek-t@ 's comments at #41718 (comment) and its following discussion with bsalamat@ , we'd like to describe the requirements as "I want to have my app equally spreaded across zones". So I close the PR based on AntiAffintiy. I'll try to find other solution to address this requirements; and it's great if anyone has some suggestion on that :)

wackxu · 2018-06-04T03:52:16Z

"I want to have my app equally spreaded across zones"

@k82cn Does SelectorSpreadPriority can implements this?

k82cn · 2018-06-04T05:33:14Z

Something similar with SelectorSpreadPriority maybe helpful. SelectorSpreadPriority dependents on ReplicaSet, StatefulSet and so on, which should be avoid. I think there's also a proposal/option to use OwnerReference, but I can not remember the comments for it :).

adampl · 2018-07-18T16:57:08Z

In my case, a deployment should have a specific number (N>1) of replicas per node, regardless of the number of nodes. Is there any way to achieve this? And (if the answer is yes) is it also possible to have a different N for each node (or a set of nodes matching some selector)?

Huang-Wei · 2019-02-22T01:41:06Z

@BrendanThompson There was a design doc approaching "max pods per topology". But after brainstorming with Bobby, it turns out that might not be a good idea, esp. on how to set that "max value", see the discussion here.

Recently I drafted up a KEP to achieve "even pods spreading" so as to resolve this and similar problems. Any comments are welcome.

zhangxiaoyu-zidif · 2019-02-22T02:41:15Z

`Require` is good. and `perfer` is redundant. Just image that when we deploy a daemonset in our K8s, we actually knows the replicas number. Some resources and replicas number is deliberated by users with some tools. So we need not consider too much on graceful solution. Wei Huang <notifications@github.com> 于2019年2月22日周五上午9:42写道：

…

@BrendanThompson <https://github.com/BrendanThompson> There was a design doc approaching "max pods per topology". But after brainstorming with Bobby, it turns out that might not be a good idea, esp. on how to set that "max value", see the discussion here <https://docs.google.com/document/d/1llmjRNp5UEfzXR0Tv7btt5zB04FW8bu960t_o91mueI/edit?usp=sharing> . Recently I drafted up a KEP <kubernetes/enhancements#851> to achieve "even pods spreading" so as to resolve this and similar problems. Any comments are welcome. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40358 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE6mHk3QHKwnOnkm8YaOy6Ez7JF2W6O5ks5vP0sQgaJpZM4LsPRF> .

fejta-bot · 2019-05-23T02:53:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-06-22T03:40:08Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Huang-Wei · 2019-07-09T21:40:44Z

/remove-lifecycle rotten

fejta-bot · 2019-10-07T22:02:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-11-06T22:44:56Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

adampl · 2019-11-07T13:11:36Z

/remove-lifecycle rotten

fejta-bot · 2020-02-05T14:09:22Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Huang-Wei · 2020-02-09T02:48:56Z

PodTopologySpread (a.k.a EvenPodsSpead) is implemented (alpha in 1.16, beta in 1.18) to resolve the issue described here. @mwielgus Can we close this issue?

leoskyrocker · 2020-03-09T09:57:50Z

Relevant: #87662

fejta-bot · 2020-04-08T10:40:28Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-05-08T11:24:20Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-05-08T11:24:34Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

j3j5 · 2020-05-21T09:27:45Z

Are there any plans to fix this? I see the issue was automatically closed (again) but it's been open for +3 years. Does anybody know any workaround? Thanks.

adampl · 2020-05-21T10:28:58Z

@j3j5 I think the solution is Pod Topology Spread Constraints and/or Horizontal Pod Autoscaler.

j3j5 · 2020-05-21T10:58:52Z

Thanks @adampl, I think the Topology Spread Constraint is what I was looking for. I guess I'll have to wait until an upgrade for 1.18 is available on my clusters.

mwielgus added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 24, 2017

davidopp self-assigned this Jan 24, 2017

timothysc mentioned this issue Feb 1, 2017

With required podAntiAffinity, scheduler couldn't find node (false positive matcher?) #35317

Closed

davidopp mentioned this issue Feb 15, 2017

Arbitrary topology through the system #41442

Closed

k82cn mentioned this issue Feb 19, 2017

Maximum of N per topology value. #41718

Closed

k8s-ci-robot assigned k82cn Mar 8, 2017

This was referenced Mar 13, 2017

Align scheduler node scoring with cluster autoscaler #28842

Closed

Deployment should support both Failure Domains and Update Domains #41394

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2018

k82cn mentioned this issue May 9, 2018

MaxPodsPerNode - be able to set hard and soft limits for deployments / replicasets #63560

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 22, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 9, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 6, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 7, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 8, 2020

k8s-ci-robot closed this as completed May 8, 2020

sync-by-unito bot mentioned this issue Aug 11, 2021

K8SSAND-788 ⁃ In Rack Topology, Replace Affinity Rules with TopologySpreadConstraint k8ssandra/k8ssandra-operator#74

Open

Add max number of replicas per node/topologyKey to pod anti-affinity #40358

Add max number of replicas per node/topologyKey to pod anti-affinity #40358

Comments

mwielgus commented Jan 24, 2017

jredl-va commented Jan 24, 2017

davidopp commented Jan 24, 2017 • edited

davidopp commented Jan 24, 2017

timothysc commented Feb 1, 2017

k82cn commented Feb 2, 2017

davidopp commented Feb 11, 2017

k82cn commented Feb 19, 2017

davidopp commented Feb 19, 2017

k82cn commented Feb 19, 2017 • edited

bgrant0607 commented Mar 8, 2017

k82cn commented Mar 8, 2017

kevin-wangzefeng commented Apr 6, 2017

fejta-bot commented Dec 22, 2017

Bekt commented Jan 3, 2018

bgrant0607 commented Jan 22, 2018

k82cn commented Feb 28, 2018

wackxu commented Jun 4, 2018

k82cn commented Jun 4, 2018

adampl commented Jul 18, 2018

Huang-Wei commented Feb 22, 2019

zhangxiaoyu-zidif commented Feb 22, 2019 via email

fejta-bot commented May 23, 2019

fejta-bot commented Jun 22, 2019

Huang-Wei commented Jul 9, 2019

fejta-bot commented Oct 7, 2019

fejta-bot commented Nov 6, 2019

adampl commented Nov 7, 2019

fejta-bot commented Feb 5, 2020

Huang-Wei commented Feb 9, 2020

leoskyrocker commented Mar 9, 2020

fejta-bot commented Apr 8, 2020

fejta-bot commented May 8, 2020

k8s-ci-robot commented May 8, 2020

j3j5 commented May 21, 2020

adampl commented May 21, 2020

j3j5 commented May 21, 2020

davidopp commented Jan 24, 2017 •

edited

k82cn commented Feb 19, 2017 •

edited