Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max number of replicas per node/topologyKey to pod anti-affinity #40358

Closed
mwielgus opened this issue Jan 24, 2017 · 43 comments
Closed

Add max number of replicas per node/topologyKey to pod anti-affinity #40358

mwielgus opened this issue Jan 24, 2017 · 43 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@mwielgus
Copy link
Contributor

In the current design there are basically two choices that a user can make - either allow indefinite number of pods to be scheduled on one node or not allow to have more than 1 pod at all. In many cases it might be good to have some middle ground option - have maximum of N pod scheduled in one node/topologyKey.

My primary use case is ClusterAutoscaler. In CA when a replicaset/deployment is created and its pods cannot schedule due to lack of capacity new nodes are added. If the nodes used in the clusters are big enough it may happen that all replicas end up on a single node which is probably not the user wanted when creating pods with multiple replicas. If the user specifies hard pod anti-affinity it will get as many nodes as replicas which may lead to poor node utilization.

The other use case is to ensure that not all pods go to a single rack as the rack may die at some time causing a big outage. On the other hand we don't want to have only ONE pod in a rack as there might not be enough racks.

cc: @davidopp @wojtek-t @gmarek

@mwielgus mwielgus added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 24, 2017
@jredl-va
Copy link

A use case from us ("ignoring the applications running within the same cluster") our Jenkins CI server is currently running on GKE with local SSD disks attached to the nodes. Our builds are generally blazing fast, however when multiple pods start stacking up on the same machine disk I/O kills this performance. I would love to be able to be able to tell kubernetes to schedule no more than 2 pods per node.

@davidopp davidopp self-assigned this Jan 24, 2017
@davidopp
Copy link
Member

davidopp commented Jan 24, 2017

We considered handling this use case @mwielgus described when we were designing pod anti-affinity, but we decided to wait until people really needed it, to avoid complicating the implementation.

I think this is an important feature, but there are a couple of issues with implementing it:

(1) The easiest API change is to just add an integer field to PodAffinityTerm. This could be considered a backward compatible change since we would just be adding a new field and could ignore it when not set. But then it will be possible to use it with all four combinations of {hard, soft} x {affinity, anti-affinity} and we only want this to be usable by one of those four combinations (hard anti-affinity). So we'd have to do something hacky like having validation reject setting this integer field when you are using anything other than hard anti-affinity (and of course add a comment to the API definition).

So the "right" way to do it is to create a HardPodAntiAffinityTerm, which is the same as today's PodAffinityTerm except it also has the new integer field. Then the old PodAffinityTerm would be used by WeightedPodAffinityTerm and by PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution, and the new HardPodAntiAffinityTerm would be used only by PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution. This is obviously a non-backward-compatible change, so we'd need to make it before moving to Beta. But this feature is too complicated to finish for 1.6, so we'd need to revert #39478 which moved pod affinity to Beta for 1.6.

But wait, it gets even worse -- since there is just a single Affinity type, if we decide to delay moving pod affinity to Beta, we should probably also delay moving node affinity to Beta, which would mean also rolling back #37299. (Otherwise we'd end up in the messy situation of having an Affinity in pod annotation for pod affinity and an Affinity in PodSpec for node affinity in 1.6. It works but is pretty confusing/messy.)

(2) We'd need to think carefully about how to implement this efficiently. The current way of specifying affinity/anti-affinity lets you match against arbitrary labels so you can write things like "max of three pods that match this arbitrary label selector per rack" and I'm not sure how efficiently that can be implemented.

So we have two options:

Option A

Option B

  • don't roll back anything
  • start working on this feature whenever we have time; implement this feature using the "hacky" approach (add an integer to PodAfinityTerm, and have validation reject use of this field except with hard pod anti-affinity); this is backward-compatible so does not block us from moving affinity to beta in 1.6

My opinion is that we should go with Option B. Option A is going to be a ton of work for no benefit other than making the API maybe a little clearer. Though one could argue that actually Option A will obscure the similarity between the different types of affinity and thus will actually make the API more confusing -- in which case it's completely pointless.

Thoughts?

cc/ @kubernetes/sig-scheduling-misc @derekwaynecarr @rrati

@davidopp
Copy link
Member

@jredl-va I think your use case can already be handled using PodAntiAffinity.PreferredDuringSchedulingIgnoredDuringExecution. See this page of the user guide for more details.

@timothysc
Copy link
Member

I agree this is needed, and a huge +1 to the builder use case. We have no iop fences and this is probably the best knob we would have for a while.

Option A isn't really an option ;-).
+1 to Option B, when time permits.

@k82cn
Copy link
Member

k82cn commented Feb 2, 2017

+1 for B; for time, I'd like to work this, @timothysc would you shepherd it? I'll draft an doc for that.

@davidopp
Copy link
Member

cc/ @bsalamat

@k82cn
Copy link
Member

k82cn commented Feb 19, 2017

@davidopp , @mwielgus , @bsalamat, I draft an design doc here , would you help to review it? I'm going to create PRs for it. And there's two points I'd like to highlight:

  1. When checking anti-affinity, "max number of replicas" will be pod level, e.g. if two has 3 anti-affinity matched, scheduler will only count 1 (one pod) instead of 3.
  2. Another concern is about performance; in satisfiesExistingPodsAntiAffinity, it need to calculate replicas for "max number of replicas" between existing pods, and then, account current pods to check anti-affinity. The calculation between existing pods is the overhead (performance impact).

@davidopp
Copy link
Member

I didn't read the doc yet, but we should consider a couple of additional things beyond what this issue originally proposed, namely:

  • add a number to pod affinity that means "minimum of N per topology value", similar to this proposal of adding a number to pod anti-affinity that means "maximum of N per topology value"
  • preference versions of both (with associated weights, as with preference today)

I'm not really sold on the second thing -- we already have preference versions for pod affinity and anti-affinity with N=1 and it's not clear to me that people would really need to combine N>1 with preference (OTOH combining N>1 with hard requirement makes some sense).

I think that for now we should only implement the thing described originally in this issue, but should at least consider those other cases to make sure the design could accommodate them in the future if we decided to do them.

@k82cn
Copy link
Member

k82cn commented Feb 19, 2017

Sure; in this issue, I only address original requirements.

preference versions of both (with associated weights, as with preference today)

To me, N>1 seems not necessary for preference; any issue for that?

@bgrant0607
Copy link
Member

Original issue: #3945

@k82cn
Copy link
Member

k82cn commented Mar 8, 2017

/assign @k82cn

@kevin-wangzefeng
Copy link
Member

/cc @alfred-huangjian @kubernetes/huawei

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017
@Bekt
Copy link

Bekt commented Jan 3, 2018

Instead of the hard integer limit as described for option B, it would be nice to do this based on percentage as well. Max n% of replicas of the RC/Deployment replicas.

@bgrant0607
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2018
@k82cn
Copy link
Member

k82cn commented Feb 28, 2018

According to wojtek-t@ 's comments at #41718 (comment) and its following discussion with bsalamat@ , we'd like to describe the requirements as "I want to have my app equally spreaded across zones". So I close the PR based on AntiAffintiy. I'll try to find other solution to address this requirements; and it's great if anyone has some suggestion on that :)

@wackxu
Copy link
Contributor

wackxu commented Jun 4, 2018

"I want to have my app equally spreaded across zones"

@k82cn Does SelectorSpreadPriority can implements this?

@k82cn
Copy link
Member

k82cn commented Jun 4, 2018

Something similar with SelectorSpreadPriority maybe helpful. SelectorSpreadPriority dependents on ReplicaSet, StatefulSet and so on, which should be avoid. I think there's also a proposal/option to use OwnerReference, but I can not remember the comments for it :).

@adampl
Copy link

adampl commented Jul 18, 2018

In my case, a deployment should have a specific number (N>1) of replicas per node, regardless of the number of nodes. Is there any way to achieve this? And (if the answer is yes) is it also possible to have a different N for each node (or a set of nodes matching some selector)?

@Huang-Wei
Copy link
Member

@BrendanThompson There was a design doc approaching "max pods per topology". But after brainstorming with Bobby, it turns out that might not be a good idea, esp. on how to set that "max value", see the discussion here.

Recently I drafted up a KEP to achieve "even pods spreading" so as to resolve this and similar problems. Any comments are welcome.

@zhangxiaoyu-zidif
Copy link
Contributor

zhangxiaoyu-zidif commented Feb 22, 2019 via email

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 22, 2019
@Huang-Wei
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 9, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 6, 2019
@adampl
Copy link

adampl commented Nov 7, 2019

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 7, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2020
@Huang-Wei
Copy link
Member

PodTopologySpread (a.k.a EvenPodsSpead) is implemented (alpha in 1.16, beta in 1.18) to resolve the issue described here. @mwielgus Can we close this issue?

@leoskyrocker
Copy link

Relevant: #87662

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 8, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@j3j5
Copy link

j3j5 commented May 21, 2020

Are there any plans to fix this? I see the issue was automatically closed (again) but it's been open for +3 years. Does anybody know any workaround? Thanks.

@adampl
Copy link

adampl commented May 21, 2020

@j3j5 I think the solution is Pod Topology Spread Constraints and/or Horizontal Pod Autoscaler.

@j3j5
Copy link

j3j5 commented May 21, 2020

Thanks @adampl, I think the Topology Spread Constraint is what I was looking for. I guess I'll have to wait until an upgrade for 1.18 is available on my clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.