Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink pod affinity/anti-affinity #72479

Open
wojtek-t opened this issue Jan 2, 2019 · 29 comments
Open

Rethink pod affinity/anti-affinity #72479

wojtek-t opened this issue Jan 2, 2019 · 29 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@wojtek-t
Copy link
Member

wojtek-t commented Jan 2, 2019

Pod Affinity/AntiAffinity is causing us a bunch of pain from performance/scalability perspective.

Although scheduling SIG has done tremendous effort to improve it, it's still the very significant (most?) factor for scheduling throughput.

The reason why this is so painful, is that it is cluster-wide function, i.e. we need to know the state of the whole cluster to compute feasibility/priority of a single node. This is unique (I'm consciously ignoring aggregation in priority function (e.g. normalization based on scores of all nodes), which is the only other exception) and this is the reason of our issues.

I think that in order to significantly improve throughput, we need to somehow strenghten this predicate/priority.

(I know it will be painful due to deprecation process), but here I propose that we say that the only allowed topology for pod affinity/anti-affinity is "node".

What would this buy us:

  • computations of feasibiliyt/priority of a node is local to that node only
  • we can fully parallelize all the computations across nodes

What do we lose:

  • the predicate/priority is no longer generic
    However, I claim that this function (as is) is not super useful in a generic case (the usecase is spreading across failure domains). And given that predicate is "at most one per FD" (not spread ~evenly), it's not generic enough to solve many cases.

I think that if we would strenghten pod affinity/anti-affinity topology to be always "node" and in addition to that will introduce "spread evenly" function, that would be more beneficial.

Thoughts?

/kind feature
/sig scheduling

@davidopp @bsalamat @kubernetes/sig-scheduling-feature-requests

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 2, 2019
@bsalamat
Copy link
Member

bsalamat commented Jan 2, 2019

cc/ @Huang-Wei

@wojtek-t
I agree that a better and easier-to-use spreading feature would be very helpful. Many users have asked for it. We have been recently trying to add it by augmenting affinity/anti-affinity.
I also agree that restricting inter-pod affinity/anti-affinity to "node" would improve performance of the feature dramatically, but I believe it would probably make the feature less usable. Some users want to put multiple pods that communicate with one another in the same zone to avoid inter-zone traffic charges. There are most probably other important use-cases for the feature. As a result, I believe limiting topology to "node" will reduce usability of the feature significantly.
I would suggest the followings as a kind of a middle-ground solution:

  1. Add a feature to spread pods evenly in an arbitrary topology domain.
  2. Limit inter-pod anti-affinity to "node" only. This helps because anti-affinity is symmetric and the scheduler won't need to check all pods in the cluster to ensure there is no anti-affinity. Other use-cases of anti-affinity that I am aware of will be addresses by the "spread evenly" feature.
  3. Limit inter-pod affinity to "node", "zone", and "region". I am on the fence about this item (despite suggesting it), because these labels are not exhaustive. For example, users may want to place pods in the same rack in an on-prem cluster. With this limitation, they won't be able to do so. On the other hand, this limitation makes it possible for the scheduler to cache nodes for these three topology domains, which in turn may help improve performance of the feature.

@resouer
Copy link
Contributor

resouer commented Jan 3, 2019

First of all, I don't think it make sense to restrict affinity & anti-affinity locally to node only. I would claim it's already a widely adopted pattern for using affinity in cluster level.

Add a feature to spread pods evenly in an arbitrary topology domain.

While I'm not so confident that we have a perfect spreading even algorithm in current scheduler's design. For instance, in the community, we've seen users have a custom priority to spreading pods evenly in the cluster, which is very tricky.

I believe @Huang-Wei have more inputs regarding to this direction.

@k82cn
Copy link
Member

k82cn commented Jan 3, 2019

Some users want to put multiple pods that communicate with one another in the same zone to avoid inter-zone traffic charges.

+1, IMO, only support "node" level affinity/anti-affinity is not enough for user :)

Limit inter-pod affinity to "node", "zone", and "region".

Comparing to "node" only, that's better to support some pre-define "area" which we can introduce more "index" to speed up.

@davidopp
Copy link
Member

davidopp commented Jan 3, 2019

As a general observation, it seems like we can publish scalability numbers with the caveat that you will not get those results if you use certain scheduler features. The scheduler should be able to keep track of whether any pod is using a "slow" feature. (for example, assuming we make "node" topology key pod affinity/anti-affinity fast, until the first pod with non-"node" affinity/anti-affinity is submitted, there's no need to check all the nodes, thus the scheduling should be fast.) We can supply an admission controller that blocks pods that use "slow" scheduling features (we already have an admission controller that blocks non-node hard pod anti-affinity, to prevent someone from blocking scheduling in a large part of a cluster, and this could kind of be an extension of that.)

So maybe we can make "node" "rack" and "zone" fast for affinity/anti-affinity and not worry about anything else? Or is it impossible to make "rack" and "zone" fast?

@kevin-wangzefeng
Copy link
Member

Instead of breaking backward compatibility, I would recommand turning the decision over to users:

  1. Keep current semantic support and introduce cache/index with pre-defined topologykeys (indicated when scheduler bootstrapping) for both inter-pod affinity and inter-pod anti-affinity.
  2. If a pod comes with undefined topologykey, scheduler could fallback to arbitrary topology process.
  3. User can turn on a "topologykey restriction" admission-controller to reject pod with undefined topologykeys in performance/scalability sensitive cases.

@wgliang
Copy link
Contributor

wgliang commented Jan 3, 2019

Instead of breaking backward compatibility, I would recommand turning the decision over to users

+1

@wojtek-t
Copy link
Member Author

wojtek-t commented Jan 3, 2019

Some users want to put multiple pods that communicate with one another in the same zone to avoid inter-zone traffic charges.

This is important usecase, but I'm not convinced that using pod-affinity is actually the best way for doing that. Using pod affinity for it means, that scheduling a first pod in a particular topology, determines all future schedulings. Given that topologies may be different, I think users should actually choose an explicit topology (e.g. a given zone) and then they can use node selectors.

Limit inter-pod anti-affinity to "node" only.

Though, that sounds also reasonable to me (at least as a starting point to see where exactly we will land).

Comparing to "node" only, that's better to support some pre-define "area" which we can introduce more "index" to speed up.

I don't think this really helps. I would claim that things like "zone" or "region" should definitely be part in this predefined topologies then. And you can easily end-up with tens of thousands of pods in a single zone, which means processing them must take time.

it seems like we can publish scalability numbers with the caveat that you will not get those results if you use certain scheduler features

The problem is that with pod affinity/anti-affinity (mostly anti-affinity) there is visible overhead even if you're not using it at all. This is what is concerning me most.

Or is it impossible to make "rack" and "zone" fast?

We don't have good way for doing that (yet?).

Instead of breaking backward compatibility, I would recommand turning the decision over to users:

See my points above. In particular this point:

I don't think this really helps. I would claim that things like "zone" or "region" should definitely be part in this predefined topologies then. And you can easily end-up with tens of thousands of pods in a single zone, which means processing them must take time.

So while in general I would like to avoid breaking backward compatibility, and I was trying to avoid that in the past, looking into how much it costs us and how much painful it is, now I'm leaning towards saying that it wasn't a good decision.

@k82cn
Copy link
Member

k82cn commented Jan 3, 2019

Or is it impossible to make "rack" and "zone" fast?

We don't have good way for doing that (yet?).

IMO, it's better to "try" this (make "rack" and "zone" faster) firstly comparing to change the behaviour; WDYT ?

@wojtek-t
Copy link
Member Author

wojtek-t commented Jan 3, 2019

IMO, it's better to "try" this (make "rack" and "zone" faster) firstly comparing to change the behaviour

How would you like to do it? I'm happy to hear your thoughts.

@xiaoxubeii
Copy link
Member

The problem is that with pod affinity/anti-affinity (mostly anti-affinity) there is visible overhead even if you're not using it at all. This is what is concerning me most.

I find that kubernetes scheduler can set policy to specify own predicates and priorities:
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/

func (c *configFactory) CreateFromConfig(policy schedulerapi.Policy) (*Config, error) {
klog.V(2).Infof("Creating scheduler from configuration: %v", policy)
// validate the policy configuration
if err := validation.ValidatePolicy(policy); err != nil {
return nil, err
}
predicateKeys := sets.NewString()
if policy.Predicates == nil {
klog.V(2).Infof("Using predicates from algorithm provider '%v'", DefaultProvider)
provider, err := GetAlgorithmProvider(DefaultProvider)
if err != nil {
return nil, err
}
predicateKeys = provider.FitPredicateKeys
} else {
for _, predicate := range policy.Predicates {
klog.V(2).Infof("Registering predicate: %s", predicate.Name)
predicateKeys.Insert(RegisterCustomFitPredicate(predicate))
}
}

But they are all marked as DEPRECATED. Is it one way to achieve disabling pod affinity/anti-affinity?

@bsalamat
Copy link
Member

This is important usecase, but I'm not convinced that using pod-affinity is actually the best way for doing that. Using pod affinity for it means, that scheduling a first pod in a particular topology, determines all future schedulings. Given that topologies may be different, I think users should actually choose an explicit topology (e.g. a given zone) and then they can use node selectors.

Using node selectors is not portable from one cluster to another. It also has the issue that if a user provides nodes in a specific zone, their pods won't be scheduled if that zone is full and while other zones have capacity. Inter-pod affinity is a lot more convenient and much more portable than node selector.

@bsalamat
Copy link
Member

bsalamat commented Jan 10, 2019

So maybe we can make "node" "rack" and "zone" fast for affinity/anti-affinity and not worry about anything else? Or is it impossible to make "rack" and "zone" fast?

Rack is not a standard label in K8s, but we can make "node", "zone" faster. We could add "region" too, but I don't think it helps a lot, as most of the clusters don't span multiple regions and even if they do, there are probably not many regions in each cluster. As a result, a region usually has a large portion of the nodes of a cluster and filtering by region won't help much.

@Huang-Wei
Copy link
Member

My 2 cents:

Re @wojtek-t:

And given that predicate is "at most one per FD" (not spread ~evenly), it's not generic enough to solve many cases.

We're trying to come out with a KEP to solve this pain point - called "PodsEvenSpearding" or "MaxPodsPerTopology", the scope and approach is still being discussed.

Re @bsalamat:

Add a feature to spread pods evenly in an arbitrary topology domain.

Generally I agree on this. Let's discuss more details on the proposal and I will revise that proposal later.

Re @bsalamat

Limit inter-pod affinity to "node", "zone", and "region".

and @k82cn:

Comparing to "node" only, that's better to support some pre-define "area" which we can introduce more "index" to speed up.

Along with the PR that augments semantics of PodAffinity from single pod to multiples pods, a structure called "TopologyInfo" (keyed/valued with node label pairs) is introduced. If that approach can get to a consensus, and with some additional code changes (I wouldn't expect much), basically we don't need to limit it to node/zone/region.

Re @wojtek-t:

Using pod affinity for it means, that scheduling a first pod in a particular topology, determines all future schedulings. Given that topologies may be different, I think users should actually choose an explicit topology (e.g. a given zone) and then they can use node selectors.

It's true that some use cases can be implemented using NodeAffinity with region/zone labels. But by using that, it's not generic for their yams to be deployed on multiple cloud providers - as their node labels are different.

The problem is that with pod affinity/anti-affinity (mostly anti-affinity) there is visible overhead even if you're not using it at all. This is what is concerning me most.

Do you mean a) even if a workload isn't using pod affinity/anti-affinity, there is still visible overhead, or b) if a workload is using pod affinity/anti-affinity, but it has failed all nodes in previous predicates, it still has visible overhead?

@wojtek-t
Copy link
Member Author

Do you mean a) even if a workload isn't using pod affinity/anti-affinity, there is still visible overhead, or b) if a workload is using pod affinity/anti-affinity, but it has failed all nodes in previous predicates, it still has visible overhead?

(a)

@bsalamat @k82cn - I would like to try to start converging on that.
While I personally am not fully convinced, I can see you arguments about necessity for generic "PodAffinity" feature. So let's skip this for now.

This is less of an issue, because PodAffinity is computed only if a given pod is using that feature.
So that doesn't impact pods that don't use it, which means I'm less concerned about it.

But I think up until now, noone provided real argument why we shouldn't consider changing the "anti-affinity" semantic.
So I guess what @bsalamat proposed in #72479 (comment)
in points (1) and (2) is pretty much exactly what I had on my mind.

So the proposal would be:

  • introduce a new "spread your pods evenly across arbitrary topology" feature
  • change the "pod-anti-affinity" to only accept "node" topology (for both predicate and priority) [and if we agree on that, we should do that asap to limit amount of usecases of that]

Is that something that at least from a first glance seems acceptable?

@bsalamat
Copy link
Member

So the proposal would be:

  1. introduce a new "spread your pods evenly across arbitrary topology" feature
  2. change the "pod-anti-affinity" to only accept "node" topology (for both predicate and priority) [and if we agree on that, we should do that asap to limit amount of usecases of that]

Yes, this is what I proposed and obviously I agree with, however point 2 is already hard to do as there are many users of inter-pod anti-affinity.
What we can do is to document that pod anti-affinity with any topology other than node is deprecated and shouldn't be used. The scheduler implements a fast path to process anti-affinity with topology=node. It keeps the existing logic of anti-affinity with arbitrary topology as a slow path alternative. If it sees a pod with anti-affinity and a topology key other than "node", it will switch to the slow path and will never switch back to the fast path as long as the scheduler is alive. In other words, deletion of the pod with anti-affinity does not change it back to the fast path.
This approach has the drawback of the need to maintain two implementations of the slow and fast paths and needs some extra logic to flip a flag once the scheduler sees a pod with anti-affinity with a non-node topology. On the positive side, it gives our customers enough time to switch their configurations and we don't have to wait for the whole deprecation time to pass before we can optimize the scheduler.

@Huang-Wei
Copy link
Member

+1 for @bsalamat's proposal. And I can rewrite current MaxPodsPerTopology proposal to aim for "spread your pods evenly across arbitrary topology" feature.

@wojtek-t
Copy link
Member Author

Yes, this is what I proposed and obviously I agree with, however point 2 is already hard to do as there are many users of inter-pod anti-affinity.
What we can do is to document that pod anti-affinity with any topology other than node is deprecated and shouldn't be used.

This feature is not in GA and there is formal deprecation policy. We should document, make it part of release-note and we would be able to remove it after 3 releases:
https://kubernetes.io/docs/reference/using-api/deprecation-policy/
The exact way of how to remove it (how to modify existing pods etc.) we should discuss in a dedicated issues/KEP.

The scheduler implements a fast path to process anti-affinity with topology=node. It keeps the existing logic of anti-affinity with arbitrary topology as a slow path alternative. If it sees a pod with anti-affinity and a topology key other than "node", it will switch to the slow path and will never switch back to the fast path as long as the scheduler is alive. In other words, deletion of the pod with anti-affinity does not change it back to the fast path.

I claim that once we eliminate pod-anti-affinity that is different tha node, we can speed up scheduler visibly even in the fast mode.
In particular, we are trying to precompute things in predicate metadata for anti-affinity:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/algorithm/predicates/metadata.go#L84
When we will only support Node, all the computations will be local so there is no need to precompute anything there. And this should visibly help.
Obviously this will take another 3 releases (or actually 4 including this one), but I still believe that we should make this call and do that rather than suffering from that forever.

+1 for @bsalamat's proposal. And I can rewrite current MaxPodsPerTopology proposal to aim for "spread your pods evenly across arbitrary topology" feature.

That would be great - thanks @Huang-Wei

@krmayankk
Copy link

krmayankk commented Jan 22, 2019

Interesting discussion. I am a user of pod anti affinity and we need it to spread pods across failure domains or zones(topology key other than nodes). Without that feature , i cant see a good way to do that in a portable way.

  • What size clusters are we talking about when we say the pod anti affinity is a huge performance hit ?
  • One of the comments was that even if pod anti affinity is not being used, it still eats up performance, curious why ?

I think before a feature to implement pod spreading across arbitrary topology exists, we shouldnt remove the pod anti affinity feature.

With managed K8s offerings like GKE,AKS, etc, creating multiple smaller k8s clusters is more of a norm than creating huge clusters and experiencing performance hits.

+1 on some way to disable/admission this if its a huge performance issue. Do we have any discussion of this on discuss.k8s or anywhere with users asking this to be removed ?

@Huang-Wei
Copy link
Member

  • One of the comments was that even if pod anti affinity is not being used, it still eats up performance, curious why ?

One reason is to guarantee symmetry of anti-affinity - i.e. for an incoming pod even if it's not using pod anti-affinity, calculation is still needed to ensure it doesn't break existing pods which have anti-affinity defined.

@wojtek-t
Copy link
Member Author

What size clusters are we talking about when we say the pod anti affinity is a huge performance hit ?

Even on cluster with small hundreds of nodes the impact is visible. And there are many people creating such clusters.

I think before a feature to implement pod spreading across arbitrary topology exists, we shouldnt remove the pod anti affinity feature.

It won't be removed for at least next 3 releases due to deprecation policy. But it doesn't mean we can't deprecate it imho.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019
@yastij
Copy link
Member

yastij commented Apr 28, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019
@yastij
Copy link
Member

yastij commented Apr 28, 2019

/priority important-soon
/sig scalability

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Apr 28, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2019
@wojtek-t
Copy link
Member Author

/remove-lifecycle stale
/lifecycle frozen

Effectively, this is already happening.

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 29, 2019
@0xmichalis
Copy link
Contributor

0xmichalis commented Feb 1, 2021

@wojtek-t what is the status of this issue?

We are currently using anti-affinity to spread pods across failure domains, it would be a pity to see this go away. Eventually, we would also like to use pod affinity to group specific deployments on the same nodes, but for now spreading across failure domains is the most important feature.

@Huang-Wei
Copy link
Member

@Kargakis The behavior of pod anti-affinity remains the same actually - unless you proactively enable LimitPodHardAntiAffinityTopology admisson plugin. (default admission plugins)

On the other hand, you may use PodTopologySpread to spread the pods across failure domains - but it's unlike PodAntiAffinity, the N+1 Pod can still be scheduled after existing N pods get distributed evenly to N domains.

@0xmichalis
Copy link
Contributor

Thanks for the response @Huang-Wei, PodTopologySpread should suffice for our needs.

@binacs
Copy link
Member

binacs commented May 5, 2023

/cc @binacs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
Status: Needs Triage
Development

No branches or pull requests