Figure out and implement custom handling for MatchInterPodAffinity predicate #257

MaciekPytel · 2017-08-24T12:48:39Z

As part of working on CA performance we run a large scale-up test with some additional logging including call count and total duration spent in each predicate. The results are as follows:

E0824 09:27:54.815425       8 predicates.go:192] Predicate statistics for MaxAzureDiskVolumeCount: called 59985 times, total time 25.328538ms, mean duration 422ns
E0824 09:27:54.815489       8 predicates.go:192] Predicate statistics for MatchInterPodAffinity: called 59985 times, total time 1m48.73252767s, mean duration 1.812661ms
E0824 09:27:54.815497       8 predicates.go:192] Predicate statistics for CheckNodeCondition: called 59985 times, total time 49.05121ms, mean duration 817ns
E0824 09:27:54.815502       8 predicates.go:192] Predicate statistics for MaxEBSVolumeCount: called 59985 times, total time 24.906838ms, mean duration 415ns
E0824 09:27:54.815508       8 predicates.go:192] Predicate statistics for GeneralPredicates: called 59985 times, total time 114.434325ms, mean duration 1.907µs
E0824 09:27:54.815534       8 predicates.go:192] Predicate statistics for NoDiskConflict: called 59985 times, total time 26.067526ms, mean duration 434ns
E0824 09:27:54.815553       8 predicates.go:192] Predicate statistics for NoVolumeNodeConflict: called 59985 times, total time 38.554035ms, mean duration 642ns
E0824 09:27:54.815559       8 predicates.go:192] Predicate statistics for CheckNodeDiskPressure: called 59985 times, total time 19.062642ms, mean duration 317ns
E0824 09:27:54.815564       8 predicates.go:192] Predicate statistics for PodToleratesNodeTaints: called 59985 times, total time 22.448605ms, mean duration 374ns
E0824 09:27:54.815568       8 predicates.go:192] Predicate statistics for MaxGCEPDVolumeCount: called 59985 times, total time 61.944698ms, mean duration 1.032µs
E0824 09:27:54.815572       8 predicates.go:192] Predicate statistics for NoVolumeZoneConflict: called 59985 times, total time 64.231254ms, mean duration 1.07µs 
E0824 09:27:54.815578       8 predicates.go:192] Predicate statistics for PodFitsResources: called 357952 times, total time 514.838808ms, mean duration 1.438µs
E0824 09:27:54.815584       8 predicates.go:192] Predicate statistics for ready: called 59985 times, total time 50.931691ms, mean duration 849ns
E0824 09:27:54.815588       8 predicates.go:192] Predicate statistics for CheckNodeMemoryPressure: called 59985 times, total time 285.070472ms, mean duration 4.752µs

It turns out that MatchInterPodAffinity predicate is 3 orders of magnitude slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it and we don't maintain predicateMeta object.

After a quick glance at predicate code it makes sense - it needs to iterate over all existing pods to check if any of them has pod antiaffinity on the pod we're running predicates for. This brings up another problem - how does it get all pods and nodes? We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.

Bottom line is that using zone-level antiaffinity can cause CA to "overshoot" creating some nodes for pods that won't be able to schedule on them anyway. Fortunately, this is a pretty unlikely edge case and we will scale-down the unnecessary nodes without any problem.

The text was updated successfully, but these errors were encountered:

MaciekPytel · 2017-08-24T12:50:10Z

This will not be easy to deal with and it looks like we can work around the performance problem with #253, so I think this is best left to 1.9 timeframe.

mwielgus · 2017-08-24T13:12:01Z

@davidopp @bsalamat Do you have any recommendations?

mwielgus · 2017-08-24T13:12:45Z

cc: @shyamjvs @porridge @gmarek @wojtek-t

shyamjvs · 2017-08-24T17:55:42Z

It turns out that MatchInterPodAffinity predicate is 3 orders of magnitude slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it

@wojtek-t I remember discussing with you sometime ago about not precomputing stuff in scheduler. It changed in the meanwhile? :)

wojtek-t · 2017-08-25T08:06:03Z

@wojtek-t I remember discussing with you sometime ago about not precomputing stuff in scheduler. It changed in the meanwhile? :)

I don't understand. Can you clarify?

shyamjvs · 2017-08-25T10:53:54Z

Sorry - When we were discussing about for e.g zone-level soft inter-pod anti-affinity, fwiu the scheduler was doing computation of O(nodes*pods) when scheduling a new pod. Though it seemed to be doable in O(pods + nodes) with precomputation. And from the comment it seemed like it's precomputing now. But I think I misunderstood as soft anti-affinity uses priorities instead of hard anti-affinity which uses predicates. Correct me if I said something stupid.

wojtek-t · 2017-08-25T11:12:12Z

Yes - we were discussing some potential optimizations in scheduler. And those are still not done. But we do some caching in scheduler, and we don't do anything in cluster autoscaler.

davidopp · 2017-08-27T23:06:37Z

We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.

If I'm understanding your statement correctly, @bsalamat ran into almost exactly the same problem in the preemption work (need to run scheduler predicates against cluster state minus preempted pods on some node). He wrote a function FilteredList() that is like List() but subtracts out the pods that match a filter provided as an argument. You might be able to do the opposite, define a function that does a List() but also returns some extra pods that you supply as an argument?

I may be misunderstanding your issue completely though.

MaciekPytel · 2017-08-29T14:05:15Z

@bskiba ran some follow-up performance tests and I've spent time experimenting with predicates. Below are results of those investigations:

Fixing this is by far the most important performance problem we have. The longest simulation in performance test took 2m36, out of which 2m31 was spent inside MatchInterPodAffinity (in a cluster without a single pod using affinity or antiaffinity).
We already have NodeInfo objects including for simulated nodes and pods. What we need is to be able to pass all of this and make predicate use this as the current state of the cluster. Looking at the code it seems this can be actually done pretty easily by calculating predicateMetadata before running predicates. If my understanding is correct this should solve the correctness issue, though I haven't tested it yet.
Incidentally calculating predicateMeta improves the performance significantly, reducing maximum simulation time from 2m36s to 16.5s. In our test cluster we only tend to run a single MatchInterPodAffinity call per pre-calculation (as we only run it after other predicates have passed and we only need to find a single node we can fit pod onto). The performance gain should we significantly bigger if someone was actually using antiaffinity. So pre-calculating predicateMeta at least for some of the predicates we run seems to be the way to go.
Even after applying the above we still spend >50% time doing affinity / antiaffinity related calculations in a cluster without a single pod using those features.
CA logic is executed for a given point in time and we have a list of all pods that we consider. We can check all of them and if none of them defines affinity or antiaffinity we can disable the MatchInterPodAffinity and predicateMetadata calculation for the rest of the loop. I haven't tested this yet but I expect significant performance gains.

My current plan is to have PRs implementing both points 2 and 3 ready by tomorrow noon.

@davidopp @bsalamat does this approach sound reasonable to you?

fejta-bot · 2018-01-04T00:34:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

MaciekPytel · 2018-01-04T09:47:34Z

/lifecycle frozen
/remove-lifecycle stale

MaciekPytel · 2018-01-24T10:04:01Z

We should look more deeply into kubernetes/kubernetes#54189 and see how it impacts CA. We will get those changes with godeps update for 1.2.0 anyway, but we should spend some time to understand them. In particular we should make sure that:

The changes will play nice with our handling of MatchInterPodAffinity predicate.
Will they improve our performance out of the box? Can we do something to get more juice out of them? Our strategy of 'tell users to never use pod affinity in large clusters and disable the predicate' may become even less attractive if kube-dns ends up using it by default (Add self anti-affinity to kube-dns pods kubernetes#57683).

MaciekPytel · 2018-01-24T10:04:19Z

@mwielgus ^

MaciekPytel · 2018-02-09T14:31:25Z

@aleksandra-malinowska pointed out to me that a deployment with self-pod-antiaffinity causes nodes to be added one by one. This is because the problem with affinity predicate taking data from informers mentioned in the description of this issue is still not fixed.

Other optimizations were easier to implement and got us to where we met our performance target and we actually never started using predicateMeta when binpacking. Note that this is trickier than in case of scale-down, because:

The state of cluster is constantly changing and we need to update predicateMeta in a way that doesn't cripple our performance and introduce bugs (using stale predicateMeta would break binpacking, likely in subtle and hard to debug ways).
Regardless of how smart we are about calculating predicateMeta it will likely still hurt performance if there are no pods using hard pod affinity / antiaffinity. We should make sure this doesn't happen.

zufardhiyaulhaq · 2020-11-04T06:22:33Z

@MaciekPytel what do you mean self-pod-antiaffinity in
a deployment with self-pod-antiaffinity causes nodes to be added one by one.?

It is a soft AntiAffinity (preferredDuringSchedulingIgnoredDuringExecution)?

TBBle · 2020-11-04T07:50:49Z

It's a hard Pod AntiAffinity (requiredDuringSchedulingIgnoredDuringExecution) against a Pod's own labels, e.g., spreading kube-dns load across nodes.

MaciekPytel · 2020-11-04T09:02:04Z

Cluster Autoscaler completely ignores "soft" scheduler requirements (aka scores). preferredDuringScheduling don't impact CA's decisions or performance in any way.

zufardhiyaulhaq · 2020-11-04T10:30:13Z

Thanks, so it is safe to use soft Anti Affinity since it completely ignores. Thanks

ahg-g · 2021-08-17T17:31:48Z

@aleksandra-malinowska pointed out to me that a deployment with self-pod-antiaffinity causes nodes to be added one by one. This is because the problem with affinity predicate taking data from informers mentioned in the description of this issue is still not fixed.

Other optimizations were easier to implement and got us to where we met our performance target and we actually never started using predicateMeta when binpacking. Note that this is trickier than in case of scale-down, because:

The state of cluster is constantly changing and we need to update predicateMeta in a way that doesn't cripple our performance and introduce bugs (using stale predicateMeta would break binpacking, likely in subtle and hard to debug ways).

Regardless of how smart we are about calculating predicateMeta it will likely still hurt performance if there are no pods using hard pod affinity / antiaffinity. We should make sure this doesn't happen.

This seems to be doable now that the autoscaler can pass in the whole snapshot and is able to run PreFilters directly.

… annotations (kubernetes#257) * Add node group specific options to NodeGroupAutoscalingOptions from machineDeployment annotations * Updated func and variable names * Added unit tests * Fix unit tests after rebase

do workload.NewInfo when inserting and updating

MaciekPytel added bug area/cluster-autoscaler labels Aug 24, 2017

MaciekPytel self-assigned this Aug 24, 2017

This was referenced Aug 29, 2017

precompute predicateMetadata in scale-down #270

Merged

Disable MatchInterPodAffinity if there are no pods using affinity #281

Merged

This was referenced Oct 12, 2017

adding predicate ordering design proposal kubernetes/community#1152

Merged

Add a fast path to inter-pod affinity/anti-affinity for hostname topology key kubernetes/kubernetes#54189

Closed

MaciekPytel mentioned this issue Dec 20, 2017

[CA] Pod pending with Custom Scheduler request for scaling node #506

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2018

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 4, 2018

MaciekPytel mentioned this issue Jan 24, 2018

Add self anti-affinity to kube-dns pods kubernetes/kubernetes#57683

Merged

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/bug Categorizes issue or PR as related to a bug. and removed enhancement labels Jun 5, 2018

k8s-ci-robot removed the bug label Jun 5, 2018

aermakov-zalando mentioned this issue Jun 15, 2018

Autoscaling buffer: use nodeSelectors instead of anti-affinity zalando-incubator/kubernetes-on-aws#1171

Merged

MaciekPytel mentioned this issue May 6, 2019

bulk scale-up in azure creates only one node per iteration sometimes #1984

Closed

MaciekPytel mentioned this issue Apr 10, 2020

cluster-autoscaler: cluster expansion slow #2947

Closed

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024

Merge pull request kubernetes#257 from cmssczy/queue-workload-info

afabae9

do workload.NewInfo when inserting and updating

towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out and implement custom handling for MatchInterPodAffinity predicate #257

Figure out and implement custom handling for MatchInterPodAffinity predicate #257

MaciekPytel commented Aug 24, 2017

MaciekPytel commented Aug 24, 2017

mwielgus commented Aug 24, 2017

mwielgus commented Aug 24, 2017

shyamjvs commented Aug 24, 2017

wojtek-t commented Aug 25, 2017

shyamjvs commented Aug 25, 2017

wojtek-t commented Aug 25, 2017

davidopp commented Aug 27, 2017

MaciekPytel commented Aug 29, 2017

fejta-bot commented Jan 4, 2018

MaciekPytel commented Jan 4, 2018

MaciekPytel commented Jan 24, 2018

MaciekPytel commented Jan 24, 2018

MaciekPytel commented Feb 9, 2018

zufardhiyaulhaq commented Nov 4, 2020

TBBle commented Nov 4, 2020

MaciekPytel commented Nov 4, 2020

zufardhiyaulhaq commented Nov 4, 2020

ahg-g commented Aug 17, 2021

Figure out and implement custom handling for MatchInterPodAffinity predicate #257

Figure out and implement custom handling for MatchInterPodAffinity predicate #257

Comments

MaciekPytel commented Aug 24, 2017

MaciekPytel commented Aug 24, 2017

mwielgus commented Aug 24, 2017

mwielgus commented Aug 24, 2017

shyamjvs commented Aug 24, 2017

wojtek-t commented Aug 25, 2017

shyamjvs commented Aug 25, 2017

wojtek-t commented Aug 25, 2017

davidopp commented Aug 27, 2017

MaciekPytel commented Aug 29, 2017

fejta-bot commented Jan 4, 2018

MaciekPytel commented Jan 4, 2018

MaciekPytel commented Jan 24, 2018

MaciekPytel commented Jan 24, 2018

MaciekPytel commented Feb 9, 2018

zufardhiyaulhaq commented Nov 4, 2020

TBBle commented Nov 4, 2020

MaciekPytel commented Nov 4, 2020

zufardhiyaulhaq commented Nov 4, 2020

ahg-g commented Aug 17, 2021