Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out and implement custom handling for MatchInterPodAffinity predicate #257

Open
MaciekPytel opened this issue Aug 24, 2017 · 19 comments
Assignees
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@MaciekPytel
Copy link
Contributor

As part of working on CA performance we run a large scale-up test with some additional logging including call count and total duration spent in each predicate. The results are as follows:

E0824 09:27:54.815425       8 predicates.go:192] Predicate statistics for MaxAzureDiskVolumeCount: called 59985 times, total time 25.328538ms, mean duration 422ns
E0824 09:27:54.815489       8 predicates.go:192] Predicate statistics for MatchInterPodAffinity: called 59985 times, total time 1m48.73252767s, mean duration 1.812661ms
E0824 09:27:54.815497       8 predicates.go:192] Predicate statistics for CheckNodeCondition: called 59985 times, total time 49.05121ms, mean duration 817ns
E0824 09:27:54.815502       8 predicates.go:192] Predicate statistics for MaxEBSVolumeCount: called 59985 times, total time 24.906838ms, mean duration 415ns
E0824 09:27:54.815508       8 predicates.go:192] Predicate statistics for GeneralPredicates: called 59985 times, total time 114.434325ms, mean duration 1.907µs
E0824 09:27:54.815534       8 predicates.go:192] Predicate statistics for NoDiskConflict: called 59985 times, total time 26.067526ms, mean duration 434ns
E0824 09:27:54.815553       8 predicates.go:192] Predicate statistics for NoVolumeNodeConflict: called 59985 times, total time 38.554035ms, mean duration 642ns
E0824 09:27:54.815559       8 predicates.go:192] Predicate statistics for CheckNodeDiskPressure: called 59985 times, total time 19.062642ms, mean duration 317ns
E0824 09:27:54.815564       8 predicates.go:192] Predicate statistics for PodToleratesNodeTaints: called 59985 times, total time 22.448605ms, mean duration 374ns
E0824 09:27:54.815568       8 predicates.go:192] Predicate statistics for MaxGCEPDVolumeCount: called 59985 times, total time 61.944698ms, mean duration 1.032µs
E0824 09:27:54.815572       8 predicates.go:192] Predicate statistics for NoVolumeZoneConflict: called 59985 times, total time 64.231254ms, mean duration 1.07µs 
E0824 09:27:54.815578       8 predicates.go:192] Predicate statistics for PodFitsResources: called 357952 times, total time 514.838808ms, mean duration 1.438µs
E0824 09:27:54.815584       8 predicates.go:192] Predicate statistics for ready: called 59985 times, total time 50.931691ms, mean duration 849ns
E0824 09:27:54.815588       8 predicates.go:192] Predicate statistics for CheckNodeMemoryPressure: called 59985 times, total time 285.070472ms, mean duration 4.752µs

It turns out that MatchInterPodAffinity predicate is 3 orders of magnitude slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it and we don't maintain predicateMeta object.

After a quick glance at predicate code it makes sense - it needs to iterate over all existing pods to check if any of them has pod antiaffinity on the pod we're running predicates for. This brings up another problem - how does it get all pods and nodes? We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.

Bottom line is that using zone-level antiaffinity can cause CA to "overshoot" creating some nodes for pods that won't be able to schedule on them anyway. Fortunately, this is a pretty unlikely edge case and we will scale-down the unnecessary nodes without any problem.

@MaciekPytel
Copy link
Contributor Author

This will not be easy to deal with and it looks like we can work around the performance problem with #253, so I think this is best left to 1.9 timeframe.

@mwielgus
Copy link
Contributor

@davidopp @bsalamat Do you have any recommendations?

@mwielgus
Copy link
Contributor

cc: @shyamjvs @porridge @gmarek @wojtek-t

@shyamjvs
Copy link
Member

It turns out that MatchInterPodAffinity predicate is 3 orders of magnitude slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it

@wojtek-t I remember discussing with you sometime ago about not precomputing stuff in scheduler. It changed in the meanwhile? :)

@wojtek-t
Copy link
Member

@wojtek-t I remember discussing with you sometime ago about not precomputing stuff in scheduler. It changed in the meanwhile? :)

I don't understand. Can you clarify?

@shyamjvs
Copy link
Member

Sorry - When we were discussing about for e.g zone-level soft inter-pod anti-affinity, fwiu the scheduler was doing computation of O(nodes*pods) when scheduling a new pod. Though it seemed to be doable in O(pods + nodes) with precomputation. And from the comment it seemed like it's precomputing now. But I think I misunderstood as soft anti-affinity uses priorities instead of hard anti-affinity which uses predicates. Correct me if I said something stupid.

@wojtek-t
Copy link
Member

Yes - we were discussing some potential optimizations in scheduler. And those are still not done. But we do some caching in scheduler, and we don't do anything in cluster autoscaler.

@davidopp
Copy link
Member

We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.

If I'm understanding your statement correctly, @bsalamat ran into almost exactly the same problem in the preemption work (need to run scheduler predicates against cluster state minus preempted pods on some node). He wrote a function FilteredList() that is like List() but subtracts out the pods that match a filter provided as an argument. You might be able to do the opposite, define a function that does a List() but also returns some extra pods that you supply as an argument?

I may be misunderstanding your issue completely though.

@MaciekPytel
Copy link
Contributor Author

@bskiba ran some follow-up performance tests and I've spent time experimenting with predicates. Below are results of those investigations:

  1. Fixing this is by far the most important performance problem we have. The longest simulation in performance test took 2m36, out of which 2m31 was spent inside MatchInterPodAffinity (in a cluster without a single pod using affinity or antiaffinity).
  2. We already have NodeInfo objects including for simulated nodes and pods. What we need is to be able to pass all of this and make predicate use this as the current state of the cluster. Looking at the code it seems this can be actually done pretty easily by calculating predicateMetadata before running predicates. If my understanding is correct this should solve the correctness issue, though I haven't tested it yet.
    Incidentally calculating predicateMeta improves the performance significantly, reducing maximum simulation time from 2m36s to 16.5s. In our test cluster we only tend to run a single MatchInterPodAffinity call per pre-calculation (as we only run it after other predicates have passed and we only need to find a single node we can fit pod onto). The performance gain should we significantly bigger if someone was actually using antiaffinity. So pre-calculating predicateMeta at least for some of the predicates we run seems to be the way to go.
  3. Even after applying the above we still spend >50% time doing affinity / antiaffinity related calculations in a cluster without a single pod using those features.
    CA logic is executed for a given point in time and we have a list of all pods that we consider. We can check all of them and if none of them defines affinity or antiaffinity we can disable the MatchInterPodAffinity and predicateMetadata calculation for the rest of the loop. I haven't tested this yet but I expect significant performance gains.

My current plan is to have PRs implementing both points 2 and 3 ready by tomorrow noon.

@davidopp @bsalamat does this approach sound reasonable to you?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2018
@MaciekPytel
Copy link
Contributor Author

/lifecycle frozen
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 4, 2018
@MaciekPytel
Copy link
Contributor Author

We should look more deeply into kubernetes/kubernetes#54189 and see how it impacts CA. We will get those changes with godeps update for 1.2.0 anyway, but we should spend some time to understand them. In particular we should make sure that:

  1. The changes will play nice with our handling of MatchInterPodAffinity predicate.
  2. Will they improve our performance out of the box? Can we do something to get more juice out of them? Our strategy of 'tell users to never use pod affinity in large clusters and disable the predicate' may become even less attractive if kube-dns ends up using it by default (Add self anti-affinity to kube-dns pods kubernetes#57683).

@MaciekPytel
Copy link
Contributor Author

@mwielgus ^

@MaciekPytel
Copy link
Contributor Author

@aleksandra-malinowska pointed out to me that a deployment with self-pod-antiaffinity causes nodes to be added one by one. This is because the problem with affinity predicate taking data from informers mentioned in the description of this issue is still not fixed.

Other optimizations were easier to implement and got us to where we met our performance target and we actually never started using predicateMeta when binpacking. Note that this is trickier than in case of scale-down, because:

  1. The state of cluster is constantly changing and we need to update predicateMeta in a way that doesn't cripple our performance and introduce bugs (using stale predicateMeta would break binpacking, likely in subtle and hard to debug ways).
  2. Regardless of how smart we are about calculating predicateMeta it will likely still hurt performance if there are no pods using hard pod affinity / antiaffinity. We should make sure this doesn't happen.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/bug Categorizes issue or PR as related to a bug. and removed enhancement labels Jun 5, 2018
@zufardhiyaulhaq
Copy link

@MaciekPytel what do you mean self-pod-antiaffinity in
a deployment with self-pod-antiaffinity causes nodes to be added one by one.?

It is a soft AntiAffinity (preferredDuringSchedulingIgnoredDuringExecution)?

@TBBle
Copy link

TBBle commented Nov 4, 2020

It's a hard Pod AntiAffinity (requiredDuringSchedulingIgnoredDuringExecution) against a Pod's own labels, e.g., spreading kube-dns load across nodes.

@MaciekPytel
Copy link
Contributor Author

Cluster Autoscaler completely ignores "soft" scheduler requirements (aka scores). preferredDuringScheduling don't impact CA's decisions or performance in any way.

@zufardhiyaulhaq
Copy link

Thanks, so it is safe to use soft Anti Affinity since it completely ignores. Thanks

@ahg-g
Copy link
Member

ahg-g commented Aug 17, 2021

@aleksandra-malinowska pointed out to me that a deployment with self-pod-antiaffinity causes nodes to be added one by one. This is because the problem with affinity predicate taking data from informers mentioned in the description of this issue is still not fixed.

Other optimizations were easier to implement and got us to where we met our performance target and we actually never started using predicateMeta when binpacking. Note that this is trickier than in case of scale-down, because:

  1. The state of cluster is constantly changing and we need to update predicateMeta in a way that doesn't cripple our performance and introduce bugs (using stale predicateMeta would break binpacking, likely in subtle and hard to debug ways).
  2. Regardless of how smart we are about calculating predicateMeta it will likely still hurt performance if there are no pods using hard pod affinity / antiaffinity. We should make sure this doesn't happen.

This seems to be doable now that the autoscaler can pass in the whole snapshot and is able to run PreFilters directly.

voelzmo pushed a commit to voelzmo/autoscaler that referenced this issue Jan 10, 2024
… annotations (kubernetes#257)

* Add node group specific options to NodeGroupAutoscalingOptions from machineDeployment annotations

* Updated func and variable names

* Added unit tests

* Fix unit tests after rebase
yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
do workload.NewInfo when inserting and updating
@towca towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests