Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod spec equivalency checks can break Cluster Autoscaler scalability #4724

Closed
towca opened this issue Mar 4, 2022 · 0 comments · Fixed by #4735 or #4742
Closed

Pod spec equivalency checks can break Cluster Autoscaler scalability #4724

towca opened this issue Mar 4, 2022 · 0 comments · Fixed by #4735 or #4742
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@towca
Copy link
Collaborator

towca commented Mar 4, 2022

The logic in buildPodEquivalenceGroups and filterOutSchedulable groups pods by their scheduling requirements, as a scalability optimization. This is done by first grouping by the controller UID, and then comparing pod specs for pods from one controller. If there's something in the pod spec that's unique to a single pod within a controller, every pod ends up in a group of its own, and the optimization breaks.

In extreme cases when there are a lot of such pods (a couple thousand can be enough), CA can spend such a long time in one loop that it fails health-checks and is killed by kubelet. Then everything repeats once it gets back up, and CA is effectively broken until the pods are scheduled or deleted.

One trigger for pod specs being different is the BoundServiceAccountTokenVolume feature, which injects uniquely-named projected volumes into each pod's spec. This was taken into account by CA in #4441.

We've just run into another one - Jobs using completionMode: Indexed. In this mode, each pod gets a unique, indexed hostname in its spec. This is documented here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode. AFAIU the hostname shouldn't affect scheduling, so sanitizing it in PodSpecSemanticallyEqual should be enough to fix this particular issue.

However, this approach of "fixing" single fields as issues pop up doesn't scale very well. We should come up with a more generic solution to these kinds of problems. One idea could be having a cutoff for the number of groups within one controller, proposed in #4441 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
1 participant