Improve cluster autoscaling #24404

fgrzadkowski · 2016-04-18T12:25:30Z

Current hacky version has number of drawbacks:

works only on GCE
doesn't support number of cases and we may still have pending pods
very cloud provider specific
very hard to update configuration
UX is very poor

We'd like to improve it to:

support all cases for pending pods
provide a reference implementation that could be ported to other cloud providers.

It should support both scaling up (P1) and scaling down (P2).

This an umbrella bug used for referencing PRs etc.

fgrzadkowski · 2016-04-18T12:49:54Z

@davidopp @roberthbailey @piosz @aronchick

roberthbailey · 2016-04-18T17:08:13Z

/cc @erictune since this is effectively a "feature issue" and might be a good use case for the new feature issue proposal.

when scheduler tried to schedule a Pod, but failed. Ref kubernetes#24404

@mml

Automatic merge from submit-queue Add pod condition PodScheduled to detect situation when scheduler tried to schedule a Pod, but failed Set `PodSchedule` condition to `ConditionFalse` in `scheduleOne()` if scheduling failed and to `ConditionTrue` in `/bind` subresource. Ref #24404 @mml (as it seems to be related to "why pending" effort)  --- This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24459)

davidopp · 2016-06-12T02:04:16Z

Can you list what work remains to be done before you can move this issue to 1.4 milestone?

davidopp · 2016-06-15T18:43:38Z

ping - can you list what work remains to be done before you can move this issue to 1.4 milestone?

matchstick · 2016-06-15T19:50:44Z

Please list what work needs to be done for next candidate.

mwielgus · 2016-06-15T23:24:30Z

More efficient unneeded node analysis. The current code will work ok for small to medium clusters (50-150 nodes with 1k-3k pods) but will not work for 1k node/30k pods.
Reduce latency in scale down. Right now we are super conservative when considering nodes for scale down - for 10 min we check whether is is possible to schedule all of their pods on other machines. Once we delete one node the checks for all other machines are invalidated (to some degree) and we start counting from the begin. This means that we can only remove 6 nodes per hour. We have to either consider deletion of multiple machines at once or relax our strategy.
Clean ups in the code.
Decide what to do with best effort pods.

mwielgus · 2017-06-21T22:34:42Z

Closing the issue. All Cluster Autoscaler issues should be tracked inside https://github.com/kubernetes/autoscaler.
Moreover, all of the improvements mentioned above were done.

fgrzadkowski assigned mwielgus Apr 18, 2016

fgrzadkowski added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. team/control-plane labels Apr 18, 2016

fgrzadkowski added this to the v1.3 milestone Apr 18, 2016

fgrzadkowski mentioned this issue Apr 19, 2016

Add pod condition PodScheduled to detect situation when scheduler tried to schedule a Pod, but failed #24459

Merged

mwielgus mentioned this issue Apr 19, 2016

Configuration flags for cluster autoscaler kubernetes-retired/contrib#805

Merged

fgrzadkowski mentioned this issue Apr 20, 2016

Cluster-Autoscaler - Kubernetes client config builder kubernetes-retired/contrib#812

Merged

mwielgus mentioned this issue Apr 20, 2016

Cluster-autoscaler: Unscheduled pod watch kubernetes-retired/contrib#814

Merged

fgrzadkowski added a commit to fgrzadkowski/kubernetes that referenced this issue Apr 20, 2016

Add pod condition PodScheduled to detect situation

f5aa38e

when scheduler tried to schedule a Pod, but failed. Ref kubernetes#24404

mwielgus mentioned this issue Apr 21, 2016

Cluster-autoscaler: Utils + more scale up flow kubernetes-retired/contrib#820

Merged

piosz mentioned this issue Apr 21, 2016

Library to operate on MIGs in GCE for Cluster Autoscaler kubernetes-retired/contrib#796

Merged

mwielgus mentioned this issue Apr 21, 2016

Cluster-autoscaler: New node count estimator kubernetes-retired/contrib#822

Merged

piosz mentioned this issue Apr 22, 2016

Cluster-autoscaler: Implemented waiting for operation in GCE Manager kubernetes-retired/contrib#828

Merged

fgrzadkowski added a commit to fgrzadkowski/kubernetes that referenced this issue May 12, 2016

Add pod condition PodScheduled to detect situation

a80b179

when scheduler tried to schedule a Pod, but failed. Ref kubernetes#24404

fgrzadkowski mentioned this issue May 13, 2016

Fix updating pod condition in scheduler #25575

Merged

fgrzadkowski mentioned this issue May 16, 2016

cluster-autoscaler: Fix slicing pods into old pods and new pods kubernetes-retired/contrib#992

Merged

This was referenced May 23, 2016

Refactor scheduler to expose predicates to cluster autoscaler #26069

Merged

cluster-autoscaler: Use all predicates in simulator kubernetes-retired/contrib#1048

Merged

matchstick modified the milestones: next-candidate, v1.3 Jun 15, 2016

timothysc added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Jun 16, 2016

davidopp mentioned this issue Feb 22, 2017

Allow updates to pod tolerations. #41818

Merged

wojtek-t removed sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/control-plane (deprecated - do not use) labels May 30, 2017

mwielgus closed this as completed Jun 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cluster autoscaling #24404

Improve cluster autoscaling #24404

fgrzadkowski commented Apr 18, 2016 •

edited

fgrzadkowski commented Apr 18, 2016

roberthbailey commented Apr 18, 2016

davidopp commented Jun 12, 2016

davidopp commented Jun 15, 2016

matchstick commented Jun 15, 2016

mwielgus commented Jun 15, 2016 •

edited

mwielgus commented Jun 21, 2017

Improve cluster autoscaling #24404

Improve cluster autoscaling #24404

Comments

fgrzadkowski commented Apr 18, 2016 • edited

fgrzadkowski commented Apr 18, 2016

roberthbailey commented Apr 18, 2016

davidopp commented Jun 12, 2016

davidopp commented Jun 15, 2016

matchstick commented Jun 15, 2016

mwielgus commented Jun 15, 2016 • edited

mwielgus commented Jun 21, 2017

fgrzadkowski commented Apr 18, 2016 •

edited

mwielgus commented Jun 15, 2016 •

edited