leave a buffer of underutilized nodes when scaling down #5611

grosser · 2023-03-24T04:46:48Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

When scheduling nodes with topologySpreadConstraints ScheduleAnyway the scheduler does not evict capacity buffers and creates skew
When trying to de-schedule skewed pods descheduler noops when there is no empty space
... so allow users to opt-in to empty space

This is not perfect since it will not create "new empty space" by scaling up, but I think it's a good step forward and allows us to fix some edge-cases that capacity buffer does not solve.

Which issue(s) this PR fixes:

Fixes #5377

Special notes for your reviewer:

Does this PR introduce a user-facing change?

New --scale-down-buffer-ratio flag for ratio of empty or underutilized nodes to leave as capacity buffer per nodegroup

vadasambar · 2023-03-24T05:26:33Z

Appreciate the PR (though I am not sure I fully understand the problem yet) 👍

When scheduling nodes with topologySpreadConstraints ScheduleAnyway the scheduler does not evict capacity buffers and creates skew

This is expected. CA doesn't support ScheduleAnyway because it is a part of Scoring phase of scheduler. CA only simulates Filter phase (PreFilter and Filter extension points to be precise) and that is so by design.
You might be interested in

Getting the CA to play well with a custom scheduler #1406 (comment)
cluster-autoscaler : KubeSchedulerConfiguration plugin configuration PodTopologySpread #3879 (comment)

grosser · 2023-03-24T05:32:06Z

I know that it's not supported, that's why I made this PR to make it somewhat supported, it's not perfect but might be good enough ... I'll have to test it in our clusters to know more, but wanted to share the approach in case anyone else finds it useful or has inout on how to make it better.

towca · 2023-03-24T17:31:09Z

/assign @MaciekPytel

grosser · 2023-03-24T17:59:00Z

A better version of this would calculate the "free cpu" on all nodes and then leave "10% of capacity free" instead of this crude node math.

k8s-ci-robot · 2023-03-24T19:04:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: grosser
Once this PR has been reviewed and has the lgtm label, please ask for approval from maciekpytel. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

grosser · 2023-03-31T20:45:17Z

this worked but was not reliable enough (sometimes leaves half a node empty, sometimes full nodes) and cost accounting was not great either (cannot differentiate between bad binpacking and intentional gaps)

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 24, 2023

k8s-ci-robot requested a review from BigDarkClown March 24, 2023 04:47

k8s-ci-robot added the area/cluster-autoscaler label Mar 24, 2023

k8s-ci-robot requested a review from x13n March 24, 2023 04:47

grosser mentioned this pull request Mar 24, 2023

support overprovsioning without pending pods #5377

Closed

k8s-ci-robot assigned MaciekPytel Mar 24, 2023

leave a buffer of underutilized nodes when scaling down

5be1790

grosser force-pushed the grosser/scaledownbufferpr branch from 3d49328 to 5be1790 Compare March 24, 2023 19:03

grosser closed this Mar 31, 2023

grosser deleted the grosser/scaledownbufferpr branch March 31, 2023 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leave a buffer of underutilized nodes when scaling down #5611

leave a buffer of underutilized nodes when scaling down #5611

grosser commented Mar 24, 2023

vadasambar commented Mar 24, 2023

grosser commented Mar 24, 2023

towca commented Mar 24, 2023

grosser commented Mar 24, 2023

k8s-ci-robot commented Mar 24, 2023

grosser commented Mar 31, 2023

leave a buffer of underutilized nodes when scaling down #5611

leave a buffer of underutilized nodes when scaling down #5611

Conversation

grosser commented Mar 24, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

vadasambar commented Mar 24, 2023

grosser commented Mar 24, 2023

towca commented Mar 24, 2023

grosser commented Mar 24, 2023

k8s-ci-robot commented Mar 24, 2023

grosser commented Mar 31, 2023