Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add atomic scale down option for node groups #5695

Merged
merged 10 commits into from
Jun 30, 2023
Merged

Conversation

kawych
Copy link
Contributor

@kawych kawych commented Apr 18, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a node group AtomicScaleDown option, that allows for all-or-nothing scale down of the node group.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add AtomicScaleDown option that allows all-or-nothing scale down of node groups.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 18, 2023
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 18, 2023
@kawych
Copy link
Contributor Author

kawych commented Apr 18, 2023

CC @x13n

deletionStartTime := time.Now()
defer func() { metrics.UpdateDuration(metrics.ScaleDownNodeDeletion, time.Now().Sub(deletionStartTime)) }()

results, ts := a.nodeDeletionTracker.DeletionResults()
scaleDownStatus := &status.ScaleDownStatus{NodeDeleteResults: results, NodeDeleteResultsAsOf: ts}

emptyToDelete, drainToDelete := a.cropNodesToBudgets(empty, drain)
if len(emptyToDelete) == 0 && len(drainToDelete) == 0 {
emptyIndividualToDelete, drainIndividualToDelete, emptyAtomicToDelete, drainAtomicToDelete := a.cropNodesToBudgets(emptyIndividual, drainIndividual, emptyAtomic, drainAtomic)
Copy link
Contributor Author

@kawych kawych Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x13n
My idea of getting rid of the increasing number of pools here would be to create a dedicated struct for a bucket of nodes to scale down, for example:

type nodeBucket struct{ nodeGroup cloudprovider.NodeGroup nodes []*apiv1.Node atomic bool drain bool }

This could be populated by the Planner and processed by each component in appropriate order (e.g. cropNodesToBudgets would go through atomic first, delete would go through empty first). I think we could also move cropNodesToBudgets() out of the actuator.go if we want to keep its size limited, the rest seems more tightly coupled with the actuation logic. Please let me know WDYT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually starting to think that cropping to budget shouldn't really be done in actuator. The actuator should do what it is told to do: drain and delete a bunch of nodes. If we move this logic out of actuator (perhaps to a dedicated scale down set processor), a lot of the code here will become simpler. The only remaining issue will be the batching logic. One idea to deal with it would be to have batching criteria adjusted per node group. For most node groups it would be "wait for N nodes or T time, whichever comes first". Atomic node groups would set N equal to number of nodes (so it would have to be dynamic) and set T to +inf. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it out of the main actuator object, however I wanted to limit the amount of changes this PR introduces, so I still left it in the actuator directory. Please LMK if you're OK with it.

Regarding the batching logic, after our offline discussion I ended up with a wrapper over batcher, that will queue nodes for deletion within one pass of scale-down loop, but it will roll them back if any other node fails.

@kawych kawych force-pushed the tpu branch 2 times, most recently from 29d36ab to f14d91a Compare April 18, 2023 14:34
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 19, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 21, 2023
@x13n
Copy link
Member

x13n commented Apr 24, 2023

/assign

cluster-autoscaler/core/scaledown/planner/planner.go Outdated Show resolved Hide resolved
@@ -154,17 +157,63 @@ func (p *Planner) NodesToDelete(_ time.Time) (empty, needDrain []*apiv1.Node) {
// downs already in progress. If we pass the empty nodes first, they will be first
// to get deleted, thus we decrease chances of hitting the limit on non-empty scale down.
append(emptyRemovable, needDrainRemovable...),
p.context.AutoscalingOptions.MaxScaleDownParallelism)
// No need to limit the number of nodes, since it will happen later, in the actuation stage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing math.MaxInt, this param should be just removed. It is effectively going to be unused anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still needed for sequential scaledown, for now I didn't want to change the api too much.

deletionStartTime := time.Now()
defer func() { metrics.UpdateDuration(metrics.ScaleDownNodeDeletion, time.Now().Sub(deletionStartTime)) }()

results, ts := a.nodeDeletionTracker.DeletionResults()
scaleDownStatus := &status.ScaleDownStatus{NodeDeleteResults: results, NodeDeleteResultsAsOf: ts}

emptyToDelete, drainToDelete := a.cropNodesToBudgets(empty, drain)
if len(emptyToDelete) == 0 && len(drainToDelete) == 0 {
emptyIndividualToDelete, drainIndividualToDelete, emptyAtomicToDelete, drainAtomicToDelete := a.cropNodesToBudgets(emptyIndividual, drainIndividual, emptyAtomic, drainAtomic)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually starting to think that cropping to budget shouldn't really be done in actuator. The actuator should do what it is told to do: drain and delete a bunch of nodes. If we move this logic out of actuator (perhaps to a dedicated scale down set processor), a lot of the code here will become simpler. The only remaining issue will be the batching logic. One idea to deal with it would be to have batching criteria adjusted per node group. For most node groups it would be "wait for N nodes or T time, whichever comes first". Atomic node groups would set N equal to number of nodes (so it would have to be dynamic) and set T to +inf. WDYT?

@kawych kawych force-pushed the tpu branch 2 times, most recently from 12fc952 to aa01b73 Compare May 31, 2023 15:12
@kawych kawych marked this pull request as ready for review May 31, 2023 15:14
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 31, 2023
@k8s-ci-robot k8s-ci-robot requested a review from x13n May 31, 2023 15:14
cluster-autoscaler/core/scaledown/actuation/budgets.go Outdated Show resolved Hide resolved
// ScaleDownBudgetProcessor is responsible for keeping the number of nodes deleted in parallel within defined limits.
type ScaleDownBudgetProcessor struct {
ctx *context.AutoscalingContext
nodeDeletionTracker *deletiontracker.NodeDeletionTracker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeDeletionTracker should really be an implementation detail of actuator. All you need here is an interface that can give you DeletionsInProgress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - used the ActuationStatus inteface

cluster-autoscaler/core/scaledown/actuation/budgets.go Outdated Show resolved Hide resolved
drainBudget := bp.ctx.MaxDrainParallelism - len(drainInProgress)

emptyToDelete = []*nodeBucket{}
for _, bucket := range emptyAtomic {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following two loops are essentially identical, please factor them out to a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I agree I should've done it right away, I just got a little bit overwhelmed by other changes (the ones I actually ended up reverting).

cluster-autoscaler/core/scaledown/actuation/budgets.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/budgets.go Outdated Show resolved Hide resolved
@kawych kawych force-pushed the tpu branch 2 times, most recently from 0316e4f to 4a98cf6 Compare June 26, 2023 15:33
// ScaleDownBudgetProcessor is responsible for keeping the number of nodes deleted in parallel within defined limits.
type ScaleDownBudgetProcessor struct {
ctx *context.AutoscalingContext
actuationStatus scaledown.ActuationStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the status a part of the processor? The status will change with time, so should be passed through a param to CropNodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming is misleading here... the actual class that is used there is named "NodeDeletionTracker", which clearly specifies that it should have up-to-date status, but the interface defined for it's users is called "ActuationStatus" which suggests something that stays the same over time. Not sure what was the intention behind that.

Regarding this particular comment, sure, I can pass actuation status to "CropNodes", although my preference would be to treat "ActuationStatus" as an active tracker of ongoing deletions (that it is), not as a representation of a single status.

cluster-autoscaler/core/scaledown/budgets/budgets.go Outdated Show resolved Hide resolved
//drainIndividual, drainAtomic := bp.groupByNodeGroup(drain)

emptyInProgress, drainInProgress := bp.actuationStatus.DeletionsInProgress()
parallelismBudget := bp.ctx.MaxScaleDownParallelism - len(emptyInProgress) - len(drainInProgress)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you could encapsulate the budget updates in a separate object, to avoid arithmetic here and to make the budget calculations unit testable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really convinced about that... I was able to mock the budgets by setting AutoscalingOptions in the autoscaling context. I also don't think splitting this calculation helps readability. I.e. if someone is interested in the budget calculation logic, it's easier to actually read that directly in this ~20-line function than to navigate to a different file.

cluster-autoscaler/core/scaledown/planner/planner.go Outdated Show resolved Hide resolved
Previous "CropNodes" function of ScaleDownBudgetProcessor had an
assumption that atomically-scaled node groups should be classified as
"empty" or "drain" as a whole, however Cluster Autoscaler may classify
some of the nodes from a single group as "empty" and other as "drain".
@x13n
Copy link
Member

x13n commented Jun 30, 2023

Thanks for all the changes and apologies for taking so long to review this!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2023
@k8s-ci-robot k8s-ci-robot merged commit 4c55b17 into kubernetes:master Jun 30, 2023
4 checks passed
@vadasambar
Copy link
Member

vadasambar commented Jul 4, 2023

that allows for all-or-nothing scale down of the node group.

@kawych is there any issue/doc explaining why we need this feature?

I came across this PR when I was trying to resolve merge conflicts in #5672 (also changes actuator_test.go) and became interested in the PR. My guess is we want to go from 0 -> 100 (or 100 -> 0) nodes instead of 0 -> 20 -> 50 -> 90 -> 100 nodes (or the reverse). I am guessing former is faster than the latter (which means faster scaling up/down).

@kawych
Copy link
Contributor Author

kawych commented Jul 4, 2023

@vadasambar Sorry, I don't have any more detailed references to provide. But scale up/down speed is not the goal here - instead, this change allows all-or-nothing provisioning of the nodes within a node group: a scale-up will happen in one batch, scaling directly to the maximum node group size. It's useful if you want to avoid partial scale-ups (while the subsequent scale-ups could fail if there is not enough capacity), so that you don't have to pay for infrastructure until you have enough capacity to actually run your workloads.

Note that the cloud provider also has to ensure that a scale-up will not end up in a partial state.

@vadasambar
Copy link
Member

@kawych thank you for the reply.

It's useful if you want to avoid partial scale-ups (while the subsequent scale-ups could fail if there is not enough capacity), so that you don't have to pay for infrastructure until you have enough capacity to actually run your workloads.

Sorry if this is a silly question. As a user wouldn't you expect CA to scale up (even if partially) so that there's space for Pending pods instead of trying to do it all at once? I am not sure I understand the problem we are trying to solve here. I am guessing this is for workloads like jobs which need to run all at once (but I don't understand why partial scaling wouldn't work here). I guess I am confused as to what benefit waiting to scale until enough capacity is available has.

@kawych
Copy link
Contributor Author

kawych commented Jul 5, 2023

@vadasambar Your example is correct, this is applicable mostly for batch workloads that work collectively and should be scheduled at once. Additionally, the infrastructure might allow for customized placement of the VMs, but that would benefit from knowing the exact number of nodes upfront.
For most workloads though, I agree that it makes more sense to create nodes when we can. This is not the kind of a feature we expect to start enabling by default, it's rather one that should be intentionally enabled for the use cases where it makes sense.

@vadasambar
Copy link
Member

@vadasambar Your example is correct, this is applicable mostly for batch workloads that work collectively and should be scheduled at once. Additionally, the infrastructure might allow for customized placement of the VMs, but that would benefit from knowing the exact number of nodes upfront. For most workloads though, I agree that it makes more sense to create nodes when we can. This is not the kind of a feature we expect to start enabling by default, it's rather one that should be intentionally enabled for the use cases where it makes sense.

Thank you for the reply and explaining the use-case @kawych.

Comment on lines +57 to +61
autoscalingOptions, err := nodeGroup.GetOptions(ctx.NodeGroupDefaults)
if err != nil {
klog.Errorf("Failed to get autoscaling options for node group %s: %v", nodeGroup.Id(), err)
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloud Providers were previously free to return nil, cloudprovider.ErrNotImplemented for NodeGroup.GetOptions(). All 5 cloudprovider NodeGroups that I spot checked return this, including Hetzner which I am using.

With the current master branch I get an error from these lines when cluster-autoscaler tries to scale down my node group. Causing cluster-autoscaler to never scale down:

E0815 10:24:08.433379       1 post_filtering_processor.go:59] Failed to get autoscaling options for node group pool1: Not implemented

I think we can add an explicit check for cloudprovider.ErrNotImplemented here and consider it for the standard scale down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense. Apologies for introducing the bug, I'll try and follow up with a fix and a test case to detect issues like that in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it in #6037 & #6038. You can check out the PR, there were still some methods calling NodeGroup.GetOptions() that do not check for cloudprovider.ErrNotImplemented that I did not touch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants