Add atomic scale down option for node groups #5695

kawych · 2023-04-18T10:23:00Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a node group AtomicScaleDown option, that allows for all-or-nothing scale down of the node group.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add AtomicScaleDown option that allows all-or-nothing scale down of node groups.

kawych · 2023-04-18T10:23:25Z

CC @x13n

kawych · 2023-04-18T10:37:25Z

cluster-autoscaler/core/scaledown/actuation/actuator.go

 	deletionStartTime := time.Now()
 	defer func() { metrics.UpdateDuration(metrics.ScaleDownNodeDeletion, time.Now().Sub(deletionStartTime)) }()

 	results, ts := a.nodeDeletionTracker.DeletionResults()
 	scaleDownStatus := &status.ScaleDownStatus{NodeDeleteResults: results, NodeDeleteResultsAsOf: ts}

-	emptyToDelete, drainToDelete := a.cropNodesToBudgets(empty, drain)
-	if len(emptyToDelete) == 0 && len(drainToDelete) == 0 {
+	emptyIndividualToDelete, drainIndividualToDelete, emptyAtomicToDelete, drainAtomicToDelete := a.cropNodesToBudgets(emptyIndividual, drainIndividual, emptyAtomic, drainAtomic)


@x13n
My idea of getting rid of the increasing number of pools here would be to create a dedicated struct for a bucket of nodes to scale down, for example:

type nodeBucket struct{ nodeGroup cloudprovider.NodeGroup nodes []*apiv1.Node atomic bool drain bool }

This could be populated by the Planner and processed by each component in appropriate order (e.g. cropNodesToBudgets would go through atomic first, delete would go through empty first). I think we could also move cropNodesToBudgets() out of the actuator.go if we want to keep its size limited, the rest seems more tightly coupled with the actuation logic. Please let me know WDYT.

I'm actually starting to think that cropping to budget shouldn't really be done in actuator. The actuator should do what it is told to do: drain and delete a bunch of nodes. If we move this logic out of actuator (perhaps to a dedicated scale down set processor), a lot of the code here will become simpler. The only remaining issue will be the batching logic. One idea to deal with it would be to have batching criteria adjusted per node group. For most node groups it would be "wait for N nodes or T time, whichever comes first". Atomic node groups would set N equal to number of nodes (so it would have to be dynamic) and set T to +inf. WDYT?

I moved it out of the main actuator object, however I wanted to limit the amount of changes this PR introduces, so I still left it in the actuator directory. Please LMK if you're OK with it.

Regarding the batching logic, after our offline discussion I ended up with a wrapper over batcher, that will queue nodes for deletion within one pass of scale-down loop, but it will roll them back if any other node fails.

x13n · 2023-04-24T13:48:09Z

/assign

cluster-autoscaler/core/scaledown/planner/planner.go

x13n · 2023-04-27T13:04:11Z

cluster-autoscaler/core/scaledown/planner/planner.go

@@ -154,17 +157,63 @@ func (p *Planner) NodesToDelete(_ time.Time) (empty, needDrain []*apiv1.Node) {
 		// downs already in progress. If we pass the empty nodes first, they will be first
 		// to get deleted, thus we decrease chances of hitting the limit on non-empty scale down.
 		append(emptyRemovable, needDrainRemovable...),
-		p.context.AutoscalingOptions.MaxScaleDownParallelism)
+		// No need to limit the number of nodes, since it will happen later, in the actuation stage.


Instead of passing math.MaxInt, this param should be just removed. It is effectively going to be unused anyway.

It's still needed for sequential scaledown, for now I didn't want to change the api too much.

x13n · 2023-04-28T08:56:02Z

cluster-autoscaler/core/scaledown/actuation/actuator.go

 	deletionStartTime := time.Now()
 	defer func() { metrics.UpdateDuration(metrics.ScaleDownNodeDeletion, time.Now().Sub(deletionStartTime)) }()

 	results, ts := a.nodeDeletionTracker.DeletionResults()
 	scaleDownStatus := &status.ScaleDownStatus{NodeDeleteResults: results, NodeDeleteResultsAsOf: ts}

-	emptyToDelete, drainToDelete := a.cropNodesToBudgets(empty, drain)
-	if len(emptyToDelete) == 0 && len(drainToDelete) == 0 {
+	emptyIndividualToDelete, drainIndividualToDelete, emptyAtomicToDelete, drainAtomicToDelete := a.cropNodesToBudgets(emptyIndividual, drainIndividual, emptyAtomic, drainAtomic)


I'm actually starting to think that cropping to budget shouldn't really be done in actuator. The actuator should do what it is told to do: drain and delete a bunch of nodes. If we move this logic out of actuator (perhaps to a dedicated scale down set processor), a lot of the code here will become simpler. The only remaining issue will be the batching logic. One idea to deal with it would be to have batching criteria adjusted per node group. For most node groups it would be "wait for N nodes or T time, whichever comes first". Atomic node groups would set N equal to number of nodes (so it would have to be dynamic) and set T to +inf. WDYT?