MachineSet reconcilation too eager during scale down when using `Oldest` or `Newest` deletion policies #9334

Oats87 · 2023-08-29T14:35:51Z

What steps did you take and what happened?

A MachineSet scale down operation can lead to too many replicas being torn down.

Given a scenario where a MachineSet exists with multiple replicas, and the user desires a scale down the number of replicas, it's possible that the MachineSet controller could delete more than the desired number of machines. We specifically found this when testing the Oldest deletion policy, but a quick glance at the code shows that it's possible to hit this with both Oldest and Newest. Notably, Random does not apply the same weighting strategies as Oldest and Newest for unhealthy nodes.

The behavior we are seeing is hit through the following conditions:

The node targeted for scale down is NOT the first node lexicographically in the set
On scale down, the removal of the targeted node causes the other two replicas to flap through a NotReady state, which in turn causes the machinehealth to be considered false for these machines

In this case, we are seeing the MachineSet controller tearing down two machines instead of one, specifically because in the first iteration, it tears down the desired node, but as the rest of the machines in the set go temporarily NotReady, on the second reconciliation attempt it finds and deletes the first machine lexicographically as it is weighted the same as targeted nodes. See:

cluster-api/internal/controllers/machineset/machineset_delete_policy.go

Lines 47 to 55 in bd336eb

    
           if !machine.DeletionTimestamp.IsZero() { 
        
           	return mustDelete 
        
           } 
        
           if _, ok := machine.ObjectMeta.Annotations[clusterv1.DeleteMachineAnnotation]; ok { 
        
           	return mustDelete 
        
           } 
        
           if !isMachineHealthy(machine) { 
        
           	return mustDelete 
        
           }

It appears that a change was made (#3434) that specifically includes machines in deleting state in the slice of machines to be processed, which is understandable given the problem to solve. I believe we are best off simply fixing the Oldest and Newest policies to apply varying priorities rather than being binary.

What did you expect to happen?

I did not expect to have more than my desired number of machines deleted on scale down.

Cluster API version

v1.4.4

Kubernetes version

v1.26.7

Anything else you would like to add?

cc: @richardcase @furkatgofurov7

Label(s) to be applied

/kind bug
/area machineset

The text was updated successfully, but these errors were encountered:

furkatgofurov7 · 2023-08-30T07:19:34Z

/triage accepted

richardcase · 2023-08-30T13:40:12Z

This is an interesting one. For this to be a issue the following has to happen (there are other variants):

A MachineSet is scaled down
There are no Machines with a deletion timestamp, deletion annotation or marked as unhealthy
The Machine machineb is deleted based on it having the highest priority assigned by this (deletePriority(float64(mustDelete) * (1.0 - math.Exp(-d.Seconds()/secondsPerTenDays))). For argument’s sake 50.34
The Machine machineb starts to delete but due to the finalizer the deletion timestamp is set to a non-zero value and its not deleted until after the infra machine deletes (which may take some time)
Reconciliation returns because waitForMachineDeletion waits for "not found error" or a non-zero deletion timestamp
Another Machine (machinea) becomes unhealthy (this can be for whatever reason but from this issue description its caused by machinea)
The MachineSet controller reconciles again (for whatever reason), now machinea has a priority of 100.0 because its unhealthy and machineb has a non-zero deletion timestamp, so its priority is also set to 100.0 (up from its earlier value).
- As both machines have the same priority, the sorting done by sortableMachine will order by the machine names and so machinea is deleted (again because of the finalizer this takes time and deletion timestamp is set)
- Now we have 2 machines deleting when only 1 should've been deleted.

enxebre · 2023-08-30T14:15:11Z

the removal of the targeted node causes the other two replicas to flap through a NotReady stat

Not that is determinant for the issue as it might be any reason causing other machine going unhealthy but can you elaborate on why that happen?

Would give !machine.DeletionTimestamp.IsZero() higher prio be enough?
Also the same race might happen against a machine that gets DeleteMachineAnnotation right before second reconciliation

richardcase · 2023-08-30T14:49:17Z

Would give !machine.DeletionTimestamp.IsZero() higher prio be enough? Also the same race might happen against a machine that gets DeleteMachineAnnotation right before second reconciliation

@enxebre that would help in the scenario where the "oldest" is deleted and then another machine becomes unhealthy whilst the first was deleting all resources. But it may require changing some of the other priorites.....

I believe we are best off simply fixing the Oldest and Newest policies to apply varying priorities rather than being binary.

@Oats87 (& @jakefhyde) - do you think that the 'unhealthy' machines should always have a lower priority compared to whatever the chosen machine selection strategy is? So with oldest deletion policy, where we have:

a unhealthy machine with the newest creation timestamp
a healthy machine with the oldest creation timestamp

Then the oldest (and healthy machine) should have the highest priority and be deleted instead of the healthy machine?

Oats87 · 2023-08-30T16:29:57Z

the removal of the targeted node causes the other two replicas to flap through a NotReady stat

Not that is determinant for the issue as it might be any reason causing other machine going unhealthy but can you elaborate on why that happen?

Absolutely. We have a bit of a unique architecture in which we manage controlplane/etcd nodes via MachineDeployment/MachineSet objects, which while I understand is not the standard paradigm for cluster-api controlplane providers, we have as a remnant of architectural decisions made a while ago. We are exploring moving to a more capi-like paradigm and having our controlplane provider manage our controlplane/etcd nodes but this is not going to happen anytime in the future.

The reason our subsequent nodes in the set can flap through unhealthy is due to K3s/RKE2, wherein we must swap out the --server endpoint for the subsequent nodes if the node we are currently deleting is our elected "init node". Notably, we are unable to use the cluster-api infrastructure-provider-managed load balancer as K3s/RKE2 need to talk to an etcd-only node, but this is a discussion for another day and stems from technical limitations.

Would give !machine.DeletionTimestamp.IsZero() higher prio be enough? Also the same race might happen against a machine that gets DeleteMachineAnnotation right before second reconciliation

In this specific case, I believe that would fix the issue. I believe we should be applying a different base priority for every case i.e. deleted, unhealthy, annotated, etc.

I believe we are best off simply fixing the Oldest and Newest policies to apply varying priorities rather than being binary.

@Oats87 (& @jakefhyde) - do you think that the 'unhealthy' machines should always have a lower priority compared to whatever the chosen machine selection strategy is? So with oldest deletion policy, where we have:
* a unhealthy machine with the newest creation timestamp

* a healthy machine with the oldest creation timestamp
Then the oldest (and healthy machine) should have the highest priority and be deleted instead of the healthy machine?

Yes, I believe that if a policy is stated as oldest, we should prioritize the oldest machine first. I don't believe unhealthy machines should be qualified for this as I would expect oldest to indicate that the oldest machine in my set is deleted regardless of health -- I believe there is nuance here where new policies with more granularity could be introduced i.e. UnhealthyFirstThenOldest for example.

jakefhyde · 2023-08-30T17:21:03Z

Would give !machine.DeletionTimestamp.IsZero() higher prio be enough? Also the same race might happen against a machine that gets DeleteMachineAnnotation right before second reconciliation

@enxebre that would help in the scenario where the "oldest" is deleted and then another machine becomes unhealthy whilst the first was deleting all resources. But it may require changing some of the other priorites.....

I believe we are best off simply fixing the Oldest and Newest policies to apply varying priorities rather than being binary.

@Oats87 (& @jakefhyde) - do you think that the 'unhealthy' machines should always have a lower priority compared to whatever the chosen machine selection strategy is? So with oldest deletion policy, where we have:

a unhealthy machine with the newest creation timestamp

a healthy machine with the oldest creation timestamp

Then the oldest (and healthy machine) should have the highest priority and be deleted instead of the healthy machine?

I mulled this over and I think that when scaling down, each case should be weighted with its own non-overlapping priority. I do agree with @Oats87 about there being nuance here, more granularity may be the best option. Although deleting the oldest healthy nodes as opposed to a newer unhealthy node could potentially cause a worker plane to be resource constrained, I think it's ultimately best left to the cluster administrator to decide that, as an unhealthy node does not explicitly indicate desired deletion on said machine given the current deletion policy.

enxebre · 2023-08-31T10:37:16Z

Although deleting the oldest healthy nodes as opposed to a newer unhealthy node could potentially cause a worker plane to be resource constrained, I think it's ultimately best left to the cluster administrator to decide that, as an unhealthy node does not explicitly indicate desired deletion on said machine given the current deletion policy.

I think both behaviours are fair and consumers might lean towards one approach or the other depending on their specific use case. So far this is the first report report I know about the current opinions not being fully satisfactory. Because of that my preference would be to keep the scope of this ticket to the minimum i.e. fix the "scaling down scales more replicas than expected" issue, to that end I think aligning with random strategy weights for special cases would probably be enough.

	if !machine.DeletionTimestamp.IsZero() {
		return mustDelete
	}
	if _, ok := machine.ObjectMeta.Annotations[clusterv1.DeleteMachineAnnotation]; ok {
		return betterDelete
	}
	if !isMachineHealthy(machine) {
		return betterDelete
	}

Then separately we can discuss how to satisfy additional use cases, how strong they are and how flexible we want to go, e.g. if we wanted to this granularity could even be configurable.

Oats87 · 2023-09-01T19:32:57Z

I think both behaviours are fair and consumers might lean towards one approach or the other depending on their specific use case. So far this is the first report report I know about the current opinions not being fully satisfactory. Because of that my preference would be to keep the scope of this ticket to the minimum i.e. fix the "scaling down scales more replicas than expected" issue, to that end I think aligning with random strategy weights for special cases would probably be enough.

The issue I see with this approach (identical priorities for unhealthy and delete annotated machines) is that there is the opportunity for repeated scaling operations to end up with this (albeit it's a significantly harder edge case to hit).

Assuming I have a set of machines, with names
machine-a, machine-b, and machine-c, where machine-c is the oldest machine, I scale down from desired quantity 3 to 2, this will cause the machineset controller to delete machine-c. If the same type of behavior i.e. machine-a and machine-b can go unhealthy (NotReady) for a period of time during machine-c's scale down, AND I annotate machine-b with the cluster.x-k8s.io/delete-machine and scale down once again, it is possible at this point for machine-a to be deleted which is not actually the machine that should be deleted. We can capture this in a separate issue if you'd like or just mutate the original issue to cover this case.

Per the documentation here: https://cluster-api.sigs.k8s.io/reference/labels_and_annotations the annotation is described as indicating that its existence will cause the machine to be "given top priority on all delete policies" which is not actually the case as of now.

As such, I am still proposing that the solution to this issue be giving a specific priority to every "case", with the general priority in descending order being

if a machine is currently deleting
if the annotation is set
if the machine is not healthy
all other machines in the set

This would likely require a new constant to be added (not sure what to name it) that would probably be 75. I can open a PR to introduce this change for the policies but wanted to unify on the agreed upon approach to resolution before opening the PR.

enxebre · 2023-09-04T15:11:11Z

Right, different weights makes sense to me.

This introduces an additional deletePriority between betterDelete and mustDelete to avoid a race where multiple machines might be deleted at the same time as described in kubernetes-sigs#9334.

fabriziopandini · 2024-04-11T18:56:54Z

/priority important-soon

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/machineset Issues or PRs related to machinesets needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 29, 2023

jakefhyde mentioned this issue Aug 30, 2023

[BUG] Scaling down etcd machine pool can cause multiple machines to be deleted unintentionally rancher/rancher#42582

Open

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 30, 2023

This was referenced Feb 1, 2024

fix: deletion priority to avoid deleting too many machines ninech/cluster-api#1

Open

🐛 fix: deletion priority to avoid deleting too many machines #10087

Merged

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024

k8s-ci-robot closed this as completed in #10087 Apr 12, 2024

snasovich mentioned this issue Apr 12, 2024

Update CAPI in Rancher to 1.6.4+ rancher/rancher#45090

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MachineSet reconcilation too eager during scale down when using `Oldest` or `Newest` deletion policies #9334

MachineSet reconcilation too eager during scale down when using `Oldest` or `Newest` deletion policies #9334

Oats87 commented Aug 29, 2023

furkatgofurov7 commented Aug 30, 2023

richardcase commented Aug 30, 2023 •

edited

enxebre commented Aug 30, 2023

richardcase commented Aug 30, 2023 •

edited

Oats87 commented Aug 30, 2023 •

edited

jakefhyde commented Aug 30, 2023

enxebre commented Aug 31, 2023 •

edited

Oats87 commented Sep 1, 2023

enxebre commented Sep 4, 2023

fabriziopandini commented Apr 11, 2024

MachineSet reconcilation too eager during scale down when using Oldest or Newest deletion policies #9334

MachineSet reconcilation too eager during scale down when using Oldest or Newest deletion policies #9334

Comments

Oats87 commented Aug 29, 2023

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

furkatgofurov7 commented Aug 30, 2023

richardcase commented Aug 30, 2023 • edited

enxebre commented Aug 30, 2023

richardcase commented Aug 30, 2023 • edited

Oats87 commented Aug 30, 2023 • edited

jakefhyde commented Aug 30, 2023

enxebre commented Aug 31, 2023 • edited

Oats87 commented Sep 1, 2023

enxebre commented Sep 4, 2023

fabriziopandini commented Apr 11, 2024

MachineSet reconcilation too eager during scale down when using `Oldest` or `Newest` deletion policies #9334

MachineSet reconcilation too eager during scale down when using `Oldest` or `Newest` deletion policies #9334

richardcase commented Aug 30, 2023 •

edited

richardcase commented Aug 30, 2023 •

edited

Oats87 commented Aug 30, 2023 •

edited

enxebre commented Aug 31, 2023 •

edited