Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachineSet reconcilation too eager during scale down when using Oldest or Newest deletion policies #9334

Closed
Oats87 opened this issue Aug 29, 2023 · 10 comments · Fixed by #10087
Labels
area/machineset Issues or PRs related to machinesets kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Oats87
Copy link

Oats87 commented Aug 29, 2023

What steps did you take and what happened?

A MachineSet scale down operation can lead to too many replicas being torn down.

Given a scenario where a MachineSet exists with multiple replicas, and the user desires a scale down the number of replicas, it's possible that the MachineSet controller could delete more than the desired number of machines. We specifically found this when testing the Oldest deletion policy, but a quick glance at the code shows that it's possible to hit this with both Oldest and Newest. Notably, Random does not apply the same weighting strategies as Oldest and Newest for unhealthy nodes.

The behavior we are seeing is hit through the following conditions:

  1. The node targeted for scale down is NOT the first node lexicographically in the set
  2. On scale down, the removal of the targeted node causes the other two replicas to flap through a NotReady state, which in turn causes the machinehealth to be considered false for these machines

In this case, we are seeing the MachineSet controller tearing down two machines instead of one, specifically because in the first iteration, it tears down the desired node, but as the rest of the machines in the set go temporarily NotReady, on the second reconciliation attempt it finds and deletes the first machine lexicographically as it is weighted the same as targeted nodes. See:

if !machine.DeletionTimestamp.IsZero() {
return mustDelete
}
if _, ok := machine.ObjectMeta.Annotations[clusterv1.DeleteMachineAnnotation]; ok {
return mustDelete
}
if !isMachineHealthy(machine) {
return mustDelete
}

It appears that a change was made (#3434) that specifically includes machines in deleting state in the slice of machines to be processed, which is understandable given the problem to solve. I believe we are best off simply fixing the Oldest and Newest policies to apply varying priorities rather than being binary.

What did you expect to happen?

I did not expect to have more than my desired number of machines deleted on scale down.

Cluster API version

v1.4.4

Kubernetes version

v1.26.7

Anything else you would like to add?

cc: @richardcase @furkatgofurov7

Label(s) to be applied

/kind bug
/area machineset

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/machineset Issues or PRs related to machinesets needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 29, 2023
@furkatgofurov7
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 30, 2023
@richardcase
Copy link
Member

richardcase commented Aug 30, 2023

This is an interesting one. For this to be a issue the following has to happen (there are other variants):

  • A MachineSet is scaled down
  • There are no Machines with a deletion timestamp, deletion annotation or marked as unhealthy
  • The Machine machineb is deleted based on it having the highest priority assigned by this (deletePriority(float64(mustDelete) * (1.0 - math.Exp(-d.Seconds()/secondsPerTenDays))). For argument’s sake 50.34
  • The Machine machineb starts to delete but due to the finalizer the deletion timestamp is set to a non-zero value and its not deleted until after the infra machine deletes (which may take some time)
  • Reconciliation returns because waitForMachineDeletion waits for "not found error" or a non-zero deletion timestamp
  • Another Machine (machinea) becomes unhealthy (this can be for whatever reason but from this issue description its caused by machinea)
  • The MachineSet controller reconciles again (for whatever reason), now machinea has a priority of 100.0 because its unhealthy and machineb has a non-zero deletion timestamp, so its priority is also set to 100.0 (up from its earlier value).
    • As both machines have the same priority, the sorting done by sortableMachine will order by the machine names and so machinea is deleted (again because of the finalizer this takes time and deletion timestamp is set)
    • Now we have 2 machines deleting when only 1 should've been deleted.

@enxebre
Copy link
Member

enxebre commented Aug 30, 2023

the removal of the targeted node causes the other two replicas to flap through a NotReady stat

Not that is determinant for the issue as it might be any reason causing other machine going unhealthy but can you elaborate on why that happen?

Would give !machine.DeletionTimestamp.IsZero() higher prio be enough?
Also the same race might happen against a machine that gets DeleteMachineAnnotation right before second reconciliation

@richardcase
Copy link
Member

richardcase commented Aug 30, 2023

Would give !machine.DeletionTimestamp.IsZero() higher prio be enough? Also the same race might happen against a machine that gets DeleteMachineAnnotation right before second reconciliation

@enxebre that would help in the scenario where the "oldest" is deleted and then another machine becomes unhealthy whilst the first was deleting all resources. But it may require changing some of the other priorites.....

I believe we are best off simply fixing the Oldest and Newest policies to apply varying priorities rather than being binary.

@Oats87 (& @jakefhyde) - do you think that the 'unhealthy' machines should always have a lower priority compared to whatever the chosen machine selection strategy is? So with oldest deletion policy, where we have:

  • a unhealthy machine with the newest creation timestamp
  • a healthy machine with the oldest creation timestamp

Then the oldest (and healthy machine) should have the highest priority and be deleted instead of the healthy machine?

@Oats87
Copy link
Author

Oats87 commented Aug 30, 2023

the removal of the targeted node causes the other two replicas to flap through a NotReady stat

Not that is determinant for the issue as it might be any reason causing other machine going unhealthy but can you elaborate on why that happen?

Absolutely. We have a bit of a unique architecture in which we manage controlplane/etcd nodes via MachineDeployment/MachineSet objects, which while I understand is not the standard paradigm for cluster-api controlplane providers, we have as a remnant of architectural decisions made a while ago. We are exploring moving to a more capi-like paradigm and having our controlplane provider manage our controlplane/etcd nodes but this is not going to happen anytime in the future.

The reason our subsequent nodes in the set can flap through unhealthy is due to K3s/RKE2, wherein we must swap out the --server endpoint for the subsequent nodes if the node we are currently deleting is our elected "init node". Notably, we are unable to use the cluster-api infrastructure-provider-managed load balancer as K3s/RKE2 need to talk to an etcd-only node, but this is a discussion for another day and stems from technical limitations.

Would give !machine.DeletionTimestamp.IsZero() higher prio be enough? Also the same race might happen against a machine that gets DeleteMachineAnnotation right before second reconciliation

In this specific case, I believe that would fix the issue. I believe we should be applying a different base priority for every case i.e. deleted, unhealthy, annotated, etc.

I believe we are best off simply fixing the Oldest and Newest policies to apply varying priorities rather than being binary.

@Oats87 (& @jakefhyde) - do you think that the 'unhealthy' machines should always have a lower priority compared to whatever the chosen machine selection strategy is? So with oldest deletion policy, where we have:

* a unhealthy machine with the newest creation timestamp

* a healthy machine with the oldest creation timestamp

Then the oldest (and healthy machine) should have the highest priority and be deleted instead of the healthy machine?

Yes, I believe that if a policy is stated as oldest, we should prioritize the oldest machine first. I don't believe unhealthy machines should be qualified for this as I would expect oldest to indicate that the oldest machine in my set is deleted regardless of health -- I believe there is nuance here where new policies with more granularity could be introduced i.e. UnhealthyFirstThenOldest for example.

@jakefhyde
Copy link

Would give !machine.DeletionTimestamp.IsZero() higher prio be enough? Also the same race might happen against a machine that gets DeleteMachineAnnotation right before second reconciliation

@enxebre that would help in the scenario where the "oldest" is deleted and then another machine becomes unhealthy whilst the first was deleting all resources. But it may require changing some of the other priorites.....

I believe we are best off simply fixing the Oldest and Newest policies to apply varying priorities rather than being binary.

@Oats87 (& @jakefhyde) - do you think that the 'unhealthy' machines should always have a lower priority compared to whatever the chosen machine selection strategy is? So with oldest deletion policy, where we have:

  • a unhealthy machine with the newest creation timestamp
  • a healthy machine with the oldest creation timestamp

Then the oldest (and healthy machine) should have the highest priority and be deleted instead of the healthy machine?

I mulled this over and I think that when scaling down, each case should be weighted with its own non-overlapping priority. I do agree with @Oats87 about there being nuance here, more granularity may be the best option. Although deleting the oldest healthy nodes as opposed to a newer unhealthy node could potentially cause a worker plane to be resource constrained, I think it's ultimately best left to the cluster administrator to decide that, as an unhealthy node does not explicitly indicate desired deletion on said machine given the current deletion policy.

@enxebre
Copy link
Member

enxebre commented Aug 31, 2023

Although deleting the oldest healthy nodes as opposed to a newer unhealthy node could potentially cause a worker plane to be resource constrained, I think it's ultimately best left to the cluster administrator to decide that, as an unhealthy node does not explicitly indicate desired deletion on said machine given the current deletion policy.

I think both behaviours are fair and consumers might lean towards one approach or the other depending on their specific use case. So far this is the first report report I know about the current opinions not being fully satisfactory. Because of that my preference would be to keep the scope of this ticket to the minimum i.e. fix the "scaling down scales more replicas than expected" issue, to that end I think aligning with random strategy weights for special cases would probably be enough.

	if !machine.DeletionTimestamp.IsZero() {
		return mustDelete
	}
	if _, ok := machine.ObjectMeta.Annotations[clusterv1.DeleteMachineAnnotation]; ok {
		return betterDelete
	}
	if !isMachineHealthy(machine) {
		return betterDelete
	}

Then separately we can discuss how to satisfy additional use cases, how strong they are and how flexible we want to go, e.g. if we wanted to this granularity could even be configurable.

@Oats87
Copy link
Author

Oats87 commented Sep 1, 2023

I think both behaviours are fair and consumers might lean towards one approach or the other depending on their specific use case. So far this is the first report report I know about the current opinions not being fully satisfactory. Because of that my preference would be to keep the scope of this ticket to the minimum i.e. fix the "scaling down scales more replicas than expected" issue, to that end I think aligning with random strategy weights for special cases would probably be enough.

The issue I see with this approach (identical priorities for unhealthy and delete annotated machines) is that there is the opportunity for repeated scaling operations to end up with this (albeit it's a significantly harder edge case to hit).

Assuming I have a set of machines, with names
machine-a, machine-b, and machine-c, where machine-c is the oldest machine, I scale down from desired quantity 3 to 2, this will cause the machineset controller to delete machine-c. If the same type of behavior i.e. machine-a and machine-b can go unhealthy (NotReady) for a period of time during machine-c's scale down, AND I annotate machine-b with the cluster.x-k8s.io/delete-machine and scale down once again, it is possible at this point for machine-a to be deleted which is not actually the machine that should be deleted. We can capture this in a separate issue if you'd like or just mutate the original issue to cover this case.

Per the documentation here: https://cluster-api.sigs.k8s.io/reference/labels_and_annotations the annotation is described as indicating that its existence will cause the machine to be "given top priority on all delete policies" which is not actually the case as of now.

As such, I am still proposing that the solution to this issue be giving a specific priority to every "case", with the general priority in descending order being

  1. if a machine is currently deleting
  2. if the annotation is set
  3. if the machine is not healthy
  4. all other machines in the set

This would likely require a new constant to be added (not sure what to name it) that would probably be 75. I can open a PR to introduce this change for the policies but wanted to unify on the agreed upon approach to resolution before opening the PR.

@enxebre
Copy link
Member

enxebre commented Sep 4, 2023

Right, different weights makes sense to me.

ctrox added a commit to ninech/cluster-api that referenced this issue Feb 1, 2024
This introduces an additional deletePriority between betterDelete and
mustDelete to avoid a race where multiple machines might be deleted at
the same time as described in kubernetes-sigs#9334.
@fabriziopandini
Copy link
Member

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machineset Issues or PRs related to machinesets kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants