Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort machines deterministically before scaling down #512

Conversation

ingvagabund
Copy link
Contributor

@ingvagabund ingvagabund commented Sep 25, 2018

What this PR does / why we need it:

Sort the machines deterministically to delete machines in the same order. In general the machines are sequenced randomly depending on Machines().List() result. When number of replicas goes down, the reconcilication loop is called multiple times, each time taking a different subset of machines to be deleted. Which may result in deletion of more machines than requested (followed by recreation of new machines). Given each machine represents a node, deleting more nodes than expected can be very disruptive.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ingvagabund
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mkjelland

If they are not already assigned, you can assign the PR to them by writing /assign @mkjelland in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 25, 2018
// Which may result in deletion of more machines than requested (followed
// by recreation of new machines). Given each machine represents a node,
// deleting more nodes than expected can be very disruptive.
sort.Strings(machineNames)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the machines were listed randomly. Given a machine name is generated randomly, the original deletion policy (deleting first machines in a list) is preserved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given a machine name is generated randomly, the original deletion policy (deleting first machines in a list) is preserved.

How does the final sort preserve the original deletion policy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original deletion policy takes first items from a list where whose order is generated randomly. Reordering the items based on a name that is generated randomly as well changes order if items from one random sequence to another one.

@ingvagabund
Copy link
Contributor Author

@roberthbailey roberthbailey added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 5, 2018
@alvaroaleman
Copy link
Member

// When number of replicas goes down, the reconcilication loop is called multiple
// times, each time taking a different subset of machines to be deleted.
// Which may result in deletion of more machines than requested (followed
 // by recreation of new machines

Why is it called multiple times? It gets called once for the change of .Spec.Replicas and then once again every time a machine gets deleted or am I missing something here?

WTR to the issue that too many machines may be deleted: Wouldn't just filtering out machines in shouldExcludeMachine that have a non-nil deletionTimestamp fix that?

The only possible issue I see here is a racecondition in scaling down when a machine gets deleted by one Reconsile, that deletion is not propagated back to the lister yet while another Reconsile gets called, resulting in that other Reconsile to delete more machines. However this is an issue that can only prevented by using some locking mechanism that blocks further deletions until the first deletions arrived in the lister (And filtering properly, as mentioned above).

Does that make sense or did I get something wrong?

@roberthbailey
Copy link
Contributor

Closing as per #558 (comment).

/close

@k8s-ci-robot
Copy link
Contributor

@roberthbailey: Closing this PR.

In response to this:

Closing as per #558 (comment).

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ingvagabund ingvagabund deleted the sort-machines-before-scaling-down branch October 30, 2018 09:06
jayunit100 pushed a commit to jayunit100/cluster-api that referenced this pull request Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants