Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove quorum guard pod after the pod is removed from the cluster #923

Merged
merged 3 commits into from
Sep 16, 2022

Conversation

JoelSpeed
Copy link
Contributor

Opening this to start a discussion.

I've been working on the ControlPlaneMachineSet and trying to automate replacement of the Control Plane Machines using our on-delete strategy.

One of the things we expect to be able to do is to create multiple replacement machines simultaneously (when a user deletes multiple machines simultaneously) and have the etcd operator handle moving etcd onto the new machines in quick succession.

This seems to work well!

However, what doesn't work currently is that the quorum guard on the old machines ends up failing, members that have been removed from the etcd cluster fail the health check. This means you now have multiple failing quorum guard pods.
The PDB blocks these from being removed.

To avoid this, we can be remove the quorum guard pod from a machine that we know has been removed from the quorum.
This prevents the quorum guard pod from blocking the machine being drained and allows the cluster to remove the old machines.
If we remove the quorum guard pod and the node is marked unschedulable (because it's being drained), it won't be recreated.

I've done some manual testing of this and it seems like things are working now.

Would be interested to see what others think to know if there's some edge cases I haven't considered, but I believe this should be safe to add.

@Elbehery
Copy link
Contributor

Elbehery commented Sep 6, 2022

/assign @tjungblu

@Elbehery
Copy link
Contributor

Elbehery commented Sep 6, 2022

@JoelSpeed Thanks a lot for this nice work

Is it possible to add test cases though ?

@JoelSpeed
Copy link
Contributor Author

I've got an open discussion topic on the control plane arch call today, I want to conclude the best outcome there and also test this with/without whatever the outcome of that conversation is.

I'll add tests once we conclude this is definitely the way to go here, I'll mark as WIP for now

@JoelSpeed JoelSpeed changed the title Remove quorum guard pod after the pod is removed from the cluster [WIP] Remove quorum guard pod after the pod is removed from the cluster Sep 6, 2022
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 6, 2022
@JoelSpeed
Copy link
Contributor Author

Conclusion from the arch call is that there are a few things we should do here, firstly, only drain one control plane machine at a time, but also, teach our operators to be smarter about this. Having tested "one drain only" without this patch, it isn't enough to get it working.

So I think we need this patch as part of the minimum viable.

The only caveat is that there seems to be a bit of a race, when the Machine API hasn't yet cordoned the node, the etcd operator is both creating and deleting the guard pod. I could update the logic to only remove the guard pod if the current node is cordoned, which I think will solve the issue.

So I'll fix that up and add tests tomorrow, let me know if you have any other questions in the mean time

@@ -83,6 +90,13 @@ func (c *machineDeletionHooksController) sync(ctx context.Context, syncCtx facto
if err := c.attemptToDeleteMachineDeletionHook(ctx, syncCtx.Recorder()); err != nil {
errs = append(errs, err)
}

// attempt to remove quorum guard pods from machines that are pending deletion and haven't got a deletion hook.
// This prevents a deadlock when multiple machines are pending deletion simultaneously.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a little more detail on the deadlock so it's more apparent why we remove the guard pod.

Suggested change
// This prevents a deadlock when multiple machines are pending deletion simultaneously.
// This prevents a deadlock when multiple machines are pending deletion simultaneously, but the nodes cannot be drained
// because the guard pod on each node is unready (due to non-member etcd pod) and violates the PDB.

@hasbro17
Copy link
Contributor

hasbro17 commented Sep 6, 2022

@JoelSpeed Thanks for putting up the PR.
While this might be better suited in a separate quorum guard cleanup controller, I'm okay with this being in the deletionhooks controller for now.
The approach and implementation look good to me so we can move this along and add the unit tests.

@JoelSpeed JoelSpeed changed the title [WIP] Remove quorum guard pod after the pod is removed from the cluster Remove quorum guard pod after the pod is removed from the cluster Sep 7, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2022
@JoelSpeed
Copy link
Contributor Author

@hasbro17 I've updated that comment and added some unit tests for this code, let me know what you think

@Elbehery
Copy link
Contributor

Elbehery commented Sep 7, 2022

@JoelSpeed great work, specially the testing 👍🏽 🎉

Lets wait for the CI, would be interested to test this manually, any docs / guides ?

@JoelSpeed
Copy link
Contributor Author

Lets wait for the CI, would be interested to test this manually, any docs / guides ?

It's not super easy to test right now as I've been testing this with some other WIP PRs. But, what you could do, is create a cluster from this branch, then manually create three additional control plane machines, then delete the original three control plane machines.

This would trigger etcd operator to start the migration and eventually MAPI to start draining. You may see that things get stuck (this is where the other WIP comes in) but what you should see is that the etcd quorum guard is removed correctly for each of the deleted machines.

Once I update the drain controller in MAPI to only drain one control plane machine at a time, the whole process works and things resolve themselves correctly, but that's just on a branch right now and I haven't written unit tests or had review of that.

@Elbehery
Copy link
Contributor

Elbehery commented Sep 7, 2022

/retest-required

@tjungblu
Copy link
Contributor

tjungblu commented Sep 7, 2022

Yep great tests, rest also lgtm. Would be great if you can add a couple more log statements to make debugging a bit easier later on.

Thanks for your help :)

@hasbro17
Copy link
Contributor

hasbro17 commented Sep 7, 2022

Tests look good, thanks for adding them so fast 👍
/approve

Deferring lgtm to @tjungblu

@hasbro17
Copy link
Contributor

hasbro17 commented Sep 7, 2022

/retest-required

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 7, 2022
@JoelSpeed
Copy link
Contributor Author

@tjungblu Added some additional log lines at level 4, let me know what you think

@tjungblu
Copy link
Contributor

tjungblu commented Sep 8, 2022

looks great!

@tjungblu
Copy link
Contributor

tjungblu commented Sep 9, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 9, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, JoelSpeed, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD ee2247d and 2 for PR HEAD 9d08b11 in total

@Elbehery
Copy link
Contributor

Elbehery commented Sep 9, 2022

/retest-required

@tjungblu
Copy link
Contributor

/retest

@JoelSpeed
Copy link
Contributor Author

/retest-required

@JoelSpeed
Copy link
Contributor Author

Still getting alerts about OVN, not sure if that's a persistent failure or not
/retest-required

@JoelSpeed
Copy link
Contributor Author

A different flake this time
/retest-required

@JoelSpeed
Copy link
Contributor Author

/retest-required

1 similar comment
@Elbehery
Copy link
Contributor

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD a2a14fa and 1 for PR HEAD 9d08b11 in total

@JoelSpeed
Copy link
Contributor Author

/retest-required

1 similar comment
@JoelSpeed
Copy link
Contributor Author

/retest-required

@hasbro17
Copy link
Contributor

/override ci/prow/configmap-scale
/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 15, 2022

@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/configmap-scale

In response to this:

/override ci/prow/configmap-scale
/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2022

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-disruptive 9d08b11 link false /test e2e-gcp-disruptive
ci/prow/e2e-gcp-disruptive-five-control-plane-replicas 9d08b11 link false /test e2e-gcp-disruptive-five-control-plane-replicas
ci/prow/e2e-gcp-upgrade-five-control-plane-replicas 9d08b11 link false /test e2e-gcp-upgrade-five-control-plane-replicas
ci/prow/e2e-aws-disruptive-ovn 9d08b11 link false /test e2e-aws-disruptive-ovn
ci/prow/e2e-aws-disruptive 9d08b11 link false /test e2e-aws-disruptive
ci/prow/e2e-aws-serial 9d08b11 link true /test e2e-aws-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@JoelSpeed
Copy link
Contributor Author

/retest-required

@openshift-merge-robot openshift-merge-robot merged commit 45a07ee into openshift:master Sep 16, 2022
@JoelSpeed JoelSpeed deleted the remove-guard-pod branch September 18, 2022 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants