Remove quorum guard pod after the pod is removed from the cluster #923

JoelSpeed · 2022-09-06T13:16:26Z

Opening this to start a discussion.

I've been working on the ControlPlaneMachineSet and trying to automate replacement of the Control Plane Machines using our on-delete strategy.

One of the things we expect to be able to do is to create multiple replacement machines simultaneously (when a user deletes multiple machines simultaneously) and have the etcd operator handle moving etcd onto the new machines in quick succession.

This seems to work well!

However, what doesn't work currently is that the quorum guard on the old machines ends up failing, members that have been removed from the etcd cluster fail the health check. This means you now have multiple failing quorum guard pods.
The PDB blocks these from being removed.

To avoid this, we can be remove the quorum guard pod from a machine that we know has been removed from the quorum.
This prevents the quorum guard pod from blocking the machine being drained and allows the cluster to remove the old machines.
If we remove the quorum guard pod and the node is marked unschedulable (because it's being drained), it won't be recreated.

I've done some manual testing of this and it seems like things are working now.

Would be interested to see what others think to know if there's some edge cases I haven't considered, but I believe this should be safe to add.

Elbehery · 2022-09-06T14:47:02Z

/assign @tjungblu

Elbehery · 2022-09-06T14:51:07Z

@JoelSpeed Thanks a lot for this nice work

Is it possible to add test cases though ?

JoelSpeed · 2022-09-06T14:54:40Z

I've got an open discussion topic on the control plane arch call today, I want to conclude the best outcome there and also test this with/without whatever the outcome of that conversation is.

I'll add tests once we conclude this is definitely the way to go here, I'll mark as WIP for now

JoelSpeed · 2022-09-06T15:43:32Z

Conclusion from the arch call is that there are a few things we should do here, firstly, only drain one control plane machine at a time, but also, teach our operators to be smarter about this. Having tested "one drain only" without this patch, it isn't enough to get it working.

So I think we need this patch as part of the minimum viable.

The only caveat is that there seems to be a bit of a race, when the Machine API hasn't yet cordoned the node, the etcd operator is both creating and deleting the guard pod. I could update the logic to only remove the guard pod if the current node is cordoned, which I think will solve the issue.

So I'll fix that up and add tests tomorrow, let me know if you have any other questions in the mean time

hasbro17 · 2022-09-06T20:56:13Z

pkg/operator/machinedeletionhooks/machinedeletionhooks.go

@@ -83,6 +90,13 @@ func (c *machineDeletionHooksController) sync(ctx context.Context, syncCtx facto
 	if err := c.attemptToDeleteMachineDeletionHook(ctx, syncCtx.Recorder()); err != nil {
 		errs = append(errs, err)
 	}
+
+	// attempt to remove quorum guard pods from machines that are pending deletion and haven't got a deletion hook.
+	// This prevents a deadlock when multiple machines are pending deletion simultaneously.


Just a little more detail on the deadlock so it's more apparent why we remove the guard pod.

Suggested change

// This prevents a deadlock when multiple machines are pending deletion simultaneously.

// This prevents a deadlock when multiple machines are pending deletion simultaneously, but the nodes cannot be drained

// because the guard pod on each node is unready (due to non-member etcd pod) and violates the PDB.

hasbro17 · 2022-09-06T21:06:20Z

@JoelSpeed Thanks for putting up the PR.
While this might be better suited in a separate quorum guard cleanup controller, I'm okay with this being in the deletionhooks controller for now.
The approach and implementation look good to me so we can move this along and add the unit tests.

JoelSpeed · 2022-09-07T11:54:30Z

@hasbro17 I've updated that comment and added some unit tests for this code, let me know what you think

Elbehery · 2022-09-07T13:25:09Z

@JoelSpeed great work, specially the testing 👍🏽 🎉

Lets wait for the CI, would be interested to test this manually, any docs / guides ?

JoelSpeed · 2022-09-07T13:28:47Z

Lets wait for the CI, would be interested to test this manually, any docs / guides ?

It's not super easy to test right now as I've been testing this with some other WIP PRs. But, what you could do, is create a cluster from this branch, then manually create three additional control plane machines, then delete the original three control plane machines.

This would trigger etcd operator to start the migration and eventually MAPI to start draining. You may see that things get stuck (this is where the other WIP comes in) but what you should see is that the etcd quorum guard is removed correctly for each of the deleted machines.

Once I update the drain controller in MAPI to only drain one control plane machine at a time, the whole process works and things resolve themselves correctly, but that's just on a branch right now and I haven't written unit tests or had review of that.

Elbehery · 2022-09-07T14:15:07Z

/retest-required

tjungblu · 2022-09-07T14:26:55Z

Yep great tests, rest also lgtm. Would be great if you can add a couple more log statements to make debugging a bit easier later on.

Thanks for your help :)

hasbro17 · 2022-09-07T23:02:06Z

Tests look good, thanks for adding them so fast 👍
/approve

Deferring lgtm to @tjungblu

hasbro17 · 2022-09-07T23:03:18Z

/retest-required

JoelSpeed · 2022-09-08T13:07:46Z

@tjungblu Added some additional log lines at level 4, let me know what you think

tjungblu · 2022-09-08T13:11:30Z

looks great!

tjungblu · 2022-09-09T07:24:05Z

/lgtm

openshift-ci · 2022-09-09T07:26:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, JoelSpeed, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hasbro17]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2022-09-09T07:51:43Z

/retest-required

Remaining retests: 0 against base HEAD ee2247d and 2 for PR HEAD 9d08b11 in total

Elbehery · 2022-09-09T10:15:38Z

/retest-required

tjungblu · 2022-09-12T08:47:35Z

/retest

JoelSpeed · 2022-09-12T13:01:33Z

/retest-required

JoelSpeed · 2022-09-13T10:06:21Z

Still getting alerts about OVN, not sure if that's a persistent failure or not
/retest-required

JoelSpeed · 2022-09-13T13:46:36Z

A different flake this time
/retest-required

JoelSpeed · 2022-09-14T11:20:43Z

/retest-required

Elbehery · 2022-09-14T14:21:19Z

/retest-required

openshift-ci-robot · 2022-09-14T18:03:49Z

/retest-required

Remaining retests: 0 against base HEAD a2a14fa and 1 for PR HEAD 9d08b11 in total

JoelSpeed · 2022-09-15T09:21:56Z

/retest-required

JoelSpeed · 2022-09-15T17:06:06Z

/retest-required

hasbro17 · 2022-09-15T19:20:35Z

/override ci/prow/configmap-scale
/retest-required

openshift-ci · 2022-09-15T19:21:00Z

@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/configmap-scale

In response to this:

/override ci/prow/configmap-scale
/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed · 2022-09-16T08:18:51Z

/retest-required

openshift-ci · 2022-09-16T10:53:39Z

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-disruptive	`9d08b11`	link	false	`/test e2e-gcp-disruptive`
ci/prow/e2e-gcp-disruptive-five-control-plane-replicas	`9d08b11`	link	false	`/test e2e-gcp-disruptive-five-control-plane-replicas`
ci/prow/e2e-gcp-upgrade-five-control-plane-replicas	`9d08b11`	link	false	`/test e2e-gcp-upgrade-five-control-plane-replicas`
ci/prow/e2e-aws-disruptive-ovn	`9d08b11`	link	false	`/test e2e-aws-disruptive-ovn`
ci/prow/e2e-aws-disruptive	`9d08b11`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-aws-serial	`9d08b11`	link	true	`/test e2e-aws-serial`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

JoelSpeed · 2022-09-16T17:27:02Z

/retest-required

openshift-ci bot requested review from Elbehery and EmilyM1 September 6, 2022 13:20

openshift-ci bot assigned tjungblu Sep 6, 2022

JoelSpeed changed the title ~~Remove quorum guard pod after the pod is removed from the cluster~~ [WIP] Remove quorum guard pod after the pod is removed from the cluster Sep 6, 2022

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 6, 2022

hasbro17 reviewed Sep 6, 2022

View reviewed changes

JoelSpeed force-pushed the remove-guard-pod branch from 9567b43 to 0ff2aff Compare September 7, 2022 11:52

JoelSpeed changed the title ~~[WIP] Remove quorum guard pod after the pod is removed from the cluster~~ Remove quorum guard pod after the pod is removed from the cluster Sep 7, 2022

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2022

JoelSpeed force-pushed the remove-guard-pod branch from 0ff2aff to b6b4892 Compare September 7, 2022 11:53

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 7, 2022

JoelSpeed added 3 commits September 8, 2022 14:04

Remove quorum guard pod after the pod is removed from the cluster

93f30d6

Only remove the guard pod when the node is cordoned

df14947

Test behaviour of attemptToDeleteQuorumGuard

9d08b11

JoelSpeed force-pushed the remove-guard-pod branch from b6b4892 to 9d08b11 Compare September 8, 2022 13:07

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2022

openshift-merge-robot merged commit 45a07ee into openshift:master Sep 16, 2022

JoelSpeed deleted the remove-guard-pod branch September 18, 2022 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove quorum guard pod after the pod is removed from the cluster #923

Remove quorum guard pod after the pod is removed from the cluster #923

JoelSpeed commented Sep 6, 2022

Elbehery commented Sep 6, 2022

Elbehery commented Sep 6, 2022

JoelSpeed commented Sep 6, 2022

JoelSpeed commented Sep 6, 2022

hasbro17 Sep 6, 2022

hasbro17 commented Sep 6, 2022

JoelSpeed commented Sep 7, 2022

Elbehery commented Sep 7, 2022

JoelSpeed commented Sep 7, 2022

Elbehery commented Sep 7, 2022

tjungblu commented Sep 7, 2022

hasbro17 commented Sep 7, 2022

hasbro17 commented Sep 7, 2022

JoelSpeed commented Sep 8, 2022

tjungblu commented Sep 8, 2022

tjungblu commented Sep 9, 2022

openshift-ci bot commented Sep 9, 2022

openshift-ci-robot commented Sep 9, 2022

Elbehery commented Sep 9, 2022

tjungblu commented Sep 12, 2022

JoelSpeed commented Sep 12, 2022

JoelSpeed commented Sep 13, 2022

JoelSpeed commented Sep 13, 2022

JoelSpeed commented Sep 14, 2022

Elbehery commented Sep 14, 2022

openshift-ci-robot commented Sep 14, 2022

JoelSpeed commented Sep 15, 2022

JoelSpeed commented Sep 15, 2022

hasbro17 commented Sep 15, 2022

openshift-ci bot commented Sep 15, 2022

JoelSpeed commented Sep 16, 2022

openshift-ci bot commented Sep 16, 2022 •

edited

Loading

JoelSpeed commented Sep 16, 2022

	// This prevents a deadlock when multiple machines are pending deletion simultaneously.
	// This prevents a deadlock when multiple machines are pending deletion simultaneously, but the nodes cannot be drained
	// because the guard pod on each node is unready (due to non-member etcd pod) and violates the PDB.

Remove quorum guard pod after the pod is removed from the cluster #923

Remove quorum guard pod after the pod is removed from the cluster #923

Conversation

JoelSpeed commented Sep 6, 2022

Elbehery commented Sep 6, 2022

Elbehery commented Sep 6, 2022

JoelSpeed commented Sep 6, 2022

JoelSpeed commented Sep 6, 2022

hasbro17 Sep 6, 2022

Choose a reason for hiding this comment

hasbro17 commented Sep 6, 2022

JoelSpeed commented Sep 7, 2022

Elbehery commented Sep 7, 2022

JoelSpeed commented Sep 7, 2022

Elbehery commented Sep 7, 2022

tjungblu commented Sep 7, 2022

hasbro17 commented Sep 7, 2022

hasbro17 commented Sep 7, 2022

JoelSpeed commented Sep 8, 2022

tjungblu commented Sep 8, 2022

tjungblu commented Sep 9, 2022

openshift-ci bot commented Sep 9, 2022

openshift-ci-robot commented Sep 9, 2022

Elbehery commented Sep 9, 2022

tjungblu commented Sep 12, 2022

JoelSpeed commented Sep 12, 2022

JoelSpeed commented Sep 13, 2022

JoelSpeed commented Sep 13, 2022

JoelSpeed commented Sep 14, 2022

Elbehery commented Sep 14, 2022

openshift-ci-robot commented Sep 14, 2022

JoelSpeed commented Sep 15, 2022

JoelSpeed commented Sep 15, 2022

hasbro17 commented Sep 15, 2022

openshift-ci bot commented Sep 15, 2022

JoelSpeed commented Sep 16, 2022

openshift-ci bot commented Sep 16, 2022 • edited Loading

JoelSpeed commented Sep 16, 2022

openshift-ci bot commented Sep 16, 2022 •

edited

Loading