-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove quorum guard pod after the pod is removed from the cluster #923
Remove quorum guard pod after the pod is removed from the cluster #923
Conversation
/assign @tjungblu |
@JoelSpeed Thanks a lot for this nice work Is it possible to add test cases though ? |
I've got an open discussion topic on the control plane arch call today, I want to conclude the best outcome there and also test this with/without whatever the outcome of that conversation is. I'll add tests once we conclude this is definitely the way to go here, I'll mark as WIP for now |
Conclusion from the arch call is that there are a few things we should do here, firstly, only drain one control plane machine at a time, but also, teach our operators to be smarter about this. Having tested "one drain only" without this patch, it isn't enough to get it working. So I think we need this patch as part of the minimum viable. The only caveat is that there seems to be a bit of a race, when the Machine API hasn't yet cordoned the node, the etcd operator is both creating and deleting the guard pod. I could update the logic to only remove the guard pod if the current node is cordoned, which I think will solve the issue. So I'll fix that up and add tests tomorrow, let me know if you have any other questions in the mean time |
@@ -83,6 +90,13 @@ func (c *machineDeletionHooksController) sync(ctx context.Context, syncCtx facto | |||
if err := c.attemptToDeleteMachineDeletionHook(ctx, syncCtx.Recorder()); err != nil { | |||
errs = append(errs, err) | |||
} | |||
|
|||
// attempt to remove quorum guard pods from machines that are pending deletion and haven't got a deletion hook. | |||
// This prevents a deadlock when multiple machines are pending deletion simultaneously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a little more detail on the deadlock so it's more apparent why we remove the guard pod.
// This prevents a deadlock when multiple machines are pending deletion simultaneously. | |
// This prevents a deadlock when multiple machines are pending deletion simultaneously, but the nodes cannot be drained | |
// because the guard pod on each node is unready (due to non-member etcd pod) and violates the PDB. |
@JoelSpeed Thanks for putting up the PR. |
9567b43
to
0ff2aff
Compare
0ff2aff
to
b6b4892
Compare
@hasbro17 I've updated that comment and added some unit tests for this code, let me know what you think |
@JoelSpeed great work, specially the testing 👍🏽 🎉 Lets wait for the CI, would be interested to test this manually, any docs / guides ? |
It's not super easy to test right now as I've been testing this with some other WIP PRs. But, what you could do, is create a cluster from this branch, then manually create three additional control plane machines, then delete the original three control plane machines. This would trigger etcd operator to start the migration and eventually MAPI to start draining. You may see that things get stuck (this is where the other WIP comes in) but what you should see is that the etcd quorum guard is removed correctly for each of the deleted machines. Once I update the drain controller in MAPI to only drain one control plane machine at a time, the whole process works and things resolve themselves correctly, but that's just on a branch right now and I haven't written unit tests or had review of that. |
/retest-required |
Yep great tests, rest also lgtm. Would be great if you can add a couple more log statements to make debugging a bit easier later on. Thanks for your help :) |
Tests look good, thanks for adding them so fast 👍 Deferring lgtm to @tjungblu |
/retest-required |
b6b4892
to
9d08b11
Compare
@tjungblu Added some additional log lines at level 4, let me know what you think |
looks great! |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasbro17, JoelSpeed, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required |
/retest |
/retest-required |
Still getting alerts about OVN, not sure if that's a persistent failure or not |
A different flake this time |
/retest-required |
1 similar comment
/retest-required |
/retest-required |
1 similar comment
/retest-required |
/override ci/prow/configmap-scale |
@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/configmap-scale In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
@JoelSpeed: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest-required |
Opening this to start a discussion.
I've been working on the ControlPlaneMachineSet and trying to automate replacement of the Control Plane Machines using our on-delete strategy.
One of the things we expect to be able to do is to create multiple replacement machines simultaneously (when a user deletes multiple machines simultaneously) and have the etcd operator handle moving etcd onto the new machines in quick succession.
This seems to work well!
However, what doesn't work currently is that the quorum guard on the old machines ends up failing, members that have been removed from the etcd cluster fail the health check. This means you now have multiple failing quorum guard pods.
The PDB blocks these from being removed.
To avoid this, we can be remove the quorum guard pod from a machine that we know has been removed from the quorum.
This prevents the quorum guard pod from blocking the machine being drained and allows the cluster to remove the old machines.
If we remove the quorum guard pod and the node is marked unschedulable (because it's being drained), it won't be recreated.
I've done some manual testing of this and it seems like things are working now.
Would be interested to see what others think to know if there's some edge cases I haven't considered, but I believe this should be safe to add.