New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1840358: etcd-quorum-guard remove toleration timeouts #426
Bug 1840358: etcd-quorum-guard remove toleration timeouts #426
Conversation
This commit removes timeouts for NoExecute and NoSchedule tolerations. This results in the pods not being marked deleted and rescheduled in case of a kubelet going unreachable.
@michaelgugino: This pull request references Bugzilla bug 1840358, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
1 similar comment
/retest |
/test e2e-gcp-upgrade |
/approve |
based on my review of BZ 1840358[1] and a conversation I had with @michaelgugino I think this change is warranted. Ideally, we would test this in origin e2e. @michaelgugino does your team have any e2e suite that we could append this to WRT the premise.
|
/retest |
If I've understood correctly, if a kubelet goes unreachable, the etcd quorum guard will now block the drain indefinitely? Requiring manual intervention? |
@hexfusion We don't have any tests the disrupt masters, or at least the cloud team doesn't. I'm not sure how we could test this in the e2e suite. Having 2 masters go unreachable, there's not a great mechanism to do that. |
@michaelgugino I guess what I am saying is what is the exposure to regression here? Since we do not test it I am just trying to wrap my head around how we validate it works. |
I think the exposure is somewhat small. In the medium term, we want to put control plane hosts into machinesets, and once that happens, we can do some more sophisticated tests as we'll get replacements. If we regress this behavior after this patch merges, then we're right back where we are today. The impact should be small, this is a mechanism to prevent someone from doing something they really shouldn't anyway, and it will be called out in the docs. Validating it by hand is easy. Hop onto two masters, stop the kubelet, wait for it to go unready, delete each corresponding master machine object. Zero should be successfully deleted. Today, both with be deleted, and that's not what we want. |
based on #426 (comment) I think it makes sense to move this forward. |
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hexfusion, michaelgugino The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/refresh |
/skip |
/test all |
@michaelgugino: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@michaelgugino: All pull requests linked via external trackers have merged: Bugzilla bug 1840358 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.5 |
@hexfusion: #426 failed to apply on top of branch "release-4.5":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This commit removes timeouts for NoExecute
and NoSchedule tolerations. This results in the
pods not being marked deleted and rescheduled in
case of a kubelet going unreachable.