Bug 1840358: etcd-quorum-guard remove toleration timeouts #426

michaelgugino · 2020-08-21T17:48:32Z

This commit removes timeouts for NoExecute
and NoSchedule tolerations. This results in the
pods not being marked deleted and rescheduled in
case of a kubelet going unreachable.

This commit removes timeouts for NoExecute and NoSchedule tolerations. This results in the pods not being marked deleted and rescheduled in case of a kubelet going unreachable.

openshift-ci-robot · 2020-08-21T18:15:56Z

@michaelgugino: This pull request references Bugzilla bug 1840358, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1840358: etcd-quorum-guard remove toleration timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hexfusion · 2020-08-22T19:31:13Z

/retest

michaelgugino · 2020-08-23T13:07:55Z

/retest

hexfusion · 2020-08-24T15:16:00Z

/test e2e-gcp-upgrade

hexfusion · 2020-08-25T01:43:02Z

/approve

hexfusion · 2020-08-25T12:11:31Z

based on my review of BZ 1840358[1] and a conversation I had with @michaelgugino I think this change is warranted. Ideally, we would test this in origin e2e. @michaelgugino does your team have any e2e suite that we could append this to WRT the premise.

Possible to delete 2 masters simultaneously if kubelet unreachale

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1840358

hexfusion · 2020-09-01T01:14:51Z

/retest

JoelSpeed · 2020-09-07T09:59:07Z

If I've understood correctly, if a kubelet goes unreachable, the etcd quorum guard will now block the drain indefinitely? Requiring manual intervention?

michaelgugino · 2020-09-18T13:11:02Z

based on my review of BZ 1840358[1] and a conversation I had with @michaelgugino I think this change is warranted. Ideally, we would test this in origin e2e. @michaelgugino does your team have any e2e suite that we could append this to WRT the premise.

Possible to delete 2 masters simultaneously if kubelet unreachale

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1840358

@hexfusion We don't have any tests the disrupt masters, or at least the cloud team doesn't. I'm not sure how we could test this in the e2e suite. Having 2 masters go unreachable, there's not a great mechanism to do that.

hexfusion · 2020-09-18T13:17:39Z

@michaelgugino I guess what I am saying is what is the exposure to regression here? Since we do not test it I am just trying to wrap my head around how we validate it works.

michaelgugino · 2020-09-18T13:58:00Z

@michaelgugino I guess what I am saying is what is the exposure to regression here? Since we do not test it I am just trying to wrap my head around how we validate it works.

I think the exposure is somewhat small. In the medium term, we want to put control plane hosts into machinesets, and once that happens, we can do some more sophisticated tests as we'll get replacements. If we regress this behavior after this patch merges, then we're right back where we are today. The impact should be small, this is a mechanism to prevent someone from doing something they really shouldn't anyway, and it will be called out in the docs.

Validating it by hand is easy. Hop onto two masters, stop the kubelet, wait for it to go unready, delete each corresponding master machine object. Zero should be successfully deleted. Today, both with be deleted, and that's not what we want.

hexfusion · 2020-09-19T00:39:23Z

based on #426 (comment) I think it makes sense to move this forward.
/lgtm

hexfusion · 2020-09-19T00:39:33Z

/retest

openshift-ci-robot · 2020-09-19T00:39:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, michaelgugino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hexfusion]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hexfusion · 2020-09-19T01:07:19Z

/refresh

hexfusion · 2020-09-19T01:07:34Z

/skip

hexfusion · 2020-09-19T01:08:02Z

/test all

openshift-ci-robot · 2020-09-19T03:09:17Z

@michaelgugino: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-disruptive	`86dd146`	link	`/test e2e-aws-disruptive`
ci/prow/e2e-disruptive	`86dd146`	link	`/test e2e-disruptive`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-09-19T03:10:30Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-09-19T03:30:19Z

@michaelgugino: All pull requests linked via external trackers have merged:

openshift/cluster-etcd-operator#426

Bugzilla bug 1840358 has been moved to the MODIFIED state.

In response to this:

Bug 1840358: etcd-quorum-guard remove toleration timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hexfusion · 2020-10-01T10:15:45Z

/cherry-pick release-4.5

openshift-cherrypick-robot · 2020-10-01T10:15:52Z

@hexfusion: #426 failed to apply on top of branch "release-4.5":

Applying: etcd-quorum-guard remove toleration timeouts
Using index info to reconstruct a base tree...
A	manifests/0000_12_etcd-operator_08_etcdquorumguard_deployment.yaml
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): manifests/0000_12_etcd-operator_08_etcdquorumguard_deployment.yaml deleted in HEAD and modified in etcd-quorum-guard remove toleration timeouts. Version etcd-quorum-guard remove toleration timeouts of manifests/0000_12_etcd-operator_08_etcdquorumguard_deployment.yaml left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 etcd-quorum-guard remove toleration timeouts
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

etcd-quorum-guard remove toleration timeouts

86dd146

This commit removes timeouts for NoExecute and NoSchedule tolerations. This results in the pods not being marked deleted and rescheduled in case of a kubelet going unreachable.

openshift-ci-robot requested review from hexfusion and retroflexer August 21, 2020 17:48

michaelgugino changed the title ~~etcd-quorum-guard remove toleration timeouts~~ Bug 1840358: etcd-quorum-guard remove toleration timeouts Aug 21, 2020

openshift-ci-robot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Aug 21, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 21, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2020

openshift-ci-robot assigned hexfusion Sep 19, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2020

openshift-merge-robot merged commit 0e59345 into openshift:master Sep 19, 2020

This was referenced Oct 1, 2020

Bug 1884195: etcd-quorum-guard remove toleration timeouts openshift/machine-config-operator#2133

Merged

Revert "Bug 1840358: etcd-quorum-guard remove toleration timeouts" #460

Closed

hexfusion mentioned this pull request Nov 20, 2020

Bug 1884195: test/extended/operators: add etcd-quorum-guard to unevictable whitelist openshift/origin#25700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1840358: etcd-quorum-guard remove toleration timeouts #426

Bug 1840358: etcd-quorum-guard remove toleration timeouts #426

michaelgugino commented Aug 21, 2020

openshift-ci-robot commented Aug 21, 2020

hexfusion commented Aug 22, 2020

michaelgugino commented Aug 23, 2020

hexfusion commented Aug 24, 2020

hexfusion commented Aug 25, 2020

hexfusion commented Aug 25, 2020

hexfusion commented Sep 1, 2020

JoelSpeed commented Sep 7, 2020

michaelgugino commented Sep 18, 2020

hexfusion commented Sep 18, 2020

michaelgugino commented Sep 18, 2020

hexfusion commented Sep 19, 2020

hexfusion commented Sep 19, 2020

openshift-ci-robot commented Sep 19, 2020

hexfusion commented Sep 19, 2020

hexfusion commented Sep 19, 2020

hexfusion commented Sep 19, 2020

openshift-ci-robot commented Sep 19, 2020

openshift-bot commented Sep 19, 2020

openshift-ci-robot commented Sep 19, 2020

hexfusion commented Oct 1, 2020

openshift-cherrypick-robot commented Oct 1, 2020

Bug 1840358: etcd-quorum-guard remove toleration timeouts #426

Bug 1840358: etcd-quorum-guard remove toleration timeouts #426

Conversation

michaelgugino commented Aug 21, 2020

openshift-ci-robot commented Aug 21, 2020

hexfusion commented Aug 22, 2020

michaelgugino commented Aug 23, 2020

hexfusion commented Aug 24, 2020

hexfusion commented Aug 25, 2020

hexfusion commented Aug 25, 2020

hexfusion commented Sep 1, 2020

JoelSpeed commented Sep 7, 2020

michaelgugino commented Sep 18, 2020

hexfusion commented Sep 18, 2020

michaelgugino commented Sep 18, 2020

hexfusion commented Sep 19, 2020

hexfusion commented Sep 19, 2020

openshift-ci-robot commented Sep 19, 2020

hexfusion commented Sep 19, 2020

hexfusion commented Sep 19, 2020

hexfusion commented Sep 19, 2020

openshift-ci-robot commented Sep 19, 2020

openshift-bot commented Sep 19, 2020

openshift-ci-robot commented Sep 19, 2020

hexfusion commented Oct 1, 2020

openshift-cherrypick-robot commented Oct 1, 2020