Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1840358: etcd-quorum-guard remove toleration timeouts #426

Conversation

michaelgugino
Copy link
Contributor

This commit removes timeouts for NoExecute
and NoSchedule tolerations. This results in the
pods not being marked deleted and rescheduled in
case of a kubelet going unreachable.

This commit removes timeouts for NoExecute
and NoSchedule tolerations.  This results in the
pods not being marked deleted and rescheduled in
case of a kubelet going unreachable.
@michaelgugino michaelgugino changed the title etcd-quorum-guard remove toleration timeouts Bug 1840358: etcd-quorum-guard remove toleration timeouts Aug 21, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Aug 21, 2020
@openshift-ci-robot
Copy link

@michaelgugino: This pull request references Bugzilla bug 1840358, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1840358: etcd-quorum-guard remove toleration timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 21, 2020
@hexfusion
Copy link
Contributor

/retest

1 similar comment
@michaelgugino
Copy link
Contributor Author

/retest

@hexfusion
Copy link
Contributor

/test e2e-gcp-upgrade

@hexfusion
Copy link
Contributor

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2020
@hexfusion
Copy link
Contributor

based on my review of BZ 1840358[1] and a conversation I had with @michaelgugino I think this change is warranted. Ideally, we would test this in origin e2e. @michaelgugino does your team have any e2e suite that we could append this to WRT the premise.

Possible to delete 2 masters simultaneously if kubelet unreachale

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1840358

@hexfusion
Copy link
Contributor

/retest

@JoelSpeed
Copy link
Contributor

If I've understood correctly, if a kubelet goes unreachable, the etcd quorum guard will now block the drain indefinitely? Requiring manual intervention?

@michaelgugino
Copy link
Contributor Author

based on my review of BZ 1840358[1] and a conversation I had with @michaelgugino I think this change is warranted. Ideally, we would test this in origin e2e. @michaelgugino does your team have any e2e suite that we could append this to WRT the premise.

Possible to delete 2 masters simultaneously if kubelet unreachale

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1840358

@hexfusion We don't have any tests the disrupt masters, or at least the cloud team doesn't. I'm not sure how we could test this in the e2e suite. Having 2 masters go unreachable, there's not a great mechanism to do that.

@hexfusion
Copy link
Contributor

@michaelgugino I guess what I am saying is what is the exposure to regression here? Since we do not test it I am just trying to wrap my head around how we validate it works.

@michaelgugino
Copy link
Contributor Author

@michaelgugino I guess what I am saying is what is the exposure to regression here? Since we do not test it I am just trying to wrap my head around how we validate it works.

I think the exposure is somewhat small. In the medium term, we want to put control plane hosts into machinesets, and once that happens, we can do some more sophisticated tests as we'll get replacements. If we regress this behavior after this patch merges, then we're right back where we are today. The impact should be small, this is a mechanism to prevent someone from doing something they really shouldn't anyway, and it will be called out in the docs.

Validating it by hand is easy. Hop onto two masters, stop the kubelet, wait for it to go unready, delete each corresponding master machine object. Zero should be successfully deleted. Today, both with be deleted, and that's not what we want.

@hexfusion
Copy link
Contributor

based on #426 (comment) I think it makes sense to move this forward.
/lgtm

@hexfusion
Copy link
Contributor

/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, michaelgugino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hexfusion
Copy link
Contributor

/refresh

@hexfusion
Copy link
Contributor

/skip

@hexfusion
Copy link
Contributor

/test all

@openshift-ci-robot
Copy link

@michaelgugino: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-disruptive 86dd146 link /test e2e-aws-disruptive
ci/prow/e2e-disruptive 86dd146 link /test e2e-disruptive

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 0e59345 into openshift:master Sep 19, 2020
@openshift-ci-robot
Copy link

@michaelgugino: All pull requests linked via external trackers have merged:

Bugzilla bug 1840358 has been moved to the MODIFIED state.

In response to this:

Bug 1840358: etcd-quorum-guard remove toleration timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hexfusion
Copy link
Contributor

/cherry-pick release-4.5

@openshift-cherrypick-robot

@hexfusion: #426 failed to apply on top of branch "release-4.5":

Applying: etcd-quorum-guard remove toleration timeouts
Using index info to reconstruct a base tree...
A	manifests/0000_12_etcd-operator_08_etcdquorumguard_deployment.yaml
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): manifests/0000_12_etcd-operator_08_etcdquorumguard_deployment.yaml deleted in HEAD and modified in etcd-quorum-guard remove toleration timeouts. Version etcd-quorum-guard remove toleration timeouts of manifests/0000_12_etcd-operator_08_etcdquorumguard_deployment.yaml left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 etcd-quorum-guard remove toleration timeouts
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants