Bug 1975296: Respect MaxUnhealthy limit for external remediation #902

slintes · 2021-08-10T10:11:48Z

Replaces #898 because Zane is on PTO
Applied suggested refactoring, added unit test

From the original PR:
Conventional remediation consists of simply deleting the Machine object.
In consequence, it was safe to consider that any Machines that do not
need remediation, have a Node, and are not in the process of being
deleted, are 'healthy'.

However, external remediation takes place not by deleting a Machine but
by adding an annotation to it. While the Machine continues to exist (and
may be associated with a Node for part of the time), it will not be in a
working state throughout the remediation (generally because they are
being rebooted).

Because these Machines were considered 'healthy', additional Machines
could be remediated during this process in violation of the MaxUnhealthy
limit. If the process of acting on the external remediation annotation
was delayed, potentially the whole cluster could be remediated
simultaneously, thus taking it out of service.

To prevent this, treat Machines with the external remediation annotation
as unhealthy so that the MaxUnhealthy limit is respected.

Note that when a RemediationTemplate (as added in
338eab5) is provided, it will not be
taken into account in determining whether a Machine is healthy (unless
it also results in the external remediation annotation being applied to
the Machine), so the same issue still exists in that case.

Conventional remediation consists of simply deleting the Machine object. In consequence, it was safe to consider that any Machines that do not need remediation, have a Node, and are not in the process of being deleted, are 'healthy'. However, external remediation takes place not by deleting a Machine but by adding an annotation to it. While the Machine continues to exist (and may be associated with a Node for part of the time), it will not be in a working state throughout the remediation (generally because they are being rebooted). Because these Machines were considered 'healthy', additional Machines could be remediated during this process in violation of the MaxUnhealthy limit. If the process of acting on the external remediation annotation was delayed, potentially the whole cluster could be remediated simultaneously, thus taking it out of service. To prevent this, treat Machines with the external remediation annotation as unhealthy so that the MaxUnhealthy limit is respected. Note that when a RemediationTemplate (as added in 338eab5) is provided, it will *not* be taken into account in determining whether a Machine is healthy (unless it also results in the external remediation annotation being applied to the Machine), so the same issue still exists in that case.

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

openshift-ci · 2021-08-10T10:11:53Z

@slintes: An error was encountered searching for bug 1975296 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message.


response code 503 not 200

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

Bug 1975296: Respect MaxUnhealthy limit for external remediation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed

/lgtm

openshift-ci · 2021-08-10T12:58:30Z

@slintes: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-vsphere-upgrade	`9b85699`	link	`/test e2e-vsphere-upgrade`
ci/prow/e2e-vsphere	`9b85699`	link	`/test e2e-vsphere`
ci/prow/e2e-libvirt	`9b85699`	link	`/test e2e-libvirt`
ci/prow/e2e-gcp-operator	`9b85699`	link	`/test e2e-gcp-operator`
ci/prow/e2e-metal-ipi-ovn-ipv6	`9b85699`	link	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-aws-disruptive	`9b85699`	link	`/test e2e-aws-disruptive`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

elmiko

/approve

openshift-ci · 2021-08-10T13:13:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/machinehealthcheck/OWNERS~~ [elmiko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slintes · 2021-08-10T13:21:12Z

/retest-required
/bugzilla refresh

openshift-ci · 2021-08-10T13:21:18Z

@slintes: This pull request references Bugzilla bug 1975296, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

/retest-required
/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2021-08-10T14:37:47Z

/retest-required