New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1975296: Respect MaxUnhealthy limit for external remediation #902
Bug 1975296: Respect MaxUnhealthy limit for external remediation #902
Conversation
Conventional remediation consists of simply deleting the Machine object. In consequence, it was safe to consider that any Machines that do not need remediation, have a Node, and are not in the process of being deleted, are 'healthy'. However, external remediation takes place not by deleting a Machine but by adding an annotation to it. While the Machine continues to exist (and may be associated with a Node for part of the time), it will not be in a working state throughout the remediation (generally because they are being rebooted). Because these Machines were considered 'healthy', additional Machines could be remediated during this process in violation of the MaxUnhealthy limit. If the process of acting on the external remediation annotation was delayed, potentially the whole cluster could be remediated simultaneously, thus taking it out of service. To prevent this, treat Machines with the external remediation annotation as unhealthy so that the MaxUnhealthy limit is respected. Note that when a RemediationTemplate (as added in 338eab5) is provided, it will *not* be taken into account in determining whether a Machine is healthy (unless it also results in the external remediation annotation being applied to the Machine), so the same issue still exists in that case.
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
@slintes: An error was encountered searching for bug 1975296 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
response code 503 not 200
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@slintes: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elmiko The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required |
@slintes: This pull request references Bugzilla bug 1975296, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
7 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@slintes: An error was encountered searching for external tracker bugs for bug 1975296 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
could not unmarshal response body: invalid character '<' looking for beginning of value
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.8 |
@slintes: new pull request created: #910 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Replaces #898 because Zane is on PTO
Applied suggested refactoring, added unit test
From the original PR:
Conventional remediation consists of simply deleting the Machine object.
In consequence, it was safe to consider that any Machines that do not
need remediation, have a Node, and are not in the process of being
deleted, are 'healthy'.
However, external remediation takes place not by deleting a Machine but
by adding an annotation to it. While the Machine continues to exist (and
may be associated with a Node for part of the time), it will not be in a
working state throughout the remediation (generally because they are
being rebooted).
Because these Machines were considered 'healthy', additional Machines
could be remediated during this process in violation of the MaxUnhealthy
limit. If the process of acting on the external remediation annotation
was delayed, potentially the whole cluster could be remediated
simultaneously, thus taking it out of service.
To prevent this, treat Machines with the external remediation annotation
as unhealthy so that the MaxUnhealthy limit is respected.
Note that when a RemediationTemplate (as added in
338eab5) is provided, it will not be
taken into account in determining whether a Machine is healthy (unless
it also results in the external remediation annotation being applied to
the Machine), so the same issue still exists in that case.