Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1975296: Respect MaxUnhealthy limit for external remediation #902

Merged

Conversation

slintes
Copy link
Member

@slintes slintes commented Aug 10, 2021

Replaces #898 because Zane is on PTO
Applied suggested refactoring, added unit test

From the original PR:
Conventional remediation consists of simply deleting the Machine object.
In consequence, it was safe to consider that any Machines that do not
need remediation, have a Node, and are not in the process of being
deleted, are 'healthy'.

However, external remediation takes place not by deleting a Machine but
by adding an annotation to it. While the Machine continues to exist (and
may be associated with a Node for part of the time), it will not be in a
working state throughout the remediation (generally because they are
being rebooted).

Because these Machines were considered 'healthy', additional Machines
could be remediated during this process in violation of the MaxUnhealthy
limit. If the process of acting on the external remediation annotation
was delayed, potentially the whole cluster could be remediated
simultaneously, thus taking it out of service.

To prevent this, treat Machines with the external remediation annotation
as unhealthy so that the MaxUnhealthy limit is respected.

Note that when a RemediationTemplate (as added in
338eab5) is provided, it will not be
taken into account in determining whether a Machine is healthy (unless
it also results in the external remediation annotation being applied to
the Machine), so the same issue still exists in that case.

zaneb and others added 3 commits August 5, 2021 13:25
Conventional remediation consists of simply deleting the Machine object.
In consequence, it was safe to consider that any Machines that do not
need remediation, have a Node, and are not in the process of being
deleted, are 'healthy'.

However, external remediation takes place not by deleting a Machine but
by adding an annotation to it. While the Machine continues to exist (and
may be associated with a Node for part of the time), it will not be in a
working state throughout the remediation (generally because they are
being rebooted).

Because these Machines were considered 'healthy', additional Machines
could be remediated during this process in violation of the MaxUnhealthy
limit. If the process of acting on the external remediation annotation
was delayed, potentially the whole cluster could be remediated
simultaneously, thus taking it out of service.

To prevent this, treat Machines with the external remediation annotation
as unhealthy so that the MaxUnhealthy limit is respected.

Note that when a RemediationTemplate (as added in
338eab5) is provided, it will *not* be
taken into account in determining whether a Machine is healthy (unless
it also results in the external remediation annotation being applied to
the Machine), so the same issue still exists in that case.
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 10, 2021

@slintes: An error was encountered searching for bug 1975296 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. response code 503 not 200

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

Bug 1975296: Respect MaxUnhealthy limit for external remediation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 10, 2021

@slintes: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-vsphere-upgrade 9b85699 link /test e2e-vsphere-upgrade
ci/prow/e2e-vsphere 9b85699 link /test e2e-vsphere
ci/prow/e2e-libvirt 9b85699 link /test e2e-libvirt
ci/prow/e2e-gcp-operator 9b85699 link /test e2e-gcp-operator
ci/prow/e2e-metal-ipi-ovn-ipv6 9b85699 link /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-disruptive 9b85699 link /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 10, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 10, 2021
@slintes
Copy link
Member Author

slintes commented Aug 10, 2021

/retest-required
/bugzilla refresh

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 10, 2021

@slintes: This pull request references Bugzilla bug 1975296, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

/retest-required
/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 10, 2021
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci openshift-ci bot merged commit abd3c0e into openshift:master Aug 10, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 10, 2021

@slintes: An error was encountered searching for external tracker bugs for bug 1975296 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. could not unmarshal response body: invalid character '<' looking for beginning of value

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

Bug 1975296: Respect MaxUnhealthy limit for external remediation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@slintes
Copy link
Member Author

slintes commented Sep 1, 2021

/cherry-pick release-4.8

@openshift-cherrypick-robot

@slintes: new pull request created: #910

In response to this:

/cherry-pick release-4.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@slintes slintes deleted the external-remdiation-max-unhealthy branch May 2, 2023 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants