Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement explicit reboot mode options #795

Merged
merged 3 commits into from Feb 25, 2021
Merged

Implement explicit reboot mode options #795

merged 3 commits into from Feb 25, 2021

Conversation

rdoxenham
Copy link
Contributor

@rdoxenham rdoxenham commented Feb 15, 2021

The default reboot-interface behaviour is to attempt a soft power
off, and if this fails, revert to a hard power off (PR #294). For high
availability use-cases we require the ability to immediately power-off
a node. This PR attempts to address that requirement and is a WIP given
that other changes are going to be required to the provisioner API and
the CAPBM actuator to actually enact these changes.

Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
And design doc: metal3-io/metal3-docs#164

@metal3-io-bot metal3-io-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 15, 2021
@metal3-io-bot
Copy link
Contributor

Hi @rdoxenham. Thanks for your PR.

I'm waiting for a metal3-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@metal3-io-bot metal3-io-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 15, 2021
@rdoxenham rdoxenham changed the title Add isHardReboot() code to check if hard reboot is required [WIP] Add isHardReboot() code to check if hard reboot is required Feb 15, 2021
@metal3-io-bot metal3-io-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 15, 2021
Copy link
Member

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start. I have a few comments inline, and there's the TODO item for making a public hard power off API in the provisioner.

We should also make sure metal3-io/metal3-docs#164 is fully up to date with the new API and that it's approved before this PR.

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
@metal3-io-bot metal3-io-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 15, 2021
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Feb 15, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Feb 16, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Feb 16, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Feb 16, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
pkg/provisioner/provisioner.go Outdated Show resolved Hide resolved
@rdoxenham rdoxenham changed the title [WIP] Add isHardReboot() code to check if hard reboot is required Add isHardReboot() code to check if hard reboot is required Feb 16, 2021
@metal3-io-bot metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 16, 2021
@dhellmann
Copy link
Member

This looks really close. The commit messages don't match the code (there's no isHardReboot now) so fixing that up should make this ready to go, as soon as the design is approved.

rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Feb 17, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
@dhellmann
Copy link
Member

/test-integration

@dhellmann
Copy link
Member

/ok-to-test

@metal3-io-bot metal3-io-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 17, 2021
@dhellmann
Copy link
Member

This looks good, let's see what the tests say.

I don't see this annotation documented in this repo anywhere at all, so I won't hold up the code change over adding documentation. Maybe we can add that in another PR.

In this commit we add further integration for the RebootMode type
and no longer rely on a boolean for understanding whether the reboot
request was for a hardPowerOff() or softPowerOff(). This will allow
us to expand the modes we support later down the line if required
without any significant modifications required to the provisioner
API.
@dhellmann
Copy link
Member

/retitle support hard reboot mode via annotation

@metal3-io-bot metal3-io-bot changed the title Add isHardReboot() code to check if hard reboot is required support hard reboot mode via annotation Feb 17, 2021
@dhellmann
Copy link
Member

/test-integration

@dhellmann
Copy link
Member

/retitle Implement explicit reboot mode options

That should make the title match the design doc PR in metal3-io/metal3-docs#164 a little better.

@metal3-io-bot metal3-io-bot changed the title support hard reboot mode via annotation Implement explicit reboot mode options Feb 17, 2021
@dhellmann
Copy link
Member

I'm happy with this version, so after the design doc merges next week if there are no other negative comments on this PR it should be able to merge quickly.

/approve

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, rdoxenham

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2021
@hardys
Copy link
Member

hardys commented Feb 25, 2021

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2021
@metal3-io-bot metal3-io-bot merged commit ccbab4b into metal3-io:master Feb 25, 2021
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Mar 1, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Mar 1, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Mar 1, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Mar 3, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Mar 3, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Mar 4, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Mar 4, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Mar 4, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Mar 8, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-api-provider-baremetal that referenced this pull request Mar 9, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
rdoxenham added a commit to rdoxenham/cluster-api-provider-baremetal that referenced this pull request Mar 9, 2021
This change adds an additional mode to the reboot annotation that
forces all nodes sent for remediation, e.g. via a MachineHealthCheck,
to be forcefully rebooted rather than defaulting to a soft reboot
first, as it is today. The primary drive behind this change is to
enable quicker recovery of workloads, e.g. for high-availability
use cases, and by defaulting to forced hard reboot we can enable
functionality very close to fencing. This change shouldn't impact
any other non-remediation reboot requests, as the hard reboot
functionality only takes place when the mode=hard annotation is
applied to the node.

All of the work on the BMO can be found in the link below. Whilst
we depend on this PR to have a complete solution, we don't have a
hard dependency on them merging together.

BMO PR: metal3-io/baremetal-operator#795
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants