New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default to use soft power off instead of hard power off #294
Conversation
Can one of the admins verify this patch? |
Is this the call that is used for fencing? If so should it remain a hard power off? |
pkg/provisioner/ironic/ironic.go
Outdated
result, err = p.changePower(ironicNode, nodes.SoftPowerOff) | ||
if err != nil { | ||
// Soft power off is not supported by vendor driver, uses PowerOff() | ||
if strings.HasPrefix(err.Error(), "driver does not support target power state") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way for us to detect whether the driver supports soft power off before we get here and try to use it? Could we store a setting in the Status section of the host, so the user knows what to expect when the ask for the host to be powered off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked the Ironic API, currently there is no API for retrieving supported power states. We may consider adding function to bmcAccessDetail, says softPowerOffSupported(), as an alternative solution.
A soft power off is the default, but if that fails we still yank the power. |
@tiendc thank you for working on this! |
ack, thanks |
docs/api.md
Outdated
Value is one of the following: | ||
* *<empty string>* -- Soft power off is not used on the node. | ||
* *unsupported* -- Soft power off is not supported on the node. | ||
* *triggered* -- Soft power off is triggered on the node but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to track the soft power off status separately?
If we do, I think instead of reflecting it in a new status field, we should see if we can combine it with the provisioning status and make that a top level field on the status structure.
We would never have something in "provisioning" with a soft power off status of "triggered", for example, right?
@@ -1325,6 +1336,19 @@ func (p *ironicProvisioner) PowerOn() (result provisioner.Result, err error) { | |||
func (p *ironicProvisioner) PowerOff() (result provisioner.Result, err error) { | |||
p.log.Info("ensuring host is powered off") | |||
|
|||
// Tries soft power off first, if it fails, performs hard power off | |||
result, err = p.softPowerOff() | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only want to switch to the hard power off mode if we get the very specific 400 error. If we get a 409 we want to pause and try the soft power off again, for example.
I think we want to define a new error type so that we can convert the 400 error from line 1291 to our custom type, and then check for that type here instead of just checking against nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my test environment, I have a Fujitsu BM server. It requires the OS must install a specific agent to allow soft power off. So if the agent is not installed, any try to perform soft power off will fail regardless of support from Ironic and Fujitsu driver for Ironic. I think if we retry the action when failed, we should limit the number of it, say 3 times. Do you have any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 409 error just means that ironic itself is too busy to handle our request (or more likely that it is already doing something with the host and cannot send multiple instructions). But your point about only retrying a few times makes a lot of sense, for other types of errors.
It seems I did something wrong with git, so the pull request contains a commit that is not mine. I will try to fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we could eliminate having to store state in the Host CR. It looks to me like this should be theoretically possible.
@zaneb @dhellmann |
@dhellmann PTAL |
@zaneb @dhellmann PTAL |
@zaneb @dhellmann PTAL |
Signed-off-by: Dao Cong Tien <tiendc@vn.fujitsu.com> Signed-off-by: Zhou Hao <zhouhao@cn.fujitsu.com>
@zaneb @dhellmann @maelk Can someone help review this patch? Thanks a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/test-integration
} | ||
// If the target state is unset while the last error is set, | ||
// then the last execution of soft power off has failed. | ||
if targetState == "" && ironicNode.LastError != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a possibility that the node could already have an error set before we attempt to power it off. However, even in that worst-case scenario, all that happens is that we will go straight to a hard power off. So I think this is fine.
/test-integration |
@zaneb Why has this PR been stuck in this state? Is this PR currently waiting for the result of test-integration (almost a week has passed)? Is there anything I can do to promote this PR? |
I'm not sure why the integration test didn't run. @maelk any idea? |
/test-integration |
Perhaps only org members can trigger the test job? |
I also tried unsuccessfully to trigger it last week, but on reflection that may have been before the regex was changed to allow it not to be the only line in the comment. |
@zaneb @dhellmann test-integration has passed. Please continue to review, thanks. |
@zaneb PTAL, thanks. |
@zaneb @dhellmann @maelk At present, this PR has obtained an approve, and test-integration has passed. Currently requires lgtm label. Is there anything else I can do to advance this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this locally with some VMs and it works well.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dhellmann, tiendc, zaneb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
The default reboot-interface behaviour is to attempt a soft power off, and if this fails, revert to a hard power off (PR openshift#294). For high availability use-cases we require the ability to immediately power-off a node. This PR attempts to address that requirement and is part of a wider solution requiring the CAPBM to set the annotation that we have detailed and implemented in this commit. The baseline provisioner API changes have been provided in an earlier commit. CAPBM PR: openshift/cluster-api-provider-baremetal#138 Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
…rsion [release-4.13] OCPBUGS-17229: Set minimum TLS version for webhook to 1.2
#273
Signed-off-by: Dao Cong Tien tiendc@vn.fujitsu.com