Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to use soft power off instead of hard power off #294

Merged
merged 1 commit into from Jul 2, 2020
Merged

Default to use soft power off instead of hard power off #294

merged 1 commit into from Jul 2, 2020

Conversation

tiendc
Copy link

@tiendc tiendc commented Aug 29, 2019

#273
Signed-off-by: Dao Cong Tien tiendc@vn.fujitsu.com

@nordixinfra
Copy link

Can one of the admins verify this patch?

@derekhiggins
Copy link
Member

Is this the call that is used for fencing? If so should it remain a hard power off?

pkg/provisioner/provisioner.go Outdated Show resolved Hide resolved
result, err = p.changePower(ironicNode, nodes.SoftPowerOff)
if err != nil {
// Soft power off is not supported by vendor driver, uses PowerOff()
if strings.HasPrefix(err.Error(), "driver does not support target power state") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way for us to detect whether the driver supports soft power off before we get here and try to use it? Could we store a setting in the Status section of the host, so the user knows what to expect when the ask for the host to be powered off?

Copy link
Author

@tiendc tiendc Sep 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the Ironic API, currently there is no API for retrieving supported power states. We may consider adding function to bmcAccessDetail, says softPowerOffSupported(), as an alternative solution.

pkg/provisioner/ironic/ironic.go Show resolved Hide resolved
pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
@dhellmann
Copy link
Member

Is this the call that is used for fencing? If so should it remain a hard power off?

A soft power off is the default, but if that fails we still yank the power.

@dhellmann
Copy link
Member

@tiendc thank you for working on this!

@derekhiggins
Copy link
Member

Is this the call that is used for fencing? If so should it remain a hard power off?

A soft power off is the default, but if that fails we still yank the power.

ack, thanks

@tiendc tiendc closed this Sep 26, 2019
@tiendc tiendc reopened this Sep 26, 2019
docs/api.md Outdated
Value is one of the following:
* *<empty string>* -- Soft power off is not used on the node.
* *unsupported* -- Soft power off is not supported on the node.
* *triggered* -- Soft power off is triggered on the node but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to track the soft power off status separately?

If we do, I think instead of reflecting it in a new status field, we should see if we can combine it with the provisioning status and make that a top level field on the status structure.

We would never have something in "provisioning" with a soft power off status of "triggered", for example, right?

@@ -1325,6 +1336,19 @@ func (p *ironicProvisioner) PowerOn() (result provisioner.Result, err error) {
func (p *ironicProvisioner) PowerOff() (result provisioner.Result, err error) {
p.log.Info("ensuring host is powered off")

// Tries soft power off first, if it fails, performs hard power off
result, err = p.softPowerOff()
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to switch to the hard power off mode if we get the very specific 400 error. If we get a 409 we want to pause and try the soft power off again, for example.

I think we want to define a new error type so that we can convert the 400 error from line 1291 to our custom type, and then check for that type here instead of just checking against nil.

Copy link
Author

@tiendc tiendc Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my test environment, I have a Fujitsu BM server. It requires the OS must install a specific agent to allow soft power off. So if the agent is not installed, any try to perform soft power off will fail regardless of support from Ironic and Fujitsu driver for Ironic. I think if we retry the action when failed, we should limit the number of it, say 3 times. Do you have any suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 409 error just means that ironic itself is too busy to handle our request (or more likely that it is already doing something with the host and cannot send multiple instructions). But your point about only retrying a few times makes a lot of sense, for other types of errors.

@metal3-io-bot metal3-io-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 26, 2019
deploy/crds/metal3.io_baremetalhosts_crd.yaml Outdated Show resolved Hide resolved
pkg/apis/metal3/v1alpha1/baremetalhost_types.go Outdated Show resolved Hide resolved
pkg/apis/metal3/v1alpha1/baremetalhost_types.go Outdated Show resolved Hide resolved
@tiendc
Copy link
Author

tiendc commented Jan 7, 2020

It seems I did something wrong with git, so the pull request contains a commit that is not mine. I will try to fix it.

@tiendc tiendc closed this Jan 13, 2020
@tiendc tiendc deleted the soft_power_off branch January 13, 2020 05:01
@tiendc tiendc restored the soft_power_off branch January 13, 2020 05:03
@tiendc tiendc reopened this Jan 13, 2020
Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could eliminate having to store state in the Host CR. It looks to me like this should be theoretically possible.

pkg/apis/metal3/v1alpha1/baremetalhost_types.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
pkg/provisioner/errors.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
@zhouhao3
Copy link
Member

zhouhao3 commented Apr 7, 2020

@zaneb @dhellmann
Hi, I will take over from @tiendc to continue this work, and I made some changes to the code based on your previous comments. PTAL.

@zhouhao3
Copy link
Member

@dhellmann PTAL

@zhouhao3
Copy link
Member

zhouhao3 commented May 9, 2020

@zaneb @dhellmann PTAL

@zhouhao3
Copy link
Member

@zaneb @dhellmann PTAL

pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/ironic.go Show resolved Hide resolved
pkg/provisioner/ironic/ironic.go Show resolved Hide resolved
Signed-off-by: Dao Cong Tien <tiendc@vn.fujitsu.com>
Signed-off-by: Zhou Hao <zhouhao@cn.fujitsu.com>
@zhouhao3
Copy link
Member

@zaneb @dhellmann @maelk Can someone help review this patch? Thanks a lot!

Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/test-integration

}
// If the target state is unset while the last error is set,
// then the last execution of soft power off has failed.
if targetState == "" && ironicNode.LastError != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a possibility that the node could already have an error set before we attempt to power it off. However, even in that worst-case scenario, all that happens is that we will go straight to a hard power off. So I think this is fine.

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2020
@zhouhao3
Copy link
Member

/test-integration

@zhouhao3
Copy link
Member

@zaneb Why has this PR been stuck in this state? Is this PR currently waiting for the result of test-integration (almost a week has passed)? Is there anything I can do to promote this PR?

@zaneb
Copy link
Member

zaneb commented Jun 16, 2020

I'm not sure why the integration test didn't run. @maelk any idea?

@dhellmann
Copy link
Member

/test-integration

@dhellmann
Copy link
Member

I'm not sure why the integration test didn't run. @maelk any idea?

Perhaps only org members can trigger the test job?

@zaneb
Copy link
Member

zaneb commented Jun 16, 2020

Perhaps only org members can trigger the test job?

I also tried unsuccessfully to trigger it last week, but on reflection that may have been before the regex was changed to allow it not to be the only line in the comment.

@zhouhao3
Copy link
Member

@zaneb @dhellmann test-integration has passed. Please continue to review, thanks.

@zhouhao3
Copy link
Member

@zaneb PTAL, thanks.

@zhouhao3
Copy link
Member

@zaneb @dhellmann @maelk At present, this PR has obtained an approve, and test-integration has passed. Currently requires lgtm label. Is there anything else I can do to advance this PR?

Copy link
Member

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this locally with some VMs and it works well.

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 2, 2020
@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, tiendc, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot merged commit da5b8a8 into metal3-io:master Jul 2, 2020
rdoxenham referenced this pull request in rdoxenham/baremetal-operator Mar 2, 2021
The default reboot-interface behaviour is to attempt a soft power
off, and if this fails, revert to a hard power off (PR openshift#294). For high
availability use-cases we require the ability to immediately power-off
a node. This PR attempts to address that requirement and is part of a
wider solution requiring the CAPBM to set the annotation that we have
detailed and implemented in this commit. The baseline provisioner API
changes have been provided in an earlier commit.

CAPBM PR: openshift/cluster-api-provider-baremetal#138
Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
rdoxenham referenced this pull request in rdoxenham/baremetal-operator Mar 8, 2021
The default reboot-interface behaviour is to attempt a soft power
off, and if this fails, revert to a hard power off (PR openshift#294). For high
availability use-cases we require the ability to immediately power-off
a node. This PR attempts to address that requirement and is part of a
wider solution requiring the CAPBM to set the annotation that we have
detailed and implemented in this commit. The baseline provisioner API
changes have been provided in an earlier commit.

CAPBM PR: openshift/cluster-api-provider-baremetal#138
Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
rdoxenham referenced this pull request in rdoxenham/baremetal-operator Mar 9, 2021
The default reboot-interface behaviour is to attempt a soft power
off, and if this fails, revert to a hard power off (PR openshift#294). For high
availability use-cases we require the ability to immediately power-off
a node. This PR attempts to address that requirement and is part of a
wider solution requiring the CAPBM to set the annotation that we have
detailed and implemented in this commit. The baseline provisioner API
changes have been provided in an earlier commit.

CAPBM PR: openshift/cluster-api-provider-baremetal#138
Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
rdoxenham referenced this pull request in rdoxenham/baremetal-operator Mar 9, 2021
The default reboot-interface behaviour is to attempt a soft power
off, and if this fails, revert to a hard power off (PR openshift#294). For high
availability use-cases we require the ability to immediately power-off
a node. This PR attempts to address that requirement and is part of a
wider solution requiring the CAPBM to set the annotation that we have
detailed and implemented in this commit. The baseline provisioner API
changes have been provided in an earlier commit.

CAPBM PR: openshift/cluster-api-provider-baremetal#138
Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1927678
elfosardo pushed a commit to elfosardo/baremetal-operator that referenced this pull request Oct 16, 2023
…rsion

[release-4.13] OCPBUGS-17229: Set minimum TLS version for webhook to 1.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants