Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Power off nodes upon deletion #1176

Merged
merged 1 commit into from Jul 6, 2023

Conversation

honza
Copy link
Member

@honza honza commented Oct 20, 2022

This is a continuation of #816 which in turn tries to fix #410.

Co-authored-by: Sandhya Dasu sadasu@redhat.com @sadasu

@metal3-io-bot metal3-io-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 20, 2022
@elfosardo
Copy link
Member

/test-centos-integration-main
/test-ubuntu-integration-main

@elfosardo
Copy link
Member

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2022
@furkatgofurov7
Copy link
Member

Heads up: centOS failure is unrelated to these changes, we are facing issues with CI.

xref: https://kubernetes.slack.com/archives/CHD49TLE7/p1666273231196429

@Cellebyte
Copy link

what is the progress here?

@metal3-io-bot metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 29, 2022
@elfosardo
Copy link
Member

/test-centos-integration-main
/test-ubuntu-integration-main

@elfosardo
Copy link
Member

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 29, 2022
@dtantsur
Copy link
Member

dtantsur commented Dec 5, 2022

/lgtm cancel

One issue inline, otherwise looking good.

@metal3-io-bot metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 5, 2022
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
@metal3-io-bot metal3-io-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 19, 2022
Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refamiliarising myself with how all of this works 😅

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved

if err != nil {
if info.host.Status.ErrorCount < maxPowerOffRetryCount {
return actionError{errors.Wrap(err, "failed to power off")}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing anything addressing my earlier comment about an infinite error loop in cases where the node is missing from ironic and we don't have credentials to re-register it, which we handle in deprovisioning here and also now need to handle in deleting.

I wonder if this would all be made simpler by putting the new code into a separate actionPowerOffBeforeDeleting() method, so that the state machine code can easily distinguish between errors coming from the power off vs. the delete.

}
}

info.host.Status.ErrorCount = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the last part of https://github.com/metal3-io/baremetal-operator/pull/1176/files#r1053444986, iff the error count wasn't already 0 we'll want to return actionUpdate wrapping actionContinue on line 533.

}
return result
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also set Status.PoweredOn to false so that we are reporting what we've actually done.

@metal3-io-bot metal3-io-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 23, 2023
@honza
Copy link
Member Author

honza commented Mar 23, 2023

This might be overkill but the suggested changes led me to create a new step in the state machine. When a delete is requested, instead of StateDeleting, we now enter StatePoweringOffBeforeDelete. This separates the two operations nicely, and allows us to handle errors in the usual way (i.e. as was said above regarding skipping to delete when too many errors happen, or we don't have host creds).

When we clear ErrorCount, we do it as part of the actionComplete result, and thus don't need an explicit update (as is idiomatic in the existing code).

@elfosardo
Copy link
Member

/test-centos-integration-main
/test-ubuntu-integration-main

@elfosardo
Copy link
Member

/test-centos-e2e-integration-main

@@ -561,6 +566,37 @@ func (hsm *hostStateMachine) handleDeprovisioning(info *reconcileInfo) actionRes
return actResult
}

func (hsm *hostStateMachine) handlePoweringOffBeforeDelete(info *reconcileInfo) actionResult {
actResult := hsm.Reconciler.actionPowerOffBeforeDeleting(hsm.Provisioner, info)
skipToDelete := func() actionResult {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: most of this function is repeating handleDeprovisioning. It would be great to refactor them.

@dtantsur
Copy link
Member

/test-centos-e2e-integration-main
/test-ubuntu-integration-main
/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label May 25, 2023
@honza
Copy link
Member Author

honza commented May 31, 2023

/assign @zaneb

@metal3-io-bot metal3-io-bot added the needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. label Jun 8, 2023
We introduce a new step in the state machine where the node goes through
a power off stage before it's deleted.  We attempt to power it off 3
times before giving up, and proceeding to the delete.
@metal3-io-bot metal3-io-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 9, 2023
@elfosardo
Copy link
Member

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2023
@dtantsur
Copy link
Member

dtantsur commented Jul 6, 2023

/test-centos-e2e-integration-main
/test-ubuntu-integration-main

@dtantsur
Copy link
Member

dtantsur commented Jul 6, 2023

/approve

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtantsur

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 6, 2023
@metal3-io-bot metal3-io-bot merged commit 7654cd7 into metal3-io:main Jul 6, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Host should be powered down once deleted
8 participants