Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Prevent hosts from getting into infinite power-off loops #1356

Merged
merged 3 commits into from Oct 19, 2023

Conversation

honza
Copy link
Member

@honza honza commented Sep 14, 2023

This is to cover the case where we need registration, and we have credentials, and yet we can't power the node off.

Also note that this is currently rebased on top of #1312, and only the latest commit is relevant.

@metal3-io-bot metal3-io-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 14, 2023
Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised that I steered you wrong by suggesting that we need to count the NeedsRegistration errors.

The purpose of NeedsRegistration is not to note that we failed to register; rather it is to retrigger a reconcile so that registration can run again on the new reconcile. This is why it's actionError rather than actionFailure. Any actual failures to register are recorded by registerHost() on the next reconcile.

Once in StatePoweringOffBeforeDelete, however, we will never call registerHost() because we bail out from ensureRegistered() (at line 332). This is a bug - the comment there says "In the deleting state the whole idea is to de-register the host" and that is true for the Deleting state, but not true for PoweringOffBeforeDelete where the host very much needs to be registered in order for us to power off. As things stand, once we need to re-register we can never get out of this loop because registration will never happen again.

So we need to fix that, but we also need to find a way to bail out once the registration errors start to pile up, otherwise we will be stuck trying to re-register. This makes me wonder how repeated registration errors are handled in Deprovisioning, which is in a similar boat (i.e. a thing you have to do before deleting that requires registration)... as far as I can see they're not. Maybe we could check the registration error count in checkInitiateDelete and move the state on to Deleting?

Looking at that, I note one more problem: if the Host is created with incorrect credentials, it will be stuck in the Registering state with failures. When the user then deletes the host, it should immediately go to the deleting state where it can be deleted despite the lack of credentials. However, it now goes into PoweringOffBeforeDelete where it will be stuck forever. I guess that'll also be resolved if we fix the known bug, but I think for defensive reasons prior to the initial registration succeeding (i.e. the states "registering", "unmanaged", and ""), checkInitiateDelete() should send a host with a deletion timestamp directly to the Deleting state.

(aside: I wonder if ExternallyProvisioned should also go straight to Deleting? There's no danger of us leaving a running IPA image when the host is externally provisioned - the reason for this feature - and if we don't own provisioning then maybe we shouldn't mess with the power state except when explicitly asked to by the user? Like, shouldn't it be possible to add an ExternallyProvisioned host and then delete it again without Metal³ ever doing anything to it?)

if hsm.Host.Status.ErrorCount > 3 {
info.log.Info("Giving up on host power off after 3 attempts.")
return skipToDelete()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if we did something like

return recordActionFailure(info, metal3api.RegistrationError, r.err.Error())

here (after the if block) then we wouldn't need to change the behaviour of every error return point (most of which are likely transient errors).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but see main comment - on reflection I don't think we need to count at this point at all)

@metal3-io-bot metal3-io-bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 14, 2023
@honza honza changed the title [WIP] 🐛 Give up powering off [WIP] 🐛 Prevent hosts from getting into infinite power-off loops Sep 14, 2023
@honza
Copy link
Member Author

honza commented Sep 14, 2023

  • I removed the original error-counting code.
  • I rebased the branch onto main.
  • I removed the code that prevents registerHost() from running during PoweringOffBeforeDelete.
  • I added the defensive code that sends a host straight to Deleting (bypassing PoweringOffBeforeDelete if it's Registering, Unmanaged, or None.

@honza honza changed the title [WIP] 🐛 Prevent hosts from getting into infinite power-off loops 🐛 Prevent hosts from getting into infinite power-off loops Sep 14, 2023
@metal3-io-bot metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 14, 2023
@zaneb
Copy link
Member

zaneb commented Sep 14, 2023

This looks good, but we also need to deal with:

we also need to find a way to bail out once the registration errors start to pile up, otherwise we will be stuck trying to re-register. This makes me wonder how repeated registration errors are handled in Deprovisioning, which is in a similar boat (i.e. a thing you have to do before deleting that requires registration)... as far as I can see they're not. Maybe we could check the registration error count in checkInitiateDelete and move the state on to Deleting?

because if registration fails (in the PoweringOffBeforeDelete state) then (I think!) it will just keep retrying in ensureRegistered() and never get to any code that would allow it to bail out (not that any exists).

You should be able to test this by changing the credentials to invalid values and then deleting the host.

@zaneb
Copy link
Member

zaneb commented Sep 14, 2023

Once #1312 merges we'll be able to update the state machine diagram here too. Just leaving a note to remind myself.

@metal3-io-bot metal3-io-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 18, 2023
The host needs to be registered if we are going to power it off.
Namely registering, unmanaged, and none.  This is to prevent a stuck
case where incorrect credentials have been provided.
@honza
Copy link
Member Author

honza commented Sep 18, 2023

  • We now skip to delete when a registration error happens more than 3 times during deprovisioning, and powering off
  • I updated the diagram

@dtantsur
Copy link
Member

/test-centos-e2e-integration-main
/test-ubuntu-integration-main
/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2023
// Allow state machine to run to continue deprovisioning.
return false
case metal3api.StateDeleting:
// Already in deleting state. Allow state machine to run.
return false
case metal3api.StatePoweringOffBeforeDelete:
if hsm.Host.Status.ErrorType == metal3api.RegistrationError && hsm.Host.Status.ErrorCount > 3 {
hsm.NextState = metal3api.StateDeleting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't return true here, then we'll continue to process the state machine as if the current state were already Deleting, but without updating it in the Status unless handleDeleting() happens to return actionUpdate or actionComplete. Functionally this is OK - ensureRegistered() will look at the next state and not bother trying to re-register. But for the sake of consistency with all the other cases of changing hsm.NextState in this function, I think we should return true when we set it so that the user can explicitly see what is happening at each step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return statement added here, and also at the deprovisioning state

@metal3-io-bot metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 10, 2023
Copy link
Member

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I think we should also change

if r.NeedsRegistration() && !hsm.haveCreds {

My reasoning is that we know the BMH is marked for deletion already since we are in handlePoweringOffBeforeDelete, therefore we should skip to delete if it needs registration OR if we lack credentials to power it off. I don't see a case where we should ever register the host when it is already marked for deletion.
(The reason for deleting could be that registration failed.)

@dtantsur
Copy link
Member

/test-centos-e2e-integration-main
/test-ubuntu-integration-main

@dtantsur
Copy link
Member

/approve

Please do consider the suggestion above.

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtantsur

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2023
@elfosardo
Copy link
Member

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2023
@metal3-io-bot metal3-io-bot merged commit 7559468 into metal3-io:main Oct 19, 2023
10 checks passed
@zaneb
Copy link
Member

zaneb commented Oct 24, 2023

I don't see a case where we should ever register the host when it is already marked for deletion.

The case is when the ironic DB is dropped in mid-operation while we were trying to turn the power off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants