Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Always retry failed cleaning on deprovisioning (fixes #1182) #1184

Merged
merged 1 commit into from Feb 28, 2023

Conversation

dtantsur
Copy link
Member

@dtantsur dtantsur commented Nov 23, 2022

Not running cleaning immediately after a failure may:

  1. cause side effects when the machine powers back on into the old
    operating system (and e.g. rejoins the cluster as a worker);
  2. confuse users with previous Ironic background since in Ironic a node
    cannot enter available without going through cleaning (unless disabled).

This change moves hosts manageable -> available immediately. Users who want
to opt-out have to set automatedCleanMode to disabled or detach the host.

As of this change, we still give up cleaning after 3 attempts, so it's
still possible to end up with an unclean host causing issues.

@metal3-io-bot metal3-io-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 23, 2022
@dtantsur dtantsur requested a review from zaneb November 23, 2022 15:08
@dtantsur
Copy link
Member Author

/test-centos-integration-main
/test-ubuntu-integration-main

@dtantsur
Copy link
Member Author

/test-centos-integration-main

@elfosardo
Copy link
Member

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 25, 2022
Copy link
Member

@kashifest kashifest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I really like this change.
/cc @zaneb @honza

Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with this (which I should have mentioned here instead of just on the issue) is that if there is some systematic failure, it will just sit in deprovisioning forever; there's no point where we ever report the cause of the problem to the user, nor can we increase the delay between retries.

pkg/provisioner/ironic/ironic.go Show resolved Hide resolved
@dtantsur
Copy link
Member Author

/hold

Needs careful thinking, will get back to it after the holidays

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 20, 2022
@zaneb
Copy link
Member

zaneb commented Dec 20, 2022

One reason this matters is that when we are trying to delete a host, we give up after 3 attempts at deprovisioning so that if the host is just gone, you won't be stuck with the BMH forever: https://github.com/metal3-io/baremetal-operator/blob/main/controllers/metal3.io/host_state_machine.go#L517-L521

I guess in the case where it is just gone we'll fail at Error rather than CleanFailed, so that will still work. But if we retry cleaning forever then in a situation where it can never work (e.g. BMC works but network card in the host is fried, or something) you will be stuck with a BMH that cannot be deleted.

@metal3-io-bot metal3-io-bot added the needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2023
@dtantsur
Copy link
Member Author

I guess in the case where it is just gone we'll fail at Error rather than CleanFailed, so that will still work. But if we retry cleaning forever then in a situation where it can never work (e.g. BMC works but network card in the host is fried, or something) you will be stuck with a BMH that cannot be deleted.

You can delete such node by disabling cleaning or detaching it. I'm very much against giving up on cleaning if it's requested by a user.

@metal3-io-bot metal3-io-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 16, 2023
@dtantsur
Copy link
Member Author

/test-centos-integration-main

@dtantsur
Copy link
Member Author

/hold cancel
/test-ubuntu-integration-main

This seems to work well in my testing, including stopping retries if automated cleaning mode is changed to disabled.

@metal3-io-bot metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2023
@elfosardo
Copy link
Member

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2023
@zaneb
Copy link
Member

zaneb commented Feb 20, 2023

You can delete such node by disabling cleaning or detaching it.

I could maybe buy that, but we're still not reporting the error so the user has no idea what is going on. It will just sit in 'deprovisioning' forever.

@dtantsur
Copy link
Member Author

I could maybe buy that, but we're still not reporting the error so the user has no idea what is going on. It will just sit in 'deprovisioning' forever.

This applies to everything, no? It's the same for deployment and inspecting: we briefly show the error on the BMH, then retry. Why should cleaning be different? How else can we report an error?

@zaneb
Copy link
Member

zaneb commented Feb 22, 2023

I could maybe buy that, but we're still not reporting the error so the user has no idea what is going on. It will just sit in 'deprovisioning' forever.

This applies to everything, no?

Not everything.

It's the same for deployment and inspecting: we briefly show the error on the BMH, then retry.

For deployment I think that's true, because the state changes to deprovisioning and we don't keep track of how many times we go through that cycle. We should fix that.

For inspecting I think from memory we do the exponential backoff correctly.

Why should cleaning be different?

Everything should be different.

How else can we report an error?

In principle: return a failure that will record the reason in the status and bump the errorCount, thus increasing the delay until the next retry.

@dtantsur
Copy link
Member Author

In principle: return a failure that will record the reason in the status and bump the errorCount, thus increasing the delay until the next retry.

That's what my patch does, at least as far as I understand (and see in my testing). Could you point out what I'm missing? Any explicit step to make the delay exponential?

@dtantsur
Copy link
Member Author

I tested by starting cleaning, then aborting it after the node enters clean wait using the Ironic CLI. This is the outcome:

status:
  errorCount: 1
  errorMessage: 'Cleaning failed: By request, the clean operation was aborted'
  errorType: provisioning error

On the 2nd attempt, the error message is buggy, although Ironic reports the same failure:

status:
  errorCount: 2
  errorMessage: 'Cleaning failed: '
  errorType: provisioning error

I have a feeling that more time has passed after the 2nd attempt than after the 1st one, but I'm not 100% sure.

@dtantsur
Copy link
Member Author

The empty error message is probably this Ironic issue: https://storyboard.openstack.org/#!/story/2010603. We'll fix it separately from this PR.

@zaneb
Copy link
Member

zaneb commented Feb 22, 2023

That's what my patch does

Sorry, my bad, you are correct. I thought we were still discussing the previous version of the patch and I missed that you already fixed this.

Not running cleaning immediately after a failure may:

1. cause side effects when the machine powers back on into the old
   operating system (and e.g. rejoins the cluster as a worker);
2. confuse users with previous Ironic background since in Ironic a node
   cannot enter available without going through cleaning (unless disabled).

This change moves hosts manageable -> available immediately. Users who want
to opt-out have to set automatedCleanMode to disabled or detach the host.

As of this change, we still give up cleaning after 3 attempts, so it's
still possible to end up with an unclean host causing issues.
@metal3-io-bot metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 23, 2023
@dtantsur dtantsur changed the title 🐛 Always retry failed cleaning (fixes #1182) 🐛 Always retry failed cleaning on deprovisioning (fixes #1182) Feb 23, 2023
@dtantsur
Copy link
Member Author

/test-centos-integration-e2e-main
/test-ubuntu-integration-main

Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2023
@honza
Copy link
Member

honza commented Feb 28, 2023

/approve

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: honza

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2023
@metal3-io-bot metal3-io-bot merged commit cf8c50a into metal3-io:main Feb 28, 2023
@dtantsur dtantsur deleted the always-clean branch March 1, 2023 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants