🐛 Always retry failed cleaning on deprovisioning (fixes #1182) #1184

dtantsur · 2022-11-23T15:08:11Z

Not running cleaning immediately after a failure may:

cause side effects when the machine powers back on into the old
operating system (and e.g. rejoins the cluster as a worker);
confuse users with previous Ironic background since in Ironic a node
cannot enter available without going through cleaning (unless disabled).

This change moves hosts manageable -> available immediately. Users who want
to opt-out have to set automatedCleanMode to disabled or detach the host.

As of this change, we still give up cleaning after 3 attempts, so it's
still possible to end up with an unclean host causing issues.

dtantsur · 2022-11-23T15:12:56Z

/test-centos-integration-main
/test-ubuntu-integration-main

dtantsur · 2022-11-24T10:52:31Z

/test-centos-integration-main

elfosardo · 2022-11-25T14:11:01Z

/lgtm

kashifest

lgtm. I really like this change.
/cc @zaneb @honza

zaneb

The issue with this (which I should have mentioned here instead of just on the issue) is that if there is some systematic failure, it will just sit in deprovisioning forever; there's no point where we ever report the cause of the problem to the user, nor can we increase the delay between retries.

pkg/provisioner/ironic/ironic.go

dtantsur · 2022-12-20T15:52:33Z

/hold

Needs careful thinking, will get back to it after the holidays

zaneb · 2022-12-20T15:57:47Z

One reason this matters is that when we are trying to delete a host, we give up after 3 attempts at deprovisioning so that if the host is just gone, you won't be stuck with the BMH forever: https://github.com/metal3-io/baremetal-operator/blob/main/controllers/metal3.io/host_state_machine.go#L517-L521

I guess in the case where it is just gone we'll fail at Error rather than CleanFailed, so that will still work. But if we retry cleaning forever then in a situation where it can never work (e.g. BMC works but network card in the host is fried, or something) you will be stuck with a BMH that cannot be deleted.

dtantsur · 2023-02-16T16:53:25Z

I guess in the case where it is just gone we'll fail at Error rather than CleanFailed, so that will still work. But if we retry cleaning forever then in a situation where it can never work (e.g. BMC works but network card in the host is fried, or something) you will be stuck with a BMH that cannot be deleted.

You can delete such node by disabling cleaning or detaching it. I'm very much against giving up on cleaning if it's requested by a user.

dtantsur · 2023-02-16T17:45:13Z

/test-centos-integration-main

dtantsur · 2023-02-17T15:33:33Z

/hold cancel
/test-ubuntu-integration-main

This seems to work well in my testing, including stopping retries if automated cleaning mode is changed to disabled.

elfosardo · 2023-02-20T13:56:06Z

/lgtm

zaneb · 2023-02-20T23:29:21Z

You can delete such node by disabling cleaning or detaching it.

I could maybe buy that, but we're still not reporting the error so the user has no idea what is going on. It will just sit in 'deprovisioning' forever.

dtantsur · 2023-02-21T14:27:44Z

I could maybe buy that, but we're still not reporting the error so the user has no idea what is going on. It will just sit in 'deprovisioning' forever.

This applies to everything, no? It's the same for deployment and inspecting: we briefly show the error on the BMH, then retry. Why should cleaning be different? How else can we report an error?

zaneb · 2023-02-22T08:42:54Z

I could maybe buy that, but we're still not reporting the error so the user has no idea what is going on. It will just sit in 'deprovisioning' forever.

This applies to everything, no?

Not everything.

It's the same for deployment and inspecting: we briefly show the error on the BMH, then retry.

For deployment I think that's true, because the state changes to deprovisioning and we don't keep track of how many times we go through that cycle. We should fix that.

For inspecting I think from memory we do the exponential backoff correctly.

Why should cleaning be different?

Everything should be different.

How else can we report an error?

In principle: return a failure that will record the reason in the status and bump the errorCount, thus increasing the delay until the next retry.

dtantsur · 2023-02-22T09:20:18Z

In principle: return a failure that will record the reason in the status and bump the errorCount, thus increasing the delay until the next retry.

That's what my patch does, at least as far as I understand (and see in my testing). Could you point out what I'm missing? Any explicit step to make the delay exponential?

dtantsur · 2023-02-22T14:37:50Z

I tested by starting cleaning, then aborting it after the node enters clean wait using the Ironic CLI. This is the outcome:

status:
  errorCount: 1
  errorMessage: 'Cleaning failed: By request, the clean operation was aborted'
  errorType: provisioning error

On the 2nd attempt, the error message is buggy, although Ironic reports the same failure:

status:
  errorCount: 2
  errorMessage: 'Cleaning failed: '
  errorType: provisioning error

I have a feeling that more time has passed after the 2nd attempt than after the 1st one, but I'm not 100% sure.

dtantsur · 2023-02-22T15:06:46Z

The empty error message is probably this Ironic issue: https://storyboard.openstack.org/#!/story/2010603. We'll fix it separately from this PR.

zaneb · 2023-02-22T22:35:16Z

That's what my patch does

Sorry, my bad, you are correct. I thought we were still discussing the previous version of the patch and I missed that you already fixed this.

controllers/metal3.io/host_state_machine.go

Not running cleaning immediately after a failure may: 1. cause side effects when the machine powers back on into the old operating system (and e.g. rejoins the cluster as a worker); 2. confuse users with previous Ironic background since in Ironic a node cannot enter available without going through cleaning (unless disabled). This change moves hosts manageable -> available immediately. Users who want to opt-out have to set automatedCleanMode to disabled or detach the host. As of this change, we still give up cleaning after 3 attempts, so it's still possible to end up with an unclean host causing issues.

dtantsur · 2023-02-23T09:31:38Z

/test-centos-integration-e2e-main
/test-ubuntu-integration-main

zaneb

/lgtm

honza · 2023-02-28T19:12:35Z

/approve

metal3-io-bot · 2023-02-28T19:12:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: honza

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [honza]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

metal3-io-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 23, 2022

dtantsur requested a review from zaneb November 23, 2022 15:08

metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 25, 2022

kashifest reviewed Dec 8, 2022

View reviewed changes

metal3-io-bot requested a review from honza December 8, 2022 12:58

zaneb reviewed Dec 8, 2022

View reviewed changes

pkg/provisioner/ironic/ironic.go Show resolved Hide resolved

metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 20, 2022

metal3-io-bot added the needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2023

dtantsur force-pushed the always-clean branch from f5bd570 to 030de25 Compare February 16, 2023 17:13

metal3-io-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 16, 2023

metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2023

metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2023

zaneb reviewed Feb 22, 2023

View reviewed changes

controllers/metal3.io/host_state_machine.go Show resolved Hide resolved

dtantsur force-pushed the always-clean branch from 030de25 to d78fb4d Compare February 23, 2023 09:21

metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 23, 2023

dtantsur changed the title ~~🐛 Always retry failed cleaning (fixes #1182)~~ 🐛 Always retry failed cleaning on deprovisioning (fixes #1182) Feb 23, 2023

dtantsur mentioned this pull request Feb 23, 2023

🐛 Do not abandon cleaning after 3 failures #1228

Closed

zaneb reviewed Feb 27, 2023

View reviewed changes

metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2023

metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2023

metal3-io-bot merged commit cf8c50a into metal3-io:main Feb 28, 2023

dtantsur deleted the always-clean branch March 1, 2023 09:09

carbonin mentioned this pull request Mar 10, 2023

MGMT-13888: Remove spoke node on BMH delete openshift/assisted-service#5028

Merged

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Always retry failed cleaning on deprovisioning (fixes #1182) #1184

🐛 Always retry failed cleaning on deprovisioning (fixes #1182) #1184

dtantsur commented Nov 23, 2022 •

edited

dtantsur commented Nov 23, 2022

dtantsur commented Nov 24, 2022

elfosardo commented Nov 25, 2022

kashifest left a comment

zaneb left a comment

dtantsur commented Dec 20, 2022

zaneb commented Dec 20, 2022

dtantsur commented Feb 16, 2023

dtantsur commented Feb 16, 2023

dtantsur commented Feb 17, 2023

elfosardo commented Feb 20, 2023

zaneb commented Feb 20, 2023

dtantsur commented Feb 21, 2023

zaneb commented Feb 22, 2023

dtantsur commented Feb 22, 2023

dtantsur commented Feb 22, 2023

dtantsur commented Feb 22, 2023

zaneb commented Feb 22, 2023

dtantsur commented Feb 23, 2023

zaneb left a comment

honza commented Feb 28, 2023

metal3-io-bot commented Feb 28, 2023

🐛 Always retry failed cleaning on deprovisioning (fixes #1182) #1184

🐛 Always retry failed cleaning on deprovisioning (fixes #1182) #1184

Conversation

dtantsur commented Nov 23, 2022 • edited

dtantsur commented Nov 23, 2022

dtantsur commented Nov 24, 2022

elfosardo commented Nov 25, 2022

kashifest left a comment

Choose a reason for hiding this comment

zaneb left a comment

Choose a reason for hiding this comment

dtantsur commented Dec 20, 2022

zaneb commented Dec 20, 2022

dtantsur commented Feb 16, 2023

dtantsur commented Feb 16, 2023

dtantsur commented Feb 17, 2023

elfosardo commented Feb 20, 2023

zaneb commented Feb 20, 2023

dtantsur commented Feb 21, 2023

zaneb commented Feb 22, 2023

dtantsur commented Feb 22, 2023

dtantsur commented Feb 22, 2023

dtantsur commented Feb 22, 2023

zaneb commented Feb 22, 2023

dtantsur commented Feb 23, 2023

zaneb left a comment

Choose a reason for hiding this comment

honza commented Feb 28, 2023

metal3-io-bot commented Feb 28, 2023

dtantsur commented Nov 23, 2022 •

edited