New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Always retry failed cleaning on deprovisioning (fixes #1182) #1184
Conversation
/test-centos-integration-main |
/test-centos-integration-main |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with this (which I should have mentioned here instead of just on the issue) is that if there is some systematic failure, it will just sit in deprovisioning forever; there's no point where we ever report the cause of the problem to the user, nor can we increase the delay between retries.
/hold Needs careful thinking, will get back to it after the holidays |
One reason this matters is that when we are trying to delete a host, we give up after 3 attempts at deprovisioning so that if the host is just gone, you won't be stuck with the BMH forever: https://github.com/metal3-io/baremetal-operator/blob/main/controllers/metal3.io/host_state_machine.go#L517-L521 I guess in the case where it is just gone we'll fail at |
You can delete such node by disabling cleaning or detaching it. I'm very much against giving up on cleaning if it's requested by a user. |
f5bd570
to
030de25
Compare
/test-centos-integration-main |
/hold cancel This seems to work well in my testing, including stopping retries if automated cleaning mode is changed to |
/lgtm |
I could maybe buy that, but we're still not reporting the error so the user has no idea what is going on. It will just sit in 'deprovisioning' forever. |
This applies to everything, no? It's the same for deployment and inspecting: we briefly show the error on the BMH, then retry. Why should cleaning be different? How else can we report an error? |
Not everything.
For deployment I think that's true, because the state changes to deprovisioning and we don't keep track of how many times we go through that cycle. We should fix that. For inspecting I think from memory we do the exponential backoff correctly.
Everything should be different.
In principle: return a failure that will record the reason in the status and bump the errorCount, thus increasing the delay until the next retry. |
That's what my patch does, at least as far as I understand (and see in my testing). Could you point out what I'm missing? Any explicit step to make the delay exponential? |
I tested by starting cleaning, then aborting it after the node enters status:
errorCount: 1
errorMessage: 'Cleaning failed: By request, the clean operation was aborted'
errorType: provisioning error On the 2nd attempt, the error message is buggy, although Ironic reports the same failure: status:
errorCount: 2
errorMessage: 'Cleaning failed: '
errorType: provisioning error I have a feeling that more time has passed after the 2nd attempt than after the 1st one, but I'm not 100% sure. |
The empty error message is probably this Ironic issue: https://storyboard.openstack.org/#!/story/2010603. We'll fix it separately from this PR. |
Sorry, my bad, you are correct. I thought we were still discussing the previous version of the patch and I missed that you already fixed this. |
Not running cleaning immediately after a failure may: 1. cause side effects when the machine powers back on into the old operating system (and e.g. rejoins the cluster as a worker); 2. confuse users with previous Ironic background since in Ironic a node cannot enter available without going through cleaning (unless disabled). This change moves hosts manageable -> available immediately. Users who want to opt-out have to set automatedCleanMode to disabled or detach the host. As of this change, we still give up cleaning after 3 attempts, so it's still possible to end up with an unclean host causing issues.
030de25
to
d78fb4d
Compare
/test-centos-integration-e2e-main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: honza The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Not running cleaning immediately after a failure may:
operating system (and e.g. rejoins the cluster as a worker);
cannot enter available without going through cleaning (unless disabled).
This change moves hosts manageable -> available immediately. Users who want
to opt-out have to set automatedCleanMode to disabled or detach the host.
As of this change, we still give up cleaning after 3 attempts, so it's
still possible to end up with an unclean host causing issues.