Don't use external remediation on failed Machines #688

zaneb · 2020-08-27T02:33:03Z

When Machines go into the Failed phase, this is permanent. Their actuators' Update() methods are never called again, and the actuator doesn't get another chance to run until the Machine is deleted. Therefore Machines in this phase cannot be externally remediated by the Machine actuator.

Currently baremetal Machines are the only ones using external remediation, and currently they never go into the Failed phase, but in the future we would like to use this phase to handle unrecoverable errors (invalid config, underlying Host was deleted, &c.)

In these cases, rebooting will not help, and the actuator does not get called to enable it do try anyway, so always remediate them by
deleting the Machine.

Currently baremetal Machines never go into the Failed phase, but in the future when they do it will be because of an error that is unrecoverable: invalid config, unable to find a Host to provision, underlying Host was deleted, &c. In these cases, rebooting will not help, so always remediate them by deleting the Machine.

zaneb · 2020-08-27T02:34:04Z

/cc @n1r1
/cc @beekhof

beekhof · 2020-08-27T02:48:32Z

We have a PR in the works that will automatically escalate from reboots to deletion as needed. So not keen on handling this as a special this in MHC

zaneb · 2020-09-09T20:27:21Z

We have a PR in the works that will automatically escalate from reboots to deletion as needed. So not keen on handling this as a special [case] in MHC

I just remembered that there's a problem with that approach: the escalation happens within the actuator's Update() method, but once the Machine is in the Failed phase, the actuator is never called. The only thing we can rely on the actuator to do with a failed Machine is delete it.

MHC has a special case for detecting a Machine in the Failed phase, so there's no point in then handling it in a way that can't work.

/retest

zaneb · 2020-09-09T21:50:59Z

In fact I'd go so far as to suggest that you quite possibly want to handle the escalation, when reboot is insufficient to recover the node, by putting the Machine into the Failed phase and letting the MHC do the deletion rather than having the Machine actuator delete the Machine itself.

zaneb · 2020-09-10T01:50:23Z

/retest

n1r1 · 2020-09-10T06:27:42Z

In fact I'd go so far as to suggest that you quite possibly want to handle the escalation, when reboot is insufficient to recover the node, by putting the Machine into the Failed phase and letting the MHC do the deletion rather than having the Machine actuator delete the Machine itself.

I don't think this can work, at least not with the existing implementation.
It's true that MHC will mark Failed machine as unhealthy but if we have the external remediation annoation, it will never delete the machine.

zaneb · 2020-09-10T13:52:11Z

if we have the external remediation annoation, it will never delete the machine.

Have you read this patch? ;)

n1r1 · 2020-09-10T16:15:33Z

15 days ago 😆

So yes, both can work.
I still think that if CAPBM knows that it wants to reprovision, it should delete the machine. That's the api to do so.
Sounds more straightforward than implicitly assume that putting it into a failed state, will cause reprovisioning by something else.

I also think it's more aligned with what we are trying to achieve upstream in capi (letting external component do all remediation actions).

zaneb · 2020-09-14T18:24:33Z

/retest

zaneb · 2020-10-14T15:05:32Z

@beekhof not sure if you saw my last comment, but I think this is still required even once fallback to deletion is implemented. Now that 4.7 is open I'd like to get this reviewed as it's blocking other work.

beekhof · 2020-11-19T02:42:36Z

Based on a conversation with Zane, “Failed” means

the provider went away (e.g. instance was deleted on AWS, baremetalhost was deprovisioned or deleted, etc.) or, alternatively, we weren't able to create an instance because the config was missing some essential data (e.g. cloud credentials)

Never just “I can’t poke the BMC anymore”.
Based on that, I'm good with this patch.

/approve

beekhof · 2020-11-19T02:50:42Z

/lgtm

zaneb · 2020-11-19T02:51:36Z

/assign @enxebre

JoelSpeed

/approve

openshift-ci-robot · 2020-12-03T15:39:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: beekhof, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/machinehealthcheck/OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot requested review from cynepco3hahue and paulfantom August 27, 2020 02:33

openshift-ci-robot requested review from beekhof and n1r1 August 27, 2020 02:34

zaneb mentioned this pull request Aug 27, 2020

Actuator should not delete Machine objects openshift/cluster-api-provider-baremetal#105

Closed

zaneb mentioned this pull request Sep 15, 2020

Bug 1868104: Make use of errors and Failed phase to handle failed machines openshift/cluster-api-provider-baremetal#113

Merged

openshift-ci-robot assigned beekhof Nov 19, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 19, 2020

openshift-ci-robot assigned enxebre Nov 19, 2020

JoelSpeed reviewed Dec 3, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 3, 2020

openshift-merge-robot merged commit 4935eb6 into openshift:master Dec 3, 2020

zaneb mentioned this pull request Apr 7, 2022

BareMetalHost deprovisioning phase hooks metal3-io/baremetal-operator#1105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use external remediation on failed Machines #688

Don't use external remediation on failed Machines #688

zaneb commented Aug 27, 2020 •

edited

zaneb commented Aug 27, 2020

beekhof commented Aug 27, 2020

zaneb commented Sep 9, 2020

zaneb commented Sep 9, 2020

zaneb commented Sep 10, 2020

n1r1 commented Sep 10, 2020

zaneb commented Sep 10, 2020

n1r1 commented Sep 10, 2020

zaneb commented Sep 14, 2020

zaneb commented Oct 14, 2020

beekhof commented Nov 19, 2020

beekhof commented Nov 19, 2020

zaneb commented Nov 19, 2020

JoelSpeed left a comment

openshift-ci-robot commented Dec 3, 2020

Don't use external remediation on failed Machines #688

Don't use external remediation on failed Machines #688

Conversation

zaneb commented Aug 27, 2020 • edited

zaneb commented Aug 27, 2020

beekhof commented Aug 27, 2020

zaneb commented Sep 9, 2020

zaneb commented Sep 9, 2020

zaneb commented Sep 10, 2020

n1r1 commented Sep 10, 2020

zaneb commented Sep 10, 2020

n1r1 commented Sep 10, 2020

zaneb commented Sep 14, 2020

zaneb commented Oct 14, 2020

beekhof commented Nov 19, 2020

beekhof commented Nov 19, 2020

zaneb commented Nov 19, 2020

JoelSpeed left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Dec 3, 2020

zaneb commented Aug 27, 2020 •

edited