Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use external remediation on failed Machines #688

Merged

Conversation

zaneb
Copy link
Member

@zaneb zaneb commented Aug 27, 2020

When Machines go into the Failed phase, this is permanent. Their actuators' Update() methods are never called again, and the actuator doesn't get another chance to run until the Machine is deleted. Therefore Machines in this phase cannot be externally remediated by the Machine actuator.

Currently baremetal Machines are the only ones using external remediation, and currently they never go into the Failed phase, but in the future we would like to use this phase to handle unrecoverable errors (invalid config, underlying Host was deleted, &c.)

In these cases, rebooting will not help, and the actuator does not get called to enable it do try anyway, so always remediate them by
deleting the Machine.

Currently baremetal Machines never go into the Failed phase, but in the
future when they do it will be because of an error that is
unrecoverable: invalid config, unable to find a Host to provision,
underlying Host was deleted, &c.

In these cases, rebooting will not help, so always remediate them by
deleting the Machine.
@zaneb
Copy link
Member Author

zaneb commented Aug 27, 2020

/cc @n1r1
/cc @beekhof

@beekhof
Copy link
Contributor

beekhof commented Aug 27, 2020

We have a PR in the works that will automatically escalate from reboots to deletion as needed. So not keen on handling this as a special this in MHC

@zaneb
Copy link
Member Author

zaneb commented Sep 9, 2020

We have a PR in the works that will automatically escalate from reboots to deletion as needed. So not keen on handling this as a special [case] in MHC

I just remembered that there's a problem with that approach: the escalation happens within the actuator's Update() method, but once the Machine is in the Failed phase, the actuator is never called. The only thing we can rely on the actuator to do with a failed Machine is delete it.

MHC has a special case for detecting a Machine in the Failed phase, so there's no point in then handling it in a way that can't work.

/retest

@zaneb
Copy link
Member Author

zaneb commented Sep 9, 2020

In fact I'd go so far as to suggest that you quite possibly want to handle the escalation, when reboot is insufficient to recover the node, by putting the Machine into the Failed phase and letting the MHC do the deletion rather than having the Machine actuator delete the Machine itself.

@zaneb
Copy link
Member Author

zaneb commented Sep 10, 2020

/retest

@n1r1
Copy link
Contributor

n1r1 commented Sep 10, 2020

In fact I'd go so far as to suggest that you quite possibly want to handle the escalation, when reboot is insufficient to recover the node, by putting the Machine into the Failed phase and letting the MHC do the deletion rather than having the Machine actuator delete the Machine itself.

I don't think this can work, at least not with the existing implementation.
It's true that MHC will mark Failed machine as unhealthy but if we have the external remediation annoation, it will never delete the machine.

@zaneb
Copy link
Member Author

zaneb commented Sep 10, 2020

if we have the external remediation annoation, it will never delete the machine.

Have you read this patch? ;)

@n1r1
Copy link
Contributor

n1r1 commented Sep 10, 2020

15 days ago 😆

So yes, both can work.
I still think that if CAPBM knows that it wants to reprovision, it should delete the machine. That's the api to do so.
Sounds more straightforward than implicitly assume that putting it into a failed state, will cause reprovisioning by something else.

I also think it's more aligned with what we are trying to achieve upstream in capi (letting external component do all remediation actions).

@zaneb
Copy link
Member Author

zaneb commented Sep 14, 2020

/retest

@zaneb
Copy link
Member Author

zaneb commented Oct 14, 2020

@beekhof not sure if you saw my last comment, but I think this is still required even once fallback to deletion is implemented. Now that 4.7 is open I'd like to get this reviewed as it's blocking other work.

@beekhof
Copy link
Contributor

beekhof commented Nov 19, 2020

Based on a conversation with Zane, “Failed” means

the provider went away (e.g. instance was deleted on AWS, baremetalhost was deprovisioned or deleted, etc.) or, alternatively, we weren't able to create an instance because the config was missing some essential data (e.g. cloud credentials)

Never just “I can’t poke the BMC anymore”.
Based on that, I'm good with this patch.

/approve

@beekhof
Copy link
Contributor

beekhof commented Nov 19, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 19, 2020
@zaneb
Copy link
Member Author

zaneb commented Nov 19, 2020

/assign @enxebre

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: beekhof, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 3, 2020
@openshift-merge-robot openshift-merge-robot merged commit 4935eb6 into openshift:master Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants