New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't use external remediation on failed Machines #688
Don't use external remediation on failed Machines #688
Conversation
Currently baremetal Machines never go into the Failed phase, but in the future when they do it will be because of an error that is unrecoverable: invalid config, unable to find a Host to provision, underlying Host was deleted, &c. In these cases, rebooting will not help, so always remediate them by deleting the Machine.
We have a PR in the works that will automatically escalate from reboots to deletion as needed. So not keen on handling this as a special this in MHC |
I just remembered that there's a problem with that approach: the escalation happens within the actuator's MHC has a special case for detecting a Machine in the Failed phase, so there's no point in then handling it in a way that can't work. /retest |
In fact I'd go so far as to suggest that you quite possibly want to handle the escalation, when reboot is insufficient to recover the node, by putting the Machine into the Failed phase and letting the MHC do the deletion rather than having the Machine actuator delete the Machine itself. |
/retest |
I don't think this can work, at least not with the existing implementation. |
Have you read this patch? ;) |
15 days ago 😆 So yes, both can work. I also think it's more aligned with what we are trying to achieve upstream in capi (letting external component do all remediation actions). |
/retest |
@beekhof not sure if you saw my last comment, but I think this is still required even once fallback to deletion is implemented. Now that 4.7 is open I'd like to get this reviewed as it's blocking other work. |
Based on a conversation with Zane, “Failed” means
Never just “I can’t poke the BMC anymore”. /approve |
/lgtm |
/assign @enxebre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: beekhof, JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When Machines go into the Failed phase, this is permanent. Their actuators'
Update()
methods are never called again, and the actuator doesn't get another chance to run until the Machine is deleted. Therefore Machines in this phase cannot be externally remediated by the Machine actuator.Currently baremetal Machines are the only ones using external remediation, and currently they never go into the Failed phase, but in the future we would like to use this phase to handle unrecoverable errors (invalid config, underlying Host was deleted, &c.)
In these cases, rebooting will not help, and the actuator does not get called to enable it do try anyway, so always remediate them by
deleting the Machine.