New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1838430: Add Machine Remediation #59
Bug 1838430: Add Machine Remediation #59
Conversation
Hi @n1r1. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
High-level question: This project already includes a machine controller. It is fairly generic and utilizes the "actuator" to implement most of the work. You can see it being imported and used in main.go. Is there a reason to add a second controller that watches the same resources? Or can the reconcile logic in this PR be added as a feature of that existing controller, by adding it to the actuator? |
@mhrivnak, thanks for the comment. It watches for In addition, CAPI is platform-agnostic, and I guess that it wasn't acceptable to add some baremetal-specific code there (machine remediation is quite different between cloud providers and baremetal) Does that make sense? |
The CAPI Machine controller is an odd beast. You're generally right, but keep in mind that it is extensible by importing it and passing it a customized "actuator". That is how we do baremetal-specific work in its reconcile function. It calls our actuator from its reconcile function to do things like create or update the real infrastructure that backs a Machine. Each cloud provider has their own actuator, imports the generic CAPI Machine controller, and instantiates it with their custom actuator. Our implementation adds a BareMetalHost watch to their controller. It took some funny coding, because the pattern with controller-runtime and kubebuilder doesn't provide a way to interact with the Controller object. We had to make a "manager-in-the-middle" Manager that intercepts the controller, adds a watch, then passes it on to the real manager. I can explain more perhaps on video chat if you are interested. |
… node deletion Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
…chine, resulting in reboot loops This reverts commit 562a1b4. Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
…onflicts Signed-off-by: Nir <niry@redhat.com>
3769770
to
6ae2e49
Compare
@mhrivnak I've updated the implementation to be part of the actuator, per your suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there is something we should do in Delete()
, remove all the annotations maybe?
Otherwise all I could do is nit-pick the logging and tests :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments are mostly about readability and naming. It took me a lot of head-scratching to finally understand how the logic flows, but I think some minor tweaks might make that a lot easier next time.
In addition, I think we need at least a brief summary in normal text (perhaps in the README) explaining what this feature is, how to use it, what the annotations mean, and what the workflow is step-by-step for what happens to a host, machine and node.
baremetalhost.Annotations = make(map[string]string) | ||
} | ||
|
||
baremetalhost.Annotations[rebootAnnotation] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would all be easier to follow if this annotation was called requestPowerOffAnnotation
. Or if this function was called requestReboot
. Either might help readability of the remediateIfNeeded
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that requestPowerOffAnnotation
is better than rebootAnnotation
. I'll change this.
requestReboot
is not a proper name for this function, since it is actually requesting a power off, and not a reboot.
The power on will happen when this annotation will be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to requestPowerOffAnnotation
|
||
//we need this annotation to differentiate between unhealthy machine that | ||
//needs remediation to unhealthy machine that just got remediated | ||
return a.addRemediationInProgressAnnotation(ctx, machine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this workflow is starting to make sense to me. Does this annotation really mean that "We requested power off, and then observed that power transitioned from on to off"? I could argue that remediation is "in progress" as soon as the power-off is requested, so if I'm understanding correctly, I think this name could throw people off (and it did throw me off). Maybe it would be easier to understand this workflow if this annotation was named something more directly representative of what it's measuring. Maybe HostPoweredOffAnnotation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it would be HostPoweredOffAnnotation
we could argue that it's not always true, since we remove this annotation after the host was powered on.
The main motivation for this annotation is described in the comments (two lines before this one).
I think it's a good name but I'm probably biased and my head was in this code for too long, and I'm blind to such issues.
@beekhof - what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why, if we can tell that we need to add this annotation, we can't use the same conditions to tell that we have a machine in the middle of remediation elsewhere. Why do we need to store a separate piece of state information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can tell that we need to add this annotation while we're on specific state. In later state, where we encounter powered-on host, with unhealthy annotation - we can't tell if this is a result of successful remediation or a real unhealthy machine that needs a remediation.
In addition, we can't remove the unhealthy annotation at this point, as MHC might annotate it again and again (since the Machine is still not healthy)
not sure if I have permission but lets try... /ok-to-test |
Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one typo, but otherwise looking good.
/test e2e-metal-ipi |
2 similar comments
/test e2e-metal-ipi |
/test e2e-metal-ipi |
@mhrivnak can we have your lgtm please? |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mhrivnak, n1r1, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retitle Bug 1838430: Add Machine Remediation |
@n1r1: Bugzilla bug 1838430 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@n1r1: All pull requests linked via external trackers have merged: . Bugzilla bug 1838430 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.4 |
@n1r1: only openshift org members may request cherry picks. You can still do the cherry-pick manually. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.4 |
@stbenjam: #59 failed to apply on top of branch "release-4.4":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Implementation of Machine Remediation Controller as described in metal3-io/metal3-docs#80