Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1838430: Add Machine Remediation #59

Merged

Conversation

n1r1
Copy link

@n1r1 n1r1 commented Mar 22, 2020

Implementation of Machine Remediation Controller as described in metal3-io/metal3-docs#80

@openshift-ci-robot openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 22, 2020
@openshift-ci-robot
Copy link

Hi @n1r1. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mhrivnak
Copy link
Member

High-level question: This project already includes a machine controller. It is fairly generic and utilizes the "actuator" to implement most of the work. You can see it being imported and used in main.go.

Is there a reason to add a second controller that watches the same resources? Or can the reconcile logic in this PR be added as a feature of that existing controller, by adding it to the actuator?

@n1r1
Copy link
Author

n1r1 commented Mar 22, 2020

@mhrivnak, thanks for the comment.
Are you referring to CAPI machine controller?

It watches for Machine while this controller (MRC) watches for Machine and BareMetalHost.

In addition, CAPI is platform-agnostic, and I guess that it wasn't acceptable to add some baremetal-specific code there (machine remediation is quite different between cloud providers and baremetal)

Does that make sense?
Thanks

@mhrivnak
Copy link
Member

The CAPI Machine controller is an odd beast. You're generally right, but keep in mind that it is extensible by importing it and passing it a customized "actuator". That is how we do baremetal-specific work in its reconcile function. It calls our actuator from its reconcile function to do things like create or update the real infrastructure that backs a Machine. Each cloud provider has their own actuator, imports the generic CAPI Machine controller, and instantiates it with their custom actuator.

Our implementation adds a BareMetalHost watch to their controller. It took some funny coding, because the pattern with controller-runtime and kubebuilder doesn't provide a way to interact with the Controller object. We had to make a "manager-in-the-middle" Manager that intercepts the controller, adds a watch, then passes it on to the real manager. I can explain more perhaps on video chat if you are interested.

n1r1 added 9 commits April 1, 2020 09:41
… node deletion

Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
…chine, resulting in reboot loops

This reverts commit 562a1b4.

Signed-off-by: Nir <niry@redhat.com>
…onflicts

Signed-off-by: Nir <niry@redhat.com>
@n1r1 n1r1 force-pushed the machine-remediation-controller branch from 3769770 to 6ae2e49 Compare April 3, 2020 06:34
@n1r1
Copy link
Author

n1r1 commented Apr 3, 2020

@mhrivnak I've updated the implementation to be part of the actuator, per your suggestion.
Thanks

@n1r1
Copy link
Author

n1r1 commented Apr 3, 2020

/assign @mhrivnak
/cc @beekhof

@n1r1 n1r1 changed the title Add Machine Remediation Controller Add Machine Remediation Apr 5, 2020
Copy link

@beekhof beekhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there is something we should do in Delete(), remove all the annotations maybe?

Otherwise all I could do is nit-pick the logging and tests :-)

pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/baremetal/actuators/machine/actuator_test.go Outdated Show resolved Hide resolved
Copy link
Member

@mhrivnak mhrivnak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are mostly about readability and naming. It took me a lot of head-scratching to finally understand how the logic flows, but I think some minor tweaks might make that a lot easier next time.

In addition, I think we need at least a brief summary in normal text (perhaps in the README) explaining what this feature is, how to use it, what the annotations mean, and what the workflow is step-by-step for what happens to a host, machine and node.

pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
baremetalhost.Annotations = make(map[string]string)
}

baremetalhost.Annotations[rebootAnnotation] = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would all be easier to follow if this annotation was called requestPowerOffAnnotation. Or if this function was called requestReboot. Either might help readability of the remediateIfNeeded function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that requestPowerOffAnnotation is better than rebootAnnotation. I'll change this.
requestReboot is not a proper name for this function, since it is actually requesting a power off, and not a reboot.
The power on will happen when this annotation will be removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to requestPowerOffAnnotation


//we need this annotation to differentiate between unhealthy machine that
//needs remediation to unhealthy machine that just got remediated
return a.addRemediationInProgressAnnotation(ctx, machine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this workflow is starting to make sense to me. Does this annotation really mean that "We requested power off, and then observed that power transitioned from on to off"? I could argue that remediation is "in progress" as soon as the power-off is requested, so if I'm understanding correctly, I think this name could throw people off (and it did throw me off). Maybe it would be easier to understand this workflow if this annotation was named something more directly representative of what it's measuring. Maybe HostPoweredOffAnnotation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it would be HostPoweredOffAnnotation we could argue that it's not always true, since we remove this annotation after the host was powered on.
The main motivation for this annotation is described in the comments (two lines before this one).

I think it's a good name but I'm probably biased and my head was in this code for too long, and I'm blind to such issues.
@beekhof - what do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why, if we can tell that we need to add this annotation, we can't use the same conditions to tell that we have a machine in the middle of remediation elsewhere. Why do we need to store a separate piece of state information?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can tell that we need to add this annotation while we're on specific state. In later state, where we encounter powered-on host, with unhealthy annotation - we can't tell if this is a result of successful remediation or a real unhealthy machine that needs a remediation.
In addition, we can't remove the unhealthy annotation at this point, as MHC might annotate it again and again (since the Machine is still not healthy)

pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved
@beekhof
Copy link

beekhof commented Apr 7, 2020

not sure if I have permission but lets try...

/ok-to-test

@openshift-ci-robot openshift-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 7, 2020
n1r1 added 2 commits April 7, 2020 09:31
Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
Signed-off-by: Nir <niry@redhat.com>
@n1r1 n1r1 requested a review from mhrivnak April 13, 2020 19:58
Copy link
Member

@mhrivnak mhrivnak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one typo, but otherwise looking good.

README.md Outdated Show resolved Hide resolved
Signed-off-by: Nir <niry@redhat.com>
@n1r1 n1r1 requested a review from mhrivnak April 13, 2020 20:19
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2020
@stbenjam
Copy link
Member

/test e2e-metal-ipi

2 similar comments
@n1r1
Copy link
Author

n1r1 commented Apr 14, 2020

/test e2e-metal-ipi

@n1r1
Copy link
Author

n1r1 commented Apr 14, 2020

/test e2e-metal-ipi

@n1r1
Copy link
Author

n1r1 commented Apr 14, 2020

@mhrivnak can we have your lgtm please?

@stbenjam
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 14, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mhrivnak, n1r1, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@beekhof
Copy link

beekhof commented May 29, 2020

/retitle Bug 1838430: Add Machine Remediation

@openshift-ci-robot openshift-ci-robot changed the title Add Machine Remediation Bug 1838430: Add Machine Remediation May 29, 2020
@openshift-ci-robot
Copy link

@n1r1: Bugzilla bug 1838430 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state.

In response to this:

Bug 1838430: Add Machine Remediation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@n1r1
Copy link
Author

n1r1 commented May 31, 2020

/bugzilla refresh

@openshift-ci-robot
Copy link

@n1r1: All pull requests linked via external trackers have merged: . Bugzilla bug 1838430 has been moved to the MODIFIED state.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@n1r1
Copy link
Author

n1r1 commented Jun 1, 2020

/cherry-pick release-4.4

@openshift-cherrypick-robot

@n1r1: only openshift org members may request cherry picks. You can still do the cherry-pick manually.

In response to this:

/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stbenjam
Copy link
Member

stbenjam commented Jun 1, 2020

/cherry-pick release-4.4

@openshift-cherrypick-robot

@stbenjam: #59 failed to apply on top of branch "release-4.4":

Applying: add RBAC rules for machines
Using index info to reconstruct a base tree...
M	config/rbac/rbac_role.yaml
Falling back to patching base and 3-way merge...
Auto-merging config/rbac/rbac_role.yaml
Applying: Changed MAO unhealthy annotation to the new one and added logs
Applying: Update code to match enhancement design, without extra annotation
Applying: Add machine remediation controller to main.go
error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	cmd/manager/main.go
Falling back to patching base and 3-way merge...
Auto-merging cmd/manager/main.go
CONFLICT (content): Merge conflict in cmd/manager/main.go
Patch failed at 0005 Add machine remediation controller to main.go

In response to this:

/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants