Bug 1838430: Add Machine Remediation #59

n1r1 · 2020-03-22T14:54:17Z

Implementation of Machine Remediation Controller as described in metal3-io/metal3-docs#80

openshift-ci-robot · 2020-03-22T14:54:39Z

Hi @n1r1. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mhrivnak · 2020-03-22T16:27:25Z

High-level question: This project already includes a machine controller. It is fairly generic and utilizes the "actuator" to implement most of the work. You can see it being imported and used in main.go.

Is there a reason to add a second controller that watches the same resources? Or can the reconcile logic in this PR be added as a feature of that existing controller, by adding it to the actuator?

n1r1 · 2020-03-22T18:03:31Z

@mhrivnak, thanks for the comment.
Are you referring to CAPI machine controller?

It watches for Machine while this controller (MRC) watches for Machine and BareMetalHost.

In addition, CAPI is platform-agnostic, and I guess that it wasn't acceptable to add some baremetal-specific code there (machine remediation is quite different between cloud providers and baremetal)

Does that make sense?
Thanks

mhrivnak · 2020-03-22T20:28:33Z

The CAPI Machine controller is an odd beast. You're generally right, but keep in mind that it is extensible by importing it and passing it a customized "actuator". That is how we do baremetal-specific work in its reconcile function. It calls our actuator from its reconcile function to do things like create or update the real infrastructure that backs a Machine. Each cloud provider has their own actuator, imports the generic CAPI Machine controller, and instantiates it with their custom actuator.

Our implementation adds a BareMetalHost watch to their controller. It took some funny coding, because the pattern with controller-runtime and kubebuilder doesn't provide a way to interact with the Controller object. We had to make a "manager-in-the-middle" Manager that intercepts the controller, adds a watch, then passes it on to the real manager. I can explain more perhaps on video chat if you are interested.

… node deletion Signed-off-by: Nir <niry@redhat.com>

Signed-off-by: Nir <niry@redhat.com>

…chine, resulting in reboot loops This reverts commit 562a1b4. Signed-off-by: Nir <niry@redhat.com>

Signed-off-by: Nir <niry@redhat.com>

…onflicts Signed-off-by: Nir <niry@redhat.com>

n1r1 · 2020-04-03T06:37:49Z

@mhrivnak I've updated the implementation to be part of the actuator, per your suggestion.
Thanks

n1r1 · 2020-04-03T06:41:24Z

/assign @mhrivnak
/cc @beekhof

beekhof

I feel like there is something we should do in Delete(), remove all the annotations maybe?

Otherwise all I could do is nit-pick the logging and tests :-)

pkg/cloud/baremetal/actuators/machine/actuator.go

pkg/cloud/baremetal/actuators/machine/actuator_test.go

mhrivnak

My comments are mostly about readability and naming. It took me a lot of head-scratching to finally understand how the logic flows, but I think some minor tweaks might make that a lot easier next time.

In addition, I think we need at least a brief summary in normal text (perhaps in the README) explaining what this feature is, how to use it, what the annotations mean, and what the workflow is step-by-step for what happens to a host, machine and node.

pkg/cloud/baremetal/actuators/machine/actuator.go

mhrivnak · 2020-04-07T02:24:30Z

pkg/cloud/baremetal/actuators/machine/actuator.go

+		baremetalhost.Annotations = make(map[string]string)
+	}
+
+	baremetalhost.Annotations[rebootAnnotation] = ""


I wonder if this would all be easier to follow if this annotation was called requestPowerOffAnnotation. Or if this function was called requestReboot. Either might help readability of the remediateIfNeeded function.

I agree with that requestPowerOffAnnotation is better than rebootAnnotation. I'll change this.
requestReboot is not a proper name for this function, since it is actually requesting a power off, and not a reboot.
The power on will happen when this annotation will be removed.

Changed to requestPowerOffAnnotation

mhrivnak · 2020-04-07T02:29:39Z

pkg/cloud/baremetal/actuators/machine/actuator.go

+
+		//we need this annotation to differentiate between unhealthy machine that
+		//needs remediation to unhealthy machine that just got remediated
+		return a.addRemediationInProgressAnnotation(ctx, machine)


I think this workflow is starting to make sense to me. Does this annotation really mean that "We requested power off, and then observed that power transitioned from on to off"? I could argue that remediation is "in progress" as soon as the power-off is requested, so if I'm understanding correctly, I think this name could throw people off (and it did throw me off). Maybe it would be easier to understand this workflow if this annotation was named something more directly representative of what it's measuring. Maybe HostPoweredOffAnnotation?

If it would be HostPoweredOffAnnotation we could argue that it's not always true, since we remove this annotation after the host was powered on.
The main motivation for this annotation is described in the comments (two lines before this one).

I think it's a good name but I'm probably biased and my head was in this code for too long, and I'm blind to such issues.
@beekhof - what do you think?

I wonder why, if we can tell that we need to add this annotation, we can't use the same conditions to tell that we have a machine in the middle of remediation elsewhere. Why do we need to store a separate piece of state information?

We can tell that we need to add this annotation while we're on specific state. In later state, where we encounter powered-on host, with unhealthy annotation - we can't tell if this is a result of successful remediation or a real unhealthy machine that needs a remediation.
In addition, we can't remove the unhealthy annotation at this point, as MHC might annotate it again and again (since the Machine is still not healthy)

pkg/cloud/baremetal/actuators/machine/actuator.go

beekhof · 2020-04-07T04:55:58Z

not sure if I have permission but lets try...

/ok-to-test

Signed-off-by: Nir <niry@redhat.com>

mhrivnak

Just one typo, but otherwise looking good.

README.md

Signed-off-by: Nir <niry@redhat.com>

stbenjam · 2020-04-14T01:15:02Z

/test e2e-metal-ipi

n1r1 · 2020-04-14T02:45:38Z

/test e2e-metal-ipi

n1r1 · 2020-04-14T03:21:35Z

/test e2e-metal-ipi

n1r1 · 2020-04-14T05:21:02Z

@mhrivnak can we have your lgtm please?

stbenjam · 2020-04-14T11:22:17Z

/lgtm

openshift-ci-robot · 2020-04-14T11:22:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mhrivnak, n1r1, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mhrivnak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

beekhof · 2020-05-29T02:55:34Z

/retitle Bug 1838430: Add Machine Remediation

openshift-ci-robot · 2020-05-29T02:55:48Z

@n1r1: Bugzilla bug 1838430 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state.

In response to this:

Bug 1838430: Add Machine Remediation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

n1r1 · 2020-05-31T14:08:12Z

/bugzilla refresh

openshift-ci-robot · 2020-05-31T14:08:15Z

@n1r1: All pull requests linked via external trackers have merged: . Bugzilla bug 1838430 has been moved to the MODIFIED state.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

n1r1 · 2020-06-01T11:14:00Z

/cherry-pick release-4.4

openshift-cherrypick-robot · 2020-06-01T11:14:07Z

@n1r1: only openshift org members may request cherry picks. You can still do the cherry-pick manually.

In response to this:

/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stbenjam · 2020-06-01T11:28:49Z

/cherry-pick release-4.4

openshift-cherrypick-robot · 2020-06-01T11:29:02Z

@stbenjam: #59 failed to apply on top of branch "release-4.4":

Applying: add RBAC rules for machines
Using index info to reconstruct a base tree...
M	config/rbac/rbac_role.yaml
Falling back to patching base and 3-way merge...
Auto-merging config/rbac/rbac_role.yaml
Applying: Changed MAO unhealthy annotation to the new one and added logs
Applying: Update code to match enhancement design, without extra annotation
Applying: Add machine remediation controller to main.go
error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	cmd/manager/main.go
Falling back to patching base and 3-way merge...
Auto-merging cmd/manager/main.go
CONFLICT (content): Merge conflict in cmd/manager/main.go
Patch failed at 0005 Add machine remediation controller to main.go

In response to this:

/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 22, 2020

openshift-ci-robot requested review from hardys and russellb March 22, 2020 14:54

n1r1 mentioned this pull request Mar 22, 2020

Update design of Machine Remediation metal3-io/metal3-docs#80

Closed

n1r1 added 9 commits April 1, 2020 09:41

Adding Machine Remediation controller to reboot unhealthy hosts after…

f02f98a

… node deletion Signed-off-by: Nir <niry@redhat.com>

add RBAC rules for machines

6c9a50b

Signed-off-by: Nir <niry@redhat.com>

Changed MAO unhealthy annotation to the new one and added logs

583fe12

Signed-off-by: Nir <niry@redhat.com>

Update code to match enhancement design, without extra annotation

71f663b

Add machine remediation controller to main.go

562a1b4

Signed-off-by: Nir <niry@redhat.com>

Add machine remediation unit test

8112210

Signed-off-by: Nir <niry@redhat.com>

Revert back to extra annotation usage, as MHC keeps annotating the ma…

963bf36

…chine, resulting in reboot loops This reverts commit 562a1b4. Signed-off-by: Nir <niry@redhat.com>

Move machine remediation logic from a separate contoller into actuator

1922fbd

Signed-off-by: Nir <niry@redhat.com>

Remediate only if machine object wasn't changed previously to avoid c…

6ae2e49

…onflicts Signed-off-by: Nir <niry@redhat.com>

n1r1 force-pushed the machine-remediation-controller branch from 3769770 to 6ae2e49 Compare April 3, 2020 06:34

openshift-ci-robot assigned mhrivnak Apr 3, 2020

openshift-ci-robot requested a review from beekhof April 3, 2020 06:41

n1r1 changed the title ~~Add Machine Remediation Controller~~ Add Machine Remediation Apr 5, 2020

beekhof reviewed Apr 7, 2020

View reviewed changes

mhrivnak suggested changes Apr 7, 2020

View reviewed changes

openshift-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 7, 2020

n1r1 added 2 commits April 7, 2020 09:31

Change ensureAnnotation return values order (err is now last)

01a51c1

Signed-off-by: Nir <niry@redhat.com>

Fix requestPowerOff and requestPowerOn comments

1cfa151

Signed-off-by: Nir <niry@redhat.com>

n1r1 added 2 commits April 13, 2020 21:59

Fail test if deleteNode didn't return a requque request

3bd3cc7

Signed-off-by: Nir <niry@redhat.com>

Add machine remediation to README

73cf9d5

Signed-off-by: Nir <niry@redhat.com>

n1r1 requested a review from mhrivnak April 13, 2020 19:58

mhrivnak suggested changes Apr 13, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

fix typo

fb4a97e

Signed-off-by: Nir <niry@redhat.com>

n1r1 requested a review from mhrivnak April 13, 2020 20:19

mhrivnak approved these changes Apr 13, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2020

openshift-ci-robot assigned stbenjam Apr 14, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 14, 2020

openshift-merge-robot merged commit bc712c5 into openshift:master Apr 14, 2020

n1r1 deleted the machine-remediation-controller branch April 16, 2020 06:43

n1r1 mentioned this pull request Apr 16, 2020

External Machine remediation kubernetes-sigs/cluster-api#2846

Closed

n1r1 mentioned this pull request May 5, 2020

Allow external remediation even if there's no controller owner openshift/machine-api-operator#581

Merged

openshift-ci-robot changed the title ~~Add Machine Remediation~~ Bug 1838430: Add Machine Remediation May 29, 2020

n1r1 mentioned this pull request Jun 3, 2020

Bug 1838431: Machine Remediation Backport to 4.4 #73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1838430: Add Machine Remediation #59

Bug 1838430: Add Machine Remediation #59

n1r1 commented Mar 22, 2020

openshift-ci-robot commented Mar 22, 2020

mhrivnak commented Mar 22, 2020

n1r1 commented Mar 22, 2020

mhrivnak commented Mar 22, 2020

n1r1 commented Apr 3, 2020

n1r1 commented Apr 3, 2020

beekhof left a comment

mhrivnak left a comment

mhrivnak Apr 7, 2020

n1r1 Apr 7, 2020

n1r1 Apr 7, 2020

mhrivnak Apr 7, 2020

n1r1 Apr 7, 2020

dhellmann Apr 7, 2020

n1r1 Apr 7, 2020

beekhof commented Apr 7, 2020

mhrivnak left a comment

stbenjam commented Apr 14, 2020

n1r1 commented Apr 14, 2020

n1r1 commented Apr 14, 2020

n1r1 commented Apr 14, 2020

stbenjam commented Apr 14, 2020

openshift-ci-robot commented Apr 14, 2020

beekhof commented May 29, 2020

openshift-ci-robot commented May 29, 2020

n1r1 commented May 31, 2020

openshift-ci-robot commented May 31, 2020

n1r1 commented Jun 1, 2020

openshift-cherrypick-robot commented Jun 1, 2020

stbenjam commented Jun 1, 2020

openshift-cherrypick-robot commented Jun 1, 2020

Bug 1838430: Add Machine Remediation #59

Bug 1838430: Add Machine Remediation #59

Conversation

n1r1 commented Mar 22, 2020

openshift-ci-robot commented Mar 22, 2020

mhrivnak commented Mar 22, 2020

n1r1 commented Mar 22, 2020

mhrivnak commented Mar 22, 2020

n1r1 commented Apr 3, 2020

n1r1 commented Apr 3, 2020

beekhof left a comment

Choose a reason for hiding this comment

mhrivnak left a comment

Choose a reason for hiding this comment

mhrivnak Apr 7, 2020

Choose a reason for hiding this comment

n1r1 Apr 7, 2020

Choose a reason for hiding this comment

n1r1 Apr 7, 2020

Choose a reason for hiding this comment

mhrivnak Apr 7, 2020

Choose a reason for hiding this comment

n1r1 Apr 7, 2020

Choose a reason for hiding this comment

dhellmann Apr 7, 2020

Choose a reason for hiding this comment

n1r1 Apr 7, 2020

Choose a reason for hiding this comment

beekhof commented Apr 7, 2020

mhrivnak left a comment

Choose a reason for hiding this comment

stbenjam commented Apr 14, 2020

n1r1 commented Apr 14, 2020

n1r1 commented Apr 14, 2020

n1r1 commented Apr 14, 2020

stbenjam commented Apr 14, 2020

openshift-ci-robot commented Apr 14, 2020

beekhof commented May 29, 2020

openshift-ci-robot commented May 29, 2020

n1r1 commented May 31, 2020

openshift-ci-robot commented May 31, 2020

n1r1 commented Jun 1, 2020

openshift-cherrypick-robot commented Jun 1, 2020

stbenjam commented Jun 1, 2020

openshift-cherrypick-robot commented Jun 1, 2020