New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Propose to backport the "external remediation template" feature #551

Merged

openshift-merge-robot merged 6 commits into openshift:master from slintes:external-remediation-template

Apr 19, 2021

Member

slintes commented Nov 30, 2020

With this enhancement we propose to backport the "external remediation template" feature.

See


          Added external remediation template proposal

625b35d

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

openshift-ci-robot requested review from jwforres and sttts

November 30, 2020 18:43

Member Author

slintes commented Nov 30, 2020

/cc @beekhof @n1r1

a 1st round of review is appreciated before spreading this, thanks!

openshift-ci-robot requested review from beekhof and n1r1

November 30, 2020 18:43

jwforres removed their request for review

November 30, 2020 20:49

beekhof reviewed

View reviewed changes

enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved

beekhof reviewed

View reviewed changes

enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved

beekhof reviewed

View reviewed changes

enhancements/baremetal/external-remediations.md Outdated

+              ##### Removing a deprecated feature
+              - The annotation based external remediation needs to be deprecated
+              - Open question: for how long do we need to support both mechanisms in parallel (if at all)?

Contributor

beekhof Dec 1, 2020

The annotation could just be a syntactic shortcut for an equivalent externalRemediationTemplate if no other one is provided.
Wouldn't be too burdensome to support.

beekhof reviewed

View reviewed changes

enhancements/baremetal/external-remediations.md Outdated


		### Upgrade / Downgrade Strategy

		- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one?

Contributor

beekhof Dec 1, 2020

Yes. Fencing must not break due to an upgrade

n1r1 Dec 1, 2020

same goes for downgrade?
i.e. should we convert specific external remediation template to annotation in existing MHC on downgrade? is this possible?

Member Author

slintes Dec 1, 2020

I'm wondering when and how a downgrade will ever happen...?

n1r1 Dec 1, 2020

The enhancement template contains "Downgrade Strategy" and I remember Clayton saying this is an important one and a core platform requirement, so I guess this is a supported option.

as for "when", maybe to rollback version if you're having an issues with the new version.
as for "how", no idea :)

n1r1 reviewed

View reviewed changes

enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved

enhancements/baremetal/external-remediations.md Outdated

+              create a new one. This isn't the best remediation strategy in all environments.
+              There is already a mechanism to provide an alternative, external remediation strategy, by adding an annotation to the
+              `MachineHealthCheck` and then to `Machine`s. However, this is isn't very maintainable.

n1r1 Dec 1, 2020

I suggest to elaborate more on the downsides of having an annotation instead of CR.

enhancements/baremetal/external-remediations.md Outdated


		### User Stories

		#### Story 1

n1r1 Dec 1, 2020

Maybe add a story for non-BM case?

enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved

enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved

enhancements/baremetal/external-remediations.md Outdated Show resolved Hide resolved

enhancements/baremetal/external-remediations.md Outdated


		### Upgrade / Downgrade Strategy

		- Open question: do we need an automatic MHC conversion from the existing annotation based mechanism to the new one?

n1r1 Dec 1, 2020

same goes for downgrade?
i.e. should we convert specific external remediation template to annotation in existing MHC on downgrade? is this possible?

n1r1 commented Dec 1, 2020

Forward looking, maybe we'll want all remediation strategies (including the default one) to rely on a CR.
This will allow separation of detection (MHC) and remediation.
So if a user didn't specify externalRemediationTemplate, MHC will create a CR that the default remediation controller will consume.

slintes and others added 2 commits

December 1, 2020 13:06


          Update enhancements/baremetal/external-remediations.md

7ca934a

Co-authored-by: Andrew Beekhof <andrew@beekhof.net>


          Update enhancements/baremetal/external-remediations.md

c5b121e

Co-authored-by: Andrew Beekhof <andrew@beekhof.net>

Member Author

slintes commented Dec 1, 2020

Forward looking, maybe we'll want all remediation strategies (including the default one) to rely on a CR.
This will allow separation of detection (MHC) and remediation.
So if a user didn't specify externalRemediationTemplate, MHC will create a CR that the default remediation controller will consume.

Interesting idea, I guess that would be a follow up though?
Are there similar plans upstream already?


          Moved to machine-api and adressed feedback

99d7e56

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

n1r1 commented Dec 1, 2020

Forward looking, maybe we'll want all remediation strategies (including the default one) to rely on a CR.
This will allow separation of detection (MHC) and remediation.
So if a user didn't specify externalRemediationTemplate, MHC will create a CR that the default remediation controller will consume.

Interesting idea, I guess that would be a follow up though?

yeah. just something to keep in mind.

Are there similar plans upstream already?

I remember we discussed this upstream, but I'm not aware of a concrete plan to do this.


          Added approvers

8976c2d

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Member Author

slintes commented Dec 2, 2020

/cc @JoelSpeed @michaelgugino @enxebre

Hi, it was suggested to add you as approvers to this. Do you mind giving a review? Thanks!

openshift-ci-robot requested review from enxebre, JoelSpeed and michaelgugino

December 2, 2020 17:18

openshift-bot commented Mar 2, 2021

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci-robot added the lifecycle/stale label

Contributor

JoelSpeed commented Mar 3, 2021

/remove-lifecycle stale.

Contributor

elmiko commented Mar 30, 2021

this reads well and the implementation generally makes sense to me. i do have a question about the interaction, or lack thereof, between the MHC and ERC. is there any consideration about the notion that the MHC could create an EMR which never gets reconciled (maybe the ERC is down or something)?

i'm just curious if we would want the MHC to create an alert if an EMR hasn't been removed in like 24-48 hours?

Contributor

elmiko commented Mar 30, 2021

/remove-lifecycle stale

openshift-ci-robot removed the lifecycle/stale label

Member Author

slintes commented Apr 1, 2021

n1r1 commented Apr 1, 2021

i do have a question about the interaction, or lack thereof, between the MHC and ERC. is there any consideration about the notion that the MHC could create an EMR which never gets reconciled (maybe the ERC is down or something)?

i'm just curious if we would want the MHC to create an alert if an EMR hasn't been removed in like 24-48 hours?

Adding to this - it's a valid case to keep the EMR. e.g. if the ERC has failed to remediate or if it is doing some backoff.

Creating an alert makes sense to me. No matter if it's an ERC that is down or a machine that couldn't be remediated.

Contributor

elmiko commented Apr 1, 2021

Adding to this - it's a valid case to keep the EMR. e.g. if the ERC has failed to remediate or if it is doing some backoff.

that makes sense to me, this is why i was thinking a really long timer on the alert.

Contributor

mshitrit commented Apr 2, 2021

Adding to this - it's a valid case to keep the EMR. e.g. if the ERC has failed to remediate or if it is doing some backoff.

I agree as well - IMO this will be a good improvement to this feature.
However I don't think the lack of it should block us from merging to the current release.
/cc @beekhof

openshift-ci-robot requested a review from beekhof

April 2, 2021 05:01

JoelSpeed reviewed

View reviewed changes

enhancements/machine-api/external-remediations.md Outdated

Comment on lines 40 to 44

+              This proposal is a backport of parts of the upstream machine healthcheck proposal [0], which
+              also is already implemented [1].
+              - [0] https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20191030-machine-health-checking.md
+              - [1] https://github.com/kubernetes-sigs/cluster-api/pull/3606

Contributor

JoelSpeed Apr 7, 2021

Nit, any reason not to inline these links?

enhancements/machine-api/external-remediations.md Outdated


		## Proposal

		We propose modifying the MachineHealthCheck CRD to support a externalRemediationTemplate, an ObjectReference to

Contributor

JoelSpeed Apr 7, 2021

Nit, I think this would be better if it were slightly more specific

Suggested change

      
            We propose modifying the MachineHealthCheck CRD to support a externalRemediationTemplate, an ObjectReference to
          
            We propose modifying the MachineHealthCheck CRD to add a new field, `externalRemediationTemplate`, an ObjectReference to

enhancements/machine-api/external-remediations.md

Comment on lines +78 to +79

		As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect
		non-transient issues faster.

Contributor

JoelSpeed Apr 7, 2021

Not sure this really makes all that much sense, does power cycling not effectively reset and prevent you from diagnosing the error? I don't see how this proposal helps detect the issues faster?

n1r1 Apr 7, 2021

If automatic power-cycles don't resolve the issue it helps you to rule out transient issues like software bugs, etc.

If an admin wouldn't have these automatic power-cycles, he might have try to reboot the node first to see if the problem persists or not.
Once he have the automatic reboots, he can skip that stage.

Perhaps we need to rephrase this.

Contributor

mshitrit Apr 8, 2021

Thanks, I've rephrased 👍

enhancements/machine-api/external-remediations.md Outdated

Comment on lines 83 to 84

		As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes,
		so that they are automatically added back to the cluster when I fix the underlying problem.

Contributor

JoelSpeed Apr 7, 2021

Does attempting to power cycle while you are remediating the issue not actually make this problem worse? This sounds undesirable to me, if I'm working trying to fix a hardware issue, I don't want the machine to magically come back on mid way through the hardware change.

Perhaps this story can be clarified a bit., I'm not huge on baremetal these days so I assume there's some nuances I'm not seeing here

n1r1 Apr 7, 2021 •

edited

I believe the intention here is external issues, such as network problems (e.g. a host that can't reach the api-server).
TBH I'd expect the system to be able to recover itself in such cases even without power-cycle, so maybe this user story is not very compelling

Contributor

JoelSpeed Apr 14, 2021

@mshitrit Do you have any thoughts on this one?

Contributor

mshitrit Apr 18, 2021

I agree - removed

enhancements/machine-api/external-remediations.md Outdated

Comment on lines 98 to 102

+              When a Machine enters an unhealthy state, the MHC will:
+              * Look up the referenced template
+              * Instantiate the template (for simplicity, we will refer to this as a External Machine Remediation CR, or EMR)
+              * Force the name and namespace to match the unhealthy Machine
+              * Save the new object in etcd

Contributor

JoelSpeed Apr 7, 2021

This seems to duplicate what is said in the paragraphs above, do we need it twice?

enhancements/machine-api/external-remediations.md Outdated

Comment on lines 254 to 260

+              ## Infrastructure Needed [optional]
+              Use this section if you need things from the project. Examples include a new
+              subproject, repos requested, github details, and/or testing infrastructure.
+              Listing these here allows the community to get the process for these resources
+              started right away.

Contributor

JoelSpeed Apr 7, 2021

I think we can drop this heading

mshitrit force-pushed the external-remediation-template branch from 7050b35 to 34e3d68 Compare

April 8, 2021 06:23

JoelSpeed reviewed

View reviewed changes

Contributor

JoelSpeed left a comment

@mshitrit I'm pretty happy to give my approval, just wanted your input on one thread before we do, seems maybe a redundant user story?

enhancements/machine-api/external-remediations.md Outdated

Comment on lines 83 to 84

		As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes,
		so that they are automatically added back to the cluster when I fix the underlying problem.

Contributor

JoelSpeed Apr 14, 2021

@mshitrit Do you have any thoughts on this one?


          - Inlining Links

817c3d9

_ Improve phrasing
- Remove redundant parts
- Remove trailing spaces

Signed-off-by: Michael Shitrit <mshitrit@redhat.com>

mshitrit force-pushed the external-remediation-template branch from 34e3d68 to 817c3d9 Compare

April 18, 2021 06:26

Contributor

JoelSpeed commented Apr 19, 2021

/approve

openshift-ci-robot commented Apr 19, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the approved label

Contributor

mshitrit commented Apr 19, 2021

/lgtm

openshift-ci-robot assigned mshitrit

openshift-ci-robot added the lgtm label

openshift-merge-robot merged commit a658e5a into openshift:master

mshitrit mentioned this pull request

WIP ✨ Alert old emr kubernetes-sigs/cluster-api#4571

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

JoelSpeed JoelSpeed left review comments

n1r1 n1r1 left review comments

mshitrit mshitrit left review comments

sttts Awaiting requested review from sttts

enxebre Awaiting requested review from enxebre

michaelgugino Awaiting requested review from michaelgugino

beekhof Awaiting requested review from beekhof