Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow external remediation even if there's no controller owner #581

Merged
merged 1 commit into from Jun 15, 2020

Conversation

n1r1
Copy link
Contributor

@n1r1 n1r1 commented May 4, 2020

We would like to have external remediation for Machine without controller owner, to allow masters remediation.

Signed-off-by: Nir niry@redhat.com

@openshift-ci-robot
Copy link
Contributor

Hi @n1r1. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 4, 2020
@n1r1
Copy link
Contributor Author

n1r1 commented May 4, 2020

cc @JoelSpeed @beekhof

@JoelSpeed
Copy link
Contributor

@n1r1 Does your external remediation only reboot machines presently or does it ever delete the Machine?

@n1r1
Copy link
Contributor Author

n1r1 commented May 4, 2020

@JoelSpeed Yep. Only reboots. Never deletes.

@JoelSpeed
Copy link
Contributor

Ok, in that case, I think this is safe to do, cc @enxebre do you see any scenarios that this might cause issues?

@markmc
Copy link
Contributor

markmc commented May 4, 2020

We would like to have external remediation for Machine without controller owner, to allow masters remediation.

I assume there's a whole lot of background context which you understand about this change that would be super helpful to share (in your commit message) for anyone trying to understand this change in the future.

For example:

  • Why masters don't have a controller owner? (not part of a machine set?)
  • What is external remediation? (a way to signal another controller to - for example - reboot the machine, instead of deleting it)
  • Why is default/delete remediation not suitable for masters? (not part of machine set, so won't be replaced?)
  • Why was external remediation not suitable for masters before this change, but is ok now?? (reboot wasn't safe? helped by etcd operator?)

I'm not sure I have all the details correct, but if I do, then it could be summarized as simply as this:

The external-baremetal remediation strategy is a way for the machine healthcheck controller to signal to an external controller that the machine should be rebooted, as an alternative to deletion based remediation for bare metal machines.

Master machines are not part of a machine set (because they are individually created by the installer), and so are not suitable for deletion based remediation because they will not be automatically recreated.

Previously, masters were also not flagged for rebooting by machine healthcheck (because of an oversight? or because it wasn't safe before?), but rebooting masters works just fine so this can be enabled now.

HTH. Thanks.

@markmc
Copy link
Contributor

markmc commented May 4, 2020

We would like to have external remediation for Machine without controller owner, to allow masters remediation.

I assume there's a whole lot of background context which you understand about this change that would be super helpful to share (in your commit message)

A reference to #543 would help greatly!

@n1r1
Copy link
Contributor Author

n1r1 commented May 5, 2020

Thanks for the feedback @markmc

It was always wanted that MHC will have a minimal knowledge on external remediation and from MHC's perspective, all it does is to annotate unhealthy Machine with machine.openshift.io/remediation-strategy: external-baremetal.
I would love to see this changes some day such that MHC treats baremetal remediation as first class citizen.
We're working to change that upstream in CAPI.
When an admin creates MHC CR, he can decide if he wants external remediation or default remediation (Machine deletion).
This PR intention is just to trigger external remediation, without any conditions/limitations. We'd like that the external remediator will have it's own set of rules to decide if and how to remediate.
This will not be possible if MHC won't annotate the Machine if it has no controller owner.

Why masters don't have a controller owner? (not part of a machine set?)

To my understanding, your explanation is correct ("Master machines are not part of a machine set... ")

What is external remediation? (a way to signal another controller to - for example - reboot the machine, instead of deleting it)

This is an existing feature in MHC, and is part of the MHC proposal , note that the annotation has changed since then in this PR.

As you can see in the proposal, the interface is quite simple (MHC sets annotation and forgets about it). If you're interested in more details on how we actually use it you can take a look at openshift/cluster-api-provider-baremetal#59.
In short, baremetal Machine deletion means reprovision of the node, which takes a lot of time (as opposed to cloud environment). Hence, we'd like to power-cycle the server, which is quicker and might resolve the health issue (let me know if you want me to share more details on this)

Why is default/delete remediation not suitable for masters? (not part of machine set, so won't be replaced?)

If this question refers to cloud environment, this is out my scope and I guess that others can answer this better.
If this refers to external baremetal remediation, then delete is currently not something we do. Note that this PR only let the external remediation controller to do whatever it wants to.

Why was external remediation not suitable for masters before this change, but is ok now?? (reboot wasn't safe? helped by etcd operator?)

Answering from external remediation perspective only - we just missed that this condition exists and could prevent external remediation for masters.

The external-baremetal remediation strategy is a way for the machine healthcheck controller to signal to an external controller that the machine should be rebooted, as an alternative to deletion based remediation for bare metal machines.

I'd prefer to avoid assuming that external remediation is a reboot. External remediation strategy is just letting someone else do the remediation as he thinks fits. I think I covered the reasons for that above.

Previously, masters were also not flagged for rebooting by machine healthcheck (because of an oversight? or because it wasn't safe before?), but rebooting masters works just fine so this can be enabled now.

While this is correct, I still think that it's out of MHC's scope. MHC currently has two roles - health checks, and remediation (Machine deletion). It makes sense that MHC can decide which Machines to remediate and how, but if it's external remdiation, it just needs to singal that it's unhealthy and let the external remediator decide what and how to remediate.

I hope this makes sense and clarifies the PR intention.
Let me know if you want me to share more details.

Thanks.

@markmc
Copy link
Contributor

markmc commented May 5, 2020

I hope this makes sense and clarifies the PR intention.
Let me know if you want me to share more details.

I would like you to condense this background into a summary in the commit message - imagine you are someone that is familiar with the machine-api-operator codebase, but not closely following the reboot remediation topic, and you looking at the output of git log ... what would you like to see as an explanation for why this change was made?

(I'd like to see something like the 3 short sentences I drafted and a reference back to #543 or commit f5099cb)

@n1r1
Copy link
Contributor Author

n1r1 commented May 5, 2020

@markmc, I'm not sure that #543 is strongly related to this.
Even before #543 we had the same problem (no external remediation if t.hasMachineSetOwner() == false ).

If a machine is unhealthy, and external remediation is configured, it should be triggered. That's all this PR is trying to do. As I see it, it's a bug fix.

I'll amend the commit message to include the reasons for the change.

@markmc
Copy link
Contributor

markmc commented May 5, 2020

@markmc, I'm not sure that #543 is strongly related to this.
Even before #543 we had the same problem (no external remediation if t.hasMachineSetOwner() == false ).

If a machine is unhealthy, and external remediation is configured, it should be triggered. That's all this PR is trying to do. As I see it, it's a bug fix.

Does it relate to bug #1816398 ? (AFAICT it is closely related)

The feedback is simple, and applies to all commits/PRs - small scraps of information (like a link to related PRs or bzs or a commit id) can be hugely helpful to anyone in future trying to understand why this change was made.

@enxebre
Copy link
Member

enxebre commented May 18, 2020

hey @n1r1 thanks for addressing all concerns. Is there any doc ref on metal3 that can be included on the commit desc to back the statement:
"More generally - external remediation should be handled externally, any conditions and limitations should be applied by the external remediation controller, not by MHC."

E.g:
"Baremetal external experimental remediation plugs into MHC with an annotation as per https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/machine-health-checking.md#out-of-tree-experimental-remediation-controller-eg-baremetal-reboot.

We want it to apply even before checking the machines have an owner controller so it covers more scenarios and delegate all the responsibility on the remediaton system a per link to doc"

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2020
@enxebre
Copy link
Member

enxebre commented May 18, 2020

/ok-to-test

@openshift-ci-robot openshift-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 18, 2020
… owner.

Baremetal external experimental remediation plugs into MHC with an annotation as per https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/machine-health-checking.md#out-of-tree-experimental-remediation-controller-eg-baremetal-reboot

We want it to apply even before checking the machines have an owner controller so it covers more scenarios and delegate all the responsibility on the remediaton system per https://github.com/openshift/cluster-api-provider-baremetal/blob/master/docs/remediation.md#assumptions

This will allow external remediation controller to remediate baremetal Masters which currently
don't have any controller owner.

Signed-off-by: Nir <niry@redhat.com>
@n1r1
Copy link
Contributor Author

n1r1 commented Jun 11, 2020

@enxebre , I've changed the commit message per your request, and added a linked to the documentation.

Let me know if anything else is needed.
thanks

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 15, 2020
@n1r1
Copy link
Contributor Author

n1r1 commented Jun 15, 2020

/test e2e-aws-scaleup-rhel7
/test e2e-azure-operator
/test e2e-azure-operator

@n1r1
Copy link
Contributor Author

n1r1 commented Jun 15, 2020

/test e2e-azure

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@n1r1
Copy link
Contributor Author

n1r1 commented Jun 15, 2020

/test e2e-aws

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 15, 2020

@n1r1: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure-operator 8e2963e link /test e2e-azure-operator
ci/prow/e2e-azure 8e2963e link /test e2e-azure

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit fe84abd into openshift:master Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants