New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow external remediation even if there's no controller owner #581
Allow external remediation even if there's no controller owner #581
Conversation
Hi @n1r1. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@n1r1 Does your external remediation only reboot machines presently or does it ever delete the Machine? |
@JoelSpeed Yep. Only reboots. Never deletes. |
Ok, in that case, I think this is safe to do, cc @enxebre do you see any scenarios that this might cause issues? |
I assume there's a whole lot of background context which you understand about this change that would be super helpful to share (in your commit message) for anyone trying to understand this change in the future. For example:
I'm not sure I have all the details correct, but if I do, then it could be summarized as simply as this:
HTH. Thanks. |
A reference to #543 would help greatly! |
Thanks for the feedback @markmc It was always wanted that MHC will have a minimal knowledge on external remediation and from MHC's perspective, all it does is to annotate unhealthy Machine with
To my understanding, your explanation is correct ("Master machines are not part of a machine set... ")
This is an existing feature in MHC, and is part of the MHC proposal , note that the annotation has changed since then in this PR. As you can see in the proposal, the interface is quite simple (MHC sets annotation and forgets about it). If you're interested in more details on how we actually use it you can take a look at openshift/cluster-api-provider-baremetal#59.
If this question refers to cloud environment, this is out my scope and I guess that others can answer this better.
Answering from external remediation perspective only - we just missed that this condition exists and could prevent external remediation for masters.
I'd prefer to avoid assuming that external remediation is a reboot. External remediation strategy is just letting someone else do the remediation as he thinks fits. I think I covered the reasons for that above.
While this is correct, I still think that it's out of MHC's scope. MHC currently has two roles - health checks, and remediation (Machine deletion). It makes sense that MHC can decide which Machines to remediate and how, but if it's external remdiation, it just needs to singal that it's unhealthy and let the external remediator decide what and how to remediate. I hope this makes sense and clarifies the PR intention. Thanks. |
I would like you to condense this background into a summary in the commit message - imagine you are someone that is familiar with the machine-api-operator codebase, but not closely following the reboot remediation topic, and you looking at the output of (I'd like to see something like the 3 short sentences I drafted and a reference back to #543 or commit f5099cb) |
@markmc, I'm not sure that #543 is strongly related to this. If a machine is unhealthy, and external remediation is configured, it should be triggered. That's all this PR is trying to do. As I see it, it's a bug fix. I'll amend the commit message to include the reasons for the change. |
Does it relate to bug #1816398 ? (AFAICT it is closely related) The feedback is simple, and applies to all commits/PRs - small scraps of information (like a link to related PRs or bzs or a commit id) can be hugely helpful to anyone in future trying to understand why this change was made. |
hey @n1r1 thanks for addressing all concerns. Is there any doc ref on metal3 that can be included on the commit desc to back the statement: E.g: We want it to apply even before checking the machines have an owner controller so it covers more scenarios and delegate all the responsibility on the remediaton system a per /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
… owner. Baremetal external experimental remediation plugs into MHC with an annotation as per https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/machine-health-checking.md#out-of-tree-experimental-remediation-controller-eg-baremetal-reboot We want it to apply even before checking the machines have an owner controller so it covers more scenarios and delegate all the responsibility on the remediaton system per https://github.com/openshift/cluster-api-provider-baremetal/blob/master/docs/remediation.md#assumptions This will allow external remediation controller to remediate baremetal Masters which currently don't have any controller owner. Signed-off-by: Nir <niry@redhat.com>
9d67298
to
8e2963e
Compare
@enxebre , I've changed the commit message per your request, and added a linked to the documentation. Let me know if anything else is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/test e2e-aws-scaleup-rhel7 |
/test e2e-azure |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
/test e2e-aws |
/retest Please review the full test history for this PR and help us cut down flakes. |
7 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@n1r1: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
We would like to have external remediation for Machine without controller owner, to allow masters remediation.
Signed-off-by: Nir niry@redhat.com