-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix reboot remediation by adding node deletion #367
Conversation
Hi @slintes. Thanks for your PR. I'm waiting for a metal3-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
d4414d7
to
93df2db
Compare
Hey there. A couple of thoughts on the PR:
|
@macaptain We originally proposed much the same thing, but respected the community's decision to take a different path. Since you are the one wanting to re-litigate the direction, perhaps you can be the one to write up an alternative proposal and convince the community. |
@beekhof: GitHub didn't allow me to request PR reviews from the following users: and. Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'm not sure I've been understood here? I don't want to "re-litigate" any directions. I see there are advantages to the remediation strategy proposed in this PR. I'd just like more of a write-up on these advantages so that I can judge if I'd like to use this remediation strategy myself, and to have the option of continuing to use the existing remediation behavior of rebooting if I choose. Thanks. |
I respectfully disagree here. In my opinion the lack of Node removal is a bug, and forcing users to choose between the buggy strategy and the non-buggy strategy is not doing anybody any favours. As far as disruption goes, CAPM3 has releases and nobody is obliged to chase However, your request to have this documented so that everyone can understand why the current behaviour is buggy is entirely reasonable. A design proposal to describe the change and its benefits would be very welcome. |
Isn't it so that Node removal workflow is part of the CAPI Machine health check workflow and it will remove the machine which in turn will remove the node as well which is why node deletion was not included here in the first place? I might be wrong but it will be worth checking before we add that feature here. And I am also not sure how CAPI would behave in case we delete the node here in CAPM3 instead of CAPI doing it. It might just the delete the machine actually but as I said it is worth looking into the CAPI machine health check workflow.
I am +1 on this since this is changing the existing behavior of the controller and that's the norm we follow in meta3 generally. If you think @beekhof you wont manage time to propose changes to the initial proposal, we can arrange a joint session and someone else can also take care of that part. |
The standard remediation workflow in CAPI is indeed to delete the Machine, which has the effect of deleting the Node. Remediation controllers are an alternative to the standard remediation workflow. The purpose of the 'reboot' remediation strategy is to allow us to not delete the Machine (which would mean reprovisioning the Host) and instead just reboot the Host. Because the Machine itself remains, this means that the CAPI will not delete the Node for us, which is an essential part of remediation, and thus the remediation controller (all remediation controllers) must do it. That's why it's a bug that we currently don't. |
Hey @zaneb. Is there an issue describing the defect? I couldn't find an issue mentioning the remediation controller in CAPM3 repo. You will hopefully forgive me for thinking that a draft PR named ":sparkles: Add ..." with no linked issue is for a new feature rather than a bugfix. (Sadly, I do need these details to be made clear for me, I'm not a mind reader - although if I were, it sure would help me write other people's change proposals @beekhof hehe) From the perspective that this is a new feature PR, hopefully you'll also understand why I thought an open-closed-principle-be-damned approach, with +300, -50 into the heart of a controller from an implemented proposal may be best behind a configuration setting. There may still be advantages to putting a significant bugfix behind a feature flag. It means we can merge to
If there's something that surely does nobody any favours it's users stuck on old releases because You seem convinced that this change is an unalloyed benefit and that the previous approach specified in the remediation controller proposal should be eliminated. This is entirely possible and I would really like to understand more details. In particular, I would appreciate:
Cheers! |
I don't care that much if we call it a bugfix or feature, but I tend to bugfix, because of Nir's comment on the old PR and Zane's comments here. The thing I still need to know however if this should be a new strategy or not, no matter how we call it in the end...
Actually, I don't see any details there about how remediation will happen, it jut says "RC uses BMO APIs to get hosts back into a healthy or manageable state" 🤷🏼 But I'm happy to add the details.
Or I create an issue. Whatever you prefer... or both.
Happy to fix the title and make this a bugfix. I just can't find the docs explaining the emojis anymore... where is it? 🤔 I expected it to be in https://github.com/metal3-io/cluster-api-provider-metal3/blob/main/.github/PULL_REQUEST_TEMPLATE.md...? edit: found it, fixed title
Line count, seriously? I'm just putting node deletion between the power off and power on of the reboot. Sorry that this isn't possible with fewer lines of code 🤷🏼♂️ And I explained why in the decription.
Tests will be part of this PR, it's WIP. |
Sorry this comment came across like this. I really do appreciate the improvements that this change will bring to remediation. It can only be a good thing to have node deletion at least as an option. Also having this as the only remediation behavior and replacing the old may be far better, I just need to check the reasons for that. Thanks for raising the issue. Having the issue up as a place for discussion is appreciated. @zaneb and I had a discussion in the community meeting, and all's well. We agreed having more documentation around remediation would be a good thing since remediation has some history in the project which I myself being new to this area am also unaware of. We spoke about a 'change proposal', which I think may be a good thing as well? Since consensus based on what someone said on Slack means we don't have all the technical background documented.
Great, thanks!! I was also planning on writing some unit tests for the old behavior since I noticed there aren't any. Hopefully will have a PR up before too long and we can compare notes. There's also an open PR in the metal3-dev-env repo where some remediation behavior was being tested: metal3-io/metal3-dev-env#828. I was going to take a closer look at that PR and see if it's something we want to merge and add any useful tests there. Also, welcome to metal3! Sorry - the discussions on PRs are not usually this robust, this PR has been interesting 😆 |
just to be sure: does this mean you are also fine with adding node deletion to the existing remdiation strategy?
sure... where and how do I file a change proposal? 🤔
Thanks for the heads up. I think having a basic test over there sounds good, but tests of implementation details should stay in this repo? Btw, I'm still busy with the e2e tests here, having a hard time to actually run them... I might reach out on Slack for help.
🙂 |
PR to the metal3-docs repo. The original design was https://github.com/metal3-io/metal3-docs/blob/master/design/cluster-api-provider-metal3/capm3-remediation-controller-proposal.md - we could edit that one but since it is marked as 'implemented' already it might be better to create a new proposal. |
I overestimated the maturity of the old code and implementation. After reading the comments on the issue, I changed my mind and I think it would be better to change the old strategy in place and fix the issue directly. Just so I could get my head around what the remediation controller does at the moment, I raised a draft PR to add a few unit tests covering the remediation steps: #404. Let me know what you think!
Sure! In #cluster-api-baremetal Kubernetes Slack there's several people who know a good deal about the e2e tests and would be happy to help.
As @slintes said, the old proposal doesn't include details on how the remediation controller actually remediates the machine. The proposal has been implemented nonetheless, so I think a new proposal is easiest. Not all the template questions may be relevant. For me and my ignorance, covering the steps in documentation of what remediation should actually do in CAPM3 would be very beneficial. Thank you! |
Hi @slintes, This PR has been staying as a draft from end of november 2021. Do you have update on this PR and any plan to add e2e test? |
Hi @smoshiur1237, we are currently finalizing our plannings for the next weeks and months. I can probably continue with this soon. If not, I will let you know. Thanks. |
93df2db
to
483fb39
Compare
- Refactor actual existing remediation test into own file - Call new metal3remediation test from remediation_based_feature_test.go Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Can happen with clusters which were created with an older version Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
5119cfa
to
f79eb51
Compare
done |
/test-ubuntu-integration-main |
I dont know whats wrong with this PR. No test is getting trigerred. |
/ok-to-test |
/test-ubuntu-integration-main |
Thx! though there seems to be still problem with CI. /test-ubuntu-integration-main |
/test-ubuntu-integration-main |
1 similar comment
/test-ubuntu-integration-main |
/test all |
/test-centos-e2e-parallel-main |
@slintes: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test-ubuntu-integration-main |
@slintes hi! triggering the integration/e2e tests in this specific PR does not work for some unknown reason we are trying to find out but does work in any other PR in this repo. What I would suggest is, for now, can you please open a new PR with same changes on top of the main branch (sorry for extra work 🙂 ), because we had seen similar issue in the past and simple solution was creating a new one (not 100% guarantee it will work, but we can give it a try). Meanwhile I will continue investigating the root cause of the issue. Thanks. Edit: To keep a history of discussions here, we could link this PR to the new one and indicate in a new PR that it replaces this for future references if someone is interested to check the old discussions. |
Thanks, seems to work now 😄 |
nice 🙂 |
/hold |
/approve cancel |
/close |
@furkatgofurov7: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
This PR adds node deletion to the reboot remediation strategy, because it ensures that unhealthy nodes are fenced, and workloads will be rescheduled to other nodes.
This was proposed on Slack, and it was agreed to add this to the existing reboot remediation strategy, instead of adding a new strategy.
Unfortunately this increases complexity of remediation, because instead of having 1 step only (set reboot annotation), we now have several steps, plus waiting for certain cluster conditions:
Fixes #392