🐛 Fix reboot remediation by adding node deletion #367

slintes · 2021-11-06T13:00:19Z

What this PR does / why we need it:

This PR adds node deletion to the reboot remediation strategy, because it ensures that unhealthy nodes are fenced, and workloads will be rescheduled to other nodes.
This was proposed on Slack, and it was agreed to add this to the existing reboot remediation strategy, instead of adding a new strategy.

Unfortunately this increases complexity of remediation, because instead of having 1 step only (set reboot annotation), we now have several steps, plus waiting for certain cluster conditions:

request power off and wait until done
Backup node annotations and labels
Delete node and wait until gone
Power on and wait until done
Restore node annotations and labels

Fixes #392

metal3-io-bot · 2021-11-06T13:00:29Z

Hi @slintes. Thanks for your PR.

I'm waiting for a metal3-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

baremetal/metal3remediation_manager.go

macaptain · 2021-11-12T07:38:57Z

Hey there.

A couple of thoughts on the PR:

We should implement this as an alternative remediation strategy. I know there was a message on Slack suggesting that modifying the existing strategy would be OK (https://kubernetes.slack.com/archives/CHD49TLE7/p1634217997210700?thread_ts=1634199536.209500&cid=CHD49TLE7), but I advise we extend the behavior as a new strategy instead of modifying in place. Firstly, it will minimize any disruption as we trial the new remediation strategy if we can continue to use the reboot strategy while any issues are ironed out. Secondly, it will let us test both implementations side-by-side. Thirdly, it opens the possibility of new ways of extending the behavior in future.
It would be great to have a proposal written up. The proposal detailing the existing strategy is here: https://github.com/metal3-io/metal3-docs/blob/master/design/cluster-api-provider-metal3/capm3-remediation-controller-proposal.md. I found out the background on this PR from the comment here: Introducing the new CAPM3 Remediation Controller. metal3-docs#118 (comment). It would be good to have this raised for discussion once more.

beekhof · 2021-11-12T09:58:17Z

@macaptain We originally proposed much the same thing, but respected the community's decision to take a different path. Since you are the one wanting to re-litigate the direction, perhaps you can be the one to write up an alternative proposal and convince the community.

/cc @zaneb and @jan-est

metal3-io-bot · 2021-11-12T09:58:20Z

@beekhof: GitHub didn't allow me to request PR reviews from the following users: and.

Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@macaptain We originally proposed much the same thing, but respected the community's decision to take a different path. Since you are the one wanting to re-litigate the direction, perhaps you can be the one to write up an alternative proposal and convince the community.

/cc @zaneb and @jan-est

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

macaptain · 2021-11-12T10:18:26Z

@macaptain We originally proposed much the same thing, but respected the community's decision to take a different path. Since you are the one wanting to re-litigate the direction, perhaps you can be the one to write up an alternative proposal and convince the community.

/cc @zaneb and @jan-est

I'm not sure I've been understood here? I don't want to "re-litigate" any directions. I see there are advantages to the remediation strategy proposed in this PR. I'd just like more of a write-up on these advantages so that I can judge if I'd like to use this remediation strategy myself, and to have the option of continuing to use the existing remediation behavior of rebooting if I choose.

Thanks.

zaneb · 2021-11-12T14:16:45Z

We should implement this as an alternative remediation strategy. I know there was a message on Slack suggesting that modifying the existing strategy would be OK (https://kubernetes.slack.com/archives/CHD49TLE7/p1634217997210700?thread_ts=1634199536.209500&cid=CHD49TLE7), but I advise we extend the behavior as a new strategy instead of modifying in place. Firstly, it will minimize any disruption as we trial the new remediation strategy if we can continue to use the reboot strategy while any issues are ironed out. Secondly, it will let us test both implementations side-by-side. Thirdly, it opens the possibility of new ways of extending the behavior in future.

I respectfully disagree here. In my opinion the lack of Node removal is a bug, and forcing users to choose between the buggy strategy and the non-buggy strategy is not doing anybody any favours. As far as disruption goes, CAPM3 has releases and nobody is obliged to chase main.

However, your request to have this documented so that everyone can understand why the current behaviour is buggy is entirely reasonable. A design proposal to describe the change and its benefits would be very welcome.

kashifest · 2021-11-12T19:08:01Z

I respectfully disagree here. In my opinion the lack of Node removal is a bug, and forcing users to choose between the buggy strategy and the non-buggy strategy is not doing anybody any favours. As far as disruption goes, CAPM3 has releases and nobody is obliged to chase main.

Isn't it so that Node removal workflow is part of the CAPI Machine health check workflow and it will remove the machine which in turn will remove the node as well which is why node deletion was not included here in the first place? I might be wrong but it will be worth checking before we add that feature here. And I am also not sure how CAPI would behave in case we delete the node here in CAPM3 instead of CAPI doing it. It might just the delete the machine actually but as I said it is worth looking into the CAPI machine health check workflow.

However, your request to have this documented so that everyone can understand why the current behaviour is buggy is entirely reasonable. A design proposal to describe the change and its benefits would be very welcome.

I am +1 on this since this is changing the existing behavior of the controller and that's the norm we follow in meta3 generally. If you think @beekhof you wont manage time to propose changes to the initial proposal, we can arrange a joint session and someone else can also take care of that part.

baremetal/metal3remediation_manager.go

zaneb · 2021-11-12T20:16:15Z

Isn't it so that Node removal workflow is part of the CAPI Machine health check workflow and it will remove the machine which in turn will remove the node as well which is why node deletion was not included here in the first place? I might be wrong but it will be worth checking before we add that feature here. And I am also not sure how CAPI would behave in case we delete the node here in CAPM3 instead of CAPI doing it. It might just the delete the machine actually but as I said it is worth looking into the CAPI machine health check workflow.

The standard remediation workflow in CAPI is indeed to delete the Machine, which has the effect of deleting the Node.

Remediation controllers are an alternative to the standard remediation workflow. The purpose of the 'reboot' remediation strategy is to allow us to not delete the Machine (which would mean reprovisioning the Host) and instead just reboot the Host. Because the Machine itself remains, this means that the CAPI will not delete the Node for us, which is an essential part of remediation, and thus the remediation controller (all remediation controllers) must do it. That's why it's a bug that we currently don't.

macaptain · 2021-11-13T10:12:31Z

I respectfully disagree here. In my opinion the lack of Node removal is a bug

Hey @zaneb. Is there an issue describing the defect? I couldn't find an issue mentioning the remediation controller in CAPM3 repo. You will hopefully forgive me for thinking that a draft PR named ":sparkles: Add ..." with no linked issue is for a new feature rather than a bugfix. (Sadly, I do need these details to be made clear for me, I'm not a mind reader - although if I were, it sure would help me write other people's change proposals @beekhof hehe)

From the perspective that this is a new feature PR, hopefully you'll also understand why I thought an open-closed-principle-be-damned approach, with +300, -50 into the heart of a controller from an implemented proposal may be best behind a configuration setting.

There may still be advantages to putting a significant bugfix behind a feature flag. It means we can merge to main and then follow up on getting any e2e tests right while iterating on the fix, so I'll put that idea forward for consideration again. The idea is we can deprecate the old behaviour once we're sure that the new way of doing things is better in every way.

forcing users to choose between the buggy strategy and the non-buggy strategy is not doing anybody any favours. As far as disruption goes, CAPM3 has releases and nobody is obliged to chase main.

If there's something that surely does nobody any favours it's users stuck on old releases because main doesn't do what they want anymore. Let's not end up there!

You seem convinced that this change is an unalloyed benefit and that the previous approach specified in the remediation controller proposal should be eliminated. This is entirely possible and I would really like to understand more details. In particular, I would appreciate:

An issue so that I can explain to my customers that they have a defect in their environment in the remediation controller.
A change proposal so that I can explain to my customers all the benefits of the new way of doing things.

Cheers!

slintes · 2021-11-18T09:16:28Z

I don't care that much if we call it a bugfix or feature, but I tend to bugfix, because of Nir's comment on the old PR and Zane's comments here. The thing I still need to know however if this should be a new strategy or not, no matter how we call it in the end...

The proposal detailing the existing strategy...

Actually, I don't see any details there about how remediation will happen, it jut says "RC uses BMO APIs to get hosts back into a healthy or manageable state" 🤷🏼 But I'm happy to add the details.

I couldn't find an issue

Or I create an issue. Whatever you prefer... or both.

a draft PR named ":sparkles: Add ..."

Happy to fix the title and make this a bugfix. I just can't find the docs explaining the emojis anymore... where is it? 🤔 I expected it to be in https://github.com/metal3-io/cluster-api-provider-metal3/blob/main/.github/PULL_REQUEST_TEMPLATE.md...?

edit: found it, fixed title

with +300, -50 into the heart of a controller

Line count, seriously? I'm just putting node deletion between the power off and power on of the reboot. Sorry that this isn't possible with fewer lines of code 🤷🏼‍♂️ And I explained why in the decription.

follow up on getting any e2e tests right

Tests will be part of this PR, it's WIP.

slintes · 2021-11-18T09:36:34Z

#392

macaptain · 2021-11-18T16:23:27Z

Line count, seriously? I'm just putting node deletion between the power off and power on of the reboot. Sorry that this isn't possible with fewer lines of code 🤷🏼‍♂️ And I explained why in the decription.

Sorry this comment came across like this. I really do appreciate the improvements that this change will bring to remediation. It can only be a good thing to have node deletion at least as an option. Also having this as the only remediation behavior and replacing the old may be far better, I just need to check the reasons for that.

#392

Thanks for raising the issue. Having the issue up as a place for discussion is appreciated. @zaneb and I had a discussion in the community meeting, and all's well. We agreed having more documentation around remediation would be a good thing since remediation has some history in the project which I myself being new to this area am also unaware of. We spoke about a 'change proposal', which I think may be a good thing as well? Since consensus based on what someone said on Slack means we don't have all the technical background documented.

Tests will be part of this PR, it's WIP.

Great, thanks!! I was also planning on writing some unit tests for the old behavior since I noticed there aren't any. Hopefully will have a PR up before too long and we can compare notes.

There's also an open PR in the metal3-dev-env repo where some remediation behavior was being tested: metal3-io/metal3-dev-env#828. I was going to take a closer look at that PR and see if it's something we want to merge and add any useful tests there.

Also, welcome to metal3! Sorry - the discussions on PRs are not usually this robust, this PR has been interesting 😆

slintes · 2021-11-24T13:38:16Z

#392

Thanks for raising the issue. Having the issue up as a place for discussion is appreciated. @zaneb and I had a discussion in the community meeting, and all's well.

just to be sure: does this mean you are also fine with adding node deletion to the existing remdiation strategy?

We agreed having more documentation around remediation would be a good thing since remediation has some history in the project which I myself being new to this area am also unaware of. We spoke about a 'change proposal', which I think may be a good thing as well?

sure... where and how do I file a change proposal? 🤔

There's also an open PR in the metal3-dev-env repo where some remediation behavior was being tested: metal3-io/metal3-dev-env#828. I was going to take a closer look at that PR and see if it's something we want to merge and add any useful tests there.

Thanks for the heads up. I think having a basic test over there sounds good, but tests of implementation details should stay in this repo? Btw, I'm still busy with the e2e tests here, having a hard time to actually run them... I might reach out on Slack for help.

Also, welcome to metal3! Sorry - the discussions on PRs are not usually this robust, this PR has been interesting

🙂

zaneb · 2021-11-24T14:13:11Z

sure... where and how do I file a change proposal? thinking

PR to the metal3-docs repo. The original design was https://github.com/metal3-io/metal3-docs/blob/master/design/cluster-api-provider-metal3/capm3-remediation-controller-proposal.md - we could edit that one but since it is marked as 'implemented' already it might be better to create a new proposal.

macaptain · 2021-11-25T08:15:52Z

just to be sure: does this mean you are also fine with adding node deletion to the existing remdiation strategy?

I overestimated the maturity of the old code and implementation. After reading the comments on the issue, I changed my mind and I think it would be better to change the old strategy in place and fix the issue directly.

Just so I could get my head around what the remediation controller does at the moment, I raised a draft PR to add a few unit tests covering the remediation steps: #404. Let me know what you think!

Btw, I'm still busy with the e2e tests here, having a hard time to actually run them... I might reach out on Slack for help.

Sure! In #cluster-api-baremetal Kubernetes Slack there's several people who know a good deal about the e2e tests and would be happy to help.

it might be better to create a new proposal.

As @slintes said, the old proposal doesn't include details on how the remediation controller actually remediates the machine. The proposal has been implemented nonetheless, so I think a new proposal is easiest. Not all the template questions may be relevant. For me and my ignorance, covering the steps in documentation of what remediation should actually do in CAPM3 would be very beneficial. Thank you!

smoshiur1237 · 2022-02-10T08:33:08Z

Hi @slintes, This PR has been staying as a draft from end of november 2021. Do you have update on this PR and any plan to add e2e test?

slintes · 2022-02-16T10:08:51Z

Hi @smoshiur1237, we are currently finalizing our plannings for the next weeks and months. I can probably continue with this soon. If not, I will let you know. Thanks.

- Refactor actual existing remediation test into own file - Call new metal3remediation test from remediation_based_feature_test.go Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Can happen with clusters which were created with an older version Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

slintes · 2022-07-08T14:04:16Z

@slintes can you please rebase on top of the main branch, we are seeing some problems running the tests, thanks!

done

mboukhalfa · 2022-07-08T14:14:14Z

/test-ubuntu-integration-main

kashifest · 2022-07-08T14:15:15Z

I dont know whats wrong with this PR. No test is getting trigerred.
/test-ubuntu-e2e-parallel-main

mboukhalfa · 2022-07-08T14:51:25Z

/ok-to-test

mboukhalfa · 2022-07-08T14:51:43Z

/test-ubuntu-integration-main

furkatgofurov7 · 2022-07-08T16:03:40Z

done

Thx! though there seems to be still problem with CI.

/test-ubuntu-integration-main

kashifest · 2022-07-10T08:18:18Z

/test-ubuntu-integration-main

Rozzii · 2022-07-11T06:18:51Z

/test-ubuntu-integration-main

furkatgofurov7 · 2022-07-11T08:48:17Z

/test all
/test-ubuntu-integration-main

furkatgofurov7 · 2022-07-11T08:49:07Z

/test-centos-e2e-parallel-main
/test-centos-e2e-main

metal3-io-bot · 2022-07-11T08:56:29Z

@slintes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
unit	`f79eb51`	link	true	`/test unit`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

furkatgofurov7 · 2022-07-11T09:00:28Z

/test-ubuntu-integration-main

furkatgofurov7 · 2022-07-11T09:07:47Z

@slintes hi! triggering the integration/e2e tests in this specific PR does not work for some unknown reason we are trying to find out but does work in any other PR in this repo. What I would suggest is, for now, can you please open a new PR with same changes on top of the main branch (sorry for extra work 🙂 ), because we had seen similar issue in the past and simple solution was creating a new one (not 100% guarantee it will work, but we can give it a try). Meanwhile I will continue investigating the root cause of the issue. Thanks.

Edit: To keep a history of discussions here, we could link this PR to the new one and indicate in a new PR that it replaces this for future references if someone is interested to check the old discussions.

slintes · 2022-07-11T09:21:50Z

#668

furkatgofurov7 · 2022-07-11T09:24:06Z

#668

Thanks, seems to work now 😄

slintes · 2022-07-11T09:27:12Z

#668

Thanks, seems to work now smile

nice 🙂
I leave closing this one up to you, in case you want to investigate why CI doesn't work here.

furkatgofurov7 · 2022-07-11T09:28:20Z

/hold
CI needs to be investigated

furkatgofurov7 · 2022-07-11T09:28:51Z

/approve cancel

furkatgofurov7 · 2022-07-11T14:32:59Z

/close

metal3-io-bot · 2022-07-11T14:33:04Z

@furkatgofurov7: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

metal3-io-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2021

metal3-io-bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 6, 2021

slintes force-pushed the remediation-node-deletion branch from d4414d7 to 93df2db Compare November 6, 2021 13:13

beekhof reviewed Nov 7, 2021

View reviewed changes

baremetal/metal3remediation_manager.go Show resolved Hide resolved

metal3-io-bot requested a review from zaneb November 12, 2021 09:58

metal3-io-bot requested a review from jan-est November 12, 2021 09:58

zaneb reviewed Nov 12, 2021

View reviewed changes

baremetal/metal3remediation_manager.go Outdated Show resolved Hide resolved

slintes changed the title ~~✨ Add node deletion to reboot remediation~~ :🐛: Fix reboot remediation by adding node deletion Nov 18, 2021

slintes changed the title ~~:🐛: Fix reboot remediation by adding node deletion~~ 🐛 Fix reboot remediation by adding node deletion Nov 18, 2021

slintes mentioned this pull request Nov 18, 2021

Reboot remediation is incomplete #392

Closed

metal3-io-bot added the needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2022

slintes force-pushed the remediation-node-deletion branch from 93df2db to 483fb39 Compare March 9, 2022 18:45

slintes added 4 commits July 8, 2022 16:03

Add new e2e test for this remediation

a638785

- Refactor actual existing remediation test into own file - Call new metal3remediation test from remediation_based_feature_test.go Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Handle missing RBAC for node access

c12fa0f

Can happen with clusters which were created with an older version Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Some minor fixes after code review

e09dce7

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

Improve test coverage of remediation controller

f79eb51

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

slintes force-pushed the remediation-node-deletion branch from 5119cfa to f79eb51 Compare July 8, 2022 14:04

slintes mentioned this pull request Jul 11, 2022

🐛 Fix reboot remediation by adding node deletion #668

Merged

metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 11, 2022

furkatgofurov7 removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2022

metal3-io-bot closed this Jul 11, 2022

Rozzii mentioned this pull request Jun 12, 2024

Add OutOfServiceTaint to Node Instead of Deleting Node Object #1725

Closed

🐛 Fix reboot remediation by adding node deletion #367

🐛 Fix reboot remediation by adding node deletion #367

Conversation

slintes commented Nov 6, 2021 • edited Loading

metal3-io-bot commented Nov 6, 2021

macaptain commented Nov 12, 2021

beekhof commented Nov 12, 2021

metal3-io-bot commented Nov 12, 2021

macaptain commented Nov 12, 2021

zaneb commented Nov 12, 2021

kashifest commented Nov 12, 2021 • edited Loading

zaneb commented Nov 12, 2021

macaptain commented Nov 13, 2021

slintes commented Nov 18, 2021 • edited Loading

slintes commented Nov 18, 2021

macaptain commented Nov 18, 2021

slintes commented Nov 24, 2021

zaneb commented Nov 24, 2021

macaptain commented Nov 25, 2021 • edited Loading

smoshiur1237 commented Feb 10, 2022 • edited Loading

slintes commented Feb 16, 2022

slintes commented Jul 8, 2022

mboukhalfa commented Jul 8, 2022

kashifest commented Jul 8, 2022

mboukhalfa commented Jul 8, 2022

mboukhalfa commented Jul 8, 2022

furkatgofurov7 commented Jul 8, 2022

kashifest commented Jul 10, 2022

Rozzii commented Jul 11, 2022

furkatgofurov7 commented Jul 11, 2022

furkatgofurov7 commented Jul 11, 2022

metal3-io-bot commented Jul 11, 2022

furkatgofurov7 commented Jul 11, 2022

furkatgofurov7 commented Jul 11, 2022 • edited Loading

slintes commented Jul 11, 2022

furkatgofurov7 commented Jul 11, 2022

slintes commented Jul 11, 2022

furkatgofurov7 commented Jul 11, 2022

furkatgofurov7 commented Jul 11, 2022

furkatgofurov7 commented Jul 11, 2022

metal3-io-bot commented Jul 11, 2022

slintes commented Nov 6, 2021 •

edited

Loading

kashifest commented Nov 12, 2021 •

edited

Loading

slintes commented Nov 18, 2021 •

edited

Loading

macaptain commented Nov 25, 2021 •

edited

Loading

smoshiur1237 commented Feb 10, 2022 •

edited

Loading

furkatgofurov7 commented Jul 11, 2022 •

edited

Loading