Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix reboot remediation by adding node deletion #367

Closed
wants to merge 5 commits into from

Conversation

slintes
Copy link
Contributor

@slintes slintes commented Nov 6, 2021

What this PR does / why we need it:

This PR adds node deletion to the reboot remediation strategy, because it ensures that unhealthy nodes are fenced, and workloads will be rescheduled to other nodes.
This was proposed on Slack, and it was agreed to add this to the existing reboot remediation strategy, instead of adding a new strategy.

Unfortunately this increases complexity of remediation, because instead of having 1 step only (set reboot annotation), we now have several steps, plus waiting for certain cluster conditions:

  1. request power off and wait until done
  2. Backup node annotations and labels
  3. Delete node and wait until gone
  4. Power on and wait until done
  5. Restore node annotations and labels

Fixes #392

@metal3-io-bot metal3-io-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2021
@metal3-io-bot
Copy link
Contributor

Hi @slintes. Thanks for your PR.

I'm waiting for a metal3-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@metal3-io-bot metal3-io-bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 6, 2021
@macaptain
Copy link
Member

Hey there.

A couple of thoughts on the PR:

@beekhof
Copy link

beekhof commented Nov 12, 2021

@macaptain We originally proposed much the same thing, but respected the community's decision to take a different path. Since you are the one wanting to re-litigate the direction, perhaps you can be the one to write up an alternative proposal and convince the community.

/cc @zaneb and @jan-est

@metal3-io-bot
Copy link
Contributor

@beekhof: GitHub didn't allow me to request PR reviews from the following users: and.

Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@macaptain We originally proposed much the same thing, but respected the community's decision to take a different path. Since you are the one wanting to re-litigate the direction, perhaps you can be the one to write up an alternative proposal and convince the community.

/cc @zaneb and @jan-est

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@macaptain
Copy link
Member

@macaptain We originally proposed much the same thing, but respected the community's decision to take a different path. Since you are the one wanting to re-litigate the direction, perhaps you can be the one to write up an alternative proposal and convince the community.

/cc @zaneb and @jan-est

I'm not sure I've been understood here? I don't want to "re-litigate" any directions. I see there are advantages to the remediation strategy proposed in this PR. I'd just like more of a write-up on these advantages so that I can judge if I'd like to use this remediation strategy myself, and to have the option of continuing to use the existing remediation behavior of rebooting if I choose.

Thanks.

@zaneb
Copy link
Member

zaneb commented Nov 12, 2021

We should implement this as an alternative remediation strategy. I know there was a message on Slack suggesting that modifying the existing strategy would be OK (https://kubernetes.slack.com/archives/CHD49TLE7/p1634217997210700?thread_ts=1634199536.209500&cid=CHD49TLE7), but I advise we extend the behavior as a new strategy instead of modifying in place. Firstly, it will minimize any disruption as we trial the new remediation strategy if we can continue to use the reboot strategy while any issues are ironed out. Secondly, it will let us test both implementations side-by-side. Thirdly, it opens the possibility of new ways of extending the behavior in future.

I respectfully disagree here. In my opinion the lack of Node removal is a bug, and forcing users to choose between the buggy strategy and the non-buggy strategy is not doing anybody any favours. As far as disruption goes, CAPM3 has releases and nobody is obliged to chase main.

However, your request to have this documented so that everyone can understand why the current behaviour is buggy is entirely reasonable. A design proposal to describe the change and its benefits would be very welcome.

@kashifest
Copy link
Member

kashifest commented Nov 12, 2021

I respectfully disagree here. In my opinion the lack of Node removal is a bug, and forcing users to choose between the buggy strategy and the non-buggy strategy is not doing anybody any favours. As far as disruption goes, CAPM3 has releases and nobody is obliged to chase main.

Isn't it so that Node removal workflow is part of the CAPI Machine health check workflow and it will remove the machine which in turn will remove the node as well which is why node deletion was not included here in the first place? I might be wrong but it will be worth checking before we add that feature here. And I am also not sure how CAPI would behave in case we delete the node here in CAPM3 instead of CAPI doing it. It might just the delete the machine actually but as I said it is worth looking into the CAPI machine health check workflow.

However, your request to have this documented so that everyone can understand why the current behaviour is buggy is entirely reasonable. A design proposal to describe the change and its benefits would be very welcome.

I am +1 on this since this is changing the existing behavior of the controller and that's the norm we follow in meta3 generally. If you think @beekhof you wont manage time to propose changes to the initial proposal, we can arrange a joint session and someone else can also take care of that part.

@zaneb
Copy link
Member

zaneb commented Nov 12, 2021

Isn't it so that Node removal workflow is part of the CAPI Machine health check workflow and it will remove the machine which in turn will remove the node as well which is why node deletion was not included here in the first place? I might be wrong but it will be worth checking before we add that feature here. And I am also not sure how CAPI would behave in case we delete the node here in CAPM3 instead of CAPI doing it. It might just the delete the machine actually but as I said it is worth looking into the CAPI machine health check workflow.

The standard remediation workflow in CAPI is indeed to delete the Machine, which has the effect of deleting the Node.

Remediation controllers are an alternative to the standard remediation workflow. The purpose of the 'reboot' remediation strategy is to allow us to not delete the Machine (which would mean reprovisioning the Host) and instead just reboot the Host. Because the Machine itself remains, this means that the CAPI will not delete the Node for us, which is an essential part of remediation, and thus the remediation controller (all remediation controllers) must do it. That's why it's a bug that we currently don't.

@macaptain
Copy link
Member

I respectfully disagree here. In my opinion the lack of Node removal is a bug

Hey @zaneb. Is there an issue describing the defect? I couldn't find an issue mentioning the remediation controller in CAPM3 repo. You will hopefully forgive me for thinking that a draft PR named ":sparkles: Add ..." with no linked issue is for a new feature rather than a bugfix. (Sadly, I do need these details to be made clear for me, I'm not a mind reader - although if I were, it sure would help me write other people's change proposals @beekhof hehe)

From the perspective that this is a new feature PR, hopefully you'll also understand why I thought an open-closed-principle-be-damned approach, with +300, -50 into the heart of a controller from an implemented proposal may be best behind a configuration setting.

There may still be advantages to putting a significant bugfix behind a feature flag. It means we can merge to main and then follow up on getting any e2e tests right while iterating on the fix, so I'll put that idea forward for consideration again. The idea is we can deprecate the old behaviour once we're sure that the new way of doing things is better in every way.

forcing users to choose between the buggy strategy and the non-buggy strategy is not doing anybody any favours. As far as disruption goes, CAPM3 has releases and nobody is obliged to chase main.

If there's something that surely does nobody any favours it's users stuck on old releases because main doesn't do what they want anymore. Let's not end up there!

You seem convinced that this change is an unalloyed benefit and that the previous approach specified in the remediation controller proposal should be eliminated. This is entirely possible and I would really like to understand more details. In particular, I would appreciate:

  1. An issue so that I can explain to my customers that they have a defect in their environment in the remediation controller.
  2. A change proposal so that I can explain to my customers all the benefits of the new way of doing things.

Cheers!

@slintes
Copy link
Contributor Author

slintes commented Nov 18, 2021

I don't care that much if we call it a bugfix or feature, but I tend to bugfix, because of Nir's comment on the old PR and Zane's comments here. The thing I still need to know however if this should be a new strategy or not, no matter how we call it in the end...

The proposal detailing the existing strategy...

Actually, I don't see any details there about how remediation will happen, it jut says "RC uses BMO APIs to get hosts back into a healthy or manageable state" 🤷🏼 But I'm happy to add the details.

I couldn't find an issue

Or I create an issue. Whatever you prefer... or both.

a draft PR named ":sparkles: Add ..."

Happy to fix the title and make this a bugfix. I just can't find the docs explaining the emojis anymore... where is it? 🤔 I expected it to be in https://github.com/metal3-io/cluster-api-provider-metal3/blob/main/.github/PULL_REQUEST_TEMPLATE.md...?

edit: found it, fixed title

with +300, -50 into the heart of a controller

Line count, seriously? I'm just putting node deletion between the power off and power on of the reboot. Sorry that this isn't possible with fewer lines of code 🤷🏼‍♂️ And I explained why in the decription.

follow up on getting any e2e tests right

Tests will be part of this PR, it's WIP.

@slintes slintes changed the title ✨ Add node deletion to reboot remediation :🐛: Fix reboot remediation by adding node deletion Nov 18, 2021
@slintes slintes changed the title :🐛: Fix reboot remediation by adding node deletion 🐛 Fix reboot remediation by adding node deletion Nov 18, 2021
@slintes
Copy link
Contributor Author

slintes commented Nov 18, 2021

#392

@macaptain
Copy link
Member

Line count, seriously? I'm just putting node deletion between the power off and power on of the reboot. Sorry that this isn't possible with fewer lines of code 🤷🏼‍♂️ And I explained why in the decription.

Sorry this comment came across like this. I really do appreciate the improvements that this change will bring to remediation. It can only be a good thing to have node deletion at least as an option. Also having this as the only remediation behavior and replacing the old may be far better, I just need to check the reasons for that.

#392

Thanks for raising the issue. Having the issue up as a place for discussion is appreciated. @zaneb and I had a discussion in the community meeting, and all's well. We agreed having more documentation around remediation would be a good thing since remediation has some history in the project which I myself being new to this area am also unaware of. We spoke about a 'change proposal', which I think may be a good thing as well? Since consensus based on what someone said on Slack means we don't have all the technical background documented.

Tests will be part of this PR, it's WIP.

Great, thanks!! I was also planning on writing some unit tests for the old behavior since I noticed there aren't any. Hopefully will have a PR up before too long and we can compare notes.

There's also an open PR in the metal3-dev-env repo where some remediation behavior was being tested: metal3-io/metal3-dev-env#828. I was going to take a closer look at that PR and see if it's something we want to merge and add any useful tests there.

Also, welcome to metal3! Sorry - the discussions on PRs are not usually this robust, this PR has been interesting 😆

@slintes
Copy link
Contributor Author

slintes commented Nov 24, 2021

#392

Thanks for raising the issue. Having the issue up as a place for discussion is appreciated. @zaneb and I had a discussion in the community meeting, and all's well.

just to be sure: does this mean you are also fine with adding node deletion to the existing remdiation strategy?

We agreed having more documentation around remediation would be a good thing since remediation has some history in the project which I myself being new to this area am also unaware of. We spoke about a 'change proposal', which I think may be a good thing as well?

sure... where and how do I file a change proposal? 🤔

There's also an open PR in the metal3-dev-env repo where some remediation behavior was being tested: metal3-io/metal3-dev-env#828. I was going to take a closer look at that PR and see if it's something we want to merge and add any useful tests there.

Thanks for the heads up. I think having a basic test over there sounds good, but tests of implementation details should stay in this repo? Btw, I'm still busy with the e2e tests here, having a hard time to actually run them... I might reach out on Slack for help.

Also, welcome to metal3! Sorry - the discussions on PRs are not usually this robust, this PR has been interesting

🙂

@zaneb
Copy link
Member

zaneb commented Nov 24, 2021

sure... where and how do I file a change proposal? thinking

PR to the metal3-docs repo. The original design was https://github.com/metal3-io/metal3-docs/blob/master/design/cluster-api-provider-metal3/capm3-remediation-controller-proposal.md - we could edit that one but since it is marked as 'implemented' already it might be better to create a new proposal.

@macaptain
Copy link
Member

macaptain commented Nov 25, 2021

just to be sure: does this mean you are also fine with adding node deletion to the existing remdiation strategy?

I overestimated the maturity of the old code and implementation. After reading the comments on the issue, I changed my mind and I think it would be better to change the old strategy in place and fix the issue directly.

Just so I could get my head around what the remediation controller does at the moment, I raised a draft PR to add a few unit tests covering the remediation steps: #404. Let me know what you think!

Btw, I'm still busy with the e2e tests here, having a hard time to actually run them... I might reach out on Slack for help.

Sure! In #cluster-api-baremetal Kubernetes Slack there's several people who know a good deal about the e2e tests and would be happy to help.

it might be better to create a new proposal.

As @slintes said, the old proposal doesn't include details on how the remediation controller actually remediates the machine. The proposal has been implemented nonetheless, so I think a new proposal is easiest. Not all the template questions may be relevant. For me and my ignorance, covering the steps in documentation of what remediation should actually do in CAPM3 would be very beneficial. Thank you!

@smoshiur1237
Copy link
Member

smoshiur1237 commented Feb 10, 2022

Hi @slintes, This PR has been staying as a draft from end of november 2021. Do you have update on this PR and any plan to add e2e test?

@slintes
Copy link
Contributor Author

slintes commented Feb 16, 2022

Hi @smoshiur1237, we are currently finalizing our plannings for the next weeks and months. I can probably continue with this soon. If not, I will let you know. Thanks.

@metal3-io-bot metal3-io-bot added the needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2022
- Refactor actual existing remediation test into own file
- Call new metal3remediation test from remediation_based_feature_test.go

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Can happen with clusters which were created with an older version

Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
@slintes slintes force-pushed the remediation-node-deletion branch from 5119cfa to f79eb51 Compare July 8, 2022 14:04
@slintes
Copy link
Contributor Author

slintes commented Jul 8, 2022

@slintes can you please rebase on top of the main branch, we are seeing some problems running the tests, thanks!

done

@mboukhalfa
Copy link
Member

/test-ubuntu-integration-main

@kashifest
Copy link
Member

I dont know whats wrong with this PR. No test is getting trigerred.
/test-ubuntu-e2e-parallel-main

@mboukhalfa
Copy link
Member

/ok-to-test

@mboukhalfa
Copy link
Member

/test-ubuntu-integration-main

@furkatgofurov7
Copy link
Member

done

Thx! though there seems to be still problem with CI.

/test-ubuntu-integration-main

@kashifest
Copy link
Member

/test-ubuntu-integration-main

1 similar comment
@Rozzii
Copy link
Member

Rozzii commented Jul 11, 2022

/test-ubuntu-integration-main

@furkatgofurov7
Copy link
Member

/test all
/test-ubuntu-integration-main

@furkatgofurov7
Copy link
Member

/test-centos-e2e-parallel-main
/test-centos-e2e-main

@metal3-io-bot
Copy link
Contributor

@slintes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
unit f79eb51 link true /test unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@furkatgofurov7
Copy link
Member

/test-ubuntu-integration-main

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Jul 11, 2022

@slintes hi! triggering the integration/e2e tests in this specific PR does not work for some unknown reason we are trying to find out but does work in any other PR in this repo. What I would suggest is, for now, can you please open a new PR with same changes on top of the main branch (sorry for extra work 🙂 ), because we had seen similar issue in the past and simple solution was creating a new one (not 100% guarantee it will work, but we can give it a try). Meanwhile I will continue investigating the root cause of the issue. Thanks.

Edit: To keep a history of discussions here, we could link this PR to the new one and indicate in a new PR that it replaces this for future references if someone is interested to check the old discussions.

@slintes
Copy link
Contributor Author

slintes commented Jul 11, 2022

#668

@furkatgofurov7
Copy link
Member

#668

Thanks, seems to work now 😄

@slintes
Copy link
Contributor Author

slintes commented Jul 11, 2022

#668

Thanks, seems to work now smile

nice 🙂
I leave closing this one up to you, in case you want to investigate why CI doesn't work here.

@furkatgofurov7
Copy link
Member

/hold
CI needs to be investigated

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 11, 2022
@furkatgofurov7
Copy link
Member

/approve cancel

@furkatgofurov7 furkatgofurov7 removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2022
@furkatgofurov7
Copy link
Member

/close

@metal3-io-bot
Copy link
Contributor

@furkatgofurov7: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reboot remediation is incomplete