New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCPCLOUD-937] Update resources to allow termination handler to apply conditions #627
[OCPCLOUD-937] Update resources to allow termination handler to apply conditions #627
Conversation
Skipping CI for Draft Pull Request. |
/test all |
3a41e69
to
13acb9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the approach as described makes sense to me, i'd like to understand some of the risks more though. could you expand on this a little bit, i'm not following:
There will be a period between merging this and the provider PRs that termination will be broken, this is unavoidable as far as I can tell
edit:
is this like a chicken/egg situation, we can't do the provider PRs until the main one is done, but the providers won't work until they are all there?
selector: | ||
matchLabels: | ||
machine.openshift.io/interruptible-instance: "" | ||
maxUnhealthy: 100% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean that it would tolerate 100% of the worker nodes being unhealthy, or is this just for spot instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is 100% so that all spot instances that are marked with the terminating conditions (ie ones that will go away within 2 minutes or 30 seconds depending on provider), are deleted and drained immediately, there's no point trying to not interrupt the state of the cluster anymore, as it's going to be interrupted anyway
Precisely this problem, if we update the providers first, they won't be able to apply the conditions that they need to, because they won't have the permissions. Even if they did, the MHC wouldn't be deployed that causes the Machine deletion. Secondly, if we merge this first, then the old termination handlers won't have permissions to delete the Machine anymore, so won't work that way either. |
makes perfect sense, thanks Joel! one more question, i feel like turning on an automated MHC is something we need to document. will there be an addition to the documentation here, is this already captured in an enhancement? i would like to make sure we capture some user facing documentation is my main point. |
We should update the enhancements repo at a minimum, it wasn't done yet, but this is mentioned there as an alternative already, we just need to switch the implementation and the alternative |
/test all |
@JoelSpeed: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexander-demichev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
13acb9a
to
0f807a8
Compare
0f807a8
to
7f78f57
Compare
I've updated this so that I think we can merge this, then merge the provider implementations, then merge a second PR which will remove some of the stuff that is no longer needed post the migration |
/retest |
7f78f57
to
217d908
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@JoelSpeed: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This PR switches the spot termination handler to use conditions instead of deleting machines directly.
This will require counter PRs in each provider repository to update the termination handler there as well.
Considerations:
privileged
SCC but this gives us far more permission that we need