-
Notifications
You must be signed in to change notification settings - Fork 23
Skip drain when node unready #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip drain when node unready #61
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When deleting a previously stopped machine deletion hangs for ever on draining
we are calling eviction API which succeeds (because the PDB is granted because there isn't any and the pod is scheduled for the deletion - it actually has a deletionTimestamp) but then waitForDeletion never succeeds because the pod is stateful (has local storage) and those are not allowed to be removed from the API server in the case of an unreachable node |
// by cloud controller manager. In that case some machines would never get | ||
// deleted without a manual intervention. | ||
if _, exists := m.ObjectMeta.Annotations[ExcludeNodeDrainingAnnotation]; !exists && m.Status.NodeRef != nil { | ||
if _, exists := m.ObjectMeta.Annotations[ExcludeNodeDrainingAnnotation]; !exists && m.Status.NodeRef != nil && r.isNodeReady(ctx, m.Status.NodeRef.Name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if anything I think we should probably check the node is unreachable, still not sure we want to hard delete
Some proposal could be: Option 2: /hold |
1df2e4c
to
12df9b9
Compare
I think drain needs to be configurable at the machine-controller level. At a minimum we should respect some kind of The question then becomes what the default value should be. I think it makes sense to match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never force delete a machine, never skip draining. We have a lot of assumptions built in multiple places around draining behavior.
If draining is broke, admin needs to intervene. This would be a very edge-case, and the exact thing we should be alarming for with metrics 'machine with delete timestamp really old' kind of thing.
thanks a lot @bison and @michaelgugino for sharing thoughts. @michaelgugino agree to what you said regarding manual intervention. I see option1 as "your suggestion" + "configuration option at MHC, if enabled, would remediate by skipping drain in case of situation where drain is impossible to succeed like kubelet down" |
/test unit |
1 similar comment
/test unit |
@vikaschoudhary16: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/bugzilla refresh |
@openshift-bot: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@openshift-bot: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Do we want to pursue this in 4.2? Did we do it in 4.3? Should we just close this and look toward future releases? |
Yes, please let's re-open against master if relevant. fwiw a more generic approach is being proposed here https://groups.google.com/forum/#!topic/kubernetes-sig-cli/f4lLTdg0LsE |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
Release note: