Skip drain when node unready #61

vikaschoudhary16 · 2019-08-08T06:45:24Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

openshift-ci-robot · 2019-08-08T06:45:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign vikaschoudhary16
You can assign the PR to them by writing /assign @vikaschoudhary16 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2019-08-09T08:11:48Z

When deleting a previously stopped machine deletion hangs for ever on draining

How to reproduce
1- provoke involuntary disruption to an instance, e.g stop it.
2- try to drain the now unhealthy and unreachable node. Evicting API always returns success. But the pods do not go away.
3- We hang forever waiting for deletion https://github.com/openshift/kubernetes-drain/blob/master/drain.go#L446 (edited)
Deleting a previously stopped machine hangs forever.
Why?
This can happen for two reasons:
-deadlock PDB
-stateful pod https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod /

we are calling eviction API which succeeds (because the PDB is granted because there isn't any and the pod is scheduled for the deletion - it actually has a deletionTimestamp) but then waitForDeletion never succeeds because the pod is stateful (has local storage) and those are not allowed to be removed from the API server in the case of an unreachable node

enxebre · 2019-08-09T08:12:51Z

pkg/controller/machine/controller.go

 		// by cloud controller manager. In that case some machines would never get
 		// deleted without a manual intervention.
-		if _, exists := m.ObjectMeta.Annotations[ExcludeNodeDrainingAnnotation]; !exists && m.Status.NodeRef != nil {
+		if _, exists := m.ObjectMeta.Annotations[ExcludeNodeDrainingAnnotation]; !exists && m.Status.NodeRef != nil && r.isNodeReady(ctx, m.Status.NodeRef.Name) {


if anything I think we should probably check the node is unreachable, still not sure we want to hard delete

enxebre · 2019-08-09T08:35:40Z

Some proposal could be:
Option 1:
-machine controller never hard delete if there's no annotation, drains hangs if it's not able to succeed, requires manual intervention.
-As a user you have MHC, which has some opinions and makes automatic decisions for you and will skip drain it required.

Option 2:
We should account for this use case at the machine controller level and consider to skip drain and let deletion move forward when the node is unreachable? - This is a sensitive scenario as the stateful pod could be user application critical data

/hold

bison · 2019-08-13T15:50:12Z

I think drain needs to be configurable at the machine-controller level. At a minimum we should respect some kind of drain-timeout annotation. If a drain hasn't completed within that time, proceed with the deletion. This matches the behavior of the kubectl drain command.

The question then becomes what the default value should be. I think it makes sense to match kubectl drain again and default to zero, which would mean no timeout. Possibly MachineHealthCheck could set a custom timeout.

michaelgugino

Never force delete a machine, never skip draining. We have a lot of assumptions built in multiple places around draining behavior.

If draining is broke, admin needs to intervene. This would be a very edge-case, and the exact thing we should be alarming for with metrics 'machine with delete timestamp really old' kind of thing.

vikaschoudhary16 · 2019-08-14T13:31:37Z

thanks a lot @bison and @michaelgugino for sharing thoughts.
If i am not mistaken, Brad suggesting along the lines of the option1.

@michaelgugino agree to what you said regarding manual intervention. I see option1 as "your suggestion" + "configuration option at MHC, if enabled, would remediate by skipping drain in case of situation where drain is impossible to succeed like kubelet down"

vikaschoudhary16 · 2019-08-15T05:03:02Z

/test unit

vikaschoudhary16 · 2019-08-16T08:26:32Z

/test unit

openshift-ci-robot · 2019-09-05T10:05:54Z

@vikaschoudhary16: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/goimports	`12df9b9`	link	`/test goimports`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2019-09-25T22:57:54Z

/bugzilla refresh

openshift-ci-robot · 2019-09-25T22:57:55Z

@openshift-bot: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2019-09-25T23:00:16Z

/bugzilla refresh

openshift-ci-robot · 2019-09-25T23:00:18Z

@openshift-bot: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

eparis · 2019-10-26T00:22:48Z

Do we want to pursue this in 4.2? Did we do it in 4.3? Should we just close this and look toward future releases?

enxebre · 2019-10-28T08:30:21Z

Do we want to pursue this in 4.2? Did we do it in 4.3? Should we just close this and look toward future releases?

Yes, please let's re-open against master if relevant. fwiw a more generic approach is being proposed here https://groups.google.com/forum/#!topic/kubernetes-sig-cli/f4lLTdg0LsE

openshift-ci-robot requested review from enxebre and frobware August 8, 2019 06:45

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 8, 2019

enxebre reviewed Aug 9, 2019

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 9, 2019

UPSTREAM: <carry>: openshift: Skip drain when node unready

12df9b9

vikaschoudhary16 force-pushed the skip-drain-when-node-unready branch from 1df2e4c to 12df9b9 Compare August 13, 2019 06:54

michaelgugino suggested changes Aug 13, 2019

View reviewed changes

enxebre closed this Oct 28, 2019

Skip drain when node unready #61

Skip drain when node unready #61

Uh oh!

Conversation

vikaschoudhary16 commented Aug 8, 2019

Uh oh!

openshift-ci-robot commented Aug 8, 2019

Uh oh!

enxebre commented Aug 9, 2019

Uh oh!

enxebre Aug 9, 2019

Choose a reason for hiding this comment

Uh oh!

enxebre commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bison commented Aug 13, 2019

Uh oh!

michaelgugino left a comment

Choose a reason for hiding this comment

Uh oh!

vikaschoudhary16 commented Aug 14, 2019

Uh oh!

vikaschoudhary16 commented Aug 15, 2019

Uh oh!

vikaschoudhary16 commented Aug 16, 2019

Uh oh!

openshift-ci-robot commented Sep 5, 2019

Uh oh!

openshift-bot commented Sep 25, 2019

Uh oh!

openshift-ci-robot commented Sep 25, 2019

Uh oh!

openshift-bot commented Sep 25, 2019

Uh oh!

openshift-ci-robot commented Sep 25, 2019

Uh oh!

eparis commented Oct 26, 2019

Uh oh!

enxebre commented Oct 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

enxebre commented Aug 9, 2019 •

edited

Loading