Skip to content

Conversation

vikaschoudhary16
Copy link

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:


@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 8, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign vikaschoudhary16
You can assign the PR to them by writing /assign @vikaschoudhary16 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@enxebre
Copy link
Member

enxebre commented Aug 9, 2019

When deleting a previously stopped machine deletion hangs for ever on draining

we are calling eviction API which succeeds (because the PDB is granted because there isn't any and the pod is scheduled for the deletion - it actually has a deletionTimestamp) but then waitForDeletion never succeeds because the pod is stateful (has local storage) and those are not allowed to be removed from the API server in the case of an unreachable node

// by cloud controller manager. In that case some machines would never get
// deleted without a manual intervention.
if _, exists := m.ObjectMeta.Annotations[ExcludeNodeDrainingAnnotation]; !exists && m.Status.NodeRef != nil {
if _, exists := m.ObjectMeta.Annotations[ExcludeNodeDrainingAnnotation]; !exists && m.Status.NodeRef != nil && r.isNodeReady(ctx, m.Status.NodeRef.Name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if anything I think we should probably check the node is unreachable, still not sure we want to hard delete

@enxebre
Copy link
Member

enxebre commented Aug 9, 2019

Some proposal could be:
Option 1:
-machine controller never hard delete if there's no annotation, drains hangs if it's not able to succeed, requires manual intervention.
-As a user you have MHC, which has some opinions and makes automatic decisions for you and will skip drain it required.

Option 2:
We should account for this use case at the machine controller level and consider to skip drain and let deletion move forward when the node is unreachable? - This is a sensitive scenario as the stateful pod could be user application critical data

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 9, 2019
@vikaschoudhary16 vikaschoudhary16 force-pushed the skip-drain-when-node-unready branch from 1df2e4c to 12df9b9 Compare August 13, 2019 06:54
@bison
Copy link

bison commented Aug 13, 2019

I think drain needs to be configurable at the machine-controller level. At a minimum we should respect some kind of drain-timeout annotation. If a drain hasn't completed within that time, proceed with the deletion. This matches the behavior of the kubectl drain command.

The question then becomes what the default value should be. I think it makes sense to match kubectl drain again and default to zero, which would mean no timeout. Possibly MachineHealthCheck could set a custom timeout.

Copy link

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never force delete a machine, never skip draining. We have a lot of assumptions built in multiple places around draining behavior.

If draining is broke, admin needs to intervene. This would be a very edge-case, and the exact thing we should be alarming for with metrics 'machine with delete timestamp really old' kind of thing.

@vikaschoudhary16
Copy link
Author

thanks a lot @bison and @michaelgugino for sharing thoughts.
If i am not mistaken, Brad suggesting along the lines of the option1.

@michaelgugino agree to what you said regarding manual intervention. I see option1 as "your suggestion" + "configuration option at MHC, if enabled, would remediate by skipping drain in case of situation where drain is impossible to succeed like kubelet down"

@vikaschoudhary16
Copy link
Author

/test unit

1 similar comment
@vikaschoudhary16
Copy link
Author

/test unit

@openshift-ci-robot
Copy link

@vikaschoudhary16: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/goimports 12df9b9 link /test goimports

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link

/bugzilla refresh

@openshift-ci-robot
Copy link

@openshift-bot: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link

/bugzilla refresh

@openshift-ci-robot
Copy link

@openshift-bot: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@eparis
Copy link
Member

eparis commented Oct 26, 2019

Do we want to pursue this in 4.2? Did we do it in 4.3? Should we just close this and look toward future releases?

@enxebre
Copy link
Member

enxebre commented Oct 28, 2019

Do we want to pursue this in 4.2? Did we do it in 4.3? Should we just close this and look toward future releases?

Yes, please let's re-open against master if relevant. fwiw a more generic approach is being proposed here https://groups.google.com/forum/#!topic/kubernetes-sig-cli/f4lLTdg0LsE

@enxebre enxebre closed this Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants