Fix VMWare VM freezing bug by reverting #51066 #67825

nikopen · 2018-08-24T12:39:01Z

What this PR does / why we need it: kube-controller-manager, VSphere specific: When the controller tries to attach a Volume to Node A that is already attached to Node B, Node A freezes until the volume is attached. Kubernetes continues to try to attach the volume as it thinks that it's 'multi-attachable' when it's not. #51066 is the culprit.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes vmware-archive#500 / vmware-archive#502 (same issue)

Special notes for your reviewer:

Repro:

Vsphere installation, any k8s version from 1.8 and above, pod with attached PV/PVC/VMDK:

cordon the node which the pod is in
kubectl delete po/[pod] --force --grace-period=0
the pod is immediately rescheduled to a new node. Grab the new node from a kubectl describe [pod] and attempt to Ping it or SSH into it.
you can see that pings/ssh fail to reach the new node. kubectl get node shows it as 'NotReady'. New node is frozen until the volume is attached - usually 1 minute freeze for 1 volume in a low-load cluster, and many minutes more with higher loads and more volumes involved.

Patch verification:

Tested a custom patched 1.9.10 kube-controller-manager with #51066 reverted and the above bug is resolved - can't repro it anymore. New node doesn't freeze at all, and attaching happens quite quickly, in a few seconds.

Release note:

Fix VSphere VM Freezing bug by reverting #51066

k8s-ci-robot · 2018-08-24T12:39:14Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

gnufied · 2018-08-24T12:45:28Z

/ok-to-test

gnufied · 2018-08-24T12:45:40Z

/assign @divyenpatel @BaluDontu

frapposelli · 2018-08-24T12:59:07Z

/cc @dougm

PTAL

dims · 2018-08-24T17:24:34Z

/sig storage

divyenpatel · 2018-08-24T20:31:34Z

@nikopen Can you verify the workflow and behavior when Node on which Pod is present is powered off?

SandeepPissay · 2018-08-24T22:48:21Z

@divyenpatel @BaluDontu and I discussed this issue. vSphere allows multi attach to make sure that on node power off, the pods on failed node can be quickly scheduled on another worker node for reducing the pod downtime. Since the node is powered off, the attach disk to new worker node will succeed instantly and the pods can come up sooner. But in non powered off scenario, looks like this is causing an issue. Note that vSphere volume does not support ReadWriteMany.

We have the following questions:

What is the Kubernetes recommendation for pod downtime when worker node is powered off?
How are other cloud providers like AWS, Azure, GCE etc handling pod downtime in case of node power off?

Its ok to revert this change, but its better to have clarity on the above questions.

gnufied · 2018-08-25T02:34:39Z

@SandeepPissay I noticed another variant of this bug today. The problem is - vsphere cloud provider is not detaching volumes from old nodes at all. Say, I have a pod with volume on node X and node X is powered off, then in vsphere cloudprovider node api object gets removed from etcd. This actually causes DetachDisk to fail and detach is never even attempted from node X.

But because multiattach is enabled for vsphere, the disk in question gets attached to Y. But when I tried to restart node X, then it refused to boot with:

"Cannot open the disk '/vmfs/volumes/5b0f026b-31c5ecfd-1537-4ccc6ab9973d/kubevols/kubernetes-dynamic-pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk' or one of the snapshot disks it depends on.

So, it appears that using multiattach to get around some of the detach issues is problematic and needs more testing.

To answer your questions:

a. When a worker node is powered off, a pod on it will eventually be evicted and started on a healthy node. When pod gets evicted then and only then detach should be performed. if underlying volume type does not support multiattach then it should not allow attaching volume to another node. GCE,AWS etc don't allow a volume to be attached to a different node.

b. Currently all cloudproviders are moving towards a model where node api object will not be deleted when node is shutdown. There is a lengthy discussion of it here - #45986 so vsphere will be only cloudprovider left which will delete the node object when a node is shutdown.

The upshot of that decision is - on AWS/GCE, when a node is shutdown and pods get evicted from stopped node, after this event + 6 minutes - volumes get "forced detached" from the powered off node. There is a ongoing feature to try and taint the node with "Shutdown" taint, so as attach/detach controller can know that the node has been shutdown and it can detach volumes sooner.

Also - another side effect of the decision is, pods running on shutdown node, don't necessarily get deleted on AWS/GCE/Openstack. They remain in "unknown" state. One win from this approach for vsphere I see is, the bug I mentioned above. Because vsphere deletes the node api object when a node is turned off, it removes the node object from internal node map cache. This mean, detach is not possible from that node. Keeping the node object around will fix that(though there may be other ways to workaround this)

cc @saad-ali @sjenning @childsb

nikopen · 2018-08-27T08:21:14Z

@divyenpatel @BaluDontu @SandeepPissay

as @gnufied said, #51066 was a custom vsphere patch to have a different behaviour than all other cloudproviders, but that caused a much bigger problem: freezing a whole VM for a minute or more, i.e. freezing all pods running on that VM.

This PR reverts that change and returns to the behaviour all other cloud providers have. The solution for the 6-minute shutdown wait time will be implemented with #45986.

Could you please merge this and also approve cherry-picks for all previous major versions affected? That would be 1.8, 1.9, 1.10, 1.11.

Thanks

gnufied · 2018-08-27T15:09:26Z

/lgtm

divyenpatel · 2018-08-27T19:23:04Z

@nikopen

With this change, Pod is never able to come up on another Node. Disks remains attached to the powered off Node, and pod creation on new Node is failing.

Events:
  Type     Reason              Age               From                       Message
  ----     ------              ----              ----                       -------
  Normal   Scheduled           9m                default-scheduler          Successfully assigned default/wordpress-mysql-779b4ddc75-xm5vb to kubernetes-node4
  Warning  FailedAttachVolume  9m                attachdetach-controller    Multi-Attach error for volume "pvc-ffcf2f53-aa2a-11e8-8703-0050569d768c" Volume is already used by pod(s) wordpress-mysql-779b4ddc75-wwdmr
  Warning  FailedMount         21s (x4 over 7m)  kubelet, kubernetes-node4  Unable to mount volumes for pod "wordpress-mysql-779b4ddc75-xm5vb_default(1cdad0c4-aa2c-11e8-8703-0050569d768c)": timeout expired waiting for volumes to attach or mount for pod "default"/"wordpress-mysql-779b4ddc75-xm5vb". list of unmounted volumes=[mysql-persistent-storage]. list of unattached volumes=[mysql-persistent-storage default-token-4kd4n]

# kubectl get pods
NAME                               READY     STATUS              RESTARTS   AGE
wordpress-mysql-779b4ddc75-wwdmr   1/1       Unknown             0          17m
wordpress-mysql-779b4ddc75-xm5vb   0/1       ContainerCreating   0          9m48s

# kubectl get nodes
NAME                STATUS                     ROLES     AGE       VERSION
kubernetes-master   Ready,SchedulingDisabled   <none>    35m       v1.13.0-alpha.0.626+ed3c32c3f9af0e-dirty
kubernetes-node1    Ready                      <none>    35m       v1.13.0-alpha.0.626+ed3c32c3f9af0e-dirty
kubernetes-node2    NotReady                   <none>    35m       v1.13.0-alpha.0.626+ed3c32c3f9af0e-dirty
kubernetes-node3    Ready                      <none>    35m       v1.13.0-alpha.0.626+ed3c32c3f9af0e-dirty
kubernetes-node4    Ready                      <none>    35m       v1.13.0-alpha.0.626+ed3c32c3f9af0e-dirty

Log

{"log":"W0827 19:05:07.450164       1 reconciler.go:370] Multi-Attach error for volume \"pvc-ffcf2f53-aa2a-11e8-8703-0050569d768c\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] c848845b-381e-d536-d784-02002c5be5de/kubernetes-dynamic-pvc-ffcf2f53-aa2a-11e8-8703-0050569d768c.vmdk\") from node \"kubernetes-node4\" Volume is already used by pods default/wordpress-mysql-779b4ddc75-wwdmr on node kubernetes-node2\n","stream":"stderr","time":"2018-08-27T19:05:07.450201326Z"}
{"log":"I0827 19:05:07.450290       1 event.go:221] Event(v1.ObjectReference{Kind:\"Pod\", Namespace:\"default\", Name:\"wordpress-mysql-779b4ddc75-xm5vb\", UID:\"1cdad0c4-aa2c-11e8-8703-0050569d768c\", APIVersion:\"v1\", ResourceVersion:\"2113\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume \"pvc-ffcf2f53-aa2a-11e8-8703-0050569d768c\" Volume is already used by pod(s) wordpress-mysql-779b4ddc75-wwdmr\n","stream":"stderr","time":"2018-08-27T19:05:07.450420381Z"}

gnufied · 2018-08-27T19:34:03Z

@divyenpatel is that because of #67900 issue I filed? Anyhow - using multiattach to mask a detach problem seems incorrect.

fejta-bot · 2018-08-29T06:52:57Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

nikopen · 2018-08-29T07:26:21Z

/test pull-kubernetes-e2e-gce

frapposelli · 2018-08-29T10:12:59Z

/test pull-kubernetes-e2e-kops-aws

nikopen · 2018-08-29T11:39:45Z

/test pull-kubernetes-e2e-gce

nikopen · 2018-08-29T11:39:52Z

/test pull-kubernetes-e2e-kops-aws

nikopen · 2018-08-29T12:53:10Z

/test pull-kubernetes-e2e-kops-aws

frapposelli · 2018-08-29T14:47:22Z

/test pull-kubernetes-e2e-gce

frapposelli · 2018-08-29T14:47:30Z

/test pull-kubernetes-e2e-kops-aws

fejta-bot · 2018-08-29T18:25:51Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

k8s-github-robot · 2018-08-29T22:19:39Z

Automatic merge from submit-queue (batch tested with PRs 67745, 67432, 67569, 67825, 67943). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

gnufied · 2018-08-30T14:30:41Z

@nikopen Can you please confirm if your 1.9 clusters are affected by #67900 ? Because as per your comment #67825 (comment) detach from shutdown nodes in 1.9 works.

But my testing says otherwise. I think @divyenpatel has also confirmed this. In 1.9 - when a node is shutdown it gets removed from api-server and from nodeinfo map and hence DetachDisk call fails. There is another PR for this - #63413 so I must not be alone in this.

nikopen · 2018-08-30T15:36:09Z

@gnufied The repro comment stands true, tested in both non-patched and patched 1.9.10 clusters. I think the shutdown was issued via ssh && sudo shutdown, does this remove it from the Vsphere inventory? It probably shouldn't, but I didn't verify this last detail.

I can attempt it tomorrow again and let you know.

nikopen · 2018-08-31T10:01:32Z

@gnufied @divyenpatel Just tried a VM hard shutdown on a patched cluster, again couldn't repro #67900 . Volume got re-attached in 6 minutes as expected.

…5-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066 Cherry pick of #67825 on release-1.10. #67825: Fix VMWare VM freezing bug by reverting #51066

…5-upstream-release-1.9 Automatic merge from submit-queue. Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066 Cherry pick of #67825 on release-1.9. #67825: Fix VMWare VM freezing bug by reverting #51066

…5-upstream-release-1.11 Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066

Fix VMWare VM freezing bug by reverting #51066

6f2a45a

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 24, 2018

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 24, 2018

k8s-ci-robot requested review from gnufied and saad-ali August 24, 2018 12:39

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 24, 2018

k8s-ci-robot assigned BaluDontu and divyenpatel Aug 24, 2018

k8s-ci-robot requested a review from dougm August 24, 2018 12:59

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 24, 2018

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Aug 24, 2018

bradbeam approved these changes Aug 24, 2018

View reviewed changes

mattymo approved these changes Aug 24, 2018

View reviewed changes

k8s-ci-robot assigned gnufied Aug 27, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 27, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2018

saad-ali approved these changes Aug 29, 2018

View reviewed changes

k8s-github-robot merged commit e3969fe into kubernetes:master Aug 29, 2018

divyenpatel mentioned this pull request Sep 4, 2018

PV attachment causes VM to hang vmware-archive/kubernetes-archived#502

Closed

k8s-ci-robot added a commit that referenced this pull request Sep 10, 2018

Merge pull request #68058 from nikopen/automated-cherry-pick-of-#6782…

f30a74e

…5-upstream-release-1.11 Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066

nikopen mentioned this pull request Sep 24, 2018

REQUEST: New membership for nikopen kubernetes/org#113

Closed

6 tasks

divyenpatel mentioned this pull request Sep 24, 2018

Softlock when moving Persistent Volume between Nodes vmware-archive/kubernetes-archived#509

Closed

alvaroaleman mentioned this pull request Sep 25, 2018

document vSphere reboot/volume issue kubermatic/docs#74

Merged

alvaroaleman mentioned this pull request Oct 9, 2018

Allow attach of volumes to multiple nodes for vSphere #51066

Merged

gnufied mentioned this pull request Oct 11, 2018

Fix vsphere detach bug #69691

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix VMWare VM freezing bug by reverting #51066 #67825

Fix VMWare VM freezing bug by reverting #51066 #67825

nikopen commented Aug 24, 2018 •

edited

k8s-ci-robot commented Aug 24, 2018

gnufied commented Aug 24, 2018

gnufied commented Aug 24, 2018

frapposelli commented Aug 24, 2018

dims commented Aug 24, 2018

divyenpatel commented Aug 24, 2018

SandeepPissay commented Aug 24, 2018

gnufied commented Aug 25, 2018 •

edited

nikopen commented Aug 27, 2018

gnufied commented Aug 27, 2018

divyenpatel commented Aug 27, 2018 •

edited

gnufied commented Aug 27, 2018

fejta-bot commented Aug 29, 2018

nikopen commented Aug 29, 2018

frapposelli commented Aug 29, 2018

nikopen commented Aug 29, 2018

nikopen commented Aug 29, 2018

nikopen commented Aug 29, 2018

frapposelli commented Aug 29, 2018

frapposelli commented Aug 29, 2018

fejta-bot commented Aug 29, 2018

k8s-github-robot commented Aug 29, 2018

gnufied commented Aug 30, 2018

nikopen commented Aug 30, 2018

nikopen commented Aug 31, 2018

Fix VMWare VM freezing bug by reverting #51066 #67825

Fix VMWare VM freezing bug by reverting #51066 #67825

Conversation

nikopen commented Aug 24, 2018 • edited

k8s-ci-robot commented Aug 24, 2018

gnufied commented Aug 24, 2018

gnufied commented Aug 24, 2018

frapposelli commented Aug 24, 2018

dims commented Aug 24, 2018

divyenpatel commented Aug 24, 2018

SandeepPissay commented Aug 24, 2018

gnufied commented Aug 25, 2018 • edited

nikopen commented Aug 27, 2018

gnufied commented Aug 27, 2018

divyenpatel commented Aug 27, 2018 • edited

Log

gnufied commented Aug 27, 2018

fejta-bot commented Aug 29, 2018

nikopen commented Aug 29, 2018

frapposelli commented Aug 29, 2018

nikopen commented Aug 29, 2018

nikopen commented Aug 29, 2018

nikopen commented Aug 29, 2018

frapposelli commented Aug 29, 2018

frapposelli commented Aug 29, 2018

fejta-bot commented Aug 29, 2018

k8s-github-robot commented Aug 29, 2018

gnufied commented Aug 30, 2018

nikopen commented Aug 30, 2018

nikopen commented Aug 31, 2018

nikopen commented Aug 24, 2018 •

edited

gnufied commented Aug 25, 2018 •

edited

divyenpatel commented Aug 27, 2018 •

edited