Node VM failure doesn't automatically recreate a pod with attached PV #80040

vmwkkwong · 2019-07-11T14:15:21Z

What happened:
After a deployment is created with PVC, a node hosting the pod shuts down. After 6 mins timeout, the replacement/new pod is created but cannot come up since the volume is attached to a terminating pod on a shutdown node. To work around this, the volume would be detached from the original pod when it's force deleted.

What you expected to happen:
According to the Node VM Failure scenario at https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/high-availability.html, the recovery mechanism is completely automatic. In reality, it requires manual intervention to force delete a pod.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.12.7-gke.19
Cloud provider or hardware configuration: vSphere Cloud Provider
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

/sig vmware

The text was updated successfully, but these errors were encountered:

vmwkkwong · 2019-07-11T14:18:50Z

/sig vmware

rhockenbury · 2019-07-15T15:25:10Z

Out of curiosity, what 6 min timeout are you referring to?

vmwkkwong · 2019-07-15T16:21:29Z

@rhockenbury k8s timeout to force unmount the PV.

alexbrand · 2019-08-21T18:31:02Z

I was able to reproduce this issue as well by doing the following:

Create a cluster on vSphere with the vSphere in-tree cloud provider configured.
Create a PVC (This triggers the cloud provider to create a PV)
Create a deployment that consumes the PVC
Go into the HTML5 vSphere Client and Power -> Power Off the VM where the pod is running
Wait 5 minutes until Kubernetes reschedules the pod
Notice that the new pod is stuck ContainerCreating

NAME                       READY   STATUS              RESTARTS   AGE    IP              NODE                                    NOMINATED NODE   READINESS GATES
busybox-644dbf5f4b-l5lzn   1/1     Terminating         0          153m   100.115.249.6   workload-cluster-1-machineset-1-l2jmg   <none>           <none>
busybox-644dbf5f4b-ptp6v   0/1     ContainerCreating   0          28m    <none>          workload-cluster-1-machineset-1-lf2tq   <none>           <none>

maplain · 2019-09-03T19:09:44Z

looks a lot like an issue filed before: #71829

andrewsykim · 2019-09-03T19:18:26Z

This is a known issue across all cloud providers at the moment, @yastij is working on a KEP right now that addresses this kubernetes/enhancements#1116, however, the problem is a bit thorny as it requires coordination from a number of different components (controller manager, scheduler and the kubelet) and errors can lead to data corruption in certain situations. Will let @yastij comment further if there's anything else to add.

yastij · 2019-09-03T21:57:44Z

This is intended to be fixed by kubernetes/enhancements#1116, the current KEP reflects a part of the solution, but we still need to update it to make workload opt-in into this behavior.

/priority important-soon

SandeepPissay · 2019-10-11T20:29:59Z

/assign @SandeepPissay

fejta-bot · 2020-01-09T20:44:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

yastij · 2020-01-10T13:24:13Z

/remove-lifecycle stale
/lifecycle frozen

cheftako · 2021-02-03T21:27:17Z

@SandeepPissay Are you still looking at this?

SandeepPissay · 2021-02-03T21:55:01Z

@SandeepPissay Are you still looking at this?

No, I'm not.

cheftako · 2021-02-04T20:58:41Z

@andrewsykim do you know anyone who can look at this? Do we know if its still a problem?

hassenius · 2021-03-29T09:59:23Z

I can confirm that this is very much still an issue as of Kubernetes 1.19
Several days after shutting down a node, pods are still in a Terminating state, blocking proper recovery of the desired pod count.

jingxu97 · 2021-03-30T05:31:26Z

cc @xing-yang @divyenpatel @jingxu97

yuga711 · 2021-03-30T06:19:56Z

Referring a PR (kubernetes-sigs/vsphere-csi-driver#529) for a similar issue with the vSphere CSI driver.

vmwkkwong added the kind/bug Categorizes issue or PR as related to a bug. label Jul 11, 2019

k8s-ci-robot added the area/provider/vmware Issues or PRs related to vmware provider label Jul 11, 2019

nikhita added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Aug 6, 2019

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 3, 2019

k8s-ci-robot assigned SandeepPissay Oct 11, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 10, 2020

Quantas mentioned this issue Jan 11, 2020

VMDK refuses to attach/delete #75738

Closed

SandeepPissay removed their assignment Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node VM failure doesn't automatically recreate a pod with attached PV #80040

Node VM failure doesn't automatically recreate a pod with attached PV #80040

vmwkkwong commented Jul 11, 2019

vmwkkwong commented Jul 11, 2019

rhockenbury commented Jul 15, 2019

vmwkkwong commented Jul 15, 2019

alexbrand commented Aug 21, 2019

maplain commented Sep 3, 2019

andrewsykim commented Sep 3, 2019

yastij commented Sep 3, 2019

SandeepPissay commented Oct 11, 2019

fejta-bot commented Jan 9, 2020

yastij commented Jan 10, 2020

cheftako commented Feb 3, 2021

SandeepPissay commented Feb 3, 2021

cheftako commented Feb 4, 2021

hassenius commented Mar 29, 2021

jingxu97 commented Mar 30, 2021 •

edited

Loading

yuga711 commented Mar 30, 2021

Node VM failure doesn't automatically recreate a pod with attached PV #80040

Node VM failure doesn't automatically recreate a pod with attached PV #80040

Comments

vmwkkwong commented Jul 11, 2019

vmwkkwong commented Jul 11, 2019

rhockenbury commented Jul 15, 2019

vmwkkwong commented Jul 15, 2019

alexbrand commented Aug 21, 2019

maplain commented Sep 3, 2019

andrewsykim commented Sep 3, 2019

yastij commented Sep 3, 2019

SandeepPissay commented Oct 11, 2019

fejta-bot commented Jan 9, 2020

yastij commented Jan 10, 2020

cheftako commented Feb 3, 2021

SandeepPissay commented Feb 3, 2021

cheftako commented Feb 4, 2021

hassenius commented Mar 29, 2021

jingxu97 commented Mar 30, 2021 • edited Loading

yuga711 commented Mar 30, 2021

jingxu97 commented Mar 30, 2021 •

edited

Loading