Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node VM failure doesn't automatically recreate a pod with attached PV #80040

Open
vmwkkwong opened this issue Jul 11, 2019 · 16 comments
Open

Node VM failure doesn't automatically recreate a pod with attached PV #80040

vmwkkwong opened this issue Jul 11, 2019 · 16 comments
Labels
area/provider/vmware Issues or PRs related to vmware provider kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider.

Comments

@vmwkkwong
Copy link

What happened:
After a deployment is created with PVC, a node hosting the pod shuts down. After 6 mins timeout, the replacement/new pod is created but cannot come up since the volume is attached to a terminating pod on a shutdown node. To work around this, the volume would be detached from the original pod when it's force deleted.

What you expected to happen:
According to the Node VM Failure scenario at https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/high-availability.html, the recovery mechanism is completely automatic. In reality, it requires manual intervention to force delete a pod.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.12.7-gke.19
  • Cloud provider or hardware configuration: vSphere Cloud Provider
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

/sig vmware

@vmwkkwong vmwkkwong added the kind/bug Categorizes issue or PR as related to a bug. label Jul 11, 2019
@k8s-ci-robot k8s-ci-robot added the area/provider/vmware Issues or PRs related to vmware provider label Jul 11, 2019
@vmwkkwong
Copy link
Author

/sig vmware

@rhockenbury
Copy link

Out of curiosity, what 6 min timeout are you referring to?

@vmwkkwong
Copy link
Author

@rhockenbury k8s timeout to force unmount the PV.

@nikhita nikhita added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Aug 6, 2019
@alexbrand
Copy link
Contributor

I was able to reproduce this issue as well by doing the following:

  1. Create a cluster on vSphere with the vSphere in-tree cloud provider configured.
  2. Create a PVC (This triggers the cloud provider to create a PV)
  3. Create a deployment that consumes the PVC
  4. Go into the HTML5 vSphere Client and Power -> Power Off the VM where the pod is running
  5. Wait 5 minutes until Kubernetes reschedules the pod
  6. Notice that the new pod is stuck ContainerCreating
NAME                       READY   STATUS              RESTARTS   AGE    IP              NODE                                    NOMINATED NODE   READINESS GATES
busybox-644dbf5f4b-l5lzn   1/1     Terminating         0          153m   100.115.249.6   workload-cluster-1-machineset-1-l2jmg   <none>           <none>
busybox-644dbf5f4b-ptp6v   0/1     ContainerCreating   0          28m    <none>          workload-cluster-1-machineset-1-lf2tq   <none>           <none>

@maplain
Copy link

maplain commented Sep 3, 2019

looks a lot like an issue filed before: #71829

@andrewsykim
Copy link
Member

This is a known issue across all cloud providers at the moment, @yastij is working on a KEP right now that addresses this kubernetes/enhancements#1116, however, the problem is a bit thorny as it requires coordination from a number of different components (controller manager, scheduler and the kubelet) and errors can lead to data corruption in certain situations. Will let @yastij comment further if there's anything else to add.

@yastij
Copy link
Member

yastij commented Sep 3, 2019

This is intended to be fixed by kubernetes/enhancements#1116, the current KEP reflects a part of the solution, but we still need to update it to make workload opt-in into this behavior.

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 3, 2019
@SandeepPissay
Copy link
Contributor

/assign @SandeepPissay

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2020
@yastij
Copy link
Member

yastij commented Jan 10, 2020

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 10, 2020
@cheftako
Copy link
Member

cheftako commented Feb 3, 2021

@SandeepPissay Are you still looking at this?

@SandeepPissay SandeepPissay removed their assignment Feb 3, 2021
@SandeepPissay
Copy link
Contributor

@SandeepPissay Are you still looking at this?

No, I'm not.

@cheftako
Copy link
Member

cheftako commented Feb 4, 2021

@andrewsykim do you know anyone who can look at this? Do we know if its still a problem?

@hassenius
Copy link

I can confirm that this is very much still an issue as of Kubernetes 1.19
Several days after shutting down a node, pods are still in a Terminating state, blocking proper recovery of the desired pod count.

@jingxu97
Copy link
Contributor

jingxu97 commented Mar 30, 2021

cc @xing-yang @divyenpatel @jingxu97

@yuga711
Copy link
Contributor

yuga711 commented Mar 30, 2021

Referring a PR (kubernetes-sigs/vsphere-csi-driver#529) for a similar issue with the vSphere CSI driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/vmware Issues or PRs related to vmware provider kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider.
Projects
None yet
Development

No branches or pull requests