GCE PD Detach fails if node no longer exists #29358

saad-ali · 2016-07-21T03:42:50Z

Problem:

If a node with a GCE PD attached is deleted (before the volume is detached), subsequent attempts by the attach/detach controller to detach it continuously fail, and prevent the controller from attaching the volume to another node.

Repro steps:

Create a pod referencing a GCE PD
Wait for pod to get scheduled and running.
Delete the node VM (using gcloud) that the pod is scheduled to.
Check if volume is detached by attach/detach controller:
- Expected: Volume detached.
- Actual: Volume continuously fails detach.

Logs:

From `/var/log/kube-controller-manager.log`:

I0721 03:21:43.464087       7 reconciler.go:134] Started DetachVolume for volume "X" from node "Y" due to maxWaitForUnmountDuration expiry.
E0721 03:21:43.591941       7 gce.go:2580] getInstanceByName/single-zone: failed to get instance Y; err: googleapi: Error 404: The resource 'projects/[project]/zones/[zone]/instances/Y' was not found, notFound
E0721 03:21:43.591985       7 attacher.go:215] Error checking if PD ("[pdname]") is already attached to current node ("Y"). Will continue and try detach anyway. err=instance not found
E0721 03:21:43.698786       7 gce.go:2580] getInstanceByName/single-zone: failed to get instance Y; err: googleapi: Error 404: The resource 'projects/[project]/zones/[zone]/instances/Y' was not found, notFound
E0721 03:21:43.698828       7 attacher.go:225] Error detaching PD "[pdname]" from node "Y": error getting instance "Y"

Workarounds:

Restart the kube-controller-manager binary on the master.

-or-

Recreate a node with the same name.

Proposed Fix:

If GCE PD detach fails with instance not found, assume successful detach.

The text was updated successfully, but these errors were encountered:

pmorie · 2016-07-22T22:04:53Z

So is the issue here that we made a bad assumption about how the GCE API will behave for detach when the node doesn't exist?

saad-ali · 2016-07-22T22:10:33Z

Not so much a bad assumption, more a missed case.

Fixes kubernetes#29358

saad-ali · 2016-08-17T00:51:37Z

The fix for this, PR #29485, missed one location where node is fetched. The existing fix handles the case where the actual node is physically deleted. But it does not handle the case where the node API object is deleted. This means that detach can still sometimes fail due to missing node API object:

Error: DetachVolume failed fetching node from API server for volume "kubernetes.io/gce-pd/panda-disk" (spec.Name: "data") from node "nodA" with: nodes "nodeA" not found

Automatic merge from submit-queue Skip safe to detach check if node API object no longer exists Fixes #29358

Fixes kubernetes#29358

Automatic merge from submit-queue Add test to detach a pd whose node was deleted **What this PR does / why we need it**: A test for the following issue : If a node with a GCE PD attached is deleted (before the volume is detached), subsequent attempts by the attach/detach controller to detach it should not fail. **Bonus** :Added additional code to ensure that the pd can still be attached to a different node. Edit : Removed it as it was making the test much slower. #29358

saad-ali added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage. team/cluster labels Jul 21, 2016

saad-ali added this to the v1.3 milestone Jul 21, 2016

saad-ali self-assigned this Jul 21, 2016

saad-ali mentioned this issue Jul 23, 2016

Assume volume is detached if node doesn't exist #29485

Merged

saad-ali added a commit to saad-ali/kubernetes that referenced this issue Jul 23, 2016

Assume volume detached if node doesn't exist

89fd358

Fixes kubernetes#29358

k8s-github-robot closed this as completed in #29485 Jul 23, 2016

saad-ali added a commit to saad-ali/kubernetes that referenced this issue Jul 25, 2016

Assume volume detached if node doesn't exist

1bc5468

Fixes kubernetes#29358

zefciu pushed a commit to zefciu/kubernetes that referenced this issue Jul 28, 2016

Assume volume detached if node doesn't exist

a47b4c4

Fixes kubernetes#29358

saad-ali mentioned this issue Jul 30, 2016

Kubernetes keeps failing at mounting a volume #29166

Closed

saad-ali mentioned this issue Aug 9, 2016

support Azure data disk volume #29836

Merged

saad-ali reopened this Aug 17, 2016

This was referenced Aug 17, 2016

Mounting/unmounting volume error on GKE #29903

Closed

Skip safe to detach check if node API object no longer exists #30737

Merged

k8s-github-robot closed this as completed in #30737 Aug 18, 2016

k8s-github-robot pushed a commit that referenced this issue Aug 18, 2016

Merge pull request #30737 from saad-ali/fix29358Round2

9696a27

Automatic merge from submit-queue Skip safe to detach check if node API object no longer exists Fixes #29358

jingxu97 mentioned this issue Aug 18, 2016

node not exist failure during node status update flush controller's log #30898

Closed

saad-ali mentioned this issue Aug 29, 2016

Pet Set stuck in ContainerCreating #28709

Closed

ceefour mentioned this issue Aug 31, 2016

Unable to mount volume gcePersistentDisk with readOnly: true for multiple pods (timeout expired waiting for volumes to attach/mount) #31176

Closed

This was referenced Sep 2, 2016

EBS volumes get wiped after (incorrect) upgrade from 1.2 to 1.3 #31822

Closed

volume controller not handling terminated node #31088

Closed

rkouj mentioned this issue Nov 1, 2016

Add test to detach a pd whose node was deleted #36009

Merged

shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this issue Dec 1, 2016

Assume volume detached if node doesn't exist

e905e5f

Fixes kubernetes#29358

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCE PD Detach fails if node no longer exists #29358

GCE PD Detach fails if node no longer exists #29358

saad-ali commented Jul 21, 2016 •

edited

Loading

pmorie commented Jul 22, 2016

saad-ali commented Jul 22, 2016

saad-ali commented Aug 17, 2016 •

edited

Loading

GCE PD Detach fails if node no longer exists #29358

GCE PD Detach fails if node no longer exists #29358

Comments

saad-ali commented Jul 21, 2016 • edited Loading

pmorie commented Jul 22, 2016

saad-ali commented Jul 22, 2016

saad-ali commented Aug 17, 2016 • edited Loading

saad-ali commented Jul 21, 2016 •

edited

Loading

saad-ali commented Aug 17, 2016 •

edited

Loading