Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE PD Detach fails if node no longer exists #29358

Closed
saad-ali opened this issue Jul 21, 2016 · 3 comments
Closed

GCE PD Detach fails if node no longer exists #29358

saad-ali opened this issue Jul 21, 2016 · 3 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@saad-ali
Copy link
Member

saad-ali commented Jul 21, 2016

Problem:

If a node with a GCE PD attached is deleted (before the volume is detached), subsequent attempts by the attach/detach controller to detach it continuously fail, and prevent the controller from attaching the volume to another node.

Repro steps:

  1. Create a pod referencing a GCE PD
  2. Wait for pod to get scheduled and running.
  3. Delete the node VM (using gcloud) that the pod is scheduled to.
  4. Check if volume is detached by attach/detach controller:
    • Expected: Volume detached.
    • Actual: Volume continuously fails detach.

Logs:

From `/var/log/kube-controller-manager.log`:
I0721 03:21:43.464087       7 reconciler.go:134] Started DetachVolume for volume "X" from node "Y" due to maxWaitForUnmountDuration expiry.
E0721 03:21:43.591941       7 gce.go:2580] getInstanceByName/single-zone: failed to get instance Y; err: googleapi: Error 404: The resource 'projects/[project]/zones/[zone]/instances/Y' was not found, notFound
E0721 03:21:43.591985       7 attacher.go:215] Error checking if PD ("[pdname]") is already attached to current node ("Y"). Will continue and try detach anyway. err=instance not found
E0721 03:21:43.698786       7 gce.go:2580] getInstanceByName/single-zone: failed to get instance Y; err: googleapi: Error 404: The resource 'projects/[project]/zones/[zone]/instances/Y' was not found, notFound
E0721 03:21:43.698828       7 attacher.go:225] Error detaching PD "[pdname]" from node "Y": error getting instance "Y"

Workarounds:

  • Restart the kube-controller-manager binary on the master.

-or-

  • Recreate a node with the same name.

Proposed Fix:

If GCE PD detach fails with instance not found, assume successful detach.

@saad-ali saad-ali added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage. team/cluster labels Jul 21, 2016
@saad-ali saad-ali added this to the v1.3 milestone Jul 21, 2016
@saad-ali saad-ali self-assigned this Jul 21, 2016
@pmorie
Copy link
Member

pmorie commented Jul 22, 2016

So is the issue here that we made a bad assumption about how the GCE API will behave for detach when the node doesn't exist?

@saad-ali
Copy link
Member Author

Not so much a bad assumption, more a missed case.

@saad-ali
Copy link
Member Author

saad-ali commented Aug 17, 2016

The fix for this, PR #29485, missed one location where node is fetched. The existing fix handles the case where the actual node is physically deleted. But it does not handle the case where the node API object is deleted. This means that detach can still sometimes fail due to missing node API object:

Error: DetachVolume failed fetching node from API server for volume "kubernetes.io/gce-pd/panda-disk" (spec.Name: "data") from node "nodA" with: nodes "nodeA" not found

k8s-github-robot pushed a commit that referenced this issue Dec 20, 2016
Automatic merge from submit-queue

Add test to detach a pd whose node was deleted

**What this PR does / why we need it**:
A test for the following issue :
If a node with a GCE PD attached is deleted (before the volume is detached), subsequent attempts by the attach/detach controller to detach it should not fail.


**Bonus** :Added additional code to ensure that the pd can still be attached to a different node.
Edit : Removed it as it was making the test much slower.

#29358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

2 participants