Vsphere Cloud Provider: failed to detach volume from shutdown node #75342

ksandermann · 2019-03-13T19:00:36Z

What happened:
I have a pod with a pv attached to it running on node1.
When I shutdown node1 to simulate node-failure, Kubernetes detects the unhealthy node in the configured timeframe and tries to re-schedule the pod on node2 following the --pod-eviction timeout.
When trying to start the pod on node2, the pv can not be attached as it is still attached to the shutdown node1:

  Warning  FailedAttachVolume      6m               attachdetach-controller  Multi-Attach error for volume "pvc-db44144b-457f-11e9-a7b0-005056af6878" Volume is already exclusively attached to one node and can't be attached to another

Also, the pod on the shutdown node does not get deleted.

What you expected to happen:
As documented here:
The disk should be detached from the shutdown node and attached to the new node where the pod is scheduled on

How to reproduce it (as minimally and precisely as possible):

Schedule single pod using pvc on node1
Shutdown node1
Watch FailedAttachVolume event on the new pod on node2

Anything else we need to know?:
Also, kube-controller-manager does not log anything about this failure.
Detachment and attachment to another nodes works fine, as long as all nodes are healthy.
Force-deletion of the pod on the shutdown node also does nothing.

Environment:

Kubernetes version (use kubectl version): 1.12.5
Cloud provider or hardware configuration: vsphere
OS (e.g: cat /etc/os-release): Ubuntu 16.04
Kernel (e.g. uname -a): Linux k8s-dev-master3 4.4.0-139-generic On delete, also attempt to update controller state #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Install tools: Kubespray 2.8.3 using kubeadm
Others: VSphere 6.5

Thanks in advance! :)

The text was updated successfully, but these errors were encountered:

ksandermann · 2019-03-13T19:01:35Z

/sig vmware

ksandermann · 2019-03-13T19:03:06Z

@kubernetes/sig-vmware-bugs

k8s-ci-robot · 2019-03-13T19:03:15Z

@ksandermann: Reiterating the mentions to trigger a notification:
@kubernetes/sig-vmware-bugs

In response to this:

@kubernetes/sig-vmware-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

LinAnt · 2019-03-14T06:20:54Z

It is even worse if you drain a node for an upgrade and then delete the VM. The disks that remain attached will be gone as well as the vm. This is not a recent issue, it has been like this since 1.8.x or 1.9.x.

yastij · 2019-03-14T13:32:50Z

There's a KEP opened for this kubernetes/enhancements#719

/priority important-soon

ksandermann · 2019-03-15T12:31:16Z

@yastij I see, thanks for the reference!
Is there any estimated time for this to actually get through?
I see that the PR has been stale for 10 days now

yastij · 2019-03-15T12:46:23Z

The design is still in discussion, this will be landing on 1.15

sjberman · 2019-04-09T21:22:43Z

@ksandermann Do you also see an issue where the NotReady node never gets cleaned up? I'm seeing the issue you mentioned, but the powered down node stays in the NotReady state and never goes away. I'm wondering if it has anything to do with the fact that the pod with the volume attached still "exists" on the downed node (though in an "Unknown" state), and these two issues are somehow tied to each other.

ksandermann · 2019-04-12T08:57:22Z

@sjberman

I didn't test that case, so I can't say anything to that.
What would your expected behaviour be in that case ? That k8s removes the NotReady Node completely from the cluster after time x ?

sjberman · 2019-04-12T15:10:54Z

@ksandermann Yeah, I would expect a node that is stuck in NotReady for some amount of time to be eventually removed from the cluster. Other cloud providers in different clouds have this behavior.

zhonglin6666 · 2019-06-05T01:05:58Z

isMultiAttachForbidden return true When setting pv with accessMode value 'ReadWriteOnce'，
node1 attach and not detach when node1 down，more than one node attach volume，it will produce errors:
attachdetach-controller Multi-Attach error for volume "pvc-d0fde86c-8661-11e9-b873-0800271c9f15" Volume is already used by pod

// check if this volume is allowed to be attached to multiple PODs/nodes, if yes, return false
for _, accessMode := range volumeSpec.PersistentVolume.Spec.AccessModes {
	if accessMode == v1.ReadWriteMany || accessMode == v1.ReadOnlyMany {
		return false
	}
}

if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) {
	nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName)
	if len(nodes) > 0 {
		if !volumeToAttach.MultiAttachErrorReported {
			rc.reportMultiAttachError(volumeToAttach, nodes)
			rc.desiredStateOfWorld.SetMultiAttachError(volumeToAttach.VolumeName, volumeToAttach.NodeName)
		}
		continue
	}
}

fejta-bot · 2019-11-04T04:40:55Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

yastij · 2019-11-04T15:53:35Z

/remove-lifecycle stale
/lifecycle frozen

dfsdevops · 2020-08-29T01:18:21Z

I'm also seeing this on 1.18.6, does anyone know any workarounds for this? fwiw I do not see any kind of "NotReady" nodes. I simply scaled the worker nodes down using the TKG cli, most pods got rescheduled successfully, others did not? How can I manually detach a volume?

EDIT: Think I figured it out thanks to info in here: kubernetes-sigs/vsphere-csi-driver#221 (comment)

get the name of the pv
find the volumeattachment name

kubectl get volumeattachments.storage.k8s.io | grep <pv-name>

edit the volumeattachment and remove the finalizer
delete the volumeattachment.

kubectl delete volumeattachments.storage.k8s.io <volumeattachment-name>

note: you may also need to ensure that no pods are running and holding onto the volume, so scale down your deployment.

cheftako · 2021-02-03T21:33:05Z

/assign @andrewsykim
/triage accepted

k8s-triage-robot · 2023-02-07T23:18:48Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot · 2023-02-07T23:18:53Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ksandermann added the kind/bug Categorizes issue or PR as related to a bug. label Mar 13, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 13, 2019

k8s-ci-robot added area/provider/vmware Issues or PRs related to vmware provider and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 13, 2019

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 14, 2019

LinAnt mentioned this issue Mar 15, 2019

vSphere deletes PVs when deleting single node rancher/rancher#18221

Closed

nikhita added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Aug 6, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 4, 2019

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 4, 2019

k8s-ci-robot assigned andrewsykim Feb 3, 2021

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Feb 3, 2021

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vsphere Cloud Provider: failed to detach volume from shutdown node #75342

Vsphere Cloud Provider: failed to detach volume from shutdown node #75342

ksandermann commented Mar 13, 2019 •

edited

ksandermann commented Mar 13, 2019

ksandermann commented Mar 13, 2019

k8s-ci-robot commented Mar 13, 2019

LinAnt commented Mar 14, 2019

yastij commented Mar 14, 2019

ksandermann commented Mar 15, 2019

yastij commented Mar 15, 2019

sjberman commented Apr 9, 2019

ksandermann commented Apr 12, 2019

sjberman commented Apr 12, 2019

zhonglin6666 commented Jun 5, 2019

fejta-bot commented Nov 4, 2019

yastij commented Nov 4, 2019

dfsdevops commented Aug 29, 2020 •

edited

cheftako commented Feb 3, 2021

k8s-triage-robot commented Feb 7, 2023

k8s-ci-robot commented Feb 7, 2023

Vsphere Cloud Provider: failed to detach volume from shutdown node #75342

Vsphere Cloud Provider: failed to detach volume from shutdown node #75342

Comments

ksandermann commented Mar 13, 2019 • edited

ksandermann commented Mar 13, 2019

ksandermann commented Mar 13, 2019

k8s-ci-robot commented Mar 13, 2019

LinAnt commented Mar 14, 2019

yastij commented Mar 14, 2019

ksandermann commented Mar 15, 2019

yastij commented Mar 15, 2019

sjberman commented Apr 9, 2019

ksandermann commented Apr 12, 2019

sjberman commented Apr 12, 2019

zhonglin6666 commented Jun 5, 2019

fejta-bot commented Nov 4, 2019

yastij commented Nov 4, 2019

dfsdevops commented Aug 29, 2020 • edited

cheftako commented Feb 3, 2021

k8s-triage-robot commented Feb 7, 2023

k8s-ci-robot commented Feb 7, 2023

ksandermann commented Mar 13, 2019 •

edited

dfsdevops commented Aug 29, 2020 •

edited