Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PV can not attach to new node if the previous node is deleted #359

Closed
luwang-vmware opened this issue Sep 15, 2020 · 34 comments
Closed

PV can not attach to new node if the previous node is deleted #359

luwang-vmware opened this issue Sep 15, 2020 · 34 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. VMware Support Request

Comments

@luwang-vmware
Copy link

/kind bug

What happened:
We deployed k8s env in TKGI(PKS) and deployed CSI2.0 in the cluster. Deployed a statefulset. When the worker node in which the pod/pv is registered is in the error status, bosh resurrector mechanism created a new worker node to replace the error one. After a while, the statefulset pods/pv were rescheduled to the other nodes, but it got hang. The error said:

Events:
  Type     Reason       Age                        From                                           Message
  ----     ------       ----                       ----                                           -------
  Warning  FailedMount  16m (x11 over 114m)        kubelet, 23752abb-c3d8-4a4f-9058-404638deeab3  Unable to attach or mount volumes: unmounted volumes=[postgredb], unattached volumes=[default-token-dthk9 postgredb]: timed out waiting for the condition
  Warning  FailedMount  <invalid> (x41 over 111m)  kubelet, 23752abb-c3d8-4a4f-9058-404638deeab3  Unable to attach or mount volumes: unmounted volumes=[postgredb], unattached volumes=[postgredb default-token-dthk9]: timed out waiting for the condition

check the logs in csi-attacher.log, it printed as below. node 5844930c-f7d5-4b8d-a618-0337da1b14c7 was the worker node which was replaced by bosh resurrector mechanism.

I0915 01:22:56.209752       1 csi_handler.go:428] Saving detach error to "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368257       1 csi_handler.go:439] Saved detach error to "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368303       1 csi_handler.go:99] Error processing "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c": failed to detach: rpc error: code = Internal desc = failed to find VirtualMachine for node:"5844930c-f7d5-4b8d-a618-0337da1b14c7". Error: node wasn't found
I0915 01:22:56.368356       1 controller.go:175] Started VA processing "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368391       1 csi_handler.go:89] CSIHandler: processing VA "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368397       1 csi_handler.go:140] Starting detach operation for "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368416       1 csi_handler.go:147] Detaching "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368428       1 csi_handler.go:550] Can't get CSINode 5844930c-f7d5-4b8d-a618-0337da1b14c7: csinode.storage.k8s.io "5844930c-f7d5-4b8d-a618-0337da1b14c7" not found
I0915 01:22:56.508694       1 csi_handler.go:428] Saving detach error to "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.512934       1 csi_handler.go:439] Saved detach error to "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.512969       1 csi_handler.go:99] Error processing "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c": failed to detach: rpc error: code = Internal desc = failed to find VirtualMachine for node:"5844930c-f7d5-4b8d-a618-0337da1b14c7". Error: node wasn't found
I0915 01:22:57.368578       1 controller.go:175] Started VA processing "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:57.368615       1 csi_handler.go:89] CSIHandler: processing VA "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:57.368622       1 csi_handler.go:140] Starting detach operation for "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"

What you expected to happen:
the pods/pv can be succeed to attach to the new node.

How to reproduce it (as minimally and precisely as possible):

  1. create k8s cluster in PKS env
  2. install CSI2.0
  3. create statefulset workload with 1 PV
  4. shutdown the node which PV/pod are register.

I also executed the same steps in VCP, pods can be running in another worker node.

Anything else we need to know?:
before the testing, the pod-node mapping as below. csi-controller pods are not in the same worker node as the statefulset workload.

NAME                                    READY   STATUS    RESTARTS   AGE   IP             NODE                                   NOMINATED NODE   READINESS GATES
coredns-75dd94fdf-bx8hd                 1/1     Running   0          14h   10.10.30.162   5844930c-f7d5-4b8d-a618-0337da1b14c7   <none>           <none>
coredns-75dd94fdf-jndhd                 1/1     Running   0          14h   10.10.30.164   239ff390-3824-478f-a644-2e71d55c1b93   <none>           <none>
coredns-75dd94fdf-jrj8c                 1/1     Running   0          14h   10.10.30.163   af1372f5-3763-4e69-8eb0-d146355d84d1   <none>           <none>
metrics-server-6967cb5487-s67ww         1/1     Running   0          14h   10.10.30.165   239ff390-3824-478f-a644-2e71d55c1b93   <none>           <none>
vsphere-csi-controller-6dbc64cb-hqkhr   6/6     Running   0          14h   10.10.30.166   226b7b06-0319-4adb-9230-080e05f4e4b4   <none>           <none>
vsphere-csi-node-6d57j                  3/3     Running   0          14h   10.10.30.170   af1372f5-3763-4e69-8eb0-d146355d84d1   <none>           <none>
vsphere-csi-node-875qh                  3/3     Running   0          14h   10.10.30.171   30db53a7-5e3c-433a-a763-9d88dfb9057c   <none>           <none>
vsphere-csi-node-r8l7c                  3/3     Running   0          14h   10.10.30.168   5844930c-f7d5-4b8d-a618-0337da1b14c7   <none>           <none>
vsphere-csi-node-shcr6                  3/3     Running   0          14h   10.10.30.169   239ff390-3824-478f-a644-2e71d55c1b93   <none>           <none>
vsphere-csi-node-xd72x                  3/3     Running   0          14h   10.10.30.167   226b7b06-0319-4adb-9230-080e05f4e4b4   <none>           <none>
NAME            READY   STATUS    RESTARTS   AGE   IP            NODE                                   NOMINATED NODE   READINESS GATES
postgres-wl-0   2/2     Running   0          19m   10.10.31.34   5844930c-f7d5-4b8d-a618-0337da1b14c7   <none>           <none>
NAME                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS             AGE

Environment:

  • csi-vsphere version: gcr.io/cloud-provider-vsphere/csi/release/syncer:v2.0.0
    gcr.io/cloud-provider-vsphere/csi/release/driver:v2.0.0
  • vsphere-cloud-controller-manager version:
  • Kubernetes version:1.18.5
  • vSphere version:70
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 15, 2020
@SandeepPissay
Copy link
Contributor

failed to find VirtualMachine for node:"5844930c-f7d5-4b8d-a618-0337da1b14c7". Error: node wasn't found
I0915 01:22:57.368578 1 controller.go:175] Started VA processing "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"

@luwang-vmware This means CSI is unable to discover the node during startup. This usually happens if the vsphere conf secret has incorrect entries or the providerId in Node API object is incorrect(it is set by vSphere cloud provider). Can you check the secret and Node object? Also can you paste the CSI controller logs during startup?

@luwang-vmware
Copy link
Author

@SandeepPissay 5844930c-f7d5-4b8d-a618-0337da1b14c7 was the worker node, which has been deleted by BOSH and a new worker node is created by BOSH as well. The scenario is similar like a PV is attached to the worker node, when the worker node is deleted by accident or on purpose, the PV can not attach to the other node.

@gohilankit
Copy link
Contributor

/assign @SandeepPissay

@SandeepPissay
Copy link
Contributor

@luwang-vmware since you work at VMware, can you file a bug and upload CSI logs(all containers), VC support and ESX support bundle? Thanks!

@luwang-vmware
Copy link
Author

@SandeepPissay We will repo in-house again and file a bug then.

@tgelter
Copy link

tgelter commented Jan 26, 2021

@SandeepPissay I believe we've reproduced this issue during a routine cluster upgrade. Would new logs be useful?
I'm wondering if there's a way to invalidate the node cache since kubectl describe csinodes shows that the new nodes are known and new PV(C)s are working as expected, but PV(C)s created prior to the cluster upgrade are having trouble and I'm seeing failed to find VirtualMachine for node:\* errors in the vsphere-csi-controller pod logs.

@SandeepPissay
Copy link
Contributor

@tgelter can you file a ticket with VMware GSS? We would need logs for all containers in CSI Pod and VC support bundle as well. Also please describe what steps were performed to repro the issue. Thanks.

@tgelter
Copy link

tgelter commented Jan 27, 2021

@tgelter can you file a ticket with VMware GSS? We would need logs for all containers in CSI Pod and VC support bundle as well. Also please describe what steps were performed to repro the issue. Thanks.

Support Request #21191453601 was filed on my behalf by a team member from our virtualization team. I failed to add the repro steps, I'll ask him to add these details in that ticket, but the repro method was as follows:

  • New worker Nodes are launched alongside the existing worker Nodes
  • A set of non-customer-impacting Pods are evicted from the old Nodes and rescheduled onto new Nodes. Tests are run to ensure that these Pods are scheduling successfully onto the new Nodes before proceeding.
  • Pods on old worker Nodes are evicted from the old Nodes and rescheduled onto new Nodes. Nodes which contain Pods which refuse to evict after 15 minutes are killed. Nodes which are empty are deleted. This step is performed in parallel on a batch of the Nodes at a time. In between each batch, tests are run to check that cluster Pods are scheduling successfully onto the new Nodes before proceeding.

We started seeing PV(C) issues post worker-tier upgrade completion.

@brathina-spectro
Copy link

brathina-spectro commented Feb 3, 2021

We ran into this issue when performing resiliency testing on our Kubernetes cluster nodes.
Steps to reproduce:

  • Launch a cluster with 3 master nodes
  • Deploy few apps that use volumes. Let the apps run for a while and volume gets utilized
  • Shutdown one of the VM which has volumes mounted and delete the VM from VCenter inventory
  • Our Kubernetes platform detects the node failure and launches a new node automatically and completes joining the cluster
  • However, pods that had volumes mounted fail to run. We see the error on Pods
    Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[config data]: timed out waiting for the condition .

And the csi-controller had these errors

I0203 18:07:23.934626       1 controller.go:175] Started VA processing "csi-9ec0fe183d9abe5dd4e394a956b15a6a77b0ab4e1f88ed9dc9bdc3b39f5d5c6b"
I0203 18:07:23.934636       1 csi_handler.go:89] CSIHandler: processing VA "csi-9ec0fe183d9abe5dd4e394a956b15a6a77b0ab4e1f88ed9dc9bdc3b39f5d5c6b"
I0203 18:07:23.934640       1 csi_handler.go:140] Starting detach operation for "csi-9ec0fe183d9abe5dd4e394a956b15a6a77b0ab4e1f88ed9dc9bdc3b39f5d5c6b"
I0203 18:07:23.934676       1 csi_handler.go:147] Detaching "csi-9ec0fe183d9abe5dd4e394a956b15a6a77b0ab4e1f88ed9dc9bdc3b39f5d5c6b"
I0203 18:07:23.934695       1 csi_handler.go:550] Can't get CSINode vsphere-spectro-mgmt-cp-88h4x: csinode.storage.k8s.io "vsphere-spectro-mgmt-cp-88h4x" not found
I0203 18:07:24.136382       1 csi_handler.go:428] Saving detach error to "csi-4a9b146618466ee975b23c33c5a68875256c223a179d3e795d5df4c692806d48"
I0203 18:07:24.144857       1 csi_handler.go:439] Saved detach error to "csi-4a9b146618466ee975b23c33c5a68875256c223a179d3e795d5df4c692806d48"
I0203 18:07:24.144933       1 csi_handler.go:99] Error processing "csi-4a9b146618466ee975b23c33c5a68875256c223a179d3e795d5df4c692806d48": failed to detach: rpc error: code = Internal desc = failed to find VirtualMachine for node:"vsphere-spectro-mgmt-cp-88h4x". Error: node wasn't found

We were able to recover from this state by deleting the corresponding VolumeAttachment resource.

@tgelter
Copy link

tgelter commented Mar 18, 2021

I found a work-around for this issue, in case it helps anyone.
If you wait for the removal of VolumeAttachments corresponding to the node before shutting it down or deleting it, pods w/ PVCs which are evicted from the node will start up OK on other nodes, at least with:

  • vSphere CSI Driver - v2.1.1
  • vSphere CPI - v1.19.0
  • vSphere - 7.0U1

In our node eviction code, I've added the following logic directly after the code which evicts all of the pods:

...
    # Wait for Node VolumeAttachment removal(s):
    number_of_attempts = 0
    while number_of_attempts < 10 and node.status.volumes_attached:
        logger.info(
            f"Waiting for VolumeAttachment(s) for Node {node.metadata.name} to be removed..."
        )
        number_of_attempts += 1
        util.helpers.sleep_random_time(
            min_seconds=30, max_seconds=60
        )  # Avoid thundering herd via parallel drain tasks
        node = get_node(
            node_name=node.metadata.name, k8s_client=k8s_client
        )  # Refresh node metadata
    if node.status.volumes_attached:
        logger.error(
            f"VolumeAttachment(s) for Node {node.metadata.name} were not removed!"
        )
        return False
...

@SandeepPissay
Copy link
Contributor

  • Shutdown one of the VM which has volumes mounted and delete the VM from VCenter inventory

@brathina-spectro Did you drain the node before you shutdown the node VM? If you do not drain the node before shutdown, you may end up with a known upstream issue where the pod in the shutdown node will go to terminating state and the replacement pod would never come up since the volume is still attached to the shutdown node. Also see kubernetes/enhancements#1116. There are 2 ways to workaround this problem in Kubernetes:

  1. Force delete the pod that is in terminating state on shutdown node.
  2. Drain the node before shutting it down.

@brathina-spectro
Copy link

  • Shutdown one of the VM which has volumes mounted and delete the VM from VCenter inventory

@brathina-spectro Did you drain the node before you shutdown the node VM? If you do not drain the node before shutdown, you may end up with a known upstream issue where the pod in the shutdown node will go to terminating state and the replacement pod would never come up since the volume is still attached to the shutdown node. Also see kubernetes/enhancements#1116. There are 2 ways to workaround this problem in Kubernetes:

@SandeepPissay Our objective was to test the resiliency of the system overall and so we did not shut down the nodes gracefully.

  1. Force delete the pod that is in terminating state on shutdown node.

Pods were not stuck in terminating state. When CPI detects the node is gone, it removes the node from the cluster and so all the pods scheduled on the node gets Terminated. When the new pod comes up, it stays in "ContainerCreating" state forever with the error below.
Unable to attach or mount volumes: unmounted volumes=[mongo-data], unattached volumes=[init mongo-data]: timed out waiting for the condition

  1. Drain the node before shutting it down.

@tgelter
Copy link

tgelter commented Mar 29, 2021

I agree with @brathina-spectro's findings above. While the work-around I shared above helps to prevent this issue from happening, node issues (freeze/delete w/out drain, etc) can still trigger the issue which is difficult to recover from for our users since they don't know to (or in some cases have RBAC permissions to perform) nullify finalizers.
I think that the volume attach/mount logic needs to be fixed to deal with this case.

@SandeepPissay
Copy link
Contributor

Our objective was to test the resiliency of the system overall and so we did not shut down the nodes gracefully

Can you provide more details on how exactly are you testing the resiliency? And have you considered enabling vSphere HA on the vSphere cluster so that vSphere HA can restart the node VM if it crashes? I see node VM shutodown as a planned activity and the component/person doing that should drain the node completely before performing shutdown. If this is not done, we have an upstream issue that the volume detach does not happen(this is not unique to vSphere CSI).

@brathina-spectro
Copy link

@SandeepPissay Thanks for replying. We haven't enabled vsphere HA and the node did not go through node drain. Could you share pointers to the upstream issue where volume detach does not happen?

@SandeepPissay
Copy link
Contributor

@SandeepPissay Thanks for replying. We haven't enabled vsphere HA and the node did not go through node drain. Could you share pointers to the upstream issue where volume detach does not happen?

kubernetes/enhancements#1116

@brathina-spectro
Copy link

@SandeepPissay Thanks.

We're seeing the same behavior with pods staying in "ContainerCreating" state even when the node was drained before shutting down. Volume attachment deletion started a few mins after the new node got launched, but it never finished, And VolumeAttachment is stuck there ever since.

Tried restarting CSI controller. The controller detects that the node is gone, but is never deleting the VolumeAttachment.

What logs will help you troubleshoot this and how can I upload them? I saw support section, but not sure which product category to choose to file ticket.

@SandeepPissay
Copy link
Contributor

Tried restarting CSI controller. The controller detects that the node is gone, but is never deleting the VolumeAttachment.

CSI driver is not responsible for deleting the VolumeAttachment. The kube-controller-manager does that. Anyways, we need to take a look at the kube-controller-manager, external attacher and CSI controller logs. Please talk to VMware GSS on how to file a ticket.

@BaluDontu
Copy link
Contributor

We're seeing the same behavior with pods staying in "ContainerCreating" state even when the node was drained before shutting down.

After you drain the node, all the pods would be deleted but VolumeAttachments still remain. It is because pods would get deleted from API server after the unmount is successful. So wait until VolumeAttachments are deleted and later delete the node.
If the node is deleted before volume attachments are deleted, then the volume still remain attached to the VM in backend. There is no way for the news pods coming up in the new node can attach this volume as the pod is still not detached from old node.

So the sequence of steps would be:

  1. Drain the node which would kill all the pods.
  2. Wait until all the volume attachments are deleted.
  3. delete the node.

This way, the pods on the new node would come up properly without any issue.

@SandeepPissay
Copy link
Contributor

SandeepPissay commented Apr 13, 2021

I tested node shudown scenario with statefulset on v1.20.4 Kubernetes cluster, and here's what I did:

  1. Create a statefulset with replica set to 2.
  2. Shutdown the node where one of the replica is running. This could also be a node OS crash/corruption where the OS never comes up.
  3. Node reached NotReady state in 30 seconds.
  4. The replica on the node reached Terminating state in about 5 mins.
  5. Ran kubectl drain node/k8s-node-0972 --ignore-daemonsets --force --grace-period=0 to evict the replica from the node.
  6. The Pod immediately got evicted, and a replacement pod was sceduled on a new node. This was in ContainerCreating state for some time.
  7. The volume was still attached to the shutdown node. I waited for around 5mins to see CSI getting a callback to detach the volume from the shutdown node.
  8. CSI attached the volume to the new node where the replacement pod is scheduled.
  9. The replica pod reached Running state.

Basically, for planned shutdown we should drain the node first and wait for the volume attachments for that node to be deleted for quick recovery of the app(see @BaluDontu previous comment). For unplanned node down scenarios(OS crash, hung, etc), we should force drain the node and wait for some time for Kubernetes/CSI to detach the volume from shutdown/hung node to the new node. It is also advised to enable vSphere HA on the vSphere cluster to make sure that vSphere HA can restart crashed/hung node VMs. Hope this info helps.

@brathina-spectro
Copy link

@SandeepPissay Thanks for recording your observation. On ideal scenarios, we noticed the same behavior as well.

But on multiple occurrences, we ran into the issue where volumeattchments were not getting deleted and that was preventing volumes to get detached from the deleted node. A very similar issue got addressed in Kubernetes upstream and we're trying to confirm if upgrading to newer versions helps.

@ahrtr
Copy link
Contributor

ahrtr commented Apr 14, 2021

When BOSH recreates a problematic worker VM, which means deleting the old one and creating a new one, then the vSphere CSI Driver regards them as different VMs. So the vSphere CSI Controller couldn't find the old worker VM on which the PV was attached any more, accordingly it couldn't detach the PV from the old VM. I think this could be the key point why @SandeepPissay did not reproduce this issue.

FYI. BOSH has Auto-healing Capabilities, which means It's responsible for automatically recreating VMs that become inaccessible. Refer to,
https://bosh.io/docs/resurrector/

@mitchellmaler
Copy link

I just ran into this same issue. The node deleted ungracefully (VM deleted and node removed from kubernetes). All pods running on that node were automatically rescheduled over to new nodes once it was deleted from kubernetes. The CSI driver showing errors "Error: node wasn't found" which makes since because the node was deleted completely.

We are on 1.19.9 and that above "fix" #96617 was merged in at 1.19.7 which proves it doesn't resolve this issue.

We would like the volume to be able to be detached (remove the VolumeAttachment object) if the VM does not exist anymore so that way the disk can be moved to another VM. We cannot expect all of our nodes to be gracefully removed (drained and wait for volumeattachment to be removed).

Here is the kube controller logs when it tries to do this operation

W0421 19:54:28.043345       1 reconciler.go:222] attacherDetacher.DetachVolume started for volume "pvc-15715c91-3e52-4b47-88dd-8fda2c868ae7" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^607cc6d1-79d3-4a65-8f79-48020a6ebfae") on node "dev-k8s-w1-2" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
E0421 19:54:28.065669       1 csi_attacher.go:663] kubernetes.io/csi: detachment for VolumeAttachment for volume [607cc6d1-79d3-4a65-8f79-48020a6ebfae] failed: rpc error: code = Internal desc = failed to find VirtualMachine for node:"dev-k8s-w1-2". Error: node wasn't found
E0421 19:54:28.065782       1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^607cc6d1-79d3-4a65-8f79-48020a6ebfae podName: nodeName:}" failed. No retries permitted until 2021-04-21 19:56:30.065740661 +0000 UTC m=+4111.923222323 (durationBeforeRetry 2m2s). Error: "DetachVolume.Detach failed for volume \"pvc-15715c91-3e52-4b47-88dd-8fda2c868ae7\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^607cc6d1-79d3-4a65-8f79-48020a6ebfae\") on node \"dev-k8s-w1-2\" : rpc error: code = Internal desc = failed to find VirtualMachine for node:\"dev-k8s-w1-2\". Error: node wasn't found"
W0421 19:56:06.269492       1 reconciler.go:222] attacherDetacher.DetachVolume started for volume "pvc-bd8282bd-29bf-4840-92f6-df1eeb5730fd" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^f612a702-16fe-48f9-b42b-7625f41a1340") on node "dev-k8s-w1-2" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
E0421 19:56:06.404544       1 csi_attacher.go:663] kubernetes.io/csi: detachment for VolumeAttachment for volume [f612a702-16fe-48f9-b42b-7625f41a1340] failed: rpc error: code = Internal desc = failed to find VirtualMachine for node:"dev-k8s-w1-2". Error: node wasn't found
E0421 19:56:06.404648       1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^f612a702-16fe-48f9-b42b-7625f41a1340 podName: nodeName:}" failed. No retries permitted until 2021-04-21 19:58:08.404607364 +0000 UTC m=+4210.262089024 (durationBeforeRetry 2m2s). Error: "DetachVolume.Detach failed for volume \"pvc-bd8282bd-29bf-4840-92f6-df1eeb5730fd\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^f612a702-16fe-48f9-b42b-7625f41a1340\") on node \"dev-k8s-w1-2\" : rpc error: code = Internal desc = failed to find VirtualMachine for node:\"dev-k8s-w1-2\". Error: node wasn't found"
W0421 19:56:30.111213       1 reconciler.go:222] attacherDetacher.DetachVolume started for volume "pvc-15715c91-3e52-4b47-88dd-8fda2c868ae7" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^607cc6d1-79d3-4a65-8f79-48020a6ebfae") on node "dev-k8s-w1-2" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
E0421 19:56:30.131904       1 csi_attacher.go:663] kubernetes.io/csi: detachment for VolumeAttachment for volume [607cc6d1-79d3-4a65-8f79-48020a6ebfae] failed: rpc error: code = Internal desc = failed to find VirtualMachine for node:"dev-k8s-w1-2". Error: node wasn't found
E0421 19:56:30.132040       1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^607cc6d1-79d3-4a65-8f79-48020a6ebfae podName: nodeName:}" failed. No retries permitted until 2021-04-21 19:58:32.131995213 +0000 UTC m=+4233.989476875 (durationBeforeRetry 2m2s). Error: "DetachVolume.Detach failed for volume \"pvc-15715c91-3e52-4b47-88dd-8fda2c868ae7\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^607cc6d1-79d3-4a65-8f79-48020a6ebfae\") on node \"dev-k8s-w1-2\" : rpc error: code = Internal desc = failed to find VirtualMachine for node:\"dev-k8s-w1-2\". Error: node wasn't found"

@ikogan
Copy link

ikogan commented May 14, 2021

In our experience, Kubernetes does attempt to delete the VolumeAttachment. However, the attachments have a finalizer that never completes because it cannot find the node. If a node is actually missing from vSphere, the finalizer should assume it's gone and assume the disk is detached and allow Kubernetes to delete the VolumeAttachment.

The moment we remove the finalizer from the VolumeAttachment, it deletes, and the Pod that was needing that PVC successfuly proceeds to starting.

@mitchellmaler
Copy link

Still ran into this running Kubernetes v1.21.3. The node was deleted and the VolumeAttachment was still around specifying the old node name. Required us to update the finalizers on each resource to be deleted so the pods can start up.

@smartbit
Copy link

This one-liner removes the finalizers from volumeattachments with a detachError:

kubectl get volumeattachments \
-o=custom-columns='NAME:.metadata.name,UUID:.metadata.uid,NODE:.spec.nodeName,ERROR:.status.detachError' \
--no-headers | grep -vE '<none>$' | awk '{print $1}' | \
xargs -n1 kubectl patch -p '{"metadata":{"finalizers":[]}}' --type=merge volumeattachments

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 23, 2021
@divyenpatel
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2022
@braunsonm
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2022
@tgelter
Copy link

tgelter commented Jul 12, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022
@divyenpatel
Copy link
Member

This issue is fixed with the PR - #1879

@tgelter
Copy link

tgelter commented Jul 25, 2022

This issue is fixed with the PR - #1879

This is great news, thank you!
Is there a release version you can share which is targeting #1879 for inclusion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. VMware Support Request
Projects
None yet
Development

No branches or pull requests