Storage: devicePath is empty while WaitForAttach in StatefulSets #67342

fntlnz · 2018-08-13T16:29:54Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

I created a StatefulSet with 4 replicas and it worked correctly.
After some time I needed to restart one of the pods and when it came back again it was stuck on this error:

Events:
  Type     Reason       Age               From                                   Message
  ----     ------       ----              ----                                   -------
  Warning  FailedMount  9m (x40 over 1h)  kubelet, ip-180-12-10-58.ec2.internal  Unable to mount volumes for pod "storage-0_twodotoh(51577cea-9ccd-11e8-b024-1232e142048e)": timeout expired waiting for volumes to attach or mount for pod "myorg"/"mypod-0". list of unmounted volumes=[data]. list of unattached volumes=[data mypod-config default-token-cjfqp]
  Warning  FailedMount  3m (x55 over 1h)  kubelet, ip-180-12-10-58.ec2.internal  MountVolume.WaitForAttach failed for volume "pvc-a12b7de1-30ed-11ee-a324-2232d546216c" : WaitForAttach failed for AWS Volume "aws://us-east-1b/vol-045d3gx6hg53gz341": devicePath is empty.

When that happens I can get it working by deleting the pod again, it can happen that the error happens again after that but usually not three times in a row.

The main issue is that if you don't act manually on it it will continue to be reconciled by reconciler.go and will never come back again.

The issue seems to be on actual_state_of_world.go while doing the MarkVolumeAsAttached part, at some point the devicePath string is not written in the object.

actual_state_of_world.go:616 ->
  reconciler.go:238 ->
    operation_executor.go:712 ->
       operation_generator.go:437 -> error on line 496

What you expected to happen:

The pod comes back with no error.

How to reproduce it (as minimally and precisely as possible):
The problem seems to be difficult to reproduce, I can trigger it after the
Upgrading to 1.11 does not solve the problem.
It does not happen on Deployments, I haven't been able to reproduce there.

Start a Kubernetes cluster on AWS that is configured to use EBS volumes
Create a statefulset with a dynamic provisioned volume (see yaml file below)
Delete one of the pods of your choice

At this point one of two things can happen:

It just works
OR
The pod is not able to come back again and give the error I reported above.

I haven't found a reliable way to make one or the other happen when I wanted, it seems to be very random but I'm sure that it only happens when the pod is recreated on the same node.

Statefulset to reproduce

apiVersion: v1
kind: Namespace
metadata:
  name: repro-devicepath
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: myrepro
  namespace: repro-devicepath
  labels:
    component: myrepro
spec:
  serviceName: myrepro
  selector:
    matchLabels:
      component: myrepro
  replicas: 4
  template:
    metadata:
      name: myrepro
      labels:
        component: myrepro
    spec:
      containers:
        - name: myrepro
          image: docker.io/fntlnz/caturday:latest
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
  - metadata:
      namespace: repro-devicepath
      name: data
    spec:
      storageClassName: ebs-1
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 1Gi

Anything else we need to know?:

When this happens, if one looks at devicePath in the node's status it will be reported empty, one can verify that with:

 kubectl get node -o json | jq ".items[].status.volumesAttached"

I found some other users on slack that have this problem, @wirewc sent me this (note the empty devicePath happening in his system.

 volumesAttached:
 - devicePath: ""
   name: kubernetes.io/iscsi/10.48.147.131:iqn.2016-12.org.gluster-block:b5a96cbd-926b-421f-922b-4df13ca150e0:0
 volumesInUse:
 - kubernetes.io/iscsi/10.48.147.131:iqn.2016-12.org.gluster-block:b5a96cbd-926b-421f-922b-4df13ca150e0:0
 - kubernetes.io/iscsi/pvc-bb8b444f-9a68-11e8-b661-0050569c4ace:pvc-bb8b444f-9a68-11e8-b661-0050569c4ace:0

Also @ntfrnzn detailed a similar issue here: equinixmetal-archive/csi-packet#8

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

I'm hitting this on a production cluster on 1.10.3 but I get the same error on a testing cluster that has 1.11

Cloud provider or hardware configuration: AWS, deployed using kubeadm
OS (e.g. from /etc/os-release):

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.6.0
VERSION_ID=1800.6.0
BUILD_ID=2018-08-04-0323
PRETTY_NAME="Container Linux by CoreOS 1800.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Kernel (e.g. uname -a): Linux ip-180-12-0-57 4.14.59-coreos-r2 #1 SMP Sat Aug 4 02:49:25 UTC 2018 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

fntlnz · 2018-08-13T16:30:40Z

/sig storage

gnufied · 2018-08-13T17:13:13Z

@fntlnz When this happens, what is the status of EBS volume? Is that still attached to the node where you deleted and recreated the pod?

fntlnz · 2018-08-13T18:15:00Z

@gnufied as far as I can remember the volume is released, I remember this because I was looking at the volumes on the aws console and they were blue (not attached).

gnufied · 2018-08-13T19:31:02Z

FWIW - devicePath being empty for iSCSI is expected. iSCSI does not perform "real" attach/detach, so naturally there is no devicePath - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/iscsi/attacher.go#L58

We need to see corresponding entry for EBS when that happens. iSCSI could be a red herring.

fntlnz · 2018-08-13T21:54:53Z

Oh you’re right @gnufied I’m trying to find a way to reproduce this reliably, will keep this issue posted.

fntlnz · 2018-08-15T17:25:00Z

@gnufied just happened and noted that when this error occurs the volume is marked as attached on aws as xvdp and it is available with lsblk but empty.

NAME    MAJ:MIN   RM  SIZE RO TYPE  MOUNTPOINT
xvda    202:0      0   50G  0 disk  
|-xvda1 202:1      0  128M  0 part  /boot
|-xvda2 202:2      0    2M  0 part  
|-xvda3 202:3      0    1G  0 part  
|-xvda4 202:4      0    1G  0 part  
| `-usr 254:0      0 1016M  1 crypt /usr
|-xvda6 202:6      0  128M  0 part  /usr/share/oem
|-xvda7 202:7      0   64M  0 part  
`-xvda9 202:9      0 47.7G  0 part  /
xvdbi   202:15360  0 1000G  0 disk  /var/lib/kubelet/pods/xxxxxxxx-91eb-11e8-a727-1232e142048e/volumes/kubernetes.io~aws-ebs/pvc-xxxxxxx-9142-11e8-a727-xxxxxxxxxx
xvdbp   202:17152  0 1000G  0 disk  
xvdca   202:19968  0    5G  0 disk  /var/lib/kubelet/pods/xxxxxxx-6348-11e8-b117-xxxxxxxxxxxe/volumes/kubernetes.io~aws-ebs/pvc-xxxxxxxx-1e08-11e8-b226-xxxxxxxxxxx
xvdcm   202:23040  0   50G  0 disk  /var/lib/kubelet/pods/xxxxxxxx-6348-11e8-b117-xxxxx/volumes/kubernetes.io~aws-ebs/pvc-xxxxxxxx-05e7-11e8-9188-xxxxxxxxx

It is also not listed in any process mountinfo or mounts

The only relevant log I see in the kubelet is:

Aug 15 17:01:30 ip-180-12-20-217 kubelet[856]: I0815 17:01:30.084342     856 reconciler.go:237] Starting operationExecutor.MountVolume for volume "pvc-xxxxxxxxxxx-11e8-a727-xxxxxxxx" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-xxxxxxxxxxxxx") pod "mypod-2" (UID: "xxxxxx-a0a4-11e8-b024-xxxxxxxx")
Aug 15 17:01:30 ip-xxxxxxxxxx kubelet[856]: I0815 17:01:30.084387     856 volume_host.go:219] using default mounter/exec for kubernetes.io/aws-ebs
Aug 15 17:01:30 ip-xxxxxxxxxx kubelet[856]: I0815 17:01:30.084403     856 volume_host.go:219] using default mounter/exec for kubernetes.io/aws-ebs
Aug 15 17:01:30 ip-xxxxxxxxxx kubelet[856]: I0815 17:01:30.084410     856 volume_host.go:219] using default mounter/exec for kubernetes.io/aws-ebs
A

I'm starting thinking that the problem here is happening because of a dirty unmount rather then a bad mount.

gnufied · 2018-08-15T21:38:13Z

Can you confirm if unmount is left from previous pod or if unmounted volume is because of new pod (that triggered attach but did not mount) ? One way of confirming that would be - if device name (/dev/xvdbp) changes.

akshaymankar · 2018-08-16T10:36:15Z

Hi, we're also facing the same issue. We can confirm that the unmount is left from previous pod. The difference in our scenario is that we were trying to upgrade from 1.11.1 to 1.11.2. We initally thought it had something to do with the versions. But here is what our hypothesis is:

When pod gets deleted first time, it leaves the mount behind.
When we restart the kubelet, it sees an extra mount and unmounts it. After the unmount MarkDeviceAsUnmounted gets called. It marks devicePath as empty.

When the scheduler puts back the container on the kubelet, kubelet tries to mount again. But the attachedVolumes map in actualStateOfWorld has the attachedVolume object with empty devicePath. Which causes WaitForAttach to fail.

It makes me wonder why does MarkDeviceAsUnmounted have to set devicePath as empty.

/cc @BenChapman

fntlnz · 2018-08-16T14:31:46Z

@gnufied the device name remains the same in my case /dev/xvdbp

ddebroy · 2018-08-24T23:06:07Z

Looks like the kube e2e is also running into this as part of pull-kubernetes-e2e-kops-aws.[sig-storage] Dynamic Provisioning DynamicProvisioner should provision storage with different parameters tests that target EBS provisioning.

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/67530/pull-kubernetes-e2e-kops-aws/103140/

Aug 24 22:25:51.337: INFO: At 2018-08-24 22:10:49 +0000 UTC - event for pvc-volume-tester-stzb2: {kubelet ip-172-20-55-168.us-west-2.compute.internal} Created: Created container
Aug 24 22:25:51.337: INFO: At 2018-08-24 22:10:50 +0000 UTC - event for pvc-volume-tester-27r57: {default-scheduler } Scheduled: Successfully assigned e2e-tests-volume-provisioning-2qbc4/pvc-volume-tester-27r57 to ip-172-20-55-168.us-west-2.compute.internal
Aug 24 22:25:51.337: INFO: At 2018-08-24 22:10:50 +0000 UTC - event for pvc-volume-tester-27r57: {kubelet ip-172-20-55-168.us-west-2.compute.internal} FailedMount: MountVolume.WaitForAttach failed for volume "pvc-7c7f5f34-a7ea-11e8-99eb-02c590dc6280" : WaitForAttach failed for AWS Volume "aws://us-west-2b/vol-0194c1ef086c2bb42": devicePath is empty.
Aug 24 22:25:51.337: INFO: At 2018-08-24 22:10:51 +0000 UTC - event for pvc-volume-tester-stzb2: {kubelet ip-172-20-55-168.us-west-2.compute.internal} SandboxChanged: Pod sandbox changed, it will be killed and re-created.
Aug 24 22:25:51.337: INFO: At 2018-08-24 22:12:53 +0000 UTC - event for pvc-volume-tester-27r57: {kubelet ip-172-20-55-168.us-west-2.compute.internal} FailedMount: Unable to mount volumes for pod "pvc-volume-tester-27r57_e2e-tests-volume-provisioning-2qbc4(8f49de8c-a7ea-11e8-99eb-02c590dc6280)": timeout expired waiting for volumes to attach or mount for pod "e2e-tests-volume-provisioning-2qbc4"/"pvc-volume-tester-27r57". list of unmounted volumes=[my-volume]. list of unattached volumes=[my-volume default-token-zxnz8]

wph95 · 2018-08-26T14:15:59Z

I think I found the cause of this bug. and enhanced the current unit test, 100% to reproduce the bug wph95#1

wph95 · 2018-08-26T14:56:12Z

I opened the kubelet --v=10 in my test cluster, in our scenario (with intermittent long write operations on the disk). This bug trigger is 10%. By analyzing the logs, I found the cause of the problem and succeeded in reproducing the problem by adding a new unit test to prove that the bug existed.

The cause is that AWS EBS sometimes attacher.UnmountDevice slowly (10s ++ ), UnmountDevice is an asynchronous function and at the same time，
desiredStateOfWorldPopulator update desired_state_of_world status. The cause of the mistake rc.desiredStateOfWorld.VolumeExists(ErrorVolume) == True

This has caused some code reconciler.reconcile can't run as expected
This part of the code[MarkVolumeAsDetached] is not executed, let volumeToMount never have chance execute VerifyControllerAttachedVolume to refresh drivePath to right
reconcile fall into an error loop.

cc @gnufied

and i'm glad/want to contribute code. btw i think reconciler lifecyle is complex，I've been looking for a long time to search pr/issue about reconciler lifecyle. But I didn't find it, so I didn't know how to fix the bug correctly. maybe insure after executed UnmountDevice to excute MarkVolumeAsDetached

p.s. English is not my mother tongue; please excuse any errors on my part. if have not understood, please see wph95#1 or mention me :)

r7vme · 2018-08-28T13:47:45Z

We were hit by this issue. Dirty w/a is to restart kubelet on affected node.

ddebroy · 2018-08-31T20:04:19Z

/assign @ddebroy

gtie · 2018-10-19T11:07:02Z

We see the same issue, in particular, when the same volume is repeatedly mounted/dismounted on the same node. Here are some logs showing successful mount, dismount, and then a failed mount with empty device path:

Oct 17 09:08:11 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1017 09:08:11.107170   27019 operation_generator.go:495] MountVolume.WaitForAttach succeeded for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "APP-storage--sit-1" (UID: "250b0be6-d1ec-11e8-9483-0614d864468a") DevicePath "/dev/xvdby"
Oct 17 09:08:11 ip-XX-XX-XX-XX.eu-west-1.compute.internal systemd: Started Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-west-1b/vol-09afde99a76e386df.
Oct 17 09:08:11 ip-XX-XX-XX-XX.eu-west-1.compute.internal systemd: Starting Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-west-1b/vol-09afde99a76e386df.
Oct 17 09:08:11 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1017 09:08:11.165463   27019 operation_generator.go:514] MountVolume.MountDevice succeeded for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "APP-storage--sit-1" (UID: "250b0be6-d1ec-11e8-9483-0614d864468a") device mount path "/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-west-1b/vol-09afde99a76e386df"
Oct 17 09:08:11 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1017 09:08:11.254673   27019 operation_generator.go:557] MountVolume.SetUp succeeded for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "APP-storage--sit-1" (UID: "250b0be6-d1ec-11e8-9483-0614d864468a")
Oct 18 09:52:50 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1018 09:52:50.932279   27019 reconciler.go:181] operationExecutor.UnmountVolume started for volume "data" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "250b0be6-d1ec-11e8-9483-0614d864468a" (UID: "250b0be6-d1ec-11e8-9483-0614d864468a")
Oct 18 09:52:51 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1018 09:52:51.137616   27019 reconciler.go:278] operationExecutor.UnmountDevice started for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") on node "ip-XX-XX-xX-XX.eu-west-1.compute.internal"
Oct 18 09:53:01 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1018 09:53:01.015286   27019 operation_generator.go:760] UnmountDevice succeeded for volume "pvc-f8041f35cf8e-11e8-9483-0614d864468a" % !(EXTRA string=UnmountDevice succeeded for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") on node "ip-XX-XX-XX-XX.eu-west-1.compute.internal" )
Oct 18 09:53:01 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1018 09:53:01.082776   27019 reconciler.go:252] operationExecutor.MountVolume started for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "APP-storage--sit-1" (UID: "952d07df-d2bb-11e8-9483-0614d864468a"
Oct 18 09:53:01 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1018 09:53:01.083038   27019 operation_generator.go:486] MountVolume.WaitForAttach entering for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "APP-storage--sit-1" (UID: "952d07df-d2bb-11e8-9483-0614d864468a") DevicePath ""

fntlnz · 2018-10-19T11:12:00Z

@gtie - What's your kubernetes version?
I stopped seeing this after upgrading to 1.11.2 and I don't see it now on 1.12.1

gtie · 2018-10-19T11:38:51Z

@fntlnz, thanks for the input! I have this issue on K8s v.1.10.7. Upgrade should be coming in the next few weeks, we'll see if it appears again afterwards.

dguendisch · 2018-11-02T09:37:32Z

I see the same issue from time to time on one of my openstack k8s clusters (v1.11.3).
Seen it when experimenting with argo workflows which subsequently create/delete pods thereby reusing one and the same pv. Intermittently pods fail:

...
Events:
  Type     Reason       Age                   From                                                               Message
  ----     ------       ----                  ----                                                               -------
  Normal   Scheduled    6m52s                 default-scheduler                                                  Successfully assigned default/testrun-123-xfphb-4092629371 to shoot--core--os-worker-c3xt0-z1-645ffb5849-vkgmj
  Warning  FailedMount  40s (x11 over 6m52s)  kubelet, shoot--core--os-worker-c3xt0-z1-645ffb5849-vkgmj  MountVolume.WaitForAttach failed for volume "pvc-de123641-de80-11e8-afa1-d6a1d6ffefab" : WaitForAttach failed for Cinder disk "0c6c0878-970b-4be5-b12c-a742e7cd5cfa": devicePath is empty
  Warning  FailedMount  18s (x3 over 4m49s)   kubelet, shoot--core--os-worker-c3xt0-z1-645ffb5849-vkgmj  Unable to mount volumes for pod "testrun-123-xfphb-4092629371_default(eee65d37-de80-11e8-afa1-d6a1d6ffefab)": timeout expired waiting for volumes to attach or mount for pod "default"/"testrun-123-xfphb-4092629371". list of unmounted volumes=[shared-volume]. list of unattached volumes=[podmetadata docker-lib docker-sock shared-volume default-token-qwvs6]

The cluster has only one worker node. When the error occurs the node shows the resp. volume as attached:

...
  volumesAttached:
  - devicePath: /dev/vdb
    name: kubernetes.io/cinder/0c6c0878-970b-4be5-b12c-a742e7cd5cfa
  volumesInUse:
  - kubernetes.io/cinder/0c6c0878-970b-4be5-b12c-a742e7cd5cfa

blkerby · 2018-12-08T15:18:33Z

I hit this error in one pod of a StatefulSet on k8s v1.11.5:

Warning  FailedMount  3m16s (x239 over 7h53m)  kubelet, ip-10-200-142-34.us-west-2.compute.internal  MountVolume.WaitForAttach failed for volume "pvc-cc9ba128-f32c-11e8-b598-0258099978de" : WaitForAttach failed for AWS Volume "aws://us-west-2a/vol-06cab23babe87dd7a": devicePath is empty.

When I look at the node, I see the devicePath looks normal:

 volumesAttached:
  - devicePath: /dev/xvdbi
    name: kubernetes.io/aws-ebs/aws://us-west-2a/vol-06cab23babe87dd7a
...
  volumesInUse:
  - kubernetes.io/aws-ebs/aws://us-west-2a/vol-06cab23babe87dd7a

After deleting the pod, the problem resolved.

deshui123 · 2018-12-12T02:57:26Z

We see the same issue with kubenete v1.10.2.
Pod status is "ContainerCreating" and the event is
Warning FailedMount 7m (x384 over 14h) kubelet, dev-g02-w-12 Unable to mount volumes for pod "es-data-1_paas(caaca102-fd30-11e8-9806-fa163eab274f)": timeout expired waiting for volumes to attach or mount for pod "paas"/"es-data-1". list of unmounted volumes=[datadir]. list of unattached volumes=[datadir default-token-rf4f7] Warning FailedMount 56s (x439 over 14h) kubelet, dev-g02-w-12 MountVolume.WaitForAttach failed for volume "pvc-d9545147-c322-11e8-b95f-fa163e2dd5ca" : WaitForAttach failed for Cinder disk "2a0fe86e-4314-4816-9b57-24e56b927478": devicePath is empty

When I look at the node, I see the devicePath looks normal:
"devicePath": "/dev/vdf", "name": "kubernetes.io/cinder/2a0fe86e-4314-4816-9b57-24e56b927478"
But for lsblk, the 'MOUNTPOINT' for vdf is empty
[root@dev-g02-w-12 /var/log]$ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 464K 0 rom
vda 253:0 0 80G 0 disk
├─vda1 253:1 0 4G 0 part [SWAP]
└─vda2 253:2 0 76G 0 part /
vdb 253:16 0 100G 0 disk /data0
vdc 253:32 0 10G 0 disk /var/lib/kubelet/pods/c3550b44-fd30-11e8-9806-fa163eab274f/volumes/kubernetes.iocinder/pvc-ef2a488d-c7a1-11e8-9246-fa163e2dd5ca
vdd 253:48 0 1G 0 disk /var/lib/kubelet/pods/c3550b44-fd30-11e8-9806-fa163eab274f/volumes/kubernetes.iocinder/pvc-ef277b4e-c7a1-11e8-9246-fa163e2dd5ca
vde 253:64 0 5G 0 disk /var/lib/kubelet/pods/c3f07abc-fd30-11e8-9806-fa163eab274f/volumes/kubernetes.io~cinder/pvc-d92e93f8-f150-11e8-9806-fa163eab274f
vdf 253:80 0 200G 0 disk

gnufied · 2018-12-14T19:18:51Z

Hey guys I think @jsafrane fixed this issue via #71074 which makes sure that if UnmountVolume is called, we do not remove devicePath from the cache. If any folks who are facing this issue can try on 1.13 which includes aforementioned patch then we will consider backporting that patch to 1.12 and 1.11.

DaveKriegerEmerald · 2018-12-25T17:46:37Z

@gnufied - We're getting this same issue with Argo (https://github.com/argoproj/argo) as @dguendisch, see above. We're heavily invested technically in using Kops, which is just now supporting 1.11 in beta, so we're really looking forward to a fix being backported. (We upgraded to 1.10.12 which contains the #71074 patch but does not remove the error for us.)

jjo · 2019-01-07T22:19:57Z

We faced this issue after upgrading one of our kops clusters from 1.10.11 to 1.11.6,
only happening for a specific workload (deploy+pvc) that's pinned to a specific node,
while it didn't show for many other un-pinned deploy+pvc and sts workloads.
After manually downgrading /usr/local/bin/kubelet to 1.10.12 in that
node, the PV got mounted in a glance (+pod started ok).

FYI had tried by restarting 1.11.6 kubelet, wirhout success.

gnufied · 2019-01-07T22:42:02Z

@DaveKriegerEmerald where do you see #71074 patch being backported to 1.10.12 ? I only created backports for 1.11 and 1.12 .

DaveKriegerEmerald · 2019-01-08T23:04:20Z

@gnufied Thanks for replying! My mistake; I was looking at #71145 (comment) when I wrote that. However, kops v1.11.0 is now out of beta; I upgraded my argo K8s cluster to v1.11.6 but we're still getting "devicePath is empty" errors. So it looks like the failure mechanism for this ticket may be different than what #71074 fixes.

alebsack · 2019-01-22T03:43:31Z

@DaveKriegerEmerald I traced the 1.11 branch PR and I believe the fix is not in v1.11.6, so it should be in v1.11.7.

tsmgeek · 2019-01-24T11:22:20Z

I am seeing the problem on 1.11.5 | eks.1, when the pod is being recreated but its on the same node as the one it was replacing. Does not happen all the time. The current fix is for me to go into AWS and detach the volume manually, then delete the pod and it will then start up properly again.

DaveKriegerEmerald · 2019-01-24T18:07:23Z

Great! We have a workaround for the problem (revising our Argo workflow so steps no longer try to share a volume) and will revisit the issue after our next K8s version upgrade.

rajiteh · 2019-01-25T01:22:13Z

I can confirm that this particular issue is solved in 1.12.5 -- I have been running a PVC stress tester which basically waits for a StatefulSet to become stable and then rapidly evicts pods for rescheduling. I have never been able to successfully run it for more than 2 hours before running in to this issue. But with 1.12.5, I have been successfully running it for 36 hours.

harshkumar1 · 2019-02-16T12:26:57Z

@rajiteh we are using 1.12.3 and are seeing the issue as I have described in #73887
Would be interested in seeing what PVC stress tester do you run we are able to reproduce this issue, but am as well looking for to add some tests to our suite to be able to test these

Hashfyre · 2019-02-20T08:52:21Z

We are on 1.12.2, and seeing this issue, will update to 1.12.5 by next week and report back if that solves the issue.

drewhemm · 2019-03-11T14:18:52Z

Seeing this issue in AWS EKS, for which the highest available version is currently 1.11.5:

https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html

A backport would be greatly appreciated.

mmalex · 2019-03-15T11:25:05Z

we are also seeing this in EKS 1.11.5, and would also appreciate a backport. thanks

fntlnz · 2019-03-16T09:00:41Z

@drewhemm @mmalex the fix in #71074 had been backported to the 1.11.x release branch but didn't reach 1.11.5, the first release that has it in the 1.11 series is 1.11.7. I see that eks.2 is on 1.11.8 so the only way to get this fixed for the two of you is to upgrade to that.

see: https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html

max-rocket-internet · 2019-03-20T13:51:07Z

We ran into this in EKS, also like @mmalex

A workaround for us was to scale the deployment to 0 replicas, wait for EBS volume to unmount in EC2 console, then scale deployment back to original replicas and it mounted successfully.

whispererli · 2019-03-25T22:20:11Z

@fntlnz Looks like the Amazon EKS-Optimized API still running at version 1.11.5. This patch also require the kubelet to be updated right? Thanks.
https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html

andyzhangx · 2019-03-26T13:00:05Z

FYI. this issue is fixed in by #71074

k8s version	fixed version
v1.10	no fix
v1.11	1.11.7
v1.12	1.12.5
v1.13	no such issue

Shall we close this issue now?

fntlnz · 2019-04-26T01:27:50Z

This seems to be fixed, closing it. Feel free to reopen if needed.

codersofthedark · 2019-05-16T11:34:35Z

We ran into this in EKS, also like @mmalex

A workaround for us was to scale the deployment to 0 replicas, wait for EBS volume to unmount in EC2 console, then scale deployment back to original replicas and it mounted successfully.

Small world :D

For us, recreating the job with pause for vol detach made it work

pdutta777 · 2019-08-07T11:53:34Z

@fntlnz Looks like the Amazon EKS-Optimized API still running at version 1.11.5. This patch also require the kubelet to be updated right? Thanks.
https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html

Yes - I just verified yesterday on our EKS cluster that you need to upgrade the nodes as well. If all of the nodes were created in an autoscaling group then the process is pretty simple

kubectl get nodes
kubectl drain node name--ignore-daemonsets
aws autoscaling terminate-instance-in-auto-scaling-group --instance-id instance_id --no-should-decrement-desired-capacity

Make sure that your cloud format template has the right AMI

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Aug 13, 2018

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 13, 2018

wph95 mentioned this issue Aug 26, 2018

unit test to reproduce bug kubernetes#67342 wph95/kubernetes#1

Closed

k8s-ci-robot assigned ddebroy Aug 31, 2018

wph95 pushed a commit to wph95/kubernetes that referenced this issue Sep 1, 2018

unit test to reproduce bug kubernetes#67342

1ffee0c

wph95 pushed a commit to wph95/kubernetes that referenced this issue Sep 1, 2018

unit test to reproduce bug kubernetes#67342

2e02092

wph95 mentioned this issue Sep 1, 2018

add unit test to reproduce bug Storage:devicePath is always empty #67342 #68169

Closed

wph95 added a commit to wph95/kubernetes that referenced this issue Sep 1, 2018

unit test to reproduce bug kubernetes#67342

b7917fb

wph95 mentioned this issue Sep 20, 2018

Fix devicePath is empty while WaitForAttach in StatefulSets #68884

Closed

gnufied mentioned this issue Dec 14, 2018

Fix race between MountVolume and UnmountDevice #71074

Merged

derekperkins mentioned this issue Dec 20, 2018

MountVolume.WaitForAttach failed - strconv.Atoi: parsing "": invalid syntax Azure/AKS#761

Closed

adisky mentioned this issue Feb 20, 2019

Kubernetes Mounting Cinder Volume gives “Timeout” error #73887

Closed

fntlnz closed this as completed Apr 26, 2019

deitch mentioned this issue Sep 6, 2019

statefulset pod recovery is inconsistent equinixmetal-archive/csi-packet#8

Closed

Storage: devicePath is empty while WaitForAttach in StatefulSets #67342

Storage: devicePath is empty while WaitForAttach in StatefulSets #67342

Comments

fntlnz commented Aug 13, 2018

fntlnz commented Aug 13, 2018

gnufied commented Aug 13, 2018

fntlnz commented Aug 13, 2018

gnufied commented Aug 13, 2018

fntlnz commented Aug 13, 2018

fntlnz commented Aug 15, 2018

gnufied commented Aug 15, 2018 • edited Loading

akshaymankar commented Aug 16, 2018 • edited Loading

fntlnz commented Aug 16, 2018

ddebroy commented Aug 24, 2018

wph95 commented Aug 26, 2018 • edited Loading

wph95 commented Aug 26, 2018 • edited Loading

r7vme commented Aug 28, 2018

ddebroy commented Aug 31, 2018

gtie commented Oct 19, 2018

fntlnz commented Oct 19, 2018

gtie commented Oct 19, 2018

dguendisch commented Nov 2, 2018

blkerby commented Dec 8, 2018 • edited Loading

deshui123 commented Dec 12, 2018 • edited Loading

gnufied commented Dec 14, 2018

DaveKriegerEmerald commented Dec 25, 2018 • edited Loading

jjo commented Jan 7, 2019 • edited Loading

gnufied commented Jan 7, 2019

DaveKriegerEmerald commented Jan 8, 2019 • edited Loading

alebsack commented Jan 22, 2019

tsmgeek commented Jan 24, 2019

DaveKriegerEmerald commented Jan 24, 2019

rajiteh commented Jan 25, 2019 • edited Loading

harshkumar1 commented Feb 16, 2019 • edited Loading

Hashfyre commented Feb 20, 2019 • edited Loading

drewhemm commented Mar 11, 2019

mmalex commented Mar 15, 2019

fntlnz commented Mar 16, 2019 • edited Loading

max-rocket-internet commented Mar 20, 2019

whispererli commented Mar 25, 2019

andyzhangx commented Mar 26, 2019

fntlnz commented Apr 26, 2019

codersofthedark commented May 16, 2019 • edited Loading

pdutta777 commented Aug 7, 2019

gnufied commented Aug 15, 2018 •

edited

Loading

akshaymankar commented Aug 16, 2018 •

edited

Loading

wph95 commented Aug 26, 2018 •

edited

Loading

wph95 commented Aug 26, 2018 •

edited

Loading

blkerby commented Dec 8, 2018 •

edited

Loading

deshui123 commented Dec 12, 2018 •

edited

Loading

DaveKriegerEmerald commented Dec 25, 2018 •

edited

Loading

jjo commented Jan 7, 2019 •

edited

Loading

DaveKriegerEmerald commented Jan 8, 2019 •

edited

Loading

rajiteh commented Jan 25, 2019 •

edited

Loading

harshkumar1 commented Feb 16, 2019 •

edited

Loading

Hashfyre commented Feb 20, 2019 •

edited

Loading

fntlnz commented Mar 16, 2019 •

edited

Loading

codersofthedark commented May 16, 2019 •

edited

Loading