Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage: devicePath is empty while WaitForAttach in StatefulSets #67342

Closed
fntlnz opened this issue Aug 13, 2018 · 40 comments
Closed

Storage: devicePath is empty while WaitForAttach in StatefulSets #67342

fntlnz opened this issue Aug 13, 2018 · 40 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@fntlnz
Copy link
Contributor

fntlnz commented Aug 13, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

I created a StatefulSet with 4 replicas and it worked correctly.
After some time I needed to restart one of the pods and when it came back again it was stuck on this error:

Events:
  Type     Reason       Age               From                                   Message
  ----     ------       ----              ----                                   -------
  Warning  FailedMount  9m (x40 over 1h)  kubelet, ip-180-12-10-58.ec2.internal  Unable to mount volumes for pod "storage-0_twodotoh(51577cea-9ccd-11e8-b024-1232e142048e)": timeout expired waiting for volumes to attach or mount for pod "myorg"/"mypod-0". list of unmounted volumes=[data]. list of unattached volumes=[data mypod-config default-token-cjfqp]
  Warning  FailedMount  3m (x55 over 1h)  kubelet, ip-180-12-10-58.ec2.internal  MountVolume.WaitForAttach failed for volume "pvc-a12b7de1-30ed-11ee-a324-2232d546216c" : WaitForAttach failed for AWS Volume "aws://us-east-1b/vol-045d3gx6hg53gz341": devicePath is empty.

When that happens I can get it working by deleting the pod again, it can happen that the error happens again after that but usually not three times in a row.

The main issue is that if you don't act manually on it it will continue to be reconciled by reconciler.go and will never come back again.

The issue seems to be on actual_state_of_world.go while doing the MarkVolumeAsAttached part, at some point the devicePath string is not written in the object.

actual_state_of_world.go:616 ->
  reconciler.go:238 ->
    operation_executor.go:712 ->
       operation_generator.go:437 -> error on line 496

What you expected to happen:

The pod comes back with no error.

How to reproduce it (as minimally and precisely as possible):
The problem seems to be difficult to reproduce, I can trigger it after the
Upgrading to 1.11 does not solve the problem.
It does not happen on Deployments, I haven't been able to reproduce there.

  • Start a Kubernetes cluster on AWS that is configured to use EBS volumes
  • Create a statefulset with a dynamic provisioned volume (see yaml file below)
  • Delete one of the pods of your choice

At this point one of two things can happen:

  • It just works
    OR
  • The pod is not able to come back again and give the error I reported above.

I haven't found a reliable way to make one or the other happen when I wanted, it seems to be very random but I'm sure that it only happens when the pod is recreated on the same node.

Statefulset to reproduce

apiVersion: v1
kind: Namespace
metadata:
  name: repro-devicepath
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: myrepro
  namespace: repro-devicepath
  labels:
    component: myrepro
spec:
  serviceName: myrepro
  selector:
    matchLabels:
      component: myrepro
  replicas: 4
  template:
    metadata:
      name: myrepro
      labels:
        component: myrepro
    spec:
      containers:
        - name: myrepro
          image: docker.io/fntlnz/caturday:latest
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
  - metadata:
      namespace: repro-devicepath
      name: data
    spec:
      storageClassName: ebs-1
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 1Gi

Anything else we need to know?:

When this happens, if one looks at devicePath in the node's status it will be reported empty, one can verify that with:

 kubectl get node -o json | jq ".items[].status.volumesAttached" 

I found some other users on slack that have this problem, @wirewc sent me this (note the empty devicePath happening in his system.

 volumesAttached:
 - devicePath: ""
   name: kubernetes.io/iscsi/10.48.147.131:iqn.2016-12.org.gluster-block:b5a96cbd-926b-421f-922b-4df13ca150e0:0
 volumesInUse:
 - kubernetes.io/iscsi/10.48.147.131:iqn.2016-12.org.gluster-block:b5a96cbd-926b-421f-922b-4df13ca150e0:0
 - kubernetes.io/iscsi/pvc-bb8b444f-9a68-11e8-b661-0050569c4ace:pvc-bb8b444f-9a68-11e8-b661-0050569c4ace:0

Also @ntfrnzn detailed a similar issue here: equinixmetal-archive/csi-packet#8

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

I'm hitting this on a production cluster on 1.10.3 but I get the same error on a testing cluster that has 1.11

  • Cloud provider or hardware configuration: AWS, deployed using kubeadm
  • OS (e.g. from /etc/os-release):
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.6.0
VERSION_ID=1800.6.0
BUILD_ID=2018-08-04-0323
PRETTY_NAME="Container Linux by CoreOS 1800.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
  • Kernel (e.g. uname -a): Linux ip-180-12-0-57 4.14.59-coreos-r2 #1 SMP Sat Aug 4 02:49:25 UTC 2018 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Aug 13, 2018
@fntlnz
Copy link
Contributor Author

fntlnz commented Aug 13, 2018

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 13, 2018
@gnufied
Copy link
Member

gnufied commented Aug 13, 2018

@fntlnz When this happens, what is the status of EBS volume? Is that still attached to the node where you deleted and recreated the pod?

@fntlnz
Copy link
Contributor Author

fntlnz commented Aug 13, 2018

@gnufied as far as I can remember the volume is released, I remember this because I was looking at the volumes on the aws console and they were blue (not attached).

@gnufied
Copy link
Member

gnufied commented Aug 13, 2018

FWIW - devicePath being empty for iSCSI is expected. iSCSI does not perform "real" attach/detach, so naturally there is no devicePath - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/iscsi/attacher.go#L58

We need to see corresponding entry for EBS when that happens. iSCSI could be a red herring.

@fntlnz
Copy link
Contributor Author

fntlnz commented Aug 13, 2018

Oh you’re right @gnufied I’m trying to find a way to reproduce this reliably, will keep this issue posted.

@fntlnz
Copy link
Contributor Author

fntlnz commented Aug 15, 2018

@gnufied just happened and noted that when this error occurs the volume is marked as attached on aws as xvdp and it is available with lsblk but empty.

NAME    MAJ:MIN   RM  SIZE RO TYPE  MOUNTPOINT
xvda    202:0      0   50G  0 disk  
|-xvda1 202:1      0  128M  0 part  /boot
|-xvda2 202:2      0    2M  0 part  
|-xvda3 202:3      0    1G  0 part  
|-xvda4 202:4      0    1G  0 part  
| `-usr 254:0      0 1016M  1 crypt /usr
|-xvda6 202:6      0  128M  0 part  /usr/share/oem
|-xvda7 202:7      0   64M  0 part  
`-xvda9 202:9      0 47.7G  0 part  /
xvdbi   202:15360  0 1000G  0 disk  /var/lib/kubelet/pods/xxxxxxxx-91eb-11e8-a727-1232e142048e/volumes/kubernetes.io~aws-ebs/pvc-xxxxxxx-9142-11e8-a727-xxxxxxxxxx
xvdbp   202:17152  0 1000G  0 disk  
xvdca   202:19968  0    5G  0 disk  /var/lib/kubelet/pods/xxxxxxx-6348-11e8-b117-xxxxxxxxxxxe/volumes/kubernetes.io~aws-ebs/pvc-xxxxxxxx-1e08-11e8-b226-xxxxxxxxxxx
xvdcm   202:23040  0   50G  0 disk  /var/lib/kubelet/pods/xxxxxxxx-6348-11e8-b117-xxxxx/volumes/kubernetes.io~aws-ebs/pvc-xxxxxxxx-05e7-11e8-9188-xxxxxxxxx

It is also not listed in any process mountinfo or mounts

The only relevant log I see in the kubelet is:

Aug 15 17:01:30 ip-180-12-20-217 kubelet[856]: I0815 17:01:30.084342     856 reconciler.go:237] Starting operationExecutor.MountVolume for volume "pvc-xxxxxxxxxxx-11e8-a727-xxxxxxxx" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-xxxxxxxxxxxxx") pod "mypod-2" (UID: "xxxxxx-a0a4-11e8-b024-xxxxxxxx")
Aug 15 17:01:30 ip-xxxxxxxxxx kubelet[856]: I0815 17:01:30.084387     856 volume_host.go:219] using default mounter/exec for kubernetes.io/aws-ebs
Aug 15 17:01:30 ip-xxxxxxxxxx kubelet[856]: I0815 17:01:30.084403     856 volume_host.go:219] using default mounter/exec for kubernetes.io/aws-ebs
Aug 15 17:01:30 ip-xxxxxxxxxx kubelet[856]: I0815 17:01:30.084410     856 volume_host.go:219] using default mounter/exec for kubernetes.io/aws-ebs
A

I'm starting thinking that the problem here is happening because of a dirty unmount rather then a bad mount.

@gnufied
Copy link
Member

gnufied commented Aug 15, 2018

Can you confirm if unmount is left from previous pod or if unmounted volume is because of new pod (that triggered attach but did not mount) ? One way of confirming that would be - if device name (/dev/xvdbp) changes.

@akshaymankar
Copy link

akshaymankar commented Aug 16, 2018

Hi, we're also facing the same issue. We can confirm that the unmount is left from previous pod. The difference in our scenario is that we were trying to upgrade from 1.11.1 to 1.11.2. We initally thought it had something to do with the versions. But here is what our hypothesis is:

When pod gets deleted first time, it leaves the mount behind.
When we restart the kubelet, it sees an extra mount and unmounts it. After the unmount MarkDeviceAsUnmounted gets called. It marks devicePath as empty.

When the scheduler puts back the container on the kubelet, kubelet tries to mount again. But the attachedVolumes map in actualStateOfWorld has the attachedVolume object with empty devicePath. Which causes WaitForAttach to fail.

It makes me wonder why does MarkDeviceAsUnmounted have to set devicePath as empty.

/cc @BenChapman

@fntlnz
Copy link
Contributor Author

fntlnz commented Aug 16, 2018

@gnufied the device name remains the same in my case /dev/xvdbp

@ddebroy
Copy link
Member

ddebroy commented Aug 24, 2018

Looks like the kube e2e is also running into this as part of pull-kubernetes-e2e-kops-aws.[sig-storage] Dynamic Provisioning DynamicProvisioner should provision storage with different parameters tests that target EBS provisioning.

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/67530/pull-kubernetes-e2e-kops-aws/103140/

Aug 24 22:25:51.337: INFO: At 2018-08-24 22:10:49 +0000 UTC - event for pvc-volume-tester-stzb2: {kubelet ip-172-20-55-168.us-west-2.compute.internal} Created: Created container
Aug 24 22:25:51.337: INFO: At 2018-08-24 22:10:50 +0000 UTC - event for pvc-volume-tester-27r57: {default-scheduler } Scheduled: Successfully assigned e2e-tests-volume-provisioning-2qbc4/pvc-volume-tester-27r57 to ip-172-20-55-168.us-west-2.compute.internal
Aug 24 22:25:51.337: INFO: At 2018-08-24 22:10:50 +0000 UTC - event for pvc-volume-tester-27r57: {kubelet ip-172-20-55-168.us-west-2.compute.internal} FailedMount: MountVolume.WaitForAttach failed for volume "pvc-7c7f5f34-a7ea-11e8-99eb-02c590dc6280" : WaitForAttach failed for AWS Volume "aws://us-west-2b/vol-0194c1ef086c2bb42": devicePath is empty.
Aug 24 22:25:51.337: INFO: At 2018-08-24 22:10:51 +0000 UTC - event for pvc-volume-tester-stzb2: {kubelet ip-172-20-55-168.us-west-2.compute.internal} SandboxChanged: Pod sandbox changed, it will be killed and re-created.
Aug 24 22:25:51.337: INFO: At 2018-08-24 22:12:53 +0000 UTC - event for pvc-volume-tester-27r57: {kubelet ip-172-20-55-168.us-west-2.compute.internal} FailedMount: Unable to mount volumes for pod "pvc-volume-tester-27r57_e2e-tests-volume-provisioning-2qbc4(8f49de8c-a7ea-11e8-99eb-02c590dc6280)": timeout expired waiting for volumes to attach or mount for pod "e2e-tests-volume-provisioning-2qbc4"/"pvc-volume-tester-27r57". list of unmounted volumes=[my-volume]. list of unattached volumes=[my-volume default-token-zxnz8]

@wph95
Copy link

wph95 commented Aug 26, 2018

I think I found the cause of this bug. and enhanced the current unit test, 100% to reproduce the bug wph95#1

@wph95
Copy link

wph95 commented Aug 26, 2018

I opened the kubelet --v=10 in my test cluster, in our scenario (with intermittent long write operations on the disk). This bug trigger is 10%. By analyzing the logs, I found the cause of the problem and succeeded in reproducing the problem by adding a new unit test to prove that the bug existed.

The cause is that AWS EBS sometimes attacher.UnmountDevice slowly (10s ++ ), UnmountDevice is an asynchronous function and at the same time,
desiredStateOfWorldPopulator update desired_state_of_world status. The cause of the mistake rc.desiredStateOfWorld.VolumeExists(ErrorVolume) == True

This has caused some code reconciler.reconcile can't run as expected
This part of the code[MarkVolumeAsDetached] is not executed, let volumeToMount never have chance execute VerifyControllerAttachedVolume to refresh drivePath to right
reconcile fall into an error loop.

cc @gnufied

and i'm glad/want to contribute code. btw i think reconciler lifecyle is complex,I've been looking for a long time to search pr/issue about reconciler lifecyle. But I didn't find it, so I didn't know how to fix the bug correctly. maybe insure after executed UnmountDevice to excute MarkVolumeAsDetached

p.s. English is not my mother tongue; please excuse any errors on my part. if have not understood, please see wph95#1 or mention me :)

@r7vme
Copy link

r7vme commented Aug 28, 2018

We were hit by this issue. Dirty w/a is to restart kubelet on affected node.

@ddebroy
Copy link
Member

ddebroy commented Aug 31, 2018

/assign @ddebroy

@gtie
Copy link

gtie commented Oct 19, 2018

We see the same issue, in particular, when the same volume is repeatedly mounted/dismounted on the same node. Here are some logs showing successful mount, dismount, and then a failed mount with empty device path:

Oct 17 09:08:11 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1017 09:08:11.107170   27019 operation_generator.go:495] MountVolume.WaitForAttach succeeded for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "APP-storage--sit-1" (UID: "250b0be6-d1ec-11e8-9483-0614d864468a") DevicePath "/dev/xvdby"
Oct 17 09:08:11 ip-XX-XX-XX-XX.eu-west-1.compute.internal systemd: Started Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-west-1b/vol-09afde99a76e386df.
Oct 17 09:08:11 ip-XX-XX-XX-XX.eu-west-1.compute.internal systemd: Starting Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-west-1b/vol-09afde99a76e386df.
Oct 17 09:08:11 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1017 09:08:11.165463   27019 operation_generator.go:514] MountVolume.MountDevice succeeded for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "APP-storage--sit-1" (UID: "250b0be6-d1ec-11e8-9483-0614d864468a") device mount path "/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-west-1b/vol-09afde99a76e386df"
Oct 17 09:08:11 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1017 09:08:11.254673   27019 operation_generator.go:557] MountVolume.SetUp succeeded for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "APP-storage--sit-1" (UID: "250b0be6-d1ec-11e8-9483-0614d864468a")
Oct 18 09:52:50 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1018 09:52:50.932279   27019 reconciler.go:181] operationExecutor.UnmountVolume started for volume "data" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "250b0be6-d1ec-11e8-9483-0614d864468a" (UID: "250b0be6-d1ec-11e8-9483-0614d864468a")
Oct 18 09:52:51 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1018 09:52:51.137616   27019 reconciler.go:278] operationExecutor.UnmountDevice started for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") on node "ip-XX-XX-xX-XX.eu-west-1.compute.internal"
Oct 18 09:53:01 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1018 09:53:01.015286   27019 operation_generator.go:760] UnmountDevice succeeded for volume "pvc-f8041f35cf8e-11e8-9483-0614d864468a" % !(EXTRA string=UnmountDevice succeeded for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") on node "ip-XX-XX-XX-XX.eu-west-1.compute.internal" )
Oct 18 09:53:01 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1018 09:53:01.082776   27019 reconciler.go:252] operationExecutor.MountVolume started for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "APP-storage--sit-1" (UID: "952d07df-d2bb-11e8-9483-0614d864468a"
Oct 18 09:53:01 ip-XX-XX-XX-XX.eu-west-1.compute.internal kubelet: I1018 09:53:01.083038   27019 operation_generator.go:486] MountVolume.WaitForAttach entering for volume "pvc-f8041f35-cf8e-11e8-9483-0614d864468a" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-west-1b/vol-09afde99a76e386df") pod "APP-storage--sit-1" (UID: "952d07df-d2bb-11e8-9483-0614d864468a") DevicePath ""

@fntlnz
Copy link
Contributor Author

fntlnz commented Oct 19, 2018

@gtie - What's your kubernetes version?
I stopped seeing this after upgrading to 1.11.2 and I don't see it now on 1.12.1

@gtie
Copy link

gtie commented Oct 19, 2018

@fntlnz, thanks for the input! I have this issue on K8s v.1.10.7. Upgrade should be coming in the next few weeks, we'll see if it appears again afterwards.

@dguendisch
Copy link

I see the same issue from time to time on one of my openstack k8s clusters (v1.11.3).
Seen it when experimenting with argo workflows which subsequently create/delete pods thereby reusing one and the same pv. Intermittently pods fail:

...
Events:
  Type     Reason       Age                   From                                                               Message
  ----     ------       ----                  ----                                                               -------
  Normal   Scheduled    6m52s                 default-scheduler                                                  Successfully assigned default/testrun-123-xfphb-4092629371 to shoot--core--os-worker-c3xt0-z1-645ffb5849-vkgmj
  Warning  FailedMount  40s (x11 over 6m52s)  kubelet, shoot--core--os-worker-c3xt0-z1-645ffb5849-vkgmj  MountVolume.WaitForAttach failed for volume "pvc-de123641-de80-11e8-afa1-d6a1d6ffefab" : WaitForAttach failed for Cinder disk "0c6c0878-970b-4be5-b12c-a742e7cd5cfa": devicePath is empty
  Warning  FailedMount  18s (x3 over 4m49s)   kubelet, shoot--core--os-worker-c3xt0-z1-645ffb5849-vkgmj  Unable to mount volumes for pod "testrun-123-xfphb-4092629371_default(eee65d37-de80-11e8-afa1-d6a1d6ffefab)": timeout expired waiting for volumes to attach or mount for pod "default"/"testrun-123-xfphb-4092629371". list of unmounted volumes=[shared-volume]. list of unattached volumes=[podmetadata docker-lib docker-sock shared-volume default-token-qwvs6]

The cluster has only one worker node. When the error occurs the node shows the resp. volume as attached:

...
  volumesAttached:
  - devicePath: /dev/vdb
    name: kubernetes.io/cinder/0c6c0878-970b-4be5-b12c-a742e7cd5cfa
  volumesInUse:
  - kubernetes.io/cinder/0c6c0878-970b-4be5-b12c-a742e7cd5cfa

@blkerby
Copy link

blkerby commented Dec 8, 2018

I hit this error in one pod of a StatefulSet on k8s v1.11.5:

Warning  FailedMount  3m16s (x239 over 7h53m)  kubelet, ip-10-200-142-34.us-west-2.compute.internal  MountVolume.WaitForAttach failed for volume "pvc-cc9ba128-f32c-11e8-b598-0258099978de" : WaitForAttach failed for AWS Volume "aws://us-west-2a/vol-06cab23babe87dd7a": devicePath is empty.

When I look at the node, I see the devicePath looks normal:

 volumesAttached:
  - devicePath: /dev/xvdbi
    name: kubernetes.io/aws-ebs/aws://us-west-2a/vol-06cab23babe87dd7a
...
  volumesInUse:
  - kubernetes.io/aws-ebs/aws://us-west-2a/vol-06cab23babe87dd7a

After deleting the pod, the problem resolved.

@deshui123
Copy link

deshui123 commented Dec 12, 2018

We see the same issue with kubenete v1.10.2.
Pod status is "ContainerCreating" and the event is
Warning FailedMount 7m (x384 over 14h) kubelet, dev-g02-w-12 Unable to mount volumes for pod "es-data-1_paas(caaca102-fd30-11e8-9806-fa163eab274f)": timeout expired waiting for volumes to attach or mount for pod "paas"/"es-data-1". list of unmounted volumes=[datadir]. list of unattached volumes=[datadir default-token-rf4f7] Warning FailedMount 56s (x439 over 14h) kubelet, dev-g02-w-12 MountVolume.WaitForAttach failed for volume "pvc-d9545147-c322-11e8-b95f-fa163e2dd5ca" : WaitForAttach failed for Cinder disk "2a0fe86e-4314-4816-9b57-24e56b927478": devicePath is empty

When I look at the node, I see the devicePath looks normal:
"devicePath": "/dev/vdf", "name": "kubernetes.io/cinder/2a0fe86e-4314-4816-9b57-24e56b927478"
But for lsblk, the 'MOUNTPOINT' for vdf is empty
[root@dev-g02-w-12 /var/log]$ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 464K 0 rom
vda 253:0 0 80G 0 disk
├─vda1 253:1 0 4G 0 part [SWAP]
└─vda2 253:2 0 76G 0 part /
vdb 253:16 0 100G 0 disk /data0
vdc 253:32 0 10G 0 disk /var/lib/kubelet/pods/c3550b44-fd30-11e8-9806-fa163eab274f/volumes/kubernetes.iocinder/pvc-ef2a488d-c7a1-11e8-9246-fa163e2dd5ca
vdd 253:48 0 1G 0 disk /var/lib/kubelet/pods/c3550b44-fd30-11e8-9806-fa163eab274f/volumes/kubernetes.io
cinder/pvc-ef277b4e-c7a1-11e8-9246-fa163e2dd5ca
vde 253:64 0 5G 0 disk /var/lib/kubelet/pods/c3f07abc-fd30-11e8-9806-fa163eab274f/volumes/kubernetes.io~cinder/pvc-d92e93f8-f150-11e8-9806-fa163eab274f
vdf 253:80 0 200G 0 disk

@gnufied
Copy link
Member

gnufied commented Dec 14, 2018

Hey guys I think @jsafrane fixed this issue via #71074 which makes sure that if UnmountVolume is called, we do not remove devicePath from the cache. If any folks who are facing this issue can try on 1.13 which includes aforementioned patch then we will consider backporting that patch to 1.12 and 1.11.

@DaveKriegerEmerald
Copy link

DaveKriegerEmerald commented Dec 25, 2018

@gnufied - We're getting this same issue with Argo (https://github.com/argoproj/argo) as @dguendisch, see above. We're heavily invested technically in using Kops, which is just now supporting 1.11 in beta, so we're really looking forward to a fix being backported. (We upgraded to 1.10.12 which contains the #71074 patch but does not remove the error for us.)

@jjo
Copy link

jjo commented Jan 7, 2019

We faced this issue after upgrading one of our kops clusters from 1.10.11 to 1.11.6,
only happening for a specific workload (deploy+pvc) that's pinned to a specific node,
while it didn't show for many other un-pinned deploy+pvc and sts workloads.
After manually downgrading /usr/local/bin/kubelet to 1.10.12 in that
node, the PV got mounted in a glance (+pod started ok).

FYI had tried by restarting 1.11.6 kubelet, wirhout success.

@gnufied
Copy link
Member

gnufied commented Jan 7, 2019

@DaveKriegerEmerald where do you see #71074 patch being backported to 1.10.12 ? I only created backports for 1.11 and 1.12 .

@DaveKriegerEmerald
Copy link

DaveKriegerEmerald commented Jan 8, 2019

@gnufied Thanks for replying! My mistake; I was looking at #71145 (comment) when I wrote that. However, kops v1.11.0 is now out of beta; I upgraded my argo K8s cluster to v1.11.6 but we're still getting "devicePath is empty" errors. So it looks like the failure mechanism for this ticket may be different than what #71074 fixes.

@alebsack
Copy link

@DaveKriegerEmerald I traced the 1.11 branch PR and I believe the fix is not in v1.11.6, so it should be in v1.11.7.

@tsmgeek
Copy link

tsmgeek commented Jan 24, 2019

I am seeing the problem on 1.11.5 | eks.1, when the pod is being recreated but its on the same node as the one it was replacing. Does not happen all the time. The current fix is for me to go into AWS and detach the volume manually, then delete the pod and it will then start up properly again.

@DaveKriegerEmerald
Copy link

Great! We have a workaround for the problem (revising our Argo workflow so steps no longer try to share a volume) and will revisit the issue after our next K8s version upgrade.

@rajiteh
Copy link

rajiteh commented Jan 25, 2019

I can confirm that this particular issue is solved in 1.12.5 -- I have been running a PVC stress tester which basically waits for a StatefulSet to become stable and then rapidly evicts pods for rescheduling. I have never been able to successfully run it for more than 2 hours before running in to this issue. But with 1.12.5, I have been successfully running it for 36 hours.

@harshkumar1
Copy link

harshkumar1 commented Feb 16, 2019

@rajiteh we are using 1.12.3 and are seeing the issue as I have described in #73887
Would be interested in seeing what PVC stress tester do you run we are able to reproduce this issue, but am as well looking for to add some tests to our suite to be able to test these

@Hashfyre
Copy link

Hashfyre commented Feb 20, 2019

We are on 1.12.2, and seeing this issue, will update to 1.12.5 by next week and report back if that solves the issue.

@drewhemm
Copy link

Seeing this issue in AWS EKS, for which the highest available version is currently 1.11.5:

https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html

A backport would be greatly appreciated.

@mmalex
Copy link

mmalex commented Mar 15, 2019

we are also seeing this in EKS 1.11.5, and would also appreciate a backport. thanks

@fntlnz
Copy link
Contributor Author

fntlnz commented Mar 16, 2019

@drewhemm @mmalex the fix in #71074 had been backported to the 1.11.x release branch but didn't reach 1.11.5, the first release that has it in the 1.11 series is 1.11.7. I see that eks.2 is on 1.11.8 so the only way to get this fixed for the two of you is to upgrade to that.

see: https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html

@max-rocket-internet
Copy link

We ran into this in EKS, also like @mmalex

A workaround for us was to scale the deployment to 0 replicas, wait for EBS volume to unmount in EC2 console, then scale deployment back to original replicas and it mounted successfully.

@whispererli
Copy link

@fntlnz Looks like the Amazon EKS-Optimized API still running at version 1.11.5. This patch also require the kubelet to be updated right? Thanks.
https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html

@andyzhangx
Copy link
Member

FYI. this issue is fixed in by #71074

k8s version fixed version
v1.10 no fix
v1.11 1.11.7
v1.12 1.12.5
v1.13 no such issue

Shall we close this issue now?

@fntlnz
Copy link
Contributor Author

fntlnz commented Apr 26, 2019

This seems to be fixed, closing it. Feel free to reopen if needed.

@fntlnz fntlnz closed this as completed Apr 26, 2019
@codersofthedark
Copy link

codersofthedark commented May 16, 2019

We ran into this in EKS, also like @mmalex

A workaround for us was to scale the deployment to 0 replicas, wait for EBS volume to unmount in EC2 console, then scale deployment back to original replicas and it mounted successfully.

Small world :D

For us, recreating the job with pause for vol detach made it work

@pdutta777
Copy link

@fntlnz Looks like the Amazon EKS-Optimized API still running at version 1.11.5. This patch also require the kubelet to be updated right? Thanks.
https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html

Yes - I just verified yesterday on our EKS cluster that you need to upgrade the nodes as well. If all of the nodes were created in an autoscaling group then the process is pretty simple

  1. kubectl get nodes
  2. kubectl drain node name--ignore-daemonsets
  3. aws autoscaling terminate-instance-in-auto-scaling-group --instance-id instance_id --no-should-decrement-desired-capacity

Make sure that your cloud format template has the right AMI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.