Skip to content

Storage: devicePath is empty while WaitForAttach in StatefulSets #67342

@fntlnz

Description

@fntlnz

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

I created a StatefulSet with 4 replicas and it worked correctly.
After some time I needed to restart one of the pods and when it came back again it was stuck on this error:

Events:
  Type     Reason       Age               From                                   Message
  ----     ------       ----              ----                                   -------
  Warning  FailedMount  9m (x40 over 1h)  kubelet, ip-180-12-10-58.ec2.internal  Unable to mount volumes for pod "storage-0_twodotoh(51577cea-9ccd-11e8-b024-1232e142048e)": timeout expired waiting for volumes to attach or mount for pod "myorg"/"mypod-0". list of unmounted volumes=[data]. list of unattached volumes=[data mypod-config default-token-cjfqp]
  Warning  FailedMount  3m (x55 over 1h)  kubelet, ip-180-12-10-58.ec2.internal  MountVolume.WaitForAttach failed for volume "pvc-a12b7de1-30ed-11ee-a324-2232d546216c" : WaitForAttach failed for AWS Volume "aws://us-east-1b/vol-045d3gx6hg53gz341": devicePath is empty.

When that happens I can get it working by deleting the pod again, it can happen that the error happens again after that but usually not three times in a row.

The main issue is that if you don't act manually on it it will continue to be reconciled by reconciler.go and will never come back again.

The issue seems to be on actual_state_of_world.go while doing the MarkVolumeAsAttached part, at some point the devicePath string is not written in the object.

actual_state_of_world.go:616 ->
  reconciler.go:238 ->
    operation_executor.go:712 ->
       operation_generator.go:437 -> error on line 496

What you expected to happen:

The pod comes back with no error.

How to reproduce it (as minimally and precisely as possible):
The problem seems to be difficult to reproduce, I can trigger it after the
Upgrading to 1.11 does not solve the problem.
It does not happen on Deployments, I haven't been able to reproduce there.

  • Start a Kubernetes cluster on AWS that is configured to use EBS volumes
  • Create a statefulset with a dynamic provisioned volume (see yaml file below)
  • Delete one of the pods of your choice

At this point one of two things can happen:

  • It just works
    OR
  • The pod is not able to come back again and give the error I reported above.

I haven't found a reliable way to make one or the other happen when I wanted, it seems to be very random but I'm sure that it only happens when the pod is recreated on the same node.

Statefulset to reproduce

apiVersion: v1
kind: Namespace
metadata:
  name: repro-devicepath
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: myrepro
  namespace: repro-devicepath
  labels:
    component: myrepro
spec:
  serviceName: myrepro
  selector:
    matchLabels:
      component: myrepro
  replicas: 4
  template:
    metadata:
      name: myrepro
      labels:
        component: myrepro
    spec:
      containers:
        - name: myrepro
          image: docker.io/fntlnz/caturday:latest
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
  - metadata:
      namespace: repro-devicepath
      name: data
    spec:
      storageClassName: ebs-1
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 1Gi

Anything else we need to know?:

When this happens, if one looks at devicePath in the node's status it will be reported empty, one can verify that with:

 kubectl get node -o json | jq ".items[].status.volumesAttached" 

I found some other users on slack that have this problem, @wirewc sent me this (note the empty devicePath happening in his system.

 volumesAttached:
 - devicePath: ""
   name: kubernetes.io/iscsi/10.48.147.131:iqn.2016-12.org.gluster-block:b5a96cbd-926b-421f-922b-4df13ca150e0:0
 volumesInUse:
 - kubernetes.io/iscsi/10.48.147.131:iqn.2016-12.org.gluster-block:b5a96cbd-926b-421f-922b-4df13ca150e0:0
 - kubernetes.io/iscsi/pvc-bb8b444f-9a68-11e8-b661-0050569c4ace:pvc-bb8b444f-9a68-11e8-b661-0050569c4ace:0

Also @ntfrnzn detailed a similar issue here: equinixmetal-archive/csi-packet#8

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

I'm hitting this on a production cluster on 1.10.3 but I get the same error on a testing cluster that has 1.11

  • Cloud provider or hardware configuration: AWS, deployed using kubeadm
  • OS (e.g. from /etc/os-release):
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.6.0
VERSION_ID=1800.6.0
BUILD_ID=2018-08-04-0323
PRETTY_NAME="Container Linux by CoreOS 1800.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
  • Kernel (e.g. uname -a): Linux ip-180-12-0-57 4.14.59-coreos-r2 #1 SMP Sat Aug 4 02:49:25 UTC 2018 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.sig/storageCategorizes an issue or PR as relevant to SIG Storage.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions