Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
I created a StatefulSet with 4 replicas and it worked correctly.
After some time I needed to restart one of the pods and when it came back again it was stuck on this error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 9m (x40 over 1h) kubelet, ip-180-12-10-58.ec2.internal Unable to mount volumes for pod "storage-0_twodotoh(51577cea-9ccd-11e8-b024-1232e142048e)": timeout expired waiting for volumes to attach or mount for pod "myorg"/"mypod-0". list of unmounted volumes=[data]. list of unattached volumes=[data mypod-config default-token-cjfqp]
Warning FailedMount 3m (x55 over 1h) kubelet, ip-180-12-10-58.ec2.internal MountVolume.WaitForAttach failed for volume "pvc-a12b7de1-30ed-11ee-a324-2232d546216c" : WaitForAttach failed for AWS Volume "aws://us-east-1b/vol-045d3gx6hg53gz341": devicePath is empty.
When that happens I can get it working by deleting the pod again, it can happen that the error happens again after that but usually not three times in a row.
The main issue is that if you don't act manually on it it will continue to be reconciled by reconciler.go and will never come back again.
The issue seems to be on actual_state_of_world.go while doing the MarkVolumeAsAttached part, at some point the devicePath string is not written in the object.
actual_state_of_world.go:616 ->
reconciler.go:238 ->
operation_executor.go:712 ->
operation_generator.go:437 -> error on line 496
What you expected to happen:
The pod comes back with no error.
How to reproduce it (as minimally and precisely as possible):
The problem seems to be difficult to reproduce, I can trigger it after the
Upgrading to 1.11 does not solve the problem.
It does not happen on Deployments, I haven't been able to reproduce there.
- Start a Kubernetes cluster on AWS that is configured to use EBS volumes
- Create a statefulset with a dynamic provisioned volume (see yaml file below)
- Delete one of the pods of your choice
At this point one of two things can happen:
- It just works
OR
- The pod is not able to come back again and give the error I reported above.
I haven't found a reliable way to make one or the other happen when I wanted, it seems to be very random but I'm sure that it only happens when the pod is recreated on the same node.
Statefulset to reproduce
apiVersion: v1
kind: Namespace
metadata:
name: repro-devicepath
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: myrepro
namespace: repro-devicepath
labels:
component: myrepro
spec:
serviceName: myrepro
selector:
matchLabels:
component: myrepro
replicas: 4
template:
metadata:
name: myrepro
labels:
component: myrepro
spec:
containers:
- name: myrepro
image: docker.io/fntlnz/caturday:latest
volumeMounts:
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
namespace: repro-devicepath
name: data
spec:
storageClassName: ebs-1
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 1Gi
Anything else we need to know?:
When this happens, if one looks at devicePath in the node's status it will be reported empty, one can verify that with:
kubectl get node -o json | jq ".items[].status.volumesAttached"
I found some other users on slack that have this problem, @wirewc sent me this (note the empty devicePath happening in his system.
volumesAttached:
- devicePath: ""
name: kubernetes.io/iscsi/10.48.147.131:iqn.2016-12.org.gluster-block:b5a96cbd-926b-421f-922b-4df13ca150e0:0
volumesInUse:
- kubernetes.io/iscsi/10.48.147.131:iqn.2016-12.org.gluster-block:b5a96cbd-926b-421f-922b-4df13ca150e0:0
- kubernetes.io/iscsi/pvc-bb8b444f-9a68-11e8-b661-0050569c4ace:pvc-bb8b444f-9a68-11e8-b661-0050569c4ace:0
Also @ntfrnzn detailed a similar issue here: equinixmetal-archive/csi-packet#8
Environment:
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
I'm hitting this on a production cluster on 1.10.3 but I get the same error on a testing cluster that has 1.11
- Cloud provider or hardware configuration: AWS, deployed using kubeadm
- OS (e.g. from /etc/os-release):
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.6.0
VERSION_ID=1800.6.0
BUILD_ID=2018-08-04-0323
PRETTY_NAME="Container Linux by CoreOS 1800.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
- Kernel (e.g.
uname -a): Linux ip-180-12-0-57 4.14.59-coreos-r2 #1 SMP Sat Aug 4 02:49:25 UTC 2018 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux
- Install tools:
- Others:
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
I created a
StatefulSetwith 4 replicas and it worked correctly.After some time I needed to restart one of the pods and when it came back again it was stuck on this error:
When that happens I can get it working by deleting the pod again, it can happen that the error happens again after that but usually not three times in a row.
The main issue is that if you don't act manually on it it will continue to be reconciled by
reconciler.goand will never come back again.The issue seems to be on
actual_state_of_world.gowhile doing theMarkVolumeAsAttachedpart, at some point the devicePath string is not written in the object.What you expected to happen:
The pod comes back with no error.
How to reproduce it (as minimally and precisely as possible):
The problem seems to be difficult to reproduce, I can trigger it after the
Upgrading to 1.11 does not solve the problem.
It does not happen on Deployments, I haven't been able to reproduce there.
At this point one of two things can happen:
OR
I haven't found a reliable way to make one or the other happen when I wanted, it seems to be very random but I'm sure that it only happens when the pod is recreated on the same node.
Statefulset to reproduce
Anything else we need to know?:
When this happens, if one looks at
devicePathin the node's status it will be reported empty, one can verify that with:I found some other users on slack that have this problem, @wirewc sent me this (note the empty
devicePathhappening in his system.Also @ntfrnzn detailed a similar issue here: equinixmetal-archive/csi-packet#8
Environment:
kubectl version):I'm hitting this on a production cluster on 1.10.3 but I get the same error on a testing cluster that has 1.11
uname -a):Linux ip-180-12-0-57 4.14.59-coreos-r2 #1 SMP Sat Aug 4 02:49:25 UTC 2018 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux