-
Notifications
You must be signed in to change notification settings - Fork 40.5k
Storage: devicePath is empty while WaitForAttach in StatefulSets #67342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/sig storage |
@fntlnz When this happens, what is the status of EBS volume? Is that still attached to the node where you deleted and recreated the pod? |
@gnufied as far as I can remember the volume is released, I remember this because I was looking at the volumes on the aws console and they were blue (not attached). |
FWIW - devicePath being empty for iSCSI is expected. iSCSI does not perform "real" attach/detach, so naturally there is no We need to see corresponding entry for EBS when that happens. iSCSI could be a red herring. |
Oh you’re right @gnufied I’m trying to find a way to reproduce this reliably, will keep this issue posted. |
@gnufied just happened and noted that when this error occurs the volume is marked as attached on aws as
It is also not listed in any process The only relevant log I see in the kubelet is:
I'm starting thinking that the problem here is happening because of a dirty unmount rather then a bad mount. |
Can you confirm if unmount is left from previous pod or if unmounted volume is because of new pod (that triggered attach but did not mount) ? One way of confirming that would be - if device name ( |
Hi, we're also facing the same issue. We can confirm that the unmount is left from previous pod. The difference in our scenario is that we were trying to upgrade from 1.11.1 to 1.11.2. We initally thought it had something to do with the versions. But here is what our hypothesis is: When pod gets deleted first time, it leaves the mount behind. When the scheduler puts back the container on the kubelet, kubelet tries to mount again. But the It makes me wonder why does /cc @BenChapman |
@gnufied the device name remains the same in my case |
Looks like the kube e2e is also running into this as part of
|
I think I found the cause of this bug. and enhanced the current unit test, 100% to reproduce the bug wph95#1 |
I opened the kubelet --v=10 in my test cluster, in our scenario (with intermittent long write operations on the disk). This bug trigger is 10%. By analyzing the logs, I found the cause of the problem and succeeded in reproducing the problem by adding a new unit test to prove that the bug existed. The cause is that AWS EBS sometimes attacher.UnmountDevice slowly (10s ++ ), UnmountDevice is an asynchronous function and at the same time, This has caused some code reconciler.reconcile can't run as expected cc @gnufied and i'm glad/want to contribute code. btw i think reconciler lifecyle is complex,I've been looking for a long time to search pr/issue about reconciler lifecyle. But I didn't find it, so I didn't know how to fix the bug correctly. maybe insure after executed UnmountDevice to excute MarkVolumeAsDetached p.s. English is not my mother tongue; please excuse any errors on my part. if have not understood, please see wph95#1 or mention me :) |
We were hit by this issue. Dirty w/a is to restart kubelet on affected node. |
/assign @ddebroy |
We see the same issue, in particular, when the same volume is repeatedly mounted/dismounted on the same node. Here are some logs showing successful mount, dismount, and then a failed mount with empty device path:
|
@gtie - What's your kubernetes version? |
@fntlnz, thanks for the input! I have this issue on K8s v.1.10.7. Upgrade should be coming in the next few weeks, we'll see if it appears again afterwards. |
I see the same issue from time to time on one of my openstack k8s clusters (v1.11.3).
The cluster has only one worker node. When the error occurs the node shows the resp. volume as attached:
|
I hit this error in one pod of a StatefulSet on k8s v1.11.5:
When I look at the node, I see the devicePath looks normal:
After deleting the pod, the problem resolved. |
We see the same issue with kubenete v1.10.2. When I look at the node, I see the devicePath looks normal: |
Hey guys I think @jsafrane fixed this issue via #71074 which makes sure that if |
@gnufied - We're getting this same issue with Argo (https://github.com/argoproj/argo) as @dguendisch, see above. We're heavily invested technically in using Kops, which is just now supporting 1.11 in beta, so we're really looking forward to a fix being backported. (We upgraded to 1.10.12 which contains the #71074 patch but does not remove the error for us.) |
We faced this issue after upgrading one of our kops clusters from 1.10.11 to 1.11.6, FYI had tried by restarting 1.11.6 kubelet, wirhout success. |
@DaveKriegerEmerald where do you see #71074 patch being backported to 1.10.12 ? I only created backports for 1.11 and 1.12 . |
@gnufied Thanks for replying! My mistake; I was looking at #71145 (comment) when I wrote that. However, kops v1.11.0 is now out of beta; I upgraded my argo K8s cluster to v1.11.6 but we're still getting "devicePath is empty" errors. So it looks like the failure mechanism for this ticket may be different than what #71074 fixes. |
@DaveKriegerEmerald I traced the 1.11 branch PR and I believe the fix is not in v1.11.6, so it should be in v1.11.7. |
I am seeing the problem on 1.11.5 | eks.1, when the pod is being recreated but its on the same node as the one it was replacing. Does not happen all the time. The current fix is for me to go into AWS and detach the volume manually, then delete the pod and it will then start up properly again. |
Great! We have a workaround for the problem (revising our Argo workflow so steps no longer try to share a volume) and will revisit the issue after our next K8s version upgrade. |
I can confirm that this particular issue is solved in 1.12.5 -- I have been running a PVC stress tester which basically waits for a StatefulSet to become stable and then rapidly evicts pods for rescheduling. I have never been able to successfully run it for more than 2 hours before running in to this issue. But with 1.12.5, I have been successfully running it for 36 hours. |
We are on |
Seeing this issue in AWS EKS, for which the highest available version is currently https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html A backport would be greatly appreciated. |
we are also seeing this in EKS 1.11.5, and would also appreciate a backport. thanks |
@drewhemm @mmalex the fix in #71074 had been backported to the 1.11.x release branch but didn't reach 1.11.5, the first release that has it in the 1.11 series is 1.11.7. I see that see: https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html |
We ran into this in EKS, also like @mmalex A workaround for us was to scale the deployment to 0 replicas, wait for EBS volume to unmount in EC2 console, then scale deployment back to original replicas and it mounted successfully. |
@fntlnz Looks like the Amazon EKS-Optimized API still running at version 1.11.5. This patch also require the kubelet to be updated right? Thanks. |
FYI. this issue is fixed in by #71074
Shall we close this issue now? |
This seems to be fixed, closing it. Feel free to reopen if needed. |
Small world :D For us, recreating the job with pause for vol detach made it work |
Yes - I just verified yesterday on our EKS cluster that you need to upgrade the nodes as well. If all of the nodes were created in an autoscaling group then the process is pretty simple
Make sure that your cloud format template has the right AMI |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
I created a
StatefulSet
with 4 replicas and it worked correctly.After some time I needed to restart one of the pods and when it came back again it was stuck on this error:
When that happens I can get it working by deleting the pod again, it can happen that the error happens again after that but usually not three times in a row.
The main issue is that if you don't act manually on it it will continue to be reconciled by
reconciler.go
and will never come back again.The issue seems to be on
actual_state_of_world.go
while doing theMarkVolumeAsAttached
part, at some point the devicePath string is not written in the object.What you expected to happen:
The pod comes back with no error.
How to reproduce it (as minimally and precisely as possible):
The problem seems to be difficult to reproduce, I can trigger it after the
Upgrading to 1.11 does not solve the problem.
It does not happen on Deployments, I haven't been able to reproduce there.
At this point one of two things can happen:
OR
I haven't found a reliable way to make one or the other happen when I wanted, it seems to be very random but I'm sure that it only happens when the pod is recreated on the same node.
Statefulset to reproduce
Anything else we need to know?:
When this happens, if one looks at
devicePath
in the node's status it will be reported empty, one can verify that with:I found some other users on slack that have this problem, @wirewc sent me this (note the empty
devicePath
happening in his system.Also @ntfrnzn detailed a similar issue here: equinixmetal-archive/csi-packet#8
Environment:
kubectl version
):I'm hitting this on a production cluster on 1.10.3 but I get the same error on a testing cluster that has 1.11
uname -a
):Linux ip-180-12-0-57 4.14.59-coreos-r2 #1 SMP Sat Aug 4 02:49:25 UTC 2018 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux
The text was updated successfully, but these errors were encountered: