Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while tearing down pod, "device or resource busy" on service account secret #28750

Closed
smarterclayton opened this issue Jul 11, 2016 · 32 comments
Assignees
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@smarterclayton
Copy link
Contributor

I'm seeing errors in the kubelet now on 1.3.0+ that indicate a timeout on unmount of secrets into the pod, which looks like it causes some delay in shutdown (although not positive, since this was a long run).

I0711 00:13:30.469559     412 reconciler.go:138] UnmountVolume operation started for volume "kubernetes.io/secret/default-token-2uhpl" (spec.Name: "default-to
ken-2uhpl") from pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1" (UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1").
I0711 00:13:30.469837     412 secret.go:249] Tearing down volume default-token-2uhpl for pod 32c22a0c-46fc-11e6-af1a-0e6a9f5112c1 at /tmp/openshift/test-end-t
o-end-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/default-token-2uhpl
E0711 00:13:30.506681     412 goroutinemap.go:155] Operation for "kubernetes.io/secret/default-token-2uhpl" failed. No retries permitted until 2016-07-11 00:1
4:02.506671169 +0000 UTC (durationBeforeRetry 32s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/default-token-2uhpl" (volume.spec.Na
me: "default-token-2uhpl") pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1" (UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1") with: rename /tmp/openshift/test-end-to-en
d-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/default-token-2uhpl /tmp/openshift/test-end-to-end-docker/volumes/pods
/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/wrapped_default-token-2uhpl.deleting~719439388: device or resource busy
E0711 00:13:46.461321     412 kubelet.go:1946] Unable to mount volumes for pod "cli-with-token_test(0ba7ffca-46fc-11e6-af1a-0e6a9f5112c1)": timeout expired wa
iting for volumes to attach/mount for pod "cli-with-token"/"test". list of unattached/unmounted volumes=[default-token-2uhpl]; skipping pod
E0711 00:13:46.461353     412 pod_workers.go:183] Error syncing pod 0ba7ffca-46fc-11e6-af1a-0e6a9f5112c1, skipping: timeout expired waiting for volumes to att
ach/mount for pod "cli-with-token"/"test". list of unattached/unmounted volumes=[default-token-2uhpl]
I0711 00:13:46.461412     412 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"cli-with-token", UID:"0ba7ffca-46fc-11e6-af1a-0e6a9
f5112c1", APIVersion:"v1", ResourceVersion:"771", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: timeout expired waiting fo
r volumes to attach/mount for pod "cli-with-token"/"test". list of unattached/unmounted volumes=[default-token-2uhpl]
I0711 00:13:46.461483     412 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"cli-with-token", UID:"0ba7ffca-46fc-11e6-af1a-0e6a9
f5112c1", APIVersion:"v1", ResourceVersion:"771", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "cli-with-token_test(0
ba7ffca-46fc-11e6-af1a-0e6a9f5112c1)": timeout expired waiting for volumes to attach/mount for pod "cli-with-token"/"test". list of unattached/unmounted volum
es=[default-token-2uhpl]
I0711 00:14:02.570569     412 reconciler.go:138] UnmountVolume operation started for volume "kubernetes.io/secret/default-token-2uhpl" (spec.Name: "default-to
ken-2uhpl") from pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1" (UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1").
I0711 00:14:02.570848     412 secret.go:249] Tearing down volume default-token-2uhpl for pod 32c22a0c-46fc-11e6-af1a-0e6a9f5112c1 at /tmp/openshift/test-end-t
o-end-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/default-token-2uhpl
E0711 00:14:02.590507     412 goroutinemap.go:155] Operation for "kubernetes.io/secret/default-token-2uhpl" failed. No retries permitted until 2016-07-11 00:1
5:06.590493939 +0000 UTC (durationBeforeRetry 1m4s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/default-token-2uhpl" (volume.spec.N
ame: "default-token-2uhpl") pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1" (UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1") with: rename /tmp/openshift/test-end-to-e
nd-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/default-token-2uhpl /tmp/openshift/test-end-to-end-docker/volumes/pod
s/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/wrapped_default-token-2uhpl.deleting~619277515: device or resource busy

Occurred 4 times for roughly 30 pods. This is a rhel72 system but I don't have the kernel version handy. @kubernetes/rh-storage what sorts of frequency should we expect a device busy on unmount of a tmpfs?

@smarterclayton smarterclayton added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jul 11, 2016
@smarterclayton
Copy link
Contributor Author

This delays pod shutdown by over a minute, which means that deletes take a lot longer to complete.

@smarterclayton
Copy link
Contributor Author

Kernel 3.10.0-327.22.2.el7.x86_64

@eparis
Copy link
Contributor

eparis commented Jul 11, 2016

@kubernetes/sig-storage

@smarterclayton
Copy link
Contributor Author

This is also happening on pod startup. I think it's roughly ~10% of pods failing to get their initial setup.

E0711 13:48:24.461255   16542 kubelet.go:1946] Unable to mount volumes for pod "database-1-hook-pre_test(db7e673b-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-
token-1x1ef]; skipping pod
E0711 13:48:24.461272   16542 pod_workers.go:183] Error syncing pod db7e673b-476d-11e6-862d-0ecf6c1dd139, skipping: timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461317   16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"database-1-hook-pre", UID:"db7e673b-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1199", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing
pod, skipping: timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
E0711 13:48:24.461425   16542 kubelet.go:1946] Unable to mount volumes for pod "failing-dc-mid-2-axjzw_test(db26fca5-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[de
fault-token-1x1ef]; skipping pod
E0711 13:48:24.461441   16542 pod_workers.go:183] Error syncing pod db26fca5-476d-11e6-862d-0ecf6c1dd139, skipping: timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461468   16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"failing-dc-mid-2-axjzw", UID:"db26fca5-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1192", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error synci
ng pod, skipping: timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461514   16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"database-1-hook-pre", UID:"db7e673b-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1199", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mou
nt volumes for pod "database-1-hook-pre_test(db7e673b-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461557   16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"failing-dc-mid-2-axjzw", UID:"db26fca5-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1192", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to

It appears that this did not resolve within 1m, which caused another timeout to fail. Not sure if it eventually clears up.

@smarterclayton smarterclayton added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 11, 2016
@jingxu97
Copy link
Contributor

Hi Clayton,

Thanks for posting the log messages. Do you also have the full log file so
that we can take a look? Thanks!

Jing

On Mon, Jul 11, 2016 at 8:16 AM, Clayton Coleman notifications@github.com
wrote:

This is also happening on pod startup. I think it's roughly ~10% of pods
failing to get their initial setup.

E0711 13:48:24.461255 16542 kubelet.go:1946] Unable to mount volumes for pod "database-1-hook-pre_test(db7e673b-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-
token-1x1ef]; skipping pod
E0711 13:48:24.461272 16542 pod_workers.go:183] Error syncing pod db7e673b-476d-11e6-862d-0ecf6c1dd139, skipping: timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461317 16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"database-1-hook-pre", UID:"db7e673b-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1199", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing
pod, skipping: timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
E0711 13:48:24.461425 16542 kubelet.go:1946] Unable to mount volumes for pod "failing-dc-mid-2-axjzw_test(db26fca5-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[de
fault-token-1x1ef]; skipping pod
E0711 13:48:24.461441 16542 pod_workers.go:183] Error syncing pod db26fca5-476d-11e6-862d-0ecf6c1dd139, skipping: timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461468 16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"failing-dc-mid-2-axjzw", UID:"db26fca5-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1192", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error synci
ng pod, skipping: timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461514 16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"database-1-hook-pre", UID:"db7e673b-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1199", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mou
nt volumes for pod "database-1-hook-pre_test(db7e673b-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461557 16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"failing-dc-mid-2-axjzw", UID:"db26fca5-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1192", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to

It appears that this did not resolve within 1m, which caused another
timeout to fail. Not sure if it eventually clears up.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ASSNxUDAlwFWg0dUU1gAJMsIZVx5iOWVks5qUl49gaJpZM4JI-1a
.

  • Jing

@matchstick matchstick added this to the v1.3 milestone Jul 11, 2016
@matchstick
Copy link
Contributor

@kubernetes/sig-storage Targeting this to 1.3 until we know more of the root cause.

@smarterclayton
Copy link
Contributor Author

We're trying to get a loglevel=5 reproducer. We had log level 4 at
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/3091/s3/download/test-end-to-end-docker/logs/container-origin.log
which
is effectively a 1.3 kubelet (there are some code differences above and
around, but everything else in the product works so we still think this is
a flake).

One consideration - this is from a single node run, so we have lots of
pods from the same namespace with the same secrets landing on the same
node. Not sure if that's correlated.

On Mon, Jul 11, 2016 at 1:36 PM, Michael Rubin notifications@github.com
wrote:

@kubernetes/sig-storage
https://github.com/orgs/kubernetes/teams/sig-storage Targeting this to
1.3 until we know more of the root cause.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p_nDMfYtG5SVb4_GHW_AiDj_vekVks5qUn83gaJpZM4JI-1a
.

@smarterclayton
Copy link
Contributor Author

And we have lots more secrets per namespace.

On Mon, Jul 11, 2016 at 1:59 PM, Clayton Coleman ccoleman@redhat.com
wrote:

We're trying to get a loglevel=5 reproducer. We had log level 4 at
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/3091/s3/download/test-end-to-end-docker/logs/container-origin.log which
is effectively a 1.3 kubelet (there are some code differences above and
around, but everything else in the product works so we still think this is
a flake).

One consideration - this is from a single node run, so we have lots of
pods from the same namespace with the same secrets landing on the same
node. Not sure if that's correlated.

On Mon, Jul 11, 2016 at 1:36 PM, Michael Rubin notifications@github.com
wrote:

@kubernetes/sig-storage
https://github.com/orgs/kubernetes/teams/sig-storage Targeting this to
1.3 until we know more of the root cause.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p_nDMfYtG5SVb4_GHW_AiDj_vekVks5qUn83gaJpZM4JI-1a
.

@yujuhong
Copy link
Contributor

This might be a dup of #28616
@saad-ali's explanation from #28616 (comment): "Volume manager parallelizes work as long as the underlying volume is not the same. For secret volumes that means as long as the SecretName is different. In this case, if all the pods being batch created are identical (and therefore referencing the same volume), then they would be handled serially."

@saad-ali
Copy link
Member

The f5112c1", APIVersion:"v1", ResourceVersion:"771", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "cli-with-token_test(0 ba7ffca-46fc-11e6-af1a-0e6a9f5112c1)": timeout expired waiting for volumes to attach/mount for pod "cli-with-token"/"test". list of unattached/unmounted volum es=[default-token-2uhpl] (basically a slow down in pod creation) can be explained by issue #28616 since creation of pods using the same secret is serialized.

But it does not explain this failure:
4:02.506671169 +0000 UTC (durationBeforeRetry 32s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/default-token-2uhpl" (volume.spec.Na me: "default-token-2uhpl") pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1" (UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1") with: rename /tmp/openshift/test-end-to-en d-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/default-token-2uhpl /tmp/openshift/test-end-to-end-docker/volumes/pods /32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/wrapped_default-token-2uhpl.deleting~719439388: device or resource busy
That error indicates that an unmount started but failed. That error sounds a lot like #27054 which @jingxu97 fixed for nsenter_mount.

@smarterclayton
Copy link
Contributor Author

It does, except we have that patch in place. I'm currently trying to get a
reproduction with v(5) so we can chase it down and see whether there's
another underlying problem. I'm also not ruling out an issue in RHEL at
this point.

Is the parallelization issue prioritized already or something we can assist
with (whether review or otherwise)?

On Mon, Jul 11, 2016 at 10:27 PM, Saad Ali notifications@github.com wrote:

The f5112c1", APIVersion:"v1", ResourceVersion:"771", FieldPath:""}):
type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod
"cli-with-token_test(0
ba7ffca-46fc-11e6-af1a-0e6a9f5112c1)": timeout expired waiting for volumes
to attach/mount for pod "cli-with-token"/"test". list of
unattached/unmounted volum
es=[default-token-2uhpl](basically a slow down in pod creation) can be
explained by issue #28616
#28616 since creation of
pods using the same secret is serialized.

But it does not explain this failure:
4:02.506671169 +0000 UTC (durationBeforeRetry 32s). error:
UnmountVolume.TearDown failed for volume "
kubernetes.io/secret/default-token-2uhpl" (volume.spec.Na
me: "default-token-2uhpl") pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1"
(UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1") with: rename
/tmp/openshift/test-end-to-en
d-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/
kubernetes.iosecret/default-token-2uhpl
/tmp/openshift/test-end-to-end-docker/volumes/pods
/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io
secret/wrapped_default-token-2uhpl.deleting~719439388:
device or resource busy
That error indicates that an unmount started but failed. That error sounds
a lot like #27054 #27054
which @jingxu97 https://github.com/jingxu97 fixed for nsenter_mount.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p_8gPHFCrHkmhRZTG11FHEz4SPQZks5qUvt5gaJpZM4JI-1a
.

@saad-ali
Copy link
Member

Is the parallelization issue prioritized already or something we can assist with (whether review or otherwise)?

Thanks Clayton. I'm working on it, and have a good handle on it. Tracking it with #28616. Will include patch in 1.3.1. Will make sure to loop you in on the fix.

@ShashankYadav
Copy link

Facing the same issue in 1.3. I also notice that the files in /var/lib/kubelet/ are not removed even after deleting the pod. And am still seeing the message a day after deleting the pod:
Jul 13 06:10:43 coreos-02 kubelet-wrapper[23023]: E0713 06:10:43.761116 23023 goroutinemap.go:155] Operation for "kubernetes.io/secret/default-token-7vgrg" failed. No retries permitted until 2016-07-13 06:12:43.76110436 +0000 UTC (durationBeforeRetry 2m0s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/default-token-7vgrg" (volume.spec.Name: "default-token-7vgrg") pod "b3fc1a78-47fc-11e6-838e-0800278fdce2" (UID: "b3fc1a78-47fc-11e6-838e-0800278fdce2") with: rename /var/lib/kubelet/pods/b3fc1a78-47fc-11e6-838e-0800278fdce2/volumes/kubernetes.io~secret/default-token-7vgrg /var/lib/kubelet/pods/b3fc1a78-47fc-11e6-838e-0800278fdce2/volumes/kubernetes.io~secret/wrapped_default-token-7vgrg.deleting~765506473: device or resource busy

And pod startup is failing in default namespace with the following message
19m 19m 1 {kubelet 172.11.8.102} Warning FailedMount Unable to mount volumes for pod "dashboard-e5k5u_default(9d0138d2-48c3-11e6-b471-0800278fdce2)": timeout expired waiting for volumes to attach/mount for pod "dashboard-e5k5u"/"default". list of unattached/unmounted volumes=[default-token-7vgrg]
19m 19m 1 {kubelet 172.11.8.102} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "dashboard-e5k5u"/"default". list of unattached/unmounted volumes=[default-token-7vgrg]

So I removed all pods running on the cluster and found something interesting. There are no pods running on the cluster yet the following logs show up in kubelet
Jul 13 07:28:20 coreos-01 kubelet-wrapper[11981]: I0713 07:28:20.623730 11981 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/secret/default-token-vxqiv" (spec.Name: "default-token-vxqiv") to pod "412b65b4-4813-11e6-b471-0800278fdce2" (UID: "412b65b4-4813-11e6-b471-0800278fdce2"). Volume is already mounted to pod, but remount was requested.
And looking at the volumes mounted in /var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes, I see /kubernetes.io~secret/default-token-vxqiv has been mounted twice. Is this expected? In some other pods, the same volume for default-token has been mounted 4 times..

coreos-01 pods # mount | grep 412b65b4-4813-11e6-b471-0800278fdce2 | grep -v exited
tmpfs on /var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/kubernetes.io~secret/default-token-vxqiv type tmpfs (rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,sellable)
tmpfs on /var/lib/rkt/pods/run/071768b4-2916-4d97-bf7f-cfbad87f9106/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/kubernetes.iosecret/default-token-vxqiv type tmpfs (rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,seclabel)
**tmpfs on /var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/kubernetes.io
secret/default-token-vxqiv type tmpfs (rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,sellable)**

@childsb childsb added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 13, 2016
@chulkilee
Copy link

I'm seeing this problem on centos 7 + flannel + kubernetes in docker (containerzed). BTW even after I stop all docker containers docker stop $(docker ps -q -a), some mounted volume remains (mount)

Also for me this error always happen on kubernetes-dashboard pod. I don't know how hyberkube runs all pods (e.g. in serial or parallel)

@jingxu97
Copy link
Contributor

Hi ShashankYadav,

Thank you for posting the log message. In normal situation, those secret
directories should only be mounted once. In order to debug, could you also
share your full log file if it is still available? Thanks!

Jing

On Wed, Jul 13, 2016 at 2:07 AM, ShashankYadav notifications@github.com
wrote:

Facing the same issue in 1.3. I also notice that the files in
/var/lib/kubelet/ are not removed even after deleting the pod. And am still
seeing the message a day after deleting the pod:

Jul 13 06:10:43 coreos-02 kubelet-wrapper[23023]: E0713 06:10:43.761116
23023 goroutinemap.go:155] Operation for "
kubernetes.io/secret/default-token-7vgrg" failed. No retries permitted
until 2016-07-13 06:12:43.76110436 +0000 UTC (durationBeforeRetry 2m0s).
error: UnmountVolume.TearDown failed for volume "
kubernetes.io/secret/default-token-7vgrg" (volume.spec.Name:
"default-token-7vgrg") pod "b3fc1a78-47fc-11e6-838e-0800278fdce2" (UID:
"b3fc1a78-47fc-11e6-838e-0800278fdce2") with: rename
/var/lib/kubelet/pods/b3fc1a78-47fc-11e6-838e-0800278fdce2/volumes/
kubernetes.iosecret/default-token-7vgrg
/var/lib/kubelet/pods/b3fc1a78-47fc-11e6-838e-0800278fdce2/volumes/
kubernetes.io
secret/wrapped_default-token-7vgrg.deleting~765506473:
device or resource busy

And pod startup is failing in default namespace with the following message
19m 19m 1 {kubelet 172.11.8.102} Warning FailedMount Unable to mount
volumes for pod
"dashboard-e5k5u_default(9d0138d2-48c3-11e6-b471-0800278fdce2)": timeout
expired waiting for volumes to attach/mount for pod
"dashboard-e5k5u"/"default". list of unattached/unmounted
volumes=[default-token-7vgrg]

19m 19m 1 {kubelet 172.11.8.102} Warning FailedSync Error syncing pod,
skipping: timeout expired waiting for volumes to attach/mount for pod
"dashboard-e5k5u"/"default". list of unattached/unmounted
volumes=[default-token-7vgrg]

So I removed all pods running on the cluster and found something
interesting. There are no pods running on the cluster yet the following
logs show up in kubelet
Jul 13 07:28:20 coreos-01 kubelet-wrapper[11981]: I0713 07:28:20.623730
11981 reconciler.go:253] MountVolume operation started for volume "
kubernetes.io/secret/default-token-vxqiv" (spec.Name:
"default-token-vxqiv") to pod "412b65b4-4813-11e6-b471-0800278fdce2" (UID:
"412b65b4-4813-11e6-b471-0800278fdce2"). Volume is already mounted to pod,
but remount was requested.

And looking at the volumes mounted in
/var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes, I see /
kubernetes.io~secret/default-token-vxqiv has been mounted twice. Is this
expected? In some other pods, the same volume for default-token has been
mounted 4 times..

coreos-01 pods # mount | grep 412b65b4-4813-11e6-b471-0800278fdce2 | grep
-v exited
tmpfs on
/var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/kubernetes.io
http://kubernetes.io~secret/default-token-vxqiv type tmpfs
(rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,sellable)

tmpfs on
/var/lib/rkt/pods/run/071768b4-2916-4d97-bf7f-cfbad87f9106/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/
kubernetes.io~secret/default-token-vxqiv type tmpfs
(rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,seclabel)
tmpfs on
/var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/kubernetes.io
http://kubernetes.io~secret/default-token-vxqiv type tmpfs
(rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,sellable)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ASSNxf5iBiv4d8ZeWqWzF3wld9rouMgoks5qVKq5gaJpZM4JI-1a
.

  • Jing

@saad-ali
Copy link
Member

To be clear, based on what Clayton reported above, there are two issues.

  1. device or resource busy on secret unmount issue
  2. Unable to mount volumes for pod on secret mount issue

We can leave this issue open to track the device or resource busy on secret unmount issue.

@saad-ali
Copy link
Member

Debugged this some more (thanks to @justinsb for repro logs). The device or resource busy is the root cause of Unable to mount volumes for pod. Once the secret fails to unmount repeatedly (for whatever reason), all mounts for that secret are no longer allowed. This will be changed with the fix for #28616 which should reduce the impact of the device or resource busy to the original pod that fails to tear down.

@ShashankYadav
Copy link

@jingxu97 I am attaching the full logs. I did not encounter this issue when using a different namespace(other than default) for the pods.
kubelet.txt

mboersma added a commit to mboersma/workflow that referenced this issue Jul 18, 2016
mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016
mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016
mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016
@saad-ali
Copy link
Member

To be clear, based on what Clayton reported above, there are two issues.

  1. device or resource busy on secret unmount issue
  2. Unable to mount volumes for pod on secret mount issue

We can leave this issue open to track the device or resource busy on secret unmount issue.

Now that the fix for 2 has been merged the bite of this bug should be greatly reduced (won't prevent other pods from mounting the same secret). However, we still need to understand 1, what causes secret mounts to sometimes fail with device or resource busy. @jingxu97 can you please take a look this?

@smarterclayton
Copy link
Contributor Author

Since we can reproduce this reliably, any info we can dump on unmount we're happy to do in our runs and gather info if we know what to add.

@saad-ali
Copy link
Member

I'm thinking this maybe the same or similar to #22911 (comment)

Which would mean the underlying medium is being detected incorrectly, which causes the wrong tear down routine to be used.

So to debug this I would start with extra logging around GetMountMedium() in empty_dir/empty_dir_linux.go logic to see what's happening.

@jingxu97
Copy link
Contributor

@smarterclayton, I am working on this bug. Could you please list the steps to reproduce this error so I can further debug it. Thanks a lot!

@kyleferguson
Copy link

We've run into this same issue. Does anyone have a temporary workaround they've found until the fix is available? We're having a hard time deploying due to certain services not being able to create new pods. Any help is greatly appreciated :)

@jingxu97
Copy link
Contributor

Hi Kyle,

I am trying to investigate this problem. Could you please send us the
kubelet log from your system during the issue occurred so that we can take
a look at it. Thanks!

Best,
Jing

On Mon, Jul 25, 2016 at 9:20 AM, Kyle Ferguson notifications@github.com
wrote:

We've run into this same issue. Does anyone have a temporary workaround
they've found until the fix is available? We're having a hard time
deploying due to certain services not being able to create new pods. Any
help is greatly appreciated :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxcoA83KLzfpQzxH8UeqMxErsxBrZks5qZOJegaJpZM4JI-1a
.

  • Jing

@jingxu97
Copy link
Contributor

We are still in the process of investigating the problem. But want to add
one update about it is

The behavior of preventing following mount operations
happen will be changed in the fix
#28939
https://www.google.com/url?q=https://github.com/kubernetes/kubernetes/pull/28939&sa=D&usg=AFQjCNE0EagFQTysH0i433hIz60zpw7TFw
and should be part of v1.3.4.

Even if the error may still occur, but after this fix, you should be able
to add new pods and mount. Thanks!

Jing

On Mon, Jul 25, 2016 at 10:07 AM, Jing Xu jinxu@google.com wrote:

Hi Kyle,

I am trying to investigate this problem. Could you please send us the
kubelet log from your system during the issue occurred so that we can take
a look at it. Thanks!

Best,
Jing

On Mon, Jul 25, 2016 at 9:20 AM, Kyle Ferguson notifications@github.com
wrote:

We've run into this same issue. Does anyone have a temporary workaround
they've found until the fix is available? We're having a hard time
deploying due to certain services not being able to create new pods. Any
help is greatly appreciated :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxcoA83KLzfpQzxH8UeqMxErsxBrZks5qZOJegaJpZM4JI-1a
.

  • Jing

  • Jing

@smarterclayton
Copy link
Contributor Author

@pmorie was looking into this on our end (so folks know) - he says he found
a mount race.

On Mon, Jul 25, 2016 at 1:08 PM, Jing Xu notifications@github.com wrote:

Hi Kyle,

I am trying to investigate this problem. Could you please send us the
kubelet log from your system during the issue occurred so that we can take
a look at it. Thanks!

Best,
Jing

On Mon, Jul 25, 2016 at 9:20 AM, Kyle Ferguson notifications@github.com
wrote:

We've run into this same issue. Does anyone have a temporary workaround
they've found until the fix is available? We're having a hard time
deploying due to certain services not being able to create new pods. Any
help is greatly appreciated :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#28750 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/ASSNxcoA83KLzfpQzxH8UeqMxErsxBrZks5qZOJegaJpZM4JI-1a

.

  • Jing


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pwnsv9syI2Na4F1Cmyt4sT0QOUCKks5qZO2bgaJpZM4JI-1a
.

@jingxu97
Copy link
Contributor

Great! @pmorie Could you share the findings you have? Thank you!

Jing

On Mon, Jul 25, 2016 at 10:15 AM, Clayton Coleman notifications@github.com
wrote:

@pmorie was looking into this on our end (so folks know) - he says he found
a mount race.

On Mon, Jul 25, 2016 at 1:08 PM, Jing Xu notifications@github.com wrote:

Hi Kyle,

I am trying to investigate this problem. Could you please send us the
kubelet log from your system during the issue occurred so that we can
take
a look at it. Thanks!

Best,
Jing

On Mon, Jul 25, 2016 at 9:20 AM, Kyle Ferguson <notifications@github.com

wrote:

We've run into this same issue. Does anyone have a temporary workaround
they've found until the fix is available? We're having a hard time
deploying due to certain services not being able to create new pods.
Any
help is greatly appreciated :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<

#28750 (comment)

,
or mute the thread
<

https://github.com/notifications/unsubscribe-auth/ASSNxcoA83KLzfpQzxH8UeqMxErsxBrZks5qZOJegaJpZM4JI-1a

.

  • Jing


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#28750 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/ABG_pwnsv9syI2Na4F1Cmyt4sT0QOUCKks5qZO2bgaJpZM4JI-1a

.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxZ7blHNInqCJneReOxYfGNbDIjY3ks5qZO9KgaJpZM4JI-1a
.

  • Jing

@pmorie
Copy link
Member

pmorie commented Jul 26, 2016

So far I've spent most of my time looking at the mount timeouts. I have findings to report (trying to figure out the right place) on that, but not yet on the umount issues.

@kyleferguson
Copy link

@jingxu97 Cursory glance at the logs I couldn't find anything, it was a couple days ago. I'll try to recreate if I get some time later and capture it.

We're on GKE and don't the ability to rollback the master so we're kind of stranded. Some of our services are stuck and unable to deploy, and we have scheduled tasks (daily reports, etc.) that are very intermittent. Seems to be dependent on what node they get scheduled on.

Might try to find a healthy node and force jobs to run there in the meantime to see if that will work as a stop gap until a fix is available.

@jingxu97
Copy link
Contributor

jingxu97 commented Aug 1, 2016

@kyleferguson , we fixed a few issues recently in version 1.3.4. Could you please try on this new version and let us know if you have any issues. Thanks!

@kyleferguson
Copy link

@jingxu97 Thanks for the update! Unfortunately we're on GKE at the moment so I will have to wait until 1.3.4 is available on there. Looks like the latest available currently is 1.3.3

@saad-ali
Copy link
Member

saad-ali commented Aug 1, 2016

To be clear, based on what Clayton reported above, there are two issues.

  1. device or resource busy on secret unmount issue
  2. Unable to mount volumes for pod on secret mount issue

We can leave this issue open to track the device or resource busy on secret unmount issue.

Now that the fix for 2 has been merged the bite of this bug should be greatly reduced (won't prevent other pods from mounting the same secret). However, we still need to understand 1, what causes secret mounts to sometimes fail with device or resource busy. @jingxu97 can you please take a look this?

We believe that 1 was caused by issue #29555 which was fixed by PR #29673. Patches for both have been cherry-picked to the v1.3 release branch and are part of the v1.3.4 release. Closing this issue as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests