Error while tearing down pod, "device or resource busy" on service account secret #28750

smarterclayton · 2016-07-11T02:23:31Z

I'm seeing errors in the kubelet now on 1.3.0+ that indicate a timeout on unmount of secrets into the pod, which looks like it causes some delay in shutdown (although not positive, since this was a long run).

I0711 00:13:30.469559     412 reconciler.go:138] UnmountVolume operation started for volume "kubernetes.io/secret/default-token-2uhpl" (spec.Name: "default-to
ken-2uhpl") from pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1" (UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1").
I0711 00:13:30.469837     412 secret.go:249] Tearing down volume default-token-2uhpl for pod 32c22a0c-46fc-11e6-af1a-0e6a9f5112c1 at /tmp/openshift/test-end-t
o-end-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/default-token-2uhpl
E0711 00:13:30.506681     412 goroutinemap.go:155] Operation for "kubernetes.io/secret/default-token-2uhpl" failed. No retries permitted until 2016-07-11 00:1
4:02.506671169 +0000 UTC (durationBeforeRetry 32s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/default-token-2uhpl" (volume.spec.Na
me: "default-token-2uhpl") pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1" (UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1") with: rename /tmp/openshift/test-end-to-en
d-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/default-token-2uhpl /tmp/openshift/test-end-to-end-docker/volumes/pods
/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/wrapped_default-token-2uhpl.deleting~719439388: device or resource busy
E0711 00:13:46.461321     412 kubelet.go:1946] Unable to mount volumes for pod "cli-with-token_test(0ba7ffca-46fc-11e6-af1a-0e6a9f5112c1)": timeout expired wa
iting for volumes to attach/mount for pod "cli-with-token"/"test". list of unattached/unmounted volumes=[default-token-2uhpl]; skipping pod
E0711 00:13:46.461353     412 pod_workers.go:183] Error syncing pod 0ba7ffca-46fc-11e6-af1a-0e6a9f5112c1, skipping: timeout expired waiting for volumes to att
ach/mount for pod "cli-with-token"/"test". list of unattached/unmounted volumes=[default-token-2uhpl]
I0711 00:13:46.461412     412 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"cli-with-token", UID:"0ba7ffca-46fc-11e6-af1a-0e6a9
f5112c1", APIVersion:"v1", ResourceVersion:"771", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: timeout expired waiting fo
r volumes to attach/mount for pod "cli-with-token"/"test". list of unattached/unmounted volumes=[default-token-2uhpl]
I0711 00:13:46.461483     412 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"cli-with-token", UID:"0ba7ffca-46fc-11e6-af1a-0e6a9
f5112c1", APIVersion:"v1", ResourceVersion:"771", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "cli-with-token_test(0
ba7ffca-46fc-11e6-af1a-0e6a9f5112c1)": timeout expired waiting for volumes to attach/mount for pod "cli-with-token"/"test". list of unattached/unmounted volum
es=[default-token-2uhpl]
I0711 00:14:02.570569     412 reconciler.go:138] UnmountVolume operation started for volume "kubernetes.io/secret/default-token-2uhpl" (spec.Name: "default-to
ken-2uhpl") from pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1" (UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1").
I0711 00:14:02.570848     412 secret.go:249] Tearing down volume default-token-2uhpl for pod 32c22a0c-46fc-11e6-af1a-0e6a9f5112c1 at /tmp/openshift/test-end-t
o-end-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/default-token-2uhpl
E0711 00:14:02.590507     412 goroutinemap.go:155] Operation for "kubernetes.io/secret/default-token-2uhpl" failed. No retries permitted until 2016-07-11 00:1
5:06.590493939 +0000 UTC (durationBeforeRetry 1m4s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/default-token-2uhpl" (volume.spec.N
ame: "default-token-2uhpl") pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1" (UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1") with: rename /tmp/openshift/test-end-to-e
nd-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/default-token-2uhpl /tmp/openshift/test-end-to-end-docker/volumes/pod
s/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/wrapped_default-token-2uhpl.deleting~619277515: device or resource busy

Occurred 4 times for roughly 30 pods. This is a rhel72 system but I don't have the kernel version handy. @kubernetes/rh-storage what sorts of frequency should we expect a device busy on unmount of a tmpfs?

The text was updated successfully, but these errors were encountered:

smarterclayton · 2016-07-11T02:24:12Z

This delays pod shutdown by over a minute, which means that deletes take a lot longer to complete.

smarterclayton · 2016-07-11T02:54:41Z

Kernel 3.10.0-327.22.2.el7.x86_64

eparis · 2016-07-11T13:20:34Z

@kubernetes/sig-storage

smarterclayton · 2016-07-11T15:15:25Z

This is also happening on pod startup. I think it's roughly ~10% of pods failing to get their initial setup.

E0711 13:48:24.461255   16542 kubelet.go:1946] Unable to mount volumes for pod "database-1-hook-pre_test(db7e673b-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-
token-1x1ef]; skipping pod
E0711 13:48:24.461272   16542 pod_workers.go:183] Error syncing pod db7e673b-476d-11e6-862d-0ecf6c1dd139, skipping: timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461317   16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"database-1-hook-pre", UID:"db7e673b-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1199", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing
pod, skipping: timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
E0711 13:48:24.461425   16542 kubelet.go:1946] Unable to mount volumes for pod "failing-dc-mid-2-axjzw_test(db26fca5-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[de
fault-token-1x1ef]; skipping pod
E0711 13:48:24.461441   16542 pod_workers.go:183] Error syncing pod db26fca5-476d-11e6-862d-0ecf6c1dd139, skipping: timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461468   16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"failing-dc-mid-2-axjzw", UID:"db26fca5-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1192", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error synci
ng pod, skipping: timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461514   16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"database-1-hook-pre", UID:"db7e673b-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1199", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mou
nt volumes for pod "database-1-hook-pre_test(db7e673b-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461557   16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"failing-dc-mid-2-axjzw", UID:"db26fca5-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1192", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to

It appears that this did not resolve within 1m, which caused another timeout to fail. Not sure if it eventually clears up.

jingxu97 · 2016-07-11T17:26:10Z

Hi Clayton,

Thanks for posting the log messages. Do you also have the full log file so
that we can take a look? Thanks!

Jing

On Mon, Jul 11, 2016 at 8:16 AM, Clayton Coleman notifications@github.com
wrote:

This is also happening on pod startup. I think it's roughly ~10% of pods
failing to get their initial setup.

E0711 13:48:24.461255 16542 kubelet.go:1946] Unable to mount volumes for pod "database-1-hook-pre_test(db7e673b-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-
token-1x1ef]; skipping pod
E0711 13:48:24.461272 16542 pod_workers.go:183] Error syncing pod db7e673b-476d-11e6-862d-0ecf6c1dd139, skipping: timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461317 16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"database-1-hook-pre", UID:"db7e673b-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1199", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing
pod, skipping: timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
E0711 13:48:24.461425 16542 kubelet.go:1946] Unable to mount volumes for pod "failing-dc-mid-2-axjzw_test(db26fca5-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[de
fault-token-1x1ef]; skipping pod
E0711 13:48:24.461441 16542 pod_workers.go:183] Error syncing pod db26fca5-476d-11e6-862d-0ecf6c1dd139, skipping: timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461468 16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"failing-dc-mid-2-axjzw", UID:"db26fca5-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1192", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error synci
ng pod, skipping: timeout expired waiting for volumes to attach/mount for pod "failing-dc-mid-2-axjzw"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461514 16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"database-1-hook-pre", UID:"db7e673b-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1199", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mou
nt volumes for pod "database-1-hook-pre_test(db7e673b-476d-11e6-862d-0ecf6c1dd139)": timeout expired waiting for volumes to attach/mount for pod "database-1-hook-pre"/"test". list of unattached/unmounted volumes=[default-token-1x1ef]
I0711 13:48:24.461557 16542 server.go:655] Event(api.ObjectReference{Kind:"Pod", Namespace:"test", Name:"failing-dc-mid-2-axjzw", UID:"db26fca5-476d-11e6-862d-0ecf6c1dd139", APIVersion:"v1", ResourceVersion:"1192", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to

It appears that this did not resolve within 1m, which caused another
timeout to fail. Not sure if it eventually clears up.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ASSNxUDAlwFWg0dUU1gAJMsIZVx5iOWVks5qUl49gaJpZM4JI-1a
.

Jing

matchstick · 2016-07-11T17:36:40Z

@kubernetes/sig-storage Targeting this to 1.3 until we know more of the root cause.

smarterclayton · 2016-07-11T17:59:48Z

We're trying to get a loglevel=5 reproducer. We had log level 4 at
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/3091/s3/download/test-end-to-end-docker/logs/container-origin.log
which
is effectively a 1.3 kubelet (there are some code differences above and
around, but everything else in the product works so we still think this is
a flake).

One consideration - this is from a single node run, so we have lots of
pods from the same namespace with the same secrets landing on the same
node. Not sure if that's correlated.

On Mon, Jul 11, 2016 at 1:36 PM, Michael Rubin notifications@github.com
wrote:

@kubernetes/sig-storage
https://github.com/orgs/kubernetes/teams/sig-storage Targeting this to
1.3 until we know more of the root cause.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p_nDMfYtG5SVb4_GHW_AiDj_vekVks5qUn83gaJpZM4JI-1a
.

smarterclayton · 2016-07-11T18:00:07Z

And we have lots more secrets per namespace.

On Mon, Jul 11, 2016 at 1:59 PM, Clayton Coleman ccoleman@redhat.com
wrote:

We're trying to get a loglevel=5 reproducer. We had log level 4 at
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/3091/s3/download/test-end-to-end-docker/logs/container-origin.log which
is effectively a 1.3 kubelet (there are some code differences above and
around, but everything else in the product works so we still think this is
a flake).

One consideration - this is from a single node run, so we have lots of
pods from the same namespace with the same secrets landing on the same
node. Not sure if that's correlated.

On Mon, Jul 11, 2016 at 1:36 PM, Michael Rubin notifications@github.com
wrote:

@kubernetes/sig-storage
https://github.com/orgs/kubernetes/teams/sig-storage Targeting this to
1.3 until we know more of the root cause.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p_nDMfYtG5SVb4_GHW_AiDj_vekVks5qUn83gaJpZM4JI-1a
.

yujuhong · 2016-07-11T19:40:24Z

This might be a dup of #28616
@saad-ali's explanation from #28616 (comment): "Volume manager parallelizes work as long as the underlying volume is not the same. For secret volumes that means as long as the SecretName is different. In this case, if all the pods being batch created are identical (and therefore referencing the same volume), then they would be handled serially."

saad-ali · 2016-07-12T02:26:51Z

The f5112c1", APIVersion:"v1", ResourceVersion:"771", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "cli-with-token_test(0 ba7ffca-46fc-11e6-af1a-0e6a9f5112c1)": timeout expired waiting for volumes to attach/mount for pod "cli-with-token"/"test". list of unattached/unmounted volum es=[default-token-2uhpl] (basically a slow down in pod creation) can be explained by issue #28616 since creation of pods using the same secret is serialized.

But it does not explain this failure:
4:02.506671169 +0000 UTC (durationBeforeRetry 32s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/default-token-2uhpl" (volume.spec.Na me: "default-token-2uhpl") pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1" (UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1") with: rename /tmp/openshift/test-end-to-en d-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/default-token-2uhpl /tmp/openshift/test-end-to-end-docker/volumes/pods /32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.io~secret/wrapped_default-token-2uhpl.deleting~719439388: device or resource busy
That error indicates that an unmount started but failed. That error sounds a lot like #27054 which @jingxu97 fixed for nsenter_mount.

smarterclayton · 2016-07-12T03:45:04Z

It does, except we have that patch in place. I'm currently trying to get a
reproduction with v(5) so we can chase it down and see whether there's
another underlying problem. I'm also not ruling out an issue in RHEL at
this point.

Is the parallelization issue prioritized already or something we can assist
with (whether review or otherwise)?

On Mon, Jul 11, 2016 at 10:27 PM, Saad Ali notifications@github.com wrote:

The f5112c1", APIVersion:"v1", ResourceVersion:"771", FieldPath:""}):
type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod
"cli-with-token_test(0
ba7ffca-46fc-11e6-af1a-0e6a9f5112c1)": timeout expired waiting for volumes
to attach/mount for pod "cli-with-token"/"test". list of
unattached/unmounted volum
es=[default-token-2uhpl](basically a slow down in pod creation) can be
explained by issue #28616
#28616 since creation of
pods using the same secret is serialized.

But it does not explain this failure:
4:02.506671169 +0000 UTC (durationBeforeRetry 32s). error:
UnmountVolume.TearDown failed for volume "
kubernetes.io/secret/default-token-2uhpl" (volume.spec.Na
me: "default-token-2uhpl") pod "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1"
(UID: "32c22a0c-46fc-11e6-af1a-0e6a9f5112c1") with: rename
/tmp/openshift/test-end-to-en
d-docker/volumes/pods/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/
kubernetes.iosecret/default-token-2uhpl
/tmp/openshift/test-end-to-end-docker/volumes/pods
/32c22a0c-46fc-11e6-af1a-0e6a9f5112c1/volumes/kubernetes.iosecret/wrapped_default-token-2uhpl.deleting~719439388:
device or resource busy
That error indicates that an unmount started but failed. That error sounds
a lot like #27054 #27054
which @jingxu97 https://github.com/jingxu97 fixed for nsenter_mount.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p_8gPHFCrHkmhRZTG11FHEz4SPQZks5qUvt5gaJpZM4JI-1a
.

saad-ali · 2016-07-12T03:56:00Z

Is the parallelization issue prioritized already or something we can assist with (whether review or otherwise)?

Thanks Clayton. I'm working on it, and have a good handle on it. Tracking it with #28616. Will include patch in 1.3.1. Will make sure to loop you in on the fix.

ShashankYadav · 2016-07-13T09:06:19Z

Facing the same issue in 1.3. I also notice that the files in /var/lib/kubelet/ are not removed even after deleting the pod. And am still seeing the message a day after deleting the pod:
Jul 13 06:10:43 coreos-02 kubelet-wrapper[23023]: E0713 06:10:43.761116 23023 goroutinemap.go:155] Operation for "kubernetes.io/secret/default-token-7vgrg" failed. No retries permitted until 2016-07-13 06:12:43.76110436 +0000 UTC (durationBeforeRetry 2m0s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/default-token-7vgrg" (volume.spec.Name: "default-token-7vgrg") pod "b3fc1a78-47fc-11e6-838e-0800278fdce2" (UID: "b3fc1a78-47fc-11e6-838e-0800278fdce2") with: rename /var/lib/kubelet/pods/b3fc1a78-47fc-11e6-838e-0800278fdce2/volumes/kubernetes.io~secret/default-token-7vgrg /var/lib/kubelet/pods/b3fc1a78-47fc-11e6-838e-0800278fdce2/volumes/kubernetes.io~secret/wrapped_default-token-7vgrg.deleting~765506473: device or resource busy

And pod startup is failing in default namespace with the following message
19m 19m 1 {kubelet 172.11.8.102} Warning FailedMount Unable to mount volumes for pod "dashboard-e5k5u_default(9d0138d2-48c3-11e6-b471-0800278fdce2)": timeout expired waiting for volumes to attach/mount for pod "dashboard-e5k5u"/"default". list of unattached/unmounted volumes=[default-token-7vgrg]
19m 19m 1 {kubelet 172.11.8.102} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "dashboard-e5k5u"/"default". list of unattached/unmounted volumes=[default-token-7vgrg]

So I removed all pods running on the cluster and found something interesting. There are no pods running on the cluster yet the following logs show up in kubelet
Jul 13 07:28:20 coreos-01 kubelet-wrapper[11981]: I0713 07:28:20.623730 11981 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/secret/default-token-vxqiv" (spec.Name: "default-token-vxqiv") to pod "412b65b4-4813-11e6-b471-0800278fdce2" (UID: "412b65b4-4813-11e6-b471-0800278fdce2"). Volume is already mounted to pod, but remount was requested.
And looking at the volumes mounted in /var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes, I see /kubernetes.io~secret/default-token-vxqiv has been mounted twice. Is this expected? In some other pods, the same volume for default-token has been mounted 4 times..

coreos-01 pods # mount | grep 412b65b4-4813-11e6-b471-0800278fdce2 | grep -v exited
tmpfs on /var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/kubernetes.io~secret/default-token-vxqiv type tmpfs (rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,sellable)
tmpfs on /var/lib/rkt/pods/run/071768b4-2916-4d97-bf7f-cfbad87f9106/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/kubernetes.iosecret/default-token-vxqiv type tmpfs (rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,seclabel)
**tmpfs on /var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/kubernetes.iosecret/default-token-vxqiv type tmpfs (rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,sellable)**

chulkilee · 2016-07-13T16:30:56Z

I'm seeing this problem on centos 7 + flannel + kubernetes in docker (containerzed). BTW even after I stop all docker containers docker stop $(docker ps -q -a), some mounted volume remains (mount)

Also for me this error always happen on kubernetes-dashboard pod. I don't know how hyberkube runs all pods (e.g. in serial or parallel)

jingxu97 · 2016-07-13T17:25:15Z

Hi ShashankYadav,

Thank you for posting the log message. In normal situation, those secret
directories should only be mounted once. In order to debug, could you also
share your full log file if it is still available? Thanks!

Jing

On Wed, Jul 13, 2016 at 2:07 AM, ShashankYadav notifications@github.com
wrote:

Facing the same issue in 1.3. I also notice that the files in
/var/lib/kubelet/ are not removed even after deleting the pod. And am still
seeing the message a day after deleting the pod:

Jul 13 06:10:43 coreos-02 kubelet-wrapper[23023]: E0713 06:10:43.761116
23023 goroutinemap.go:155] Operation for "
kubernetes.io/secret/default-token-7vgrg" failed. No retries permitted
until 2016-07-13 06:12:43.76110436 +0000 UTC (durationBeforeRetry 2m0s).
error: UnmountVolume.TearDown failed for volume "
kubernetes.io/secret/default-token-7vgrg" (volume.spec.Name:
"default-token-7vgrg") pod "b3fc1a78-47fc-11e6-838e-0800278fdce2" (UID:
"b3fc1a78-47fc-11e6-838e-0800278fdce2") with: rename
/var/lib/kubelet/pods/b3fc1a78-47fc-11e6-838e-0800278fdce2/volumes/
kubernetes.iosecret/default-token-7vgrg
/var/lib/kubelet/pods/b3fc1a78-47fc-11e6-838e-0800278fdce2/volumes/
kubernetes.iosecret/wrapped_default-token-7vgrg.deleting~765506473:
device or resource busy

And pod startup is failing in default namespace with the following message
19m 19m 1 {kubelet 172.11.8.102} Warning FailedMount Unable to mount
volumes for pod
"dashboard-e5k5u_default(9d0138d2-48c3-11e6-b471-0800278fdce2)": timeout
expired waiting for volumes to attach/mount for pod
"dashboard-e5k5u"/"default". list of unattached/unmounted
volumes=[default-token-7vgrg]

19m 19m 1 {kubelet 172.11.8.102} Warning FailedSync Error syncing pod,
skipping: timeout expired waiting for volumes to attach/mount for pod
"dashboard-e5k5u"/"default". list of unattached/unmounted
volumes=[default-token-7vgrg]

So I removed all pods running on the cluster and found something
interesting. There are no pods running on the cluster yet the following
logs show up in kubelet
Jul 13 07:28:20 coreos-01 kubelet-wrapper[11981]: I0713 07:28:20.623730
11981 reconciler.go:253] MountVolume operation started for volume "
kubernetes.io/secret/default-token-vxqiv" (spec.Name:
"default-token-vxqiv") to pod "412b65b4-4813-11e6-b471-0800278fdce2" (UID:
"412b65b4-4813-11e6-b471-0800278fdce2"). Volume is already mounted to pod,
but remount was requested.

And looking at the volumes mounted in
/var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes, I see /
kubernetes.io~secret/default-token-vxqiv has been mounted twice. Is this
expected? In some other pods, the same volume for default-token has been
mounted 4 times..

coreos-01 pods # mount | grep 412b65b4-4813-11e6-b471-0800278fdce2 | grep
-v exited
tmpfs on
/var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/kubernetes.io
http://kubernetes.io~secret/default-token-vxqiv type tmpfs
(rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,sellable)
tmpfs on
/var/lib/rkt/pods/run/071768b4-2916-4d97-bf7f-cfbad87f9106/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/
kubernetes.io~secret/default-token-vxqiv type tmpfs
(rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,seclabel)
tmpfs on
/var/lib/kubelet/pods/412b65b4-4813-11e6-b471-0800278fdce2/volumes/kubernetes.io
http://kubernetes.io~secret/default-token-vxqiv type tmpfs
(rw,relatime,rootcontext=system_u:object_r:var_lib_t:s0,sellable)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ASSNxf5iBiv4d8ZeWqWzF3wld9rouMgoks5qVKq5gaJpZM4JI-1a
.

Jing

saad-ali · 2016-07-13T18:00:28Z

To be clear, based on what Clayton reported above, there are two issues.

device or resource busy on secret unmount issue
- This issue was fixed in Fix bug in isLikelyNotMountPoint function #27054 for hyperkube.
- But looks like the issue is not fixed in all cases, as reported in this issue (and Multiple mount points default-token-* #25489, Kubernetes hyperkube infinte secrets mounts #26864, Service account not working in hyperkube since v1.3.0-alpha.5 #26943)
- @jingxu97 will help debug this
Unable to mount volumes for pod on secret mount issue
- This is caused by issue Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616 and I am working on a fix.

We can leave this issue open to track the device or resource busy on secret unmount issue.

saad-ali · 2016-07-13T21:28:28Z

Debugged this some more (thanks to @justinsb for repro logs). The device or resource busy is the root cause of Unable to mount volumes for pod. Once the secret fails to unmount repeatedly (for whatever reason), all mounts for that secret are no longer allowed. This will be changed with the fix for #28616 which should reduce the impact of the device or resource busy to the original pod that fails to tear down.

ShashankYadav · 2016-07-14T09:03:27Z

@jingxu97 I am attaching the full logs. I did not encounter this issue when using a different namespace(other than default) for the pods.
kubelet.txt

Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.

saad-ali · 2016-07-21T03:55:56Z

To be clear, based on what Clayton reported above, there are two issues.

device or resource busy on secret unmount issue

This issue was fixed in Fix bug in isLikelyNotMountPoint function #27054 for hyperkube.

But looks like the issue is not fixed in all cases, as reported in this issue (and Multiple mount points default-token-* #25489, Kubernetes hyperkube infinte secrets mounts #26864, Service account not working in hyperkube since v1.3.0-alpha.5 #26943)

@jingxu97 will help debug this

Unable to mount volumes for pod on secret mount issue

This is caused by issue Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616 and I am working on a fix.

We can leave this issue open to track the device or resource busy on secret unmount issue.

Now that the fix for 2 has been merged the bite of this bug should be greatly reduced (won't prevent other pods from mounting the same secret). However, we still need to understand 1, what causes secret mounts to sometimes fail with device or resource busy. @jingxu97 can you please take a look this?

smarterclayton · 2016-07-21T04:14:10Z

Since we can reproduce this reliably, any info we can dump on unmount we're happy to do in our runs and gather info if we know what to add.

saad-ali · 2016-07-21T04:31:31Z

I'm thinking this maybe the same or similar to #22911 (comment)

Which would mean the underlying medium is being detected incorrectly, which causes the wrong tear down routine to be used.

So to debug this I would start with extra logging around GetMountMedium() in empty_dir/empty_dir_linux.go logic to see what's happening.

jingxu97 · 2016-07-22T22:07:56Z

@smarterclayton, I am working on this bug. Could you please list the steps to reproduce this error so I can further debug it. Thanks a lot!

kyleferguson · 2016-07-25T16:19:53Z

We've run into this same issue. Does anyone have a temporary workaround they've found until the fix is available? We're having a hard time deploying due to certain services not being able to create new pods. Any help is greatly appreciated :)

jingxu97 · 2016-07-25T17:08:28Z

Hi Kyle,

I am trying to investigate this problem. Could you please send us the
kubelet log from your system during the issue occurred so that we can take
a look at it. Thanks!

Best,
Jing

On Mon, Jul 25, 2016 at 9:20 AM, Kyle Ferguson notifications@github.com
wrote:

We've run into this same issue. Does anyone have a temporary workaround
they've found until the fix is available? We're having a hard time
deploying due to certain services not being able to create new pods. Any
help is greatly appreciated :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxcoA83KLzfpQzxH8UeqMxErsxBrZks5qZOJegaJpZM4JI-1a
.

Jing

jingxu97 · 2016-07-25T17:14:14Z

We are still in the process of investigating the problem. But want to add
one update about it is

The behavior of preventing following mount operations
happen will be changed in the fix
#28939
https://www.google.com/url?q=https://github.com/kubernetes/kubernetes/pull/28939&sa=D&usg=AFQjCNE0EagFQTysH0i433hIz60zpw7TFw
and should be part of v1.3.4.

Even if the error may still occur, but after this fix, you should be able
to add new pods and mount. Thanks!

Jing

On Mon, Jul 25, 2016 at 10:07 AM, Jing Xu jinxu@google.com wrote:

Hi Kyle,

I am trying to investigate this problem. Could you please send us the
kubelet log from your system during the issue occurred so that we can take
a look at it. Thanks!

Best,
Jing

On Mon, Jul 25, 2016 at 9:20 AM, Kyle Ferguson notifications@github.com
wrote:

We've run into this same issue. Does anyone have a temporary workaround
they've found until the fix is available? We're having a hard time
deploying due to certain services not being able to create new pods. Any
help is greatly appreciated :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxcoA83KLzfpQzxH8UeqMxErsxBrZks5qZOJegaJpZM4JI-1a
.

Jing

Jing

smarterclayton · 2016-07-25T17:15:00Z

@pmorie was looking into this on our end (so folks know) - he says he found
a mount race.

On Mon, Jul 25, 2016 at 1:08 PM, Jing Xu notifications@github.com wrote:

Hi Kyle,

I am trying to investigate this problem. Could you please send us the
kubelet log from your system during the issue occurred so that we can take
a look at it. Thanks!

Best,
Jing

On Mon, Jul 25, 2016 at 9:20 AM, Kyle Ferguson notifications@github.com
wrote:

We've run into this same issue. Does anyone have a temporary workaround
they've found until the fix is available? We're having a hard time
deploying due to certain services not being able to create new pods. Any
help is greatly appreciated :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#28750 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/ASSNxcoA83KLzfpQzxH8UeqMxErsxBrZks5qZOJegaJpZM4JI-1a

.

Jing

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pwnsv9syI2Na4F1Cmyt4sT0QOUCKks5qZO2bgaJpZM4JI-1a
.

jingxu97 · 2016-07-25T17:42:44Z

Great! @pmorie Could you share the findings you have? Thank you!

Jing

On Mon, Jul 25, 2016 at 10:15 AM, Clayton Coleman notifications@github.com
wrote:

@pmorie was looking into this on our end (so folks know) - he says he found
a mount race.

On Mon, Jul 25, 2016 at 1:08 PM, Jing Xu notifications@github.com wrote:

Hi Kyle,

I am trying to investigate this problem. Could you please send us the
kubelet log from your system during the issue occurred so that we can
take
a look at it. Thanks!

Best,
Jing

On Mon, Jul 25, 2016 at 9:20 AM, Kyle Ferguson <notifications@github.com

wrote:

We've run into this same issue. Does anyone have a temporary workaround
they've found until the fix is available? We're having a hard time
deploying due to certain services not being able to create new pods.
Any
help is greatly appreciated :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<

#28750 (comment)

,
or mute the thread
<

https://github.com/notifications/unsubscribe-auth/ASSNxcoA83KLzfpQzxH8UeqMxErsxBrZks5qZOJegaJpZM4JI-1a

.

Jing

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#28750 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/ABG_pwnsv9syI2Na4F1Cmyt4sT0QOUCKks5qZO2bgaJpZM4JI-1a

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28750 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxZ7blHNInqCJneReOxYfGNbDIjY3ks5qZO9KgaJpZM4JI-1a
.

Jing

pmorie · 2016-07-26T18:52:14Z

So far I've spent most of my time looking at the mount timeouts. I have findings to report (trying to figure out the right place) on that, but not yet on the umount issues.

kyleferguson · 2016-07-26T20:02:20Z

@jingxu97 Cursory glance at the logs I couldn't find anything, it was a couple days ago. I'll try to recreate if I get some time later and capture it.

We're on GKE and don't the ability to rollback the master so we're kind of stranded. Some of our services are stuck and unable to deploy, and we have scheduled tasks (daily reports, etc.) that are very intermittent. Seems to be dependent on what node they get scheduled on.

Might try to find a healthy node and force jobs to run there in the meantime to see if that will work as a stop gap until a fix is available.

jingxu97 · 2016-08-01T19:44:36Z

@kyleferguson , we fixed a few issues recently in version 1.3.4. Could you please try on this new version and let us know if you have any issues. Thanks!

kyleferguson · 2016-08-01T19:47:26Z

@jingxu97 Thanks for the update! Unfortunately we're on GKE at the moment so I will have to wait until 1.3.4 is available on there. Looks like the latest available currently is 1.3.3

saad-ali · 2016-08-01T22:42:03Z

To be clear, based on what Clayton reported above, there are two issues.

device or resource busy on secret unmount issue

This issue was fixed in Fix bug in isLikelyNotMountPoint function #27054 for hyperkube.

But looks like the issue is not fixed in all cases, as reported in this issue (and Multiple mount points default-token-* #25489, Kubernetes hyperkube infinte secrets mounts #26864, Service account not working in hyperkube since v1.3.0-alpha.5 #26943)

@jingxu97 will help debug this

Unable to mount volumes for pod on secret mount issue

This is caused by issue Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616 and I am working on a fix.

We can leave this issue open to track the device or resource busy on secret unmount issue.

Now that the fix for 2 has been merged the bite of this bug should be greatly reduced (won't prevent other pods from mounting the same secret). However, we still need to understand 1, what causes secret mounts to sometimes fail with device or resource busy. @jingxu97 can you please take a look this?

We believe that 1 was caused by issue #29555 which was fixed by PR #29673. Patches for both have been cherry-picked to the v1.3 release branch and are part of the v1.3.4 release. Closing this issue as resolved.

smarterclayton added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jul 11, 2016

smarterclayton added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 11, 2016

smarterclayton mentioned this issue Jul 11, 2016

Unable to deploy database in e2e test: secret never got mounted openshift/origin#9775

Closed

matchstick added this to the v1.3 milestone Jul 11, 2016

childsb assigned rootfs Jul 13, 2016

childsb added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 13, 2016

childsb assigned saad-ali Jul 13, 2016

saad-ali assigned jingxu97 Jul 13, 2016

mboersma mentioned this issue Jul 13, 2016

Pods fail to mount secret on k8s 1.3.0 + GKE deis/workflow#372

Closed

saad-ali mentioned this issue Jul 13, 2016

Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616

Closed

felixbuenemann mentioned this issue Jul 15, 2016

quay.io/coreos/hyperkube image for v1.2.5 coreos/coreos-kubernetes#570

Closed

mboersma added a commit to mboersma/workflow that referenced this issue Jul 18, 2016

docs(install-workflow): note that k8s 1.2.x is required

431bfc3

Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.

mboersma mentioned this issue Jul 18, 2016

docs(install-workflow): note that k8s 1.2.x is required deis/workflow#377

Merged

Quentin-M mentioned this issue Jul 19, 2016

Kubernetes keeps failing at mounting a volume #29166

Closed

mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016

docs(install-workflow): note that k8s 1.2.x is required

de4c765

Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.

mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016

docs(install-workflow): note that k8s 1.2.x is required

01fa249

Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.

mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016

docs(install-workflow): note that k8s 1.2.x is required

dfb6c8e

Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.

saad-ali mentioned this issue Jul 21, 2016

Pet Set stuck in ContainerCreating #28709

Closed

saad-ali closed this as completed Aug 1, 2016

rwyyr mentioned this issue Aug 4, 2016

Mounting/unmounting volume error on GKE #29903

Closed

lorenz mentioned this issue May 11, 2017

Pods do not get cleaned up #45688

Closed

rootfs mentioned this issue Jun 9, 2017

High CPU Utilization Openshift process (Origin v1.4) openshift/origin#14428

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while tearing down pod, "device or resource busy" on service account secret #28750

Error while tearing down pod, "device or resource busy" on service account secret #28750

smarterclayton commented Jul 11, 2016

smarterclayton commented Jul 11, 2016

smarterclayton commented Jul 11, 2016

eparis commented Jul 11, 2016

smarterclayton commented Jul 11, 2016

jingxu97 commented Jul 11, 2016

matchstick commented Jul 11, 2016

smarterclayton commented Jul 11, 2016

smarterclayton commented Jul 11, 2016

yujuhong commented Jul 11, 2016

saad-ali commented Jul 12, 2016

smarterclayton commented Jul 12, 2016

saad-ali commented Jul 12, 2016

ShashankYadav commented Jul 13, 2016

chulkilee commented Jul 13, 2016

jingxu97 commented Jul 13, 2016

saad-ali commented Jul 13, 2016

saad-ali commented Jul 13, 2016

ShashankYadav commented Jul 14, 2016

saad-ali commented Jul 21, 2016

smarterclayton commented Jul 21, 2016

saad-ali commented Jul 21, 2016

jingxu97 commented Jul 22, 2016

kyleferguson commented Jul 25, 2016

jingxu97 commented Jul 25, 2016

jingxu97 commented Jul 25, 2016

smarterclayton commented Jul 25, 2016

jingxu97 commented Jul 25, 2016

pmorie commented Jul 26, 2016

kyleferguson commented Jul 26, 2016

jingxu97 commented Aug 1, 2016

kyleferguson commented Aug 1, 2016

saad-ali commented Aug 1, 2016

Error while tearing down pod, "device or resource busy" on service account secret #28750

Error while tearing down pod, "device or resource busy" on service account secret #28750

Comments

smarterclayton commented Jul 11, 2016

smarterclayton commented Jul 11, 2016

smarterclayton commented Jul 11, 2016

eparis commented Jul 11, 2016

smarterclayton commented Jul 11, 2016

jingxu97 commented Jul 11, 2016

matchstick commented Jul 11, 2016

smarterclayton commented Jul 11, 2016

smarterclayton commented Jul 11, 2016

yujuhong commented Jul 11, 2016

saad-ali commented Jul 12, 2016

smarterclayton commented Jul 12, 2016

saad-ali commented Jul 12, 2016

ShashankYadav commented Jul 13, 2016

chulkilee commented Jul 13, 2016

jingxu97 commented Jul 13, 2016

saad-ali commented Jul 13, 2016

saad-ali commented Jul 13, 2016

ShashankYadav commented Jul 14, 2016

saad-ali commented Jul 21, 2016

smarterclayton commented Jul 21, 2016

saad-ali commented Jul 21, 2016

jingxu97 commented Jul 22, 2016

kyleferguson commented Jul 25, 2016

jingxu97 commented Jul 25, 2016

jingxu97 commented Jul 25, 2016

smarterclayton commented Jul 25, 2016

jingxu97 commented Jul 25, 2016

pmorie commented Jul 26, 2016

kyleferguson commented Jul 26, 2016

jingxu97 commented Aug 1, 2016

kyleferguson commented Aug 1, 2016

saad-ali commented Aug 1, 2016