Running pods fails with AUFS mount errors #10959

krousey · 2015-07-08T23:32:42Z

Ocassionaly our e2e tests fail with something like the following

INFO: event for dns-test-a6805741-25ac-11e5-8fa6-42010af01555: {kubelet e2e-test-parallel-minion-4c2m} pulled: Successfully pulled image "gcr.io/google_containers/jessie-dnsutils"
INFO: event for dns-test-a6805741-25ac-11e5-8fa6-42010af01555: {kubelet e2e-test-parallel-minion-4c2m} failed: Failed to create docker container with error: API error (500): error creating aufs mount to /var/lib/docker/aufs/mnt/b2a84be51ae21ce45bc2cf1ecb944be2b4345c25798af0249260874185622e64-init: invalid argument

INFO: event for dns-test-a6805741-25ac-11e5-8fa6-42010af01555: {kubelet e2e-test-parallel-minion-4c2m} failed: Failed to create docker container with error: API error (500): error creating aufs mount to /var/lib/docker/aufs/mnt/6f337451bbaa63fade20e59c88f87a6851412e3b5fb9d9b5d1a6457d1bb4e9f5-init: invalid argument

... many many more then ...

• Failure [311.910 seconds]
DNS
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/dns.go:310
  should provide DNS for services [It]
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/dns.go:308

  Expected error:
      <*errors.errorString | 0xc2080e49c0>: {
          s: "gave up waiting for pod 'dns-test-a6805741-25ac-11e5-8fa6-42010af01555' to be 'running' after 5m0s",
      }
      gave up waiting for pod 'dns-test-a6805741-25ac-11e5-8fa6-42010af01555' to be 'running' after 5m0s
  not to have occurred

23.251.154.44-22-kubelet.log reports the following

E0708 20:05:33.233502    3077 manager.go:1579] Error running pod "nginx_e2e-tests-kubectl-516xc" container "nginx": API error (500): error creating aufs mount to /var/lib/docker/aufs/mnt/66a568f6f6922b9ed2e978a31ea2ab77ddf4f65eae2291d0fe5a5a6c50eb875b-init: invalid argument
E0708 20:05:33.329154    3077 manager.go:1579] Error running pod "nginx_e2e-tests-kubectl-516xc" container "nginx": API error (500): error creating aufs mount to /var/lib/docker/aufs/mnt/c6257ab9f8ac34c01b68d6b0ad1712981bce6fe8afcd2afd1b3976b3a65bc5fa-init: invalid argument
E0708 20:05:35.510816    3077 manager.go:1579] Error running pod "dns-test-a6805741-25ac-11e5-8fa6-42010af01555_e2e-tests-dns-8q0vk" container "jessie-querier": API error (500): error creating aufs mount to /var/lib/docker/aufs/mnt/b2a84be51ae21ce45bc2cf1ecb944be2b4345c25798af0249260874185622e64-init: invalid argument
E0708 20:05:35.609901    3077 manager.go:1579] Error running pod "dns-test-a6805741-25ac-11e5-8fa6-42010af01555_e2e-tests-dns-8q0vk" container "jessie-querier": API error (500): error creating aufs mount to /var/lib/docker/aufs/mnt/6f337451bbaa63fade20e59c88f87a6851412e3b5fb9d9b5d1a6457d1bb4e9f5-init: invalid argument

Here's the logs from this particular run: https://console.developers.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-parallel/3046/

The text was updated successfully, but these errors were encountered:

krousey · 2015-07-08T23:33:02Z

cc @dchen1107 @roberthbailey @cjcullen

krousey · 2015-07-08T23:51:25Z

Seems the message is generated here: https://github.com/docker/docker/blob/master/daemon/graphdriver/aufs/aufs.go#L392 and boils down to a syscall.Mount call.

syscall.Mount calls syscall.BytePtrFromString on data. From the docs:

BytePtrFromString returns a pointer to a NUL-terminated array of bytes containing the text of s. If s contains a NUL byte at any location, it returns (nil, EINVAL).

Then syscall.Mount call mount. man 2 mount

ERRORS

The error values given below result from filesystem type independent errors. Each filesystem type may have its own special errors and its own special behavior. See the Linux kernel source code for details.
...
EINVAL source had an invalid superblock. Or, a remount (MS_REMOUNT) was attempted, but source was not already mounted on target. Or, a move (MS_MOVE) was attempted, but source was not a mount point, or was '/'.

They do use MsRemount here: https://github.com/docker/docker/blob/master/daemon/graphdriver/aufs/aufs.go#L457 So maybe it could be a race condition. I don't really have any idea. It's probably more likely that this is an AUFS specific error.

sroze · 2015-08-30T12:32:41Z

FYI, I'm experiencing this in "production" on Google Container Engine too.

iameli · 2015-09-10T21:54:39Z

Also experiencing on a vanilla kube-up installation on AWS.

ghost · 2015-09-15T23:18:02Z

Tentatively increased priority to P1, as we're receiving quite a few reports of this. @dchen1107 feel free to disagree and demote if you so wish.

dchen1107 · 2015-09-15T23:42:57Z

Ahh, the corresponding docker issue for this one should be: moby/moby#14026

zviadm · 2015-09-21T20:38:27Z

Fyi I think I was running into very similar issues with docker directly (i.e. without kubernetes). Here is an issue I opened with docker directly: moby/moby#16376

You can try same thing with synchronizing all calls to docker pull application side if possible and see if that makes the issue go away.

iameli · 2015-10-06T01:25:46Z

I don't have the expertise to say whether or not this is related, but I was encountering this issue constantly and it went away completely when I switched DOCKER_STORAGE='aufs' in the Saltstack init script to DOCKER_STORAGE='aufs-nolvm'.

Possibly some kind of poor interaction between the LVM set up by Kubernetes' Saltstack and the aforementioned Docker bugs? No idea. Figured I'd document this for anyone else running into this using the kube-up AWS.

EDIT: Eh, still happening. See my comment down two from this one.

bronger · 2015-10-06T23:20:28Z

Does anybody have a workaround for this on GKE?

vishh · 2015-10-07T16:30:55Z

Can we collect the output of dmesg from the node whenever this happens again?
@bronger: I'd check the status of docker daemon cat /proc/pidof docker/status and dmesg. In the past, machine restarts have fixed this issue.

iameli · 2015-10-08T00:49:30Z

I retract my earlier comment, it's happening again. Perhaps a bit less often since I switched to aufs-nolvm.

Here's my dmesg. Looks pretty crazy, actually. I'll follow up on my end and see if I have any pods leaking.

cat /proc/pidof docker/status:

Name:   docker
State:  S (sleeping)
Tgid:   15821
Ngid:   0
Pid:    15821
PPid:   1
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 256
Groups: 
VmPeak:  2252400 kB
VmSize:  2252240 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     52848 kB
VmRSS:     50664 kB
VmData:  2238488 kB
VmStk:       136 kB
VmExe:     13444 kB
VmLib:         0 kB
VmPTE:       768 kB
VmSwap:        0 kB
Threads:    150
SigQ:   0/15022
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: ffffffffffc1feff
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
Seccomp:    0
Cpus_allowed:   0003
Cpus_allowed_list:  0-1
Mems_allowed:   00000000,00000001
Mems_allowed_list:  0
voluntary_ctxt_switches:    21317
nonvoluntary_ctxt_switches: 121712

sroze · 2015-10-08T07:13:52Z

@iameli probably just because you restarted the Docker daemon, that's my workaround for now: I'm running a daemon on each Docker node and every time is sees this error, it is restarting the local Docker...

bronger · 2015-10-08T08:07:43Z

My proc status outout is https://gist.github.com/bronger/6185881357fc069447be
And my dmesg is https://gist.github.com/bronger/c9ccacd88cb2d7a40f95

Both seems to be similar to @iameli's observations. And restarting the docker daemon solves the issue also for me.

saturnism · 2015-10-13T22:42:28Z

restarting the docker daemon on the temporarily fixes my issue as well.

dmesg is filled with this line:

[7105875.751556] aufs opt_add:782:docker[30564]: lookup failed /var/lib/docker/aufs/diff/840234741900807847ceb6681d36ca0790fcdfd2ed0eee5eda3220e186d408e4 (-2)

status

Name:   docker
State:  S (sleeping)
Tgid:   3359
Ngid:   0
Pid:    3359
PPid:   1
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 512
Groups: 
VmPeak:  1295612 kB
VmSize:  1295612 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     57444 kB
VmRSS:     57152 kB
VmData:  1281860 kB
VmStk:       136 kB
VmExe:     13444 kB
VmLib:         0 kB
VmPTE:       520 kB
VmSwap:        0 kB
Threads:    88
SigQ:   1/14829
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: ffffffffffc1feff
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
Seccomp:    0
Cpus_allowed:   1
Cpus_allowed_list:  0
Mems_allowed:   00000000,00000001
Mems_allowed_list:  0
voluntary_ctxt_switches:    612
nonvoluntary_ctxt_switches: 14738

taylortrimble · 2015-10-18T21:34:31Z

FYI, I'm experiencing this in "production" on Google Container Engine too.

I just noticed this today on GKE. I was able to restart docker. I had to manually docker pull all my images after that, but I'm up and running. This is a concerning problem though. 😬

vishh · 2015-10-19T18:22:18Z

For Kubernetes v1.2, I plan on validating overlayfs to be the default storage driver for Kubernetes deployments.

Until then would #15856 help @tylrtrmbl @saturnism @bronger ?

saturnism · 2015-10-19T19:34:47Z

I'm not sure if #15856 helps (not entirely sure how it determines Docker daemon health?). In my cases, the docker daemon seems to run just fine. It can continue to pull other images. But in order to pull the image that got into a bad state, restarting the daemon works.

dchen1107 · 2015-10-19T20:34:08Z

@saturnism Please see my comment: #15856 (comment)
I have the same concerns as you.

vishh · 2015-10-19T21:29:00Z

@dchen1107 suggested subscribing to docker event stream and restarting the daemon if an aufs related error is returned by the daemon. However, we decided to not add that patch for v1.1.
We will try switching to OverlayFs (#15867) for v1.2. We are also hoping for newer docker releases to resolve docker specific aufs issues.

dchen1107 · 2015-10-19T21:31:07Z

I also close another attempt for fixing this issue: #15881

There is no reliable way to detect such failure. We also observed the same failure caused by kernel issue, which requires reboot the node completely. In this case, we decided to document such issue for 1.1 as known issue with info on how to remedy.

I am lowering the priority of the issue and moving it out of 1.1.

saturnism · 2015-10-19T21:55:03Z

Serializing pulls seems to help. Is that an option?
Secondly, I'm not 100% convinced that overlayfs will make fix this issue.

vsimon · 2015-10-19T22:02:59Z

Agreeing with the above comment that serializing pulls was the only thing that worked in my case. Haven't seen this issue pop up in many weeks after serializing.

iameli · 2015-10-19T22:35:51Z

Is there a easy way to instruct Kubernetes to "serialize pulls"? If I create a pod with three containers, it's going to pull down all three in parallel no matter what I do, right?

vishh · 2015-10-19T22:52:28Z

@iameli: AFAIK kubelet does not support serializing docker pulls.

saturnism · 2015-10-20T01:16:31Z

@vishh i like where this is going! 👍

derekwaynecarr · 2015-10-20T12:13:21Z

@vishh - we do not want to require the use of OverlayFS for Kube 1.2 at Red Hat

vishh · 2015-10-23T23:12:25Z

@derekwaynecarr: That was not the intention either. We currently use Aufs in GCE and GKE. My comment was that we will evaluate overlayfs as an alternative to Aufs on those deployments.

vishh · 2015-10-23T23:13:24Z

I'm marking this issue as resolved for now. Serializing images as part of #15914 should hopefully solve this issue. If it happens even after serialization feel free to re-open this issue.

jezell · 2015-12-19T10:53:05Z

Was seeing this same problem today resizing instance groups in gke (not in e2e tests but on a production gke cluster running kube 1.1.3 / docker 1.9). So don't think serializing the images has made the problem go away. Kind of a pain to resolve since I had to restart docker daemon and manually repull images down to the node via docker pull to get pods to start.

dchen1107 · 2016-05-10T00:12:32Z

#23028 for reporting sys oom event through node problem detector. Sample system oom dmesg can be found at #10959 (comment)

vmarmol added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 10, 2015

ghost added this to the v1.1-candidate milestone Sep 15, 2015

ghost added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Sep 15, 2015

This was referenced Sep 17, 2015

Docker aufs issues when running lots of pull and run commands in parallel #14117

Closed

Docker aufs issues when running lots of pull and run commands in parallel moby/moby#16376

Closed

dchen1107 mentioned this issue Sep 21, 2015

AWS rebooted spot-instance failed to launch pods because of aufs mount errors #14285

Closed

bgrant0607-nocc modified the milestones: v1.1-candidate, v1.1 Oct 5, 2015

dchen1107 assigned vishh Oct 7, 2015

davidopp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 16, 2015

vishh mentioned this issue Oct 19, 2015

Check the health of docker daemon periodically #15856

Closed

vishh mentioned this issue Oct 19, 2015

Evaluate overlayfs with docker #15867

Closed

This was referenced Oct 19, 2015

Update supervisord to use docker ps instead of docker version. #15881

Closed

v1.1 known issues / FAQ accumulator #15747

Closed

dchen1107 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Oct 19, 2015

dchen1107 removed this from the v1.1 milestone Oct 19, 2015

vishh mentioned this issue Oct 20, 2015

make kubelet image pulls serialized by default. #15914

Merged

dchen1107 added this to the v1.1 milestone Oct 21, 2015

vishh closed this as completed Oct 23, 2015

yujuhong mentioned this issue Dec 8, 2015

Timeouts for e2e tests running in a parallel environment #18366

Closed

vishh mentioned this issue Dec 17, 2015

Implement node Allocatable #18611

Merged

This was referenced Dec 21, 2015

Disable serializing image pulls for e2e tests. #18986

Merged

Enable parallel image pulls with docker 1.9. #18973

Closed

dchen1107 mentioned this issue Feb 22, 2016

docker 1.9.1: No available IPv4 addresses on this network's address pools: bridge #21523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running pods fails with AUFS mount errors #10959

Running pods fails with AUFS mount errors #10959

krousey commented Jul 8, 2015

krousey commented Jul 8, 2015

krousey commented Jul 8, 2015

sroze commented Aug 30, 2015

iameli commented Sep 10, 2015

ghost commented Sep 15, 2015

dchen1107 commented Sep 15, 2015

zviadm commented Sep 21, 2015

iameli commented Oct 6, 2015

bronger commented Oct 6, 2015

vishh commented Oct 7, 2015

iameli commented Oct 8, 2015

sroze commented Oct 8, 2015

bronger commented Oct 8, 2015

saturnism commented Oct 13, 2015

taylortrimble commented Oct 18, 2015

vishh commented Oct 19, 2015

saturnism commented Oct 19, 2015

dchen1107 commented Oct 19, 2015

vishh commented Oct 19, 2015

dchen1107 commented Oct 19, 2015

saturnism commented Oct 19, 2015

vsimon commented Oct 19, 2015

iameli commented Oct 19, 2015

vishh commented Oct 19, 2015

saturnism commented Oct 20, 2015

derekwaynecarr commented Oct 20, 2015

vishh commented Oct 23, 2015

vishh commented Oct 23, 2015

jezell commented Dec 19, 2015

dchen1107 commented May 10, 2016

Running pods fails with AUFS mount errors #10959

Running pods fails with AUFS mount errors #10959

Comments

krousey commented Jul 8, 2015

krousey commented Jul 8, 2015

krousey commented Jul 8, 2015

sroze commented Aug 30, 2015

iameli commented Sep 10, 2015

ghost commented Sep 15, 2015

dchen1107 commented Sep 15, 2015

zviadm commented Sep 21, 2015

iameli commented Oct 6, 2015

bronger commented Oct 6, 2015

vishh commented Oct 7, 2015

iameli commented Oct 8, 2015

sroze commented Oct 8, 2015

bronger commented Oct 8, 2015

saturnism commented Oct 13, 2015

taylortrimble commented Oct 18, 2015

vishh commented Oct 19, 2015

saturnism commented Oct 19, 2015

dchen1107 commented Oct 19, 2015

vishh commented Oct 19, 2015

dchen1107 commented Oct 19, 2015

saturnism commented Oct 19, 2015

vsimon commented Oct 19, 2015

iameli commented Oct 19, 2015

vishh commented Oct 19, 2015

saturnism commented Oct 20, 2015

derekwaynecarr commented Oct 20, 2015

vishh commented Oct 23, 2015

vishh commented Oct 23, 2015

jezell commented Dec 19, 2015

dchen1107 commented May 10, 2016