-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running pods fails with AUFS mount errors #10959
Comments
Seems the message is generated here: https://github.com/docker/docker/blob/master/daemon/graphdriver/aufs/aufs.go#L392 and boils down to a syscall.Mount call. syscall.Mount calls syscall.BytePtrFromString on
Then syscall.Mount call mount.
They do use MsRemount here: https://github.com/docker/docker/blob/master/daemon/graphdriver/aufs/aufs.go#L457 So maybe it could be a race condition. I don't really have any idea. It's probably more likely that this is an AUFS specific error. |
FYI, I'm experiencing this in "production" on Google Container Engine too. |
Also experiencing on a vanilla |
Tentatively increased priority to P1, as we're receiving quite a few reports of this. @dchen1107 feel free to disagree and demote if you so wish. |
Ahh, the corresponding docker issue for this one should be: moby/moby#14026 |
Fyi I think I was running into very similar issues with docker directly (i.e. without kubernetes). Here is an issue I opened with docker directly: moby/moby#16376 You can try same thing with synchronizing all calls to |
I don't have the expertise to say whether or not this is related, but I was encountering this issue constantly and it went away completely when I switched Possibly some kind of poor interaction between the LVM set up by Kubernetes' Saltstack and the aforementioned Docker bugs? No idea. Figured I'd document this for anyone else running into this using the EDIT: Eh, still happening. See my comment down two from this one. |
Does anybody have a workaround for this on GKE? |
Can we collect the output of |
I retract my earlier comment, it's happening again. Perhaps a bit less often since I switched to Here's my dmesg. Looks pretty crazy, actually. I'll follow up on my end and see if I have any pods leaking.
|
@iameli probably just because you restarted the Docker daemon, that's my workaround for now: I'm running a daemon on each Docker node and every time is sees this error, it is restarting the local Docker... |
My proc status outout is https://gist.github.com/bronger/6185881357fc069447be Both seems to be similar to @iameli's observations. And restarting the docker daemon solves the issue also for me. |
restarting the docker daemon on the temporarily fixes my issue as well.
|
I just noticed this today on GKE. I was able to restart docker. I had to manually |
For Kubernetes v1.2, I plan on validating Until then would #15856 help @tylrtrmbl @saturnism @bronger ? |
I'm not sure if #15856 helps (not entirely sure how it determines Docker daemon health?). In my cases, the docker daemon seems to run just fine. It can continue to pull other images. But in order to pull the image that got into a bad state, restarting the daemon works. |
@saturnism Please see my comment: #15856 (comment) |
@dchen1107 suggested subscribing to docker event stream and restarting the daemon if an aufs related error is returned by the daemon. However, we decided to not add that patch for v1.1. |
I also close another attempt for fixing this issue: #15881 There is no reliable way to detect such failure. We also observed the same failure caused by kernel issue, which requires reboot the node completely. In this case, we decided to document such issue for 1.1 as known issue with info on how to remedy. I am lowering the priority of the issue and moving it out of 1.1. |
Serializing pulls seems to help. Is that an option? |
Agreeing with the above comment that serializing pulls was the only thing that worked in my case. Haven't seen this issue pop up in many weeks after serializing. |
Is there a easy way to instruct Kubernetes to "serialize pulls"? If I create a pod with three containers, it's going to pull down all three in parallel no matter what I do, right? |
@iameli: AFAIK kubelet does not support serializing docker pulls. |
@vishh i like where this is going! 👍 |
@vishh - we do not want to require the use of OverlayFS for Kube 1.2 at Red Hat |
@derekwaynecarr: That was not the intention either. We currently use Aufs in GCE and GKE. My comment was that we will evaluate overlayfs as an alternative to Aufs on those deployments. |
I'm marking this issue as resolved for now. Serializing images as part of #15914 should hopefully solve this issue. If it happens even after serialization feel free to re-open this issue. |
Was seeing this same problem today resizing instance groups in gke (not in e2e tests but on a production gke cluster running kube 1.1.3 / docker 1.9). So don't think serializing the images has made the problem go away. Kind of a pain to resolve since I had to restart docker daemon and manually repull images down to the node via docker pull to get pods to start. |
#23028 for reporting sys oom event through node problem detector. Sample system oom dmesg can be found at #10959 (comment) |
#10622
Ocassionaly our e2e tests fail with something like the following
23.251.154.44-22-kubelet.log reports the following
Here's the logs from this particular run: https://console.developers.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-parallel/3046/
The text was updated successfully, but these errors were encountered: