-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker ps hang on 1.13.1 due to selinux relabelling holding a container lock #32007
Comments
The CoreOS packages are not maintained here (I also notice that it looks like containerd was built with a different Go version (1.6 instead of 1.7), not sure if that affects this). Have you also reported this with the CoreOS maintainers? |
No, I can report it there as well if you like? |
I think that's best; their version has modifications from the "vanilla" version of docker, and it would be hard for us to tell what differences that are, and to reproduce. I'm not excluding the possibility that there's an issue that needs to be fixed "upstream" here, but it's easier for them to do the initial triage, because they're familiar with the changes, and configuration of docker on CoreOS. I'll go ahead and close this here for now, but keep us posted, and we can re-open if more information becomes available 👍 |
@thaJeztah As a sorta FYI, docker 1.13.1 on Container Linux currently doesn't carry any additional patches. We used to have a few, but at this point our 1.13.1 branch is identical to upstream's. We do have small selinux patch on runc (which fedora also carries), but otherwise containerd/runc should both match upstream's too. |
Thanks for the heads up @euank; am I right that containerd is compiled with Go 1.6? Let me know if more is known on this issue, or if you need help 👍 |
@thaJeztah yeah, you're right about that go 1.6 bit. If you know of a reason that might cause this issue, a pointer there would be appreciated. We'll be updating it to 1.7 soon regardless though. Unless it's related to go 1.6 somehow, I would expect this to not be Container Linux specific. Help digging into these issues is always appreciated! |
There are locking issues, which can sometimes lead to issues (being worked on, but non trivial); @mlaventure may be able to provide more info on the containerd part, but he's on PTO this week |
@euank if you see a hang/lock of the daemon, the best solution is to send a |
@mlaventure this is the issue. It has the logs and traces and so on in the first post.
What additional information would be helpful to re-open and have Docker's help in figuring out the root cause? |
I believe we've got an idea of exactly what's causing this. On the affected nodes we have some containers mounting network volumes. The IOps limits on these volumes are rather low, and they've accumulated a large amounts of files. Our hypothesis is that the docker daemon is stuck in selinux relabelling.
Inspection of the code for the given version, and master branch, of docker indicates this code path is holding a container lock: https://github.com/docker/docker/blob/daaf9ddfa9a53fb82511accb8ad0fed381670a54/daemon/start.go#L100 So, it's not hung forever, but it's taking a significant amount of time (10s of minutes) See also this bug (which helped point us in this direction): prometheus-operator/prometheus-operator#239 @thaJeztah Please can you reopen this bug, as all the code paths listed exist upstream. My belief is that this would affect all operating systems, not just CoreOS. |
Thank you for digging into this! Running dockerd with selinux disabled might work around this. It also sounds like moving that selinux relabeling out of the docker daemon containerList lock (e.g. to containerd) could end up with |
We'll have a look into disabling selinux, thanks for the suggestion. That sounds like a reasonable fix. (Though I guess you'd still want to make sure that the containerd list operation still functions while such a relabel is in progress? Assuming it has a list operation, I don't know much about containerd) |
@euank Is there another issue open for this somewhere? |
@mikebryant are you sure about that the volume is mounted with Spec:
Are you also using k8s 1.5.4? I also tried to reproduce this on a single CoreOS node (which was doing almost nothing else):
I couldn't get docker to hang forever with 742k files, how many files does your volume (@mikebryant) contain? But I think that k8s maybe did detach the volume while relabeling was ongoing, I'm not sure if that could cause a lockup, I would try a I also stumbled on: kubernetes/kubernetes#42257 which seems to be the same issue (?). |
Yup, definitely mounted with Z:
k8s 1.5.2 The issue for us was the number of files vs the io limit on the volume. Clearly if it needs to relabel everything that's not going to work, but it shouldn't hang the daemon while doing so. Only that container should be affected. |
ping @thaJeztah @mlaventure |
@mikebryant if you wouldn't mind opening a new one aggregating all the information you provided, it would make it easier to follow. |
Description
docker ps hangs
Steps to reproduce the issue:
Use docker with Kubernetes
run
docker ps
Describe the results you received:
Status of containers
Describe the results you expected:
Hanging command
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
CoreOS on Openstack
containerd.txt
goroutine-stacks-2017-03-22T162146Z.txt
daemon-data-2017-03-22T162146Z.txt
The text was updated successfully, but these errors were encountered: