Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker ps hang on 1.13.1 due to selinux relabelling holding a container lock #32007

Closed
mikebryant opened this issue Mar 22, 2017 · 17 comments
Closed

Comments

@mikebryant
Copy link

Description

docker ps hangs

Steps to reproduce the issue:
Use docker with Kubernetes
run docker ps

Describe the results you received:
Status of containers

Describe the results you expected:
Hanging command

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.7.3
 Git commit:   092cba3
 Built:        Fri Mar 17 08:05:24 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   092cba3
 Built:        Fri Mar 17 08:05:24 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 102
 Running: 65
 Paused: 0
 Stopped: 37
Images: 28
Server Version: 1.13.1
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.10.1-coreos
Operating System: Container Linux by CoreOS 1353.1.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.68 GiB
Name: kubernetes-cit-kubernetes-cr0-0-1490197314
ID: AFAL:UR6W:I4W6:NCSW:MZ5A:NUDY:UIF6:ZEOW:EOGG:LPNV:64VE:7M2G
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: http://proxy.ocado.com:8080
Https Proxy: http://proxy.ocado.com:8080
No Proxy: .ocado.com,.lastmile.com
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
CoreOS on Openstack

containerd.txt
goroutine-stacks-2017-03-22T162146Z.txt
daemon-data-2017-03-22T162146Z.txt

@thaJeztah
Copy link
Member

The CoreOS packages are not maintained here (I also notice that it looks like containerd was built with a different Go version (1.6 instead of 1.7), not sure if that affects this).

Have you also reported this with the CoreOS maintainers?

@mikebryant
Copy link
Author

No, I can report it there as well if you like?

@thaJeztah
Copy link
Member

I think that's best; their version has modifications from the "vanilla" version of docker, and it would be hard for us to tell what differences that are, and to reproduce. I'm not excluding the possibility that there's an issue that needs to be fixed "upstream" here, but it's easier for them to do the initial triage, because they're familiar with the changes, and configuration of docker on CoreOS.

I'll go ahead and close this here for now, but keep us posted, and we can re-open if more information becomes available 👍

@euank
Copy link
Contributor

euank commented Mar 22, 2017

@thaJeztah As a sorta FYI, docker 1.13.1 on Container Linux currently doesn't carry any additional patches. We used to have a few, but at this point our 1.13.1 branch is identical to upstream's.

We do have small selinux patch on runc (which fedora also carries), but otherwise containerd/runc should both match upstream's too.

@thaJeztah
Copy link
Member

Thanks for the heads up @euank; am I right that containerd is compiled with Go 1.6?

Let me know if more is known on this issue, or if you need help 👍

@euank
Copy link
Contributor

euank commented Mar 22, 2017

@thaJeztah yeah, you're right about that go 1.6 bit. If you know of a reason that might cause this issue, a pointer there would be appreciated. We'll be updating it to 1.7 soon regardless though.

Unless it's related to go 1.6 somehow, I would expect this to not be Container Linux specific. Help digging into these issues is always appreciated!

@thaJeztah
Copy link
Member

There are locking issues, which can sometimes lead to issues (being worked on, but non trivial); @mlaventure may be able to provide more info on the containerd part, but he's on PTO this week

@mlaventure
Copy link
Contributor

@euank if you see a hang/lock of the daemon, the best solution is to send a SIGUSR1 to both dockerd and docker-containerd and to open an issue with the traces, the logs and other info as requested in the issue template.

@euank
Copy link
Contributor

euank commented Mar 29, 2017

@mlaventure this is the issue. It has the logs and traces and so on in the first post.

@thaJeztah

we can re-open if more information becomes available

What additional information would be helpful to re-open and have Docker's help in figuring out the root cause?

@mikebryant
Copy link
Author

I believe we've got an idea of exactly what's causing this.

On the affected nodes we have some containers mounting network volumes. The IOps limits on these volumes are rather low, and they've accumulated a large amounts of files.

Our hypothesis is that the docker daemon is stuck in selinux relabelling.
See goroutine 2675 in the supplied goroutine-stacks file:

syscall.Syscall6(0xbd, 0xc423382bb0, 0xc4216eb660, 0xc4213b4210, 0x2f, 0x0, 0x0, 0xc4213b4210, 0xc422a86d20, 0x2f)
	/usr/lib/go1.7/src/syscall/asm_linux_amd64.s:44 +0x5
github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/system.Lsetxattr(0xc423382a50, 0xa3, 0x193e00b, 0x10, 0xc4213b4210, 0x2f, 0x30, 0x0, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.13.1-r3/work/docker-1.13.1/.gopath/src/github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/system/xattrs_linux.go:94 +0xfb
github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/selinux.Setfilecon(0xc423382a50, 0xa3, 0xc422a86d20, 0x2f, 0xc421c2dba0, 0xc421de2448)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.13.1-r3/work/docker-1.13.1/.gopath/src/github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/selinux/selinux.go:171 +0xa3
github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/selinux.Chcon.func1(0xc423382a50, 0xa3, 0x2491e40, 0xc423bb69c0, 0x0, 0x0, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.13.1-r3/work/docker-1.13.1/.gopath/src/github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/selinux/selinux.go:478 +0x47
path/filepath.walk(0xc423382a50, 0xa3, 0x2491e40, 0xc423bb69c0, 0xc420f00fe0, 0x0, 0x0)
	/usr/lib/go1.7/src/path/filepath/path.go:351 +0x81
path/filepath.walk(0xc421bd5f40, 0x91, 0x2491e40, 0xc422a66410, 0xc420f00fe0, 0x0, 0x0)
	/usr/lib/go1.7/src/path/filepath/path.go:376 +0x344
path/filepath.walk(0xc420159080, 0x8e, 0x2491e40, 0xc422730d00, 0xc420f00fe0, 0x0, 0x1928b01)
	/usr/lib/go1.7/src/path/filepath/path.go:376 +0x344
path/filepath.Walk(0xc420159080, 0x8e, 0xc420f00fe0, 0x0, 0x8)
	/usr/lib/go1.7/src/path/filepath/path.go:398 +0xd5
github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/selinux.Chcon(0xc420159080, 0x8e, 0xc422a86d20, 0x2f, 0x24d7f01, 0xc421de277a, 0xffffffffffffffff)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.13.1-r3/work/docker-1.13.1/.gopath/src/github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/selinux/selinux.go:482 +0xc8
github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/label.Relabel(0xc420159080, 0x8e, 0xc422a86d20, 0x2f, 0x0, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.13.1-r3/work/docker-1.13.1/.gopath/src/github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/label/label_selinux.go:144 +0x16b
github.com/docker/docker/volume.(*MountPoint).Setup(0xc42268fc20, 0xc422a86d20, 0x2f, 0x0, 0x0, 0x0, 0x1, 0x2, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.13.1-r3/work/docker-1.13.1/.gopath/src/github.com/docker/docker/volume/volume.go:156 +0x32c
github.com/docker/docker/daemon.(*Daemon).setupMounts(0xc420416e00, 0xc4204ba800, 0x0, 0x0, 0x0, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.13.1-r3/work/docker-1.13.1/.gopath/src/github.com/docker/docker/daemon/volumes_unix.go:46 +0x3ca
github.com/docker/docker/daemon.(*Daemon).createSpec(0xc420416e00, 0xc4204ba800, 0x0, 0x0, 0xc420416e00)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.13.1-r3/work/docker-1.13.1/.gopath/src/github.com/docker/docker/daemon/oci_linux.go:697 +0x634
github.com/docker/docker/daemon.(*Daemon).containerStart(0xc420416e00, 0xc4204ba800, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.13.1-r3/work/docker-1.13.1/.gopath/src/github.com/docker/docker/daemon/start.go:149 +0x290
github.com/docker/docker/daemon.(*Daemon).ContainerStart(0xc420416e00, 0xc420917977, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2463af0, 0x1)
...

Inspection of the code for the given version, and master branch, of docker indicates this code path is holding a container lock: https://github.com/docker/docker/blob/daaf9ddfa9a53fb82511accb8ad0fed381670a54/daemon/start.go#L100
This then blocks docker ps etc.

So, it's not hung forever, but it's taking a significant amount of time (10s of minutes)
During this time, the docker daemon doesn't respond to any queries that need to iterate over all containers. This means for example, that Kubernetes isn't able to query the status of containers, and ends up reporting as unhealthy.

See also this bug (which helped point us in this direction): prometheus-operator/prometheus-operator#239

@thaJeztah Please can you reopen this bug, as all the code paths listed exist upstream. My belief is that this would affect all operating systems, not just CoreOS.

@euank
Copy link
Contributor

euank commented Apr 10, 2017

Thank you for digging into this!

Running dockerd with selinux disabled might work around this.

It also sounds like moving that selinux relabeling out of the docker daemon containerList lock (e.g. to containerd) could end up with docker ps continuing to be responsive.
Does that sound like the proper fix for this issue?

@mikebryant
Copy link
Author

We'll have a look into disabling selinux, thanks for the suggestion.

That sounds like a reasonable fix. (Though I guess you'd still want to make sure that the containerd list operation still functions while such a relabel is in progress? Assuming it has a list operation, I don't know much about containerd)

@mikebryant
Copy link
Author

@euank Is there another issue open for this somewhere?
If not can you reopen this one?

@mikebryant mikebryant changed the title docker ps hang on 1.13.1 docker ps hang on 1.13.1 due to selinux relabelling holding a container lock Apr 18, 2017
@klausenbusk
Copy link

@mikebryant are you sure about that the volume is mounted with :Z (selinux relabeling)? I have experienced a similar/the same issue where docker hangs (due to a lot of files from Prometheus I think), but now I just checked with docker inspect and the Prometheus volume isn't mounted with :Z.

Spec:

Container Linux by CoreOS stable (1298.6.0)
Docker version 1.12.6, build d5236f0
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.4+coreos.0", GitCommit:"97c11b097b1a2b194f1eddca8ce5468fcc83331c", GitTreeState:"clean", BuildDate:"2017-03-08T23:54:21Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

Are you also using k8s 1.5.4?

I also tried to reproduce this on a single CoreOS node (which was doing almost nothing else):

$ time docker run --rm -t -i -v $PWD:/foo busybox sh -c exit
real	0m0.566s
user	0m0.016s
sys	0m0.005s
$ time docker run --rm -t -i -v $PWD:/foo:Z busybox sh -c exit
real	0m0.480s
user	0m0.012s
sys	0m0.008s
$ dd if=/dev/urandom of=masterfile bs=1M count=10
$ split -b 5 -a 4 masterfile
...
$ ls -1 -f  | wc -l
742831
$ time docker run --rm -t -i -v $PWD:/foo busybox sh -c exit 
real	0m0.584s
user	0m0.015s
sys	0m0.012s
$ time docker run --rm -t -i -v $PWD:/foo:Z busybox sh -c exit
real	0m45.830s
user	0m0.023s
sys	0m0.020s

I couldn't get docker to hang forever with 742k files, how many files does your volume (@mikebryant) contain? But I think that k8s maybe did detach the volume while relabeling was ongoing, I'm not sure if that could cause a lockup, I would try a SIGUSR1 next time to get some more info.

I also stumbled on: kubernetes/kubernetes#42257 which seems to be the same issue (?).

@mikebryant
Copy link
Author

Yup, definitely mounted with Z:

            {
                "Source": "/var/lib/kubelet/pods/843908ae-1f59-11e7-bb5b-fa163eb19aeb/volumes/kubernetes.io~cinder/pvc-62ffd099-1b78-11e7-b59a-fa163eb19aeb/prometheus-db",
                "Destination": "/var/prometheus/data",
                "Mode": "Z",
                "RW": true,
                "Propagation": "rprivate"
            },

k8s 1.5.2

The issue for us was the number of files vs the io limit on the volume.
Our prometheus volume has 3.2M files. The io limit was by mistake set very low (100iops)

Clearly if it needs to relabel everything that's not going to work, but it shouldn't hang the daemon while doing so. Only that container should be affected.

@mikebryant
Copy link
Author

ping @thaJeztah @mlaventure
Should I file a new issue for fixing selinux relabelling, or can this issue be reopened?

@mlaventure
Copy link
Contributor

@mikebryant if you wouldn't mind opening a new one aggregating all the information you provided, it would make it easier to follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants