Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.11.0] Possible deadlock on container object #22124

Closed
tonistiigi opened this issue Apr 18, 2016 · 26 comments · Fixed by #24642
Closed

[1.11.0] Possible deadlock on container object #22124

tonistiigi opened this issue Apr 18, 2016 · 26 comments · Fixed by #24642
Assignees
Labels
priority/P2 Normal priority: default priority applied. version/1.11
Milestone

Comments

@tonistiigi
Copy link
Member

Originally reported by @mblaschke in #13885 (comment)

Creating a different issue because it may be a 1.11 regression.

https://gist.github.com/tonistiigi/9d79de62b2f7919f33a9e987619b9de8 goroutine trace seems to point that lots of goroutines are waiting on a container lock. No obvious goroutine that would keep a lock in that trace so possibly we have a codepath that returns without releasing.

Original report:

Since we updated to 1.11.0 running rspec docker image tests (~10 parallel containers running these tests on a 4 cpu machine) sometimes freezes and fails with a timeout. Docker freezes completely and doesn't respond (eg. docker ps). This is happening on vserver with Debian strech (btrfs) and with (vagrant) Parallels VM Ubuntu 14.04 (backported kernel 3.19.0-31-generic, ext4).

Filesystem for /var/lib/docker on both servers was cleared (btrfs was recreated) after first freeze. The freeze happens randomly when running these tests.

Stack trace is attached from both servers:
docker-log.zip

strace from docker-containerd and docker daemons:

# strace -p 21979 -p 22536
Process 21979 attached
Process 22536 attached
[pid 22536] futex(0x219bd90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21979] futex(0xf9b170, FUTEX_WAIT, 0, NULL

Docker info (Ubuntu 14.04 with backported kernel)

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64
root@DEV-VM:/var/lib/docker# docker info
Containers: 11
 Running: 1
 Paused: 0
 Stopped: 10
Images: 877
Server Version: 1.11.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 400
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 3.19.0-31-generic
Operating System: Ubuntu 14.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 3.282 GiB
Name: DEV-VM
ID: KCQP:OGCT:3MLX:TAQD:2XG6:HBG2:DPOM:GJXY:NDMK:BXCK:QEIT:D6KM
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

Docker version (Ubuntu 14.04 with backported kernel)

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64
@tonistiigi tonistiigi added the priority/P1 Important: P1 issues are a top priority and a must-have for the next release. label Apr 18, 2016
@tonistiigi tonistiigi added this to the 1.11.1 milestone Apr 18, 2016
@rade
Copy link

rade commented Apr 18, 2016

We are seeing these hangs regularly in our test suite, but only when using Docker 1.11. Curiously we did not see this in the 1.11 release candidates, though perhaps we were just lucky.

@tonistiigi
Copy link
Member Author

@rade Can you send sigusr1 to daemon and copy the traces from logs when this happens. Just want to make sure this is the same issue.

@rade
Copy link

rade commented Apr 18, 2016

I notice that the stack trace referenced above contains

goroutine 759138 [chan receive, 28 minutes]:
github.com/docker/docker/daemon.(*Daemon).ContainerExecStart(0xc820063c80, 0xc822419b11, 0x40, 0x7fb8e45402b8, 0xc8209206e8, 0x7fb8e45402e8, 0xc8212739a0, 0x7fb8e45402e8, 0xc821273980, 0x0, ...)
    /usr/src/docker/.gopath/src/github.com/docker/docker/daemon/exec.go:205 +0x119e

I saw the exact same when I dumped the threads on our stuck Docker.

Our tests invoke docker exec quite a lot.

@tonistiigi
Copy link
Member Author

@rade This should just mean that there was an active exec process. A goroutine is made for every exec that only returns when process exits.

@rade
Copy link

rade commented Apr 18, 2016

This should just mean that there was an active exec process. A goroutine is made for every exec that only returns when process exits.

I know that. I was simply highlighting a possible correlation, not causation. In our case the exec should have terminated.

@clkao
Copy link
Contributor

clkao commented May 10, 2016

Seeing this a couple of times with 1.11.1

@bukzor
Copy link

bukzor commented May 13, 2016

We're also seeing deadlocks in 1.11.1, but how do I confirm/deny it's this same issue?

@tonistiigi
Copy link
Member Author

@bukzor Send SIGUSR1 signal to the daemon process and post the goroutines trace that will show up in the logs.

@bukzor
Copy link

bukzor commented May 13, 2016

We have six examples right now, and I've dumped the debugging for each.
The result is huge (70M).

Here's the gist: https://gist.github.com/anonymous/0185edaee33ee8db8df6fca4a07df529

@bukzor
Copy link

bukzor commented May 13, 2016

Here's the same, in more readable form: (I expanded the \t and \n)

https://gist.github.com/394589b78225d9be727a3cd7a9381156

@bukzor
Copy link

bukzor commented May 13, 2016

@tonistiigi Please take a look and let us know if there's any smoking gun ^

@rade
Copy link

rade commented May 24, 2016

I've had a docker attach get stuck against 1.11.1, with the container involved supposedly having exited:

# docker inspect -f '{{json .State}}' c8
{"Status":"exited","Running":false,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":0,"ExitCode":0,"Error":"","StartedAt":"2016-05-24T14:07:40.844464858Z","FinishedAt":"2016-05-24T14:07:40.86345932Z"}

The stack trace shows the attach-handling goroutine:

goroutine 21626 [chan receive, 10 minutes]:
github.com/docker/docker/daemon.(*Daemon).containerAttach(0xc820323b00, 0xc820ec0700, 0x0, 0x0, 0x7f327fbb65b0, 0xc8201ab680, 0x7f327fbb65b0, 0xc8201ab660, 0xc821810100, 0x2223ad8, ...)
    /usr/src/docker/.gopath/src/github.com/docker/docker/daemon/attach.go:112 +0x543
github.com/docker/docker/daemon.(*Daemon).ContainerAttach(0xc820323b00, 0xc8216fd737, 0x2, 0xc820f4bd10, 0x0, 0x0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/daemon/attach.go:51 +0x6b9
github.com/docker/docker/api/server/router/container.(*containerRouter).postContainersAttach(0xc8207cfb90, 0x7f329a5a9f88, 0xc820f4ba40, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0, 0xc820f4b8f0, 0x0, 0x0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/router/container/container_routes.go:472 +0x6f4
github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.postContainersAttach)-fm(0x7f329a5a9f88, 0xc820f4ba40, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0, 0xc820f4b8f0, 0x0, 0x0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/router/container/container.go:51 +0x74
github.com/docker/docker/api/server/middleware.NewVersionMiddleware.func1.1(0x7f329a5a9f88, 0xc820f4ba40, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0, 0xc820f4b8f0, 0x0, 0x0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/middleware/version.go:42 +0xa43
github.com/docker/docker/api/server/middleware.NewUserAgentMiddleware.func1.1(0x7f329a5a9f88, 0xc820f4b9b0, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0, 0xc820f4b8f0, 0x0, 0x0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/middleware/user_agent.go:34 +0x6b8
github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1(0x7f327fbb58d8, 0xc821732840, 0xc820e556c0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/server.go:132 +0x138
net/http.HandlerFunc.ServeHTTP(0xc820be3ca0, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0)
    /usr/local/go/src/net/http/server.go:1422 +0x3a
github.com/gorilla/mux.(*Router).ServeHTTP(0xc820b305f0, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0)
    /usr/src/docker/vendor/src/github.com/gorilla/mux/mux.go:98 +0x29e
github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP(0xc820882920, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/router_swapper.go:29 +0x72
net/http.serverHandler.ServeHTTP(0xc82031b440, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0)
    /usr/local/go/src/net/http/server.go:1862 +0x19e
net/http.(*conn).serve(0xc821732790)
    /usr/local/go/src/net/http/server.go:1361 +0xbee
created by net/http.(*Server).Serve
    /usr/local/go/src/net/http/server.go:1910 +0x3f6

@tonistiigi
Copy link
Member Author

@rade Why do you think this goroutine matches the container from inspect? In your stack trace I see a bunch of goroutines in the attach stream copy phase but I don't see goroutines waiting for a container lock. So I'm not sure what "getting stuck" means in this case.

@rade
Copy link

rade commented May 24, 2016

Why do you think this goroutine matches the container from inspect?

It's a bit presumptious, but it's the only goroutine that is dealing with an attach command.

So I'm not sure what "getting stuck" means in this case.

I have a docker attach c8 that is just sitting there, even though c8 is dead (according to the inspect).

@rade
Copy link

rade commented May 24, 2016

I don't see goroutines waiting for a container lock.

I haven't looked into that, but if so, and if "waiting for a container lock" is the defining criteria for this issue then perhaps the what I am seeing is a different issue.

@runcom
Copy link
Member

runcom commented Jun 7, 2016

I'm seeing this also in 1.10.3

@GameScripting
Copy link

Any updates yet?

@bukzor
Copy link

bukzor commented Jun 20, 2016

This is confirmed fixed in 1.11.2, as far as I'm concerned.

On Mon, Jun 20, 2016, 03:10 Daniel Huhn notifications@github.com wrote:

Any updates yet?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#22124 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAnFSE1WQlBzoNASW-fS4cVfUmOGIIWOks5qNmcCgaJpZM4IKDci
.

@GameScripting
Copy link

Changelog for 1.11.2 only has #22918 for deadlocks. Not this issue. And its still open anyway, so it probably is not confirmed, is it?

@GameScripting
Copy link

GameScripting commented Jun 21, 2016

I updated our prod cluster to 1.11.2. Now our monitoring (Datadog) reports the daemons going down sometimes but they become responsive again after a minute or two:

image

However this does now apply to all hosts, even they all run Ubuntu 14.04.4 LTS (with KVM) (3.13.0-88-generic)

@tiborvass
Copy link
Contributor

Can any one reproduce this with 1.12.0-rc2 ?

@thaJeztah
Copy link
Member

@GameScripting would you be able to check if 1.12-rc fixes this in your use-case?

@tiborvass tiborvass added priority/P2 Normal priority: default priority applied. and removed priority/P1 Important: P1 issues are a top priority and a must-have for the next release. labels Jul 6, 2016
@PaulFurtado
Copy link

@thaJeztah @tiborvass We've been hitting this issue at my company as well, and I've been subscribed to this issue so I figured I'd respond to your query of whether anyone is attempting to reproduce on the new version.

We were hitting some issues with getting 1.12-rc2 deployed last week due to other changes in 1.12, but have worked around those and, as of a few hours ago, we are running 1.12-rc2 on 4 hosts in our QA cluster that experiences these deadlocks. We'll likely expand that to 16 hosts in a few days if 1.12 proves otherwise stable. In this particular cluster, we hit the deadlock at least once every two weeks, but sometimes more frequently (some weeks it happens almost daily, guessing likely due to container churn).

If we hit the deadlock on 1.12-rc2, I'll be sure to post it here ASAP for you to confirm, but wanted to let you know that it may take time.

@thaJeztah
Copy link
Member

thanks @PaulFurtado that's really appreciated. Issues like this may be hard to reproduce

@tonistiigi
Copy link
Member Author

This was closed because the original issue is fixed in master.

I can't see a deadlock in the trace from @rade , issue from @bukzor was fixed in #22732 in v1.11.2.

If you encounter deadlocks or strange unresponsiveness please send SIGUSR1 signal to daemon process and create a new issue with the stacktrace that you can find from the logs.

@GameScripting
Copy link

GameScripting commented Sep 29, 2016

@PaulFurtado We are hit by this issue on one host today. Did you have any issues so far?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/P2 Normal priority: default priority applied. version/1.11
Projects
None yet