[1.11.0] Possible deadlock on container object #22124

tonistiigi · 2016-04-18T19:42:22Z

Originally reported by @mblaschke in #13885 (comment)

Creating a different issue because it may be a 1.11 regression.

https://gist.github.com/tonistiigi/9d79de62b2f7919f33a9e987619b9de8 goroutine trace seems to point that lots of goroutines are waiting on a container lock. No obvious goroutine that would keep a lock in that trace so possibly we have a codepath that returns without releasing.

Original report:

Since we updated to 1.11.0 running rspec docker image tests (~10 parallel containers running these tests on a 4 cpu machine) sometimes freezes and fails with a timeout. Docker freezes completely and doesn't respond (eg. docker ps). This is happening on vserver with Debian strech (btrfs) and with (vagrant) Parallels VM Ubuntu 14.04 (backported kernel 3.19.0-31-generic, ext4).

Filesystem for /var/lib/docker on both servers was cleared (btrfs was recreated) after first freeze. The freeze happens randomly when running these tests.

Stack trace is attached from both servers:
docker-log.zip

strace from docker-containerd and docker daemons:

# strace -p 21979 -p 22536
Process 21979 attached
Process 22536 attached
[pid 22536] futex(0x219bd90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21979] futex(0xf9b170, FUTEX_WAIT, 0, NULL

Docker info (Ubuntu 14.04 with backported kernel)

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64
root@DEV-VM:/var/lib/docker# docker info
Containers: 11
 Running: 1
 Paused: 0
 Stopped: 10
Images: 877
Server Version: 1.11.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 400
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 3.19.0-31-generic
Operating System: Ubuntu 14.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 3.282 GiB
Name: DEV-VM
ID: KCQP:OGCT:3MLX:TAQD:2XG6:HBG2:DPOM:GJXY:NDMK:BXCK:QEIT:D6KM
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

Docker version (Ubuntu 14.04 with backported kernel)

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

The text was updated successfully, but these errors were encountered:

rade · 2016-04-18T19:54:59Z

We are seeing these hangs regularly in our test suite, but only when using Docker 1.11. Curiously we did not see this in the 1.11 release candidates, though perhaps we were just lucky.

tonistiigi · 2016-04-18T19:57:58Z

@rade Can you send sigusr1 to daemon and copy the traces from logs when this happens. Just want to make sure this is the same issue.

rade · 2016-04-18T20:00:47Z

I notice that the stack trace referenced above contains

goroutine 759138 [chan receive, 28 minutes]:
github.com/docker/docker/daemon.(*Daemon).ContainerExecStart(0xc820063c80, 0xc822419b11, 0x40, 0x7fb8e45402b8, 0xc8209206e8, 0x7fb8e45402e8, 0xc8212739a0, 0x7fb8e45402e8, 0xc821273980, 0x0, ...)
    /usr/src/docker/.gopath/src/github.com/docker/docker/daemon/exec.go:205 +0x119e

I saw the exact same when I dumped the threads on our stuck Docker.

Our tests invoke docker exec quite a lot.

tonistiigi · 2016-04-18T20:10:25Z

@rade This should just mean that there was an active exec process. A goroutine is made for every exec that only returns when process exits.

rade · 2016-04-18T20:17:27Z

This should just mean that there was an active exec process. A goroutine is made for every exec that only returns when process exits.

I know that. I was simply highlighting a possible correlation, not causation. In our case the exec should have terminated.

clkao · 2016-05-10T01:00:17Z

Seeing this a couple of times with 1.11.1

bukzor · 2016-05-13T17:43:37Z

We're also seeing deadlocks in 1.11.1, but how do I confirm/deny it's this same issue?

tonistiigi · 2016-05-13T17:46:40Z

@bukzor Send SIGUSR1 signal to the daemon process and post the goroutines trace that will show up in the logs.

bukzor · 2016-05-13T18:25:09Z

We have six examples right now, and I've dumped the debugging for each.
The result is huge (70M).

Here's the gist: https://gist.github.com/anonymous/0185edaee33ee8db8df6fca4a07df529

bukzor · 2016-05-13T18:32:19Z

Here's the same, in more readable form: (I expanded the \t and \n)

https://gist.github.com/394589b78225d9be727a3cd7a9381156

bukzor · 2016-05-13T20:02:51Z

@tonistiigi Please take a look and let us know if there's any smoking gun ^

rade · 2016-05-24T14:48:35Z

I've had a docker attach get stuck against 1.11.1, with the container involved supposedly having exited:

# docker inspect -f '{{json .State}}' c8
{"Status":"exited","Running":false,"Paused":false,"Restarting":false,"OOMKilled":false,"Dead":false,"Pid":0,"ExitCode":0,"Error":"","StartedAt":"2016-05-24T14:07:40.844464858Z","FinishedAt":"2016-05-24T14:07:40.86345932Z"}

The stack trace shows the attach-handling goroutine:

goroutine 21626 [chan receive, 10 minutes]:
github.com/docker/docker/daemon.(*Daemon).containerAttach(0xc820323b00, 0xc820ec0700, 0x0, 0x0, 0x7f327fbb65b0, 0xc8201ab680, 0x7f327fbb65b0, 0xc8201ab660, 0xc821810100, 0x2223ad8, ...)
    /usr/src/docker/.gopath/src/github.com/docker/docker/daemon/attach.go:112 +0x543
github.com/docker/docker/daemon.(*Daemon).ContainerAttach(0xc820323b00, 0xc8216fd737, 0x2, 0xc820f4bd10, 0x0, 0x0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/daemon/attach.go:51 +0x6b9
github.com/docker/docker/api/server/router/container.(*containerRouter).postContainersAttach(0xc8207cfb90, 0x7f329a5a9f88, 0xc820f4ba40, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0, 0xc820f4b8f0, 0x0, 0x0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/router/container/container_routes.go:472 +0x6f4
github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.postContainersAttach)-fm(0x7f329a5a9f88, 0xc820f4ba40, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0, 0xc820f4b8f0, 0x0, 0x0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/router/container/container.go:51 +0x74
github.com/docker/docker/api/server/middleware.NewVersionMiddleware.func1.1(0x7f329a5a9f88, 0xc820f4ba40, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0, 0xc820f4b8f0, 0x0, 0x0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/middleware/version.go:42 +0xa43
github.com/docker/docker/api/server/middleware.NewUserAgentMiddleware.func1.1(0x7f329a5a9f88, 0xc820f4b9b0, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0, 0xc820f4b8f0, 0x0, 0x0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/middleware/user_agent.go:34 +0x6b8
github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1(0x7f327fbb58d8, 0xc821732840, 0xc820e556c0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/server.go:132 +0x138
net/http.HandlerFunc.ServeHTTP(0xc820be3ca0, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0)
    /usr/local/go/src/net/http/server.go:1422 +0x3a
github.com/gorilla/mux.(*Router).ServeHTTP(0xc820b305f0, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0)
    /usr/src/docker/vendor/src/github.com/gorilla/mux/mux.go:98 +0x29e
github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP(0xc820882920, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0)
    /usr/src/docker/.gopath/src/github.com/docker/docker/api/server/router_swapper.go:29 +0x72
net/http.serverHandler.ServeHTTP(0xc82031b440, 0x7f327fbb58d8, 0xc821732840, 0xc820e556c0)
    /usr/local/go/src/net/http/server.go:1862 +0x19e
net/http.(*conn).serve(0xc821732790)
    /usr/local/go/src/net/http/server.go:1361 +0xbee
created by net/http.(*Server).Serve
    /usr/local/go/src/net/http/server.go:1910 +0x3f6

tonistiigi · 2016-05-24T17:13:16Z

@rade Why do you think this goroutine matches the container from inspect? In your stack trace I see a bunch of goroutines in the attach stream copy phase but I don't see goroutines waiting for a container lock. So I'm not sure what "getting stuck" means in this case.

rade · 2016-05-24T18:04:19Z

Why do you think this goroutine matches the container from inspect?

It's a bit presumptious, but it's the only goroutine that is dealing with an attach command.

So I'm not sure what "getting stuck" means in this case.

I have a docker attach c8 that is just sitting there, even though c8 is dead (according to the inspect).

rade · 2016-05-24T18:13:05Z

I don't see goroutines waiting for a container lock.

I haven't looked into that, but if so, and if "waiting for a container lock" is the defining criteria for this issue then perhaps the what I am seeing is a different issue.

runcom · 2016-06-07T06:53:52Z

I'm seeing this also in 1.10.3

GameScripting · 2016-06-20T10:08:19Z

Any updates yet?

bukzor · 2016-06-20T15:20:54Z

This is confirmed fixed in 1.11.2, as far as I'm concerned.

On Mon, Jun 20, 2016, 03:10 Daniel Huhn notifications@github.com wrote:

Any updates yet?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#22124 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAnFSE1WQlBzoNASW-fS4cVfUmOGIIWOks5qNmcCgaJpZM4IKDci
.

GameScripting · 2016-06-20T16:37:46Z

Changelog for 1.11.2 only has #22918 for deadlocks. Not this issue. And its still open anyway, so it probably is not confirmed, is it?

GameScripting · 2016-06-21T08:28:05Z

I updated our prod cluster to 1.11.2. Now our monitoring (Datadog) reports the daemons going down sometimes but they become responsive again after a minute or two:

However this does now apply to all hosts, even they all run Ubuntu 14.04.4 LTS (with KVM) (3.13.0-88-generic)

tiborvass · 2016-06-25T00:01:16Z

Can any one reproduce this with 1.12.0-rc2 ?

thaJeztah · 2016-07-06T22:18:22Z

@GameScripting would you be able to check if 1.12-rc fixes this in your use-case?

PaulFurtado · 2016-07-06T23:16:25Z

@thaJeztah @tiborvass We've been hitting this issue at my company as well, and I've been subscribed to this issue so I figured I'd respond to your query of whether anyone is attempting to reproduce on the new version.

We were hitting some issues with getting 1.12-rc2 deployed last week due to other changes in 1.12, but have worked around those and, as of a few hours ago, we are running 1.12-rc2 on 4 hosts in our QA cluster that experiences these deadlocks. We'll likely expand that to 16 hosts in a few days if 1.12 proves otherwise stable. In this particular cluster, we hit the deadlock at least once every two weeks, but sometimes more frequently (some weeks it happens almost daily, guessing likely due to container churn).

If we hit the deadlock on 1.12-rc2, I'll be sure to post it here ASAP for you to confirm, but wanted to let you know that it may take time.

thaJeztah · 2016-07-06T23:18:49Z

thanks @PaulFurtado that's really appreciated. Issues like this may be hard to reproduce

tonistiigi · 2016-07-15T18:09:54Z

This was closed because the original issue is fixed in master.

I can't see a deadlock in the trace from @rade , issue from @bukzor was fixed in #22732 in v1.11.2.

If you encounter deadlocks or strange unresponsiveness please send SIGUSR1 signal to daemon process and create a new issue with the stacktrace that you can find from the logs.

GameScripting · 2016-09-29T08:08:01Z

@PaulFurtado We are hit by this issue on one host today. Did you have any issues so far?

tonistiigi added the priority/P1 Important: P1 issues are a top priority and a must-have for the next release. label Apr 18, 2016

tonistiigi added this to the 1.11.1 milestone Apr 18, 2016

tonistiigi mentioned this issue Apr 18, 2016

Docker Daemon Hangs under load #13885

Closed

This was referenced Apr 25, 2016

Wait for the reader fifo opening to block #22310

Closed

[v1.11.1] Wait for the reader fifo opening to block #22313

Merged

mlaventure modified the milestones: 1.11.2, 1.11.1, 1.12.0 Apr 25, 2016

icecrime added the version/1.11 label May 12, 2016

tonistiigi mentioned this issue May 13, 2016

Deadlock on image delete and container attach #22732

Closed

icecrime assigned tonistiigi Jun 27, 2016

gregdurham mentioned this issue Jun 27, 2016

Nomad schedules work on hosts with docker in a deadlocked state hashicorp/nomad#1359

Closed

tiborvass added priority/P2 Normal priority: default priority applied. and removed priority/P1 Important: P1 issues are a top priority and a must-have for the next release. labels Jul 6, 2016

tonistiigi mentioned this issue Jul 14, 2016

Wait for the reader fifo opening to block #24642

Merged

LK4D4 closed this as completed in #24642 Jul 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.11.0] Possible deadlock on container object #22124

[1.11.0] Possible deadlock on container object #22124

tonistiigi commented Apr 18, 2016

rade commented Apr 18, 2016

tonistiigi commented Apr 18, 2016

rade commented Apr 18, 2016

tonistiigi commented Apr 18, 2016

rade commented Apr 18, 2016

clkao commented May 10, 2016

bukzor commented May 13, 2016

tonistiigi commented May 13, 2016

bukzor commented May 13, 2016

bukzor commented May 13, 2016

bukzor commented May 13, 2016

rade commented May 24, 2016

tonistiigi commented May 24, 2016

rade commented May 24, 2016

rade commented May 24, 2016

runcom commented Jun 7, 2016

GameScripting commented Jun 20, 2016

bukzor commented Jun 20, 2016

GameScripting commented Jun 20, 2016

GameScripting commented Jun 21, 2016 •

edited

tiborvass commented Jun 25, 2016

thaJeztah commented Jul 6, 2016

PaulFurtado commented Jul 6, 2016

thaJeztah commented Jul 6, 2016

tonistiigi commented Jul 15, 2016

GameScripting commented Sep 29, 2016 •

edited

[1.11.0] Possible deadlock on container object #22124

[1.11.0] Possible deadlock on container object #22124

Comments

tonistiigi commented Apr 18, 2016

rade commented Apr 18, 2016

tonistiigi commented Apr 18, 2016

rade commented Apr 18, 2016

tonistiigi commented Apr 18, 2016

rade commented Apr 18, 2016

clkao commented May 10, 2016

bukzor commented May 13, 2016

tonistiigi commented May 13, 2016

bukzor commented May 13, 2016

bukzor commented May 13, 2016

bukzor commented May 13, 2016

rade commented May 24, 2016

tonistiigi commented May 24, 2016

rade commented May 24, 2016

rade commented May 24, 2016

runcom commented Jun 7, 2016

GameScripting commented Jun 20, 2016

bukzor commented Jun 20, 2016

GameScripting commented Jun 20, 2016

GameScripting commented Jun 21, 2016 • edited

tiborvass commented Jun 25, 2016

thaJeztah commented Jul 6, 2016

PaulFurtado commented Jul 6, 2016

thaJeztah commented Jul 6, 2016

tonistiigi commented Jul 15, 2016

GameScripting commented Sep 29, 2016 • edited

GameScripting commented Jun 21, 2016 •

edited

GameScripting commented Sep 29, 2016 •

edited