daemon: time out reader locks for reliability #23175

sethp-jive · 2016-06-01T19:15:06Z

Hello! We're regularly seeing a critical failure with our docker daemon that causes it to get blocked and halts any forward progress. The underlying problem appears to be the kernel bug most discussed in #5618 (though there are many issues referencing the kernel bug). As mentioned in that other issue, it seems that the only remediation possible at present is a system-wide reboot.

At one point we were experiencing a ~10 minute uptime for our nodes, with them running into the kernel race after just that many minutes. We've improved stability by reducing the workload on those boxes via the removal of some Kubernetes resources, but we are able to reproduce the issue at will with a stripped-down Kubernetes Job as follows (with apologies to alpine linux):

apiVersion: batch/v1
kind: Job
metadata:
  generateName: job-stress-
spec:
  parallelism: 40
  template:
    metadata:
      name: stress
    spec:
      containers:
      - name: stress
        image: alpine
        command: ["sh", "-c", "apk add --update curl && curl -fsS --HEAD https://example.com ; exit 2"]
      restartPolicy: Never
      nodeSelector:
        docker-test: "yes"

The gist here is that we're asking Kubernetes to run a lot of containers that move a fairly large number of packets and then shut down very quickly. The exit code + RestartPolicy are intended to drive as many creations as possible through in a short time. After a few minutes of that behavior – with Kubernetes creating new short-lived containers often, multiple times a second – we start seeing the kernel error message unregister_netdevice: waiting for vethXXXXXX to become free. Usage count = 1 and the system locks up. Specifically, it locks up in the following way:

One of the shutting-down containers, lucky owner of the net device whose refcount the kernel has mis-managed, is attempting to clean itself up:

goroutine 42243 [syscall, 32 minutes, locked to thread]:
syscall.Syscall6(0x2c, 0x1c, 0xc8245b6200, 0x20, 0x0, 0xc8245b61d4, 0xc, 0x20, 0xc823ebf4a0, 0x10)
  /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.sendto(0x1c, 0xc8245b6200, 0x20, 0x20, 0x0, 0xc8245b61d4, 0xc80000000c, 0x0, 0x0)
  /usr/local/go/src/syscall/zsyscall_linux_amd64.go:1729 +0x8c
syscall.Sendto(0x1c, 0xc8245b6200, 0x20, 0x20, 0x0, 0x7f348cafa130, 0xc8245b61c8, 0x0, 0x0)
  /usr/local/go/src/syscall/syscall_unix.go:258 +0xaf
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Send(0xc8245b61c0, 0xc823ebf4a0, 0x0, 0x0)
  /go/src/github.com/docker/docker/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:333 +0xd4
github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute(0xc823ebf4a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
  /go/src/github.com/docker/docker/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:215 +0x111
github.com/vishvananda/netlink.LinkDel(0x7f348c2b3150, 0xc822a26e10, 0x0, 0x0)
  /go/src/github.com/docker/docker/vendor/src/github.com/vishvananda/netlink/link_linux.go:615 +0x16b
github.com/docker/libnetwork/drivers/bridge.(*driver).DeleteEndpoint(0xc8205c6370, 0xc8201b4c40, 0x40, 0xc824e5a200, 0x40, 0x0, 0x0)
  /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go:1060 +0x5cf
github.com/docker/libnetwork.(*endpoint).deleteEndpoint(0xc822541100, 0xc820413e00, 0x0, 0x0)
  /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/endpoint.go:760 +0x261
github.com/docker/libnetwork.(*endpoint).Delete(0xc822541100, 0x7f348c2b3700, 0x0, 0x0)
  /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/endpoint.go:735 +0xbcb
github.com/docker/libnetwork.(*sandbox).delete(0xc8240dfa40, 0xc823e4a300, 0x0, 0x0)
  /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/sandbox.go:217 +0xd3f
github.com/docker/libnetwork.(*sandbox).Delete(0xc8240dfa40, 0x0, 0x0)
  /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/sandbox.go:175 +0x32
github.com/docker/docker/daemon.(*Daemon).releaseNetwork(0xc82040c300, 0xc820b6d880)
  /go/src/github.com/docker/docker/daemon/container_operations.go:727 +0x4f1
github.com/docker/docker/daemon.(*Daemon).Cleanup(0xc82040c300, 0xc820b6d880)
  /go/src/github.com/docker/docker/daemon/start.go:163 +0x62
github.com/docker/docker/daemon.(*Daemon).StateChanged(0xc82040c300, 0xc8249faa40, 0x40, 0xc8243225a0, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
  /go/src/github.com/docker/docker/daemon/monitor.go:39 +0x60a
github.com/docker/docker/libcontainerd.(*container).handleEvent.func2()
  /go/src/github.com/docker/docker/libcontainerd/container_linux.go:176 +0xa5
github.com/docker/docker/libcontainerd.(*queue).append.func1(0xc823133c00, 0x0, 0xc8241533e0, 0xc824e11b60)
  /go/src/github.com/docker/docker/libcontainerd/queue_linux.go:26 +0x47
created by github.com/docker/docker/libcontainerd.(*queue).append
  /go/src/github.com/docker/docker/libcontainerd/queue_linux.go:28 +0x1da

The exciting part of this stack is that, in daemon.StateChanged (4th frame up from the bottom) the docker daemon has locked the container at memory address 0xc820b6d880. Due to the kernel bug, it's important to note that the syscall will never complete successfully, so we'll never get that far back up the stack and therefore release the lock.

Some hapless process (in our case, usually the kubernetes kubelet, but occasionally something else) will make a request to /containers/json and get stuck attempting to acquire that container's lock way down in reducePsContainer:

goroutine 42256 [semacquire, 30 minutes]:
sync.runtime_Semacquire(0xc8234a5b24)
  /usr/local/go/src/runtime/sema.go:43 +0x26
sync.(*Mutex).Lock(0xc8234a5b20)
  /usr/local/go/src/sync/mutex.go:82 +0x1c4
github.com/docker/docker/daemon.(*Daemon).reducePsContainer(0xc82040c300, 0xc820b6d880, 0xc824e72770, 0xc821d313f8, 0x0, 0x0, 0x0)
  /go/src/github.com/docker/docker/daemon/list.go:124 +0x49
github.com/docker/docker/daemon.(*Daemon).reduceContainers(0xc82040c300, 0xc824f4b640, 0xc821d313f8, 0x0, 0x0, 0x0, 0x0, 0x0)
  /go/src/github.com/docker/docker/daemon/list.go:107 +0x1a0
github.com/docker/docker/daemon.(*Daemon).Containers(0xc82040c300, 0xc824f4b640, 0x0, 0x0, 0x0, 0x0, 0x0)
  /go/src/github.com/docker/docker/daemon/list.go:94 +0x6a
github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersJSON(0xc820baa240, 0x7f348e37dfc0, 0xc82442e2d0, 0x7f348e380188, 0xc821865c30, 0xc8241642a0, 0xc82442e240, 0x0, 0x0)
  /go/src/github.com/docker/docker/api/server/router/container/container_routes.go:52 +0x3d4
github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.getContainersJSON)-fm(0x7f348e37dfc0, 0xc82442e2d0, 0x7f348e380188, 0xc821865c30, 0xc8241642a0, 0xc82442e240, 0x0, 0x0)
  /go/src/github.com/docker/docker/api/server/router/container/container.go:31 +0x74
github.com/docker/docker/api/server/middleware.NewVersionMiddleware.func1.1(0x7f348e37dfc0, 0xc82442e2d0, 0x7f348e380188, 0xc821865c30, 0xc8241642a0, 0xc82442e240, 0x0, 0x0)
  /go/src/github.com/docker/docker/api/server/middleware/version.go:42 +0xa43
github.com/docker/docker/api/server/middleware.NewUserAgentMiddleware.func1.1(0x7f348e37dfc0, 0xc82442e2a0, 0x7f348e380188, 0xc821865c30, 0xc8241642a0, 0xc82442e240, 0x0, 0x0)
  /go/src/github.com/docker/docker/api/server/middleware/user_agent.go:34 +0x6b8
github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1(0x7f348e380188, 0xc821865c30, 0xc8241642a0)
  /go/src/github.com/docker/docker/api/server/server.go:132 +0x138
net/http.HandlerFunc.ServeHTTP(0xc8210149c0, 0x7f348e380188, 0xc821865c30, 0xc8241642a0)
  /usr/local/go/src/net/http/server.go:1422 +0x3a
github.com/gorilla/mux.(*Router).ServeHTTP(0xc820b00000, 0x7f348e380188, 0xc821865c30, 0xc8241642a0)
  /go/src/github.com/docker/docker/vendor/src/github.com/gorilla/mux/mux.go:98 +0x29e
github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP(0xc820886660, 0x7f348e380188, 0xc821865c30, 0xc8241642a0)
  /go/src/github.com/docker/docker/api/server/router_swapper.go:29 +0x72
net/http.serverHandler.ServeHTTP(0xc820301500, 0x7f348e380188, 0xc821865c30, 0xc8241642a0)
  /usr/local/go/src/net/http/server.go:1862 +0x19e
net/http.(*conn).serve(0xc82009b3f0)
  /usr/local/go/src/net/http/server.go:1361 +0xbee
created by net/http.(*Server).Serve
  /usr/local/go/src/net/http/server.go:1910 +0x3f6

We know it's the same lock, here, because reducePsContainer's second argument (the first non-implicit-daemon-pointer argument) refers to just that same container, 0xc820b6d880.

Since that lock will never be unlocked, and the /containers/json request never returns, the upstream caller ultimately hangs on the kernel bug.

At this point, we move somewhat into the realm of desperation – a single container that's hung is able to hang the entire container-status-listing mechanism, which appears to be the only way to get a complete list of the node's container status (and is thus heavily depended-on, e.g. by the kubelet). We would therefore find it desirable to isolate that failure to the single wedged container and not block all callers, for which we've considered three approaches:

(represented in this PR) The most obvious workaround is to time out on the lock in reducePsContainer (and the equivalent lock in inspect) and report some flavor of "container status unknown" for the hung container. This approach provides us with isolation we're looking for, but 1) sync.Mutex does not and will not have a tryLock equivalent (non-blocking lock acquisition), and from the outside the best I've found is a process that leaks resources on failed acquisitions. Further, 2) there is no generally approriate timeout (too short and we'll spuriously fail under load, too long and the daemon will become effectively hung).
Move the network cleanup code outside the container lock's critical section. This solution offers a piecemeal approach to the problem (though it is by no means exclusive with the other two), where as specific operations are discovered to be problematic (like cleaning up the network) they get refactored to a place where the goproc no longer holds the container's lock. I've made a very start on this work over here, but am nowhere near competent enough in the docker codebase to see it through: sethp-jive:bugfix/shorten-critical-section. For what it's worth, I'm able to confirm that it works as expected in this narrow case – docker ps remains as snappy as ever. If you'd prefer this approach, I'm happy to abandon this PR, clean that commit up a bit, and resubmit it.
Re-architect the daemon's status mechanism so that hung writers do not prevent status reads. The golang maintainers' rationale for not including lock-timeouts seems sensible: there's usually a better concurrency tool in Go that precludes the need for a racy timeout. In this case, it would seem better to read a slightly-stale state (e.g. updated by sending state deltas over a channel vs. taking out a lock) than to block forever waiting for the state updater that will never return. This option being the most nebulous of the three I've not even tried to put together a POC, but I wanted to include it for completeness.

So, in summary: The docker daemon's status reporting mechanism completely hangs when a single container gets in a bad state. We don't understand the bad state well enough to resolve it, but hopefully we can reduce the failure domain from "the entire docker daemon" to "the single container".

- What I did

Changed the lock acquisition for docker ps and docker inspect to time out and drop the container from the list or provide an error message respectively to limit the failure mode induced by a kernel bug to the single affected container rather than the entire daemon.

- How I did it

By leaking goprocs (& a channel), unfortunately. Back-of-the envelope math suggests we're looking at weeks of uptime once we hit the bug until the daemon OOMs, which is much better than the minutes we're seeing now. A less general point solution that also resolves the specific issue is here: sethp-jive:bugfix/shorten-critical-section. That approach has the advantage of not leaking an unbounded number of resources.

- How to verify it

Turn up a kubernetes cluster (sorry!); GKE should work fine as long as there's a way to get the nodes' kernel message buffers to look for the unregister_netdevice
Create job-stress.yml with the contents above, starting with apiVersion: batch/v1
Pick a lucky node, i.e. from the output of kubectl nodes, and run kubectl label nodes NODE_NAME docker-test=yes
1. Or, alternatively, create a single-node cluster and remove the "labelSelector" line from job-stress.yml
kubectl create -f job-stress.yml
Watch the kernel buffer (e.g. dmesg -w) for the line unregister_netdevice: waiting for vethXXXXXX to become free. Usage count = 1
Once that appears, run docker ps and ensure that it returns.
Running docker inspect on each container (i.e. by grabbing the ids from /var/lib/docker/containers/* , docker ps won't report on the broken one) should yield one failure with the message: Error response from daemon: Unable to lock container for inspect.

- Description for the changelog

Reduces kernel-induced hang signaled by unregister_netdevice to the single stuck container rather than the entire daemon.

- A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Seth Pellegrino seth.pellegrino@jivesoftware.com

To aid an issue where the entire daemon becomes hung when a lock is held for an operation that cannot complete, we time out the lock and report failure instead of blocking readers forever. This limits the failed operation to a single container rather than the entire daemon. Signed-off-by: Seth Pellegrino <seth.pellegrino@jivesoftware.com>

thaJeztah · 2016-06-01T19:28:29Z

/cc @cpuguy83 @mrjana

cpuguy83 · 2016-06-01T19:49:11Z

This doesn't really work. It's not only leaking a goroutine, but also something trying to acquire a lock.
There is nothing unlocking. You would have to run a 2nd leaked goroutine waiting for the lock to actually acquire and then immediately unlock it.

NOTLGTM sorry :(

sethp-jive · 2016-06-01T20:46:02Z

Thanks for the quick reply! The contract of tryLock here is intended to be similar to other languages like Swift and Java, where the boolean return value indicates whether the caller holds responsibility for the lock or not. In other words, the caller must call Unlock if and only if tryLock returns true – at both call-sites, the Unlock is immediately deferred if tryLock returns in the affirmative. When it does not, it does indeed fire a second goproc whose job it is to wait (possibly forever) for the lock attempt to succeed so it may Unlock.

I don't disagree it's a bit hard to use correctly, which is the reason the golang authors seem disinclined to include it in the standard library, but in this case it does work to reduce our failure domain from "the entire daemon" to "a single container's status."

sethp-jive · 2016-06-01T20:51:27Z

Also, just for clarity – I don't think this is anywhere near an ideal solution. The cleanest solution we could come up with was option 3, where it's not necessary to take out a lock to report on a container's status and other concurrency mechanisms (channels, single-writer-ownership, etc.) take care of the concurrency properties of readers and writers mixing together. This PR is merely the solution I've been able to implement so we can keep using docker.

icecrime · 2016-06-01T22:00:15Z

You have a small formatting issue here:

12:16:08 ---> Making bundle: validate-gofmt (in bundles/1.12.0-dev/validate-gofmt)
12:16:08 These files are not properly gofmt'd:
12:16:08  - daemon/list.go

Signed-off-by: Seth Pellegrino <seth.pellegrino@jivesoftware.com>

LK4D4 · 2016-06-03T18:38:29Z

btw it's possible to check mutex status without Lock:
https://github.com/LK4D4/trylock/blob/master/trylock.go#L35
But I still think it should be fixed like in #23178

sethp-jive · 2016-06-03T20:43:53Z

Ah, so that's how to get at the underlying bits. Nice!

I'm not convinced this change is exclusive with #23178 or #23240 (both of which look like they'd help this specific issue), but I'm happy to close this issue in favor of one of those approaches.

manoj0077 · 2016-06-30T13:38:12Z

Hi @sethp-jive , so with 1.12 docker daemon restart without effecting running containers. This will give an alternative correct. Or do you see any other issues.

sethp-jive · 2016-06-30T20:38:46Z

@manoj0077 I haven't tried restarting the daemon on 1.12 to see whether it's a valid remediation – it depends on whether the new daemon process tries again to clean up the stuck container (thus calling releaseNetwork while holding the lock as in the backtrace above) or whether it just leaks it.

manoj0077 · 2016-07-01T07:26:58Z

@sethp-jive Did you get a chance to try with 1.12 ?...We are not hitting hang in 1.12 ...not sure what got improved.....
@cpuguy83 any comments here ?

andyxning · 2016-11-21T10:44:33Z

we encounter docker ps hang in Docker 1.12.0. :(

cpuguy83 · 2016-11-21T14:12:33Z

@andyxning ps hang can be caused by many things, a few of which are regressions in 1.12.0 that are fixed in 1.12.x patch releases.
Please make sure to run 1.12.3 which is the current latest patch release.

GordonTheTurtle added the status/0-triage label Jun 1, 2016

thaJeztah added status/2-code-review and removed status/0-triage labels Jun 1, 2016

This was referenced Jun 1, 2016

Mitigate impact of unregister_netdevice kernel race kubernetes/kubernetes#20096

Closed

Docker completely stuck - unregister_netdevice: waiting for lo to become free. Usage count = 1 coreos/bugs#254

Closed

icecrime added the status/failing-ci Indicates that the PR in its current state fails the test suite label Jun 1, 2016

gofmt: fix formatting

dec8601

Signed-off-by: Seth Pellegrino <seth.pellegrino@jivesoftware.com>

sethp-jive mentioned this pull request Jun 2, 2016

Fixes issues with stuck state locks blocking API #23178

Closed

sethp-jive closed this Jun 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daemon: time out reader locks for reliability #23175

daemon: time out reader locks for reliability #23175

sethp-jive commented Jun 1, 2016

thaJeztah commented Jun 1, 2016

cpuguy83 commented Jun 1, 2016

sethp-jive commented Jun 1, 2016

sethp-jive commented Jun 1, 2016

icecrime commented Jun 1, 2016

LK4D4 commented Jun 3, 2016

sethp-jive commented Jun 3, 2016

manoj0077 commented Jun 30, 2016 •

edited

sethp-jive commented Jun 30, 2016

manoj0077 commented Jul 1, 2016 •

edited

andyxning commented Nov 21, 2016

cpuguy83 commented Nov 21, 2016

daemon: time out reader locks for reliability #23175

daemon: time out reader locks for reliability #23175

Conversation

sethp-jive commented Jun 1, 2016

thaJeztah commented Jun 1, 2016

cpuguy83 commented Jun 1, 2016

sethp-jive commented Jun 1, 2016

sethp-jive commented Jun 1, 2016

icecrime commented Jun 1, 2016

LK4D4 commented Jun 3, 2016

sethp-jive commented Jun 3, 2016

manoj0077 commented Jun 30, 2016 • edited

sethp-jive commented Jun 30, 2016

manoj0077 commented Jul 1, 2016 • edited

andyxning commented Nov 21, 2016

cpuguy83 commented Nov 21, 2016

manoj0077 commented Jun 30, 2016 •

edited

manoj0077 commented Jul 1, 2016 •

edited