Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: time out reader locks for reliability #23175

Closed
wants to merge 2 commits into from

Conversation

sethp-jive
Copy link

Hello! We're regularly seeing a critical failure with our docker daemon that causes it to get blocked and halts any forward progress. The underlying problem appears to be the kernel bug most discussed in #5618 (though there are many issues referencing the kernel bug). As mentioned in that other issue, it seems that the only remediation possible at present is a system-wide reboot.

At one point we were experiencing a ~10 minute uptime for our nodes, with them running into the kernel race after just that many minutes. We've improved stability by reducing the workload on those boxes via the removal of some Kubernetes resources, but we are able to reproduce the issue at will with a stripped-down Kubernetes Job as follows (with apologies to alpine linux):

apiVersion: batch/v1
kind: Job
metadata:
  generateName: job-stress-
spec:
  parallelism: 40
  template:
    metadata:
      name: stress
    spec:
      containers:
      - name: stress
        image: alpine
        command: ["sh", "-c", "apk add --update curl && curl -fsS --HEAD https://example.com ; exit 2"]
      restartPolicy: Never
      nodeSelector:
        docker-test: "yes"

The gist here is that we're asking Kubernetes to run a lot of containers that move a fairly large number of packets and then shut down very quickly. The exit code + RestartPolicy are intended to drive as many creations as possible through in a short time. After a few minutes of that behavior – with Kubernetes creating new short-lived containers often, multiple times a second – we start seeing the kernel error message unregister_netdevice: waiting for vethXXXXXX to become free. Usage count = 1 and the system locks up. Specifically, it locks up in the following way:

  1. One of the shutting-down containers, lucky owner of the net device whose refcount the kernel has mis-managed, is attempting to clean itself up:

    goroutine 42243 [syscall, 32 minutes, locked to thread]:
    syscall.Syscall6(0x2c, 0x1c, 0xc8245b6200, 0x20, 0x0, 0xc8245b61d4, 0xc, 0x20, 0xc823ebf4a0, 0x10)
      /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
    syscall.sendto(0x1c, 0xc8245b6200, 0x20, 0x20, 0x0, 0xc8245b61d4, 0xc80000000c, 0x0, 0x0)
      /usr/local/go/src/syscall/zsyscall_linux_amd64.go:1729 +0x8c
    syscall.Sendto(0x1c, 0xc8245b6200, 0x20, 0x20, 0x0, 0x7f348cafa130, 0xc8245b61c8, 0x0, 0x0)
      /usr/local/go/src/syscall/syscall_unix.go:258 +0xaf
    github.com/vishvananda/netlink/nl.(*NetlinkSocket).Send(0xc8245b61c0, 0xc823ebf4a0, 0x0, 0x0)
      /go/src/github.com/docker/docker/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:333 +0xd4
    github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute(0xc823ebf4a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
      /go/src/github.com/docker/docker/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:215 +0x111
    github.com/vishvananda/netlink.LinkDel(0x7f348c2b3150, 0xc822a26e10, 0x0, 0x0)
      /go/src/github.com/docker/docker/vendor/src/github.com/vishvananda/netlink/link_linux.go:615 +0x16b
    github.com/docker/libnetwork/drivers/bridge.(*driver).DeleteEndpoint(0xc8205c6370, 0xc8201b4c40, 0x40, 0xc824e5a200, 0x40, 0x0, 0x0)
      /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go:1060 +0x5cf
    github.com/docker/libnetwork.(*endpoint).deleteEndpoint(0xc822541100, 0xc820413e00, 0x0, 0x0)
      /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/endpoint.go:760 +0x261
    github.com/docker/libnetwork.(*endpoint).Delete(0xc822541100, 0x7f348c2b3700, 0x0, 0x0)
      /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/endpoint.go:735 +0xbcb
    github.com/docker/libnetwork.(*sandbox).delete(0xc8240dfa40, 0xc823e4a300, 0x0, 0x0)
      /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/sandbox.go:217 +0xd3f
    github.com/docker/libnetwork.(*sandbox).Delete(0xc8240dfa40, 0x0, 0x0)
      /go/src/github.com/docker/docker/vendor/src/github.com/docker/libnetwork/sandbox.go:175 +0x32
    github.com/docker/docker/daemon.(*Daemon).releaseNetwork(0xc82040c300, 0xc820b6d880)
      /go/src/github.com/docker/docker/daemon/container_operations.go:727 +0x4f1
    github.com/docker/docker/daemon.(*Daemon).Cleanup(0xc82040c300, 0xc820b6d880)
      /go/src/github.com/docker/docker/daemon/start.go:163 +0x62
    github.com/docker/docker/daemon.(*Daemon).StateChanged(0xc82040c300, 0xc8249faa40, 0x40, 0xc8243225a0, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
      /go/src/github.com/docker/docker/daemon/monitor.go:39 +0x60a
    github.com/docker/docker/libcontainerd.(*container).handleEvent.func2()
      /go/src/github.com/docker/docker/libcontainerd/container_linux.go:176 +0xa5
    github.com/docker/docker/libcontainerd.(*queue).append.func1(0xc823133c00, 0x0, 0xc8241533e0, 0xc824e11b60)
      /go/src/github.com/docker/docker/libcontainerd/queue_linux.go:26 +0x47
    created by github.com/docker/docker/libcontainerd.(*queue).append
      /go/src/github.com/docker/docker/libcontainerd/queue_linux.go:28 +0x1da
    

    The exciting part of this stack is that, in daemon.StateChanged (4th frame up from the bottom) the docker daemon has locked the container at memory address 0xc820b6d880. Due to the kernel bug, it's important to note that the syscall will never complete successfully, so we'll never get that far back up the stack and therefore release the lock.

  2. Some hapless process (in our case, usually the kubernetes kubelet, but occasionally something else) will make a request to /containers/json and get stuck attempting to acquire that container's lock way down in reducePsContainer:

    goroutine 42256 [semacquire, 30 minutes]:
    sync.runtime_Semacquire(0xc8234a5b24)
      /usr/local/go/src/runtime/sema.go:43 +0x26
    sync.(*Mutex).Lock(0xc8234a5b20)
      /usr/local/go/src/sync/mutex.go:82 +0x1c4
    github.com/docker/docker/daemon.(*Daemon).reducePsContainer(0xc82040c300, 0xc820b6d880, 0xc824e72770, 0xc821d313f8, 0x0, 0x0, 0x0)
      /go/src/github.com/docker/docker/daemon/list.go:124 +0x49
    github.com/docker/docker/daemon.(*Daemon).reduceContainers(0xc82040c300, 0xc824f4b640, 0xc821d313f8, 0x0, 0x0, 0x0, 0x0, 0x0)
      /go/src/github.com/docker/docker/daemon/list.go:107 +0x1a0
    github.com/docker/docker/daemon.(*Daemon).Containers(0xc82040c300, 0xc824f4b640, 0x0, 0x0, 0x0, 0x0, 0x0)
      /go/src/github.com/docker/docker/daemon/list.go:94 +0x6a
    github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersJSON(0xc820baa240, 0x7f348e37dfc0, 0xc82442e2d0, 0x7f348e380188, 0xc821865c30, 0xc8241642a0, 0xc82442e240, 0x0, 0x0)
      /go/src/github.com/docker/docker/api/server/router/container/container_routes.go:52 +0x3d4
    github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.getContainersJSON)-fm(0x7f348e37dfc0, 0xc82442e2d0, 0x7f348e380188, 0xc821865c30, 0xc8241642a0, 0xc82442e240, 0x0, 0x0)
      /go/src/github.com/docker/docker/api/server/router/container/container.go:31 +0x74
    github.com/docker/docker/api/server/middleware.NewVersionMiddleware.func1.1(0x7f348e37dfc0, 0xc82442e2d0, 0x7f348e380188, 0xc821865c30, 0xc8241642a0, 0xc82442e240, 0x0, 0x0)
      /go/src/github.com/docker/docker/api/server/middleware/version.go:42 +0xa43
    github.com/docker/docker/api/server/middleware.NewUserAgentMiddleware.func1.1(0x7f348e37dfc0, 0xc82442e2a0, 0x7f348e380188, 0xc821865c30, 0xc8241642a0, 0xc82442e240, 0x0, 0x0)
      /go/src/github.com/docker/docker/api/server/middleware/user_agent.go:34 +0x6b8
    github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1(0x7f348e380188, 0xc821865c30, 0xc8241642a0)
      /go/src/github.com/docker/docker/api/server/server.go:132 +0x138
    net/http.HandlerFunc.ServeHTTP(0xc8210149c0, 0x7f348e380188, 0xc821865c30, 0xc8241642a0)
      /usr/local/go/src/net/http/server.go:1422 +0x3a
    github.com/gorilla/mux.(*Router).ServeHTTP(0xc820b00000, 0x7f348e380188, 0xc821865c30, 0xc8241642a0)
      /go/src/github.com/docker/docker/vendor/src/github.com/gorilla/mux/mux.go:98 +0x29e
    github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP(0xc820886660, 0x7f348e380188, 0xc821865c30, 0xc8241642a0)
      /go/src/github.com/docker/docker/api/server/router_swapper.go:29 +0x72
    net/http.serverHandler.ServeHTTP(0xc820301500, 0x7f348e380188, 0xc821865c30, 0xc8241642a0)
      /usr/local/go/src/net/http/server.go:1862 +0x19e
    net/http.(*conn).serve(0xc82009b3f0)
      /usr/local/go/src/net/http/server.go:1361 +0xbee
    created by net/http.(*Server).Serve
      /usr/local/go/src/net/http/server.go:1910 +0x3f6
    

    We know it's the same lock, here, because reducePsContainer's second argument (the first non-implicit-daemon-pointer argument) refers to just that same container, 0xc820b6d880.

  3. Since that lock will never be unlocked, and the /containers/json request never returns, the upstream caller ultimately hangs on the kernel bug.

At this point, we move somewhat into the realm of desperation – a single container that's hung is able to hang the entire container-status-listing mechanism, which appears to be the only way to get a complete list of the node's container status (and is thus heavily depended-on, e.g. by the kubelet). We would therefore find it desirable to isolate that failure to the single wedged container and not block all callers, for which we've considered three approaches:

  1. (represented in this PR) The most obvious workaround is to time out on the lock in reducePsContainer (and the equivalent lock in inspect) and report some flavor of "container status unknown" for the hung container. This approach provides us with isolation we're looking for, but 1) sync.Mutex does not and will not have a tryLock equivalent (non-blocking lock acquisition), and from the outside the best I've found is a process that leaks resources on failed acquisitions. Further, 2) there is no generally approriate timeout (too short and we'll spuriously fail under load, too long and the daemon will become effectively hung).
  2. Move the network cleanup code outside the container lock's critical section. This solution offers a piecemeal approach to the problem (though it is by no means exclusive with the other two), where as specific operations are discovered to be problematic (like cleaning up the network) they get refactored to a place where the goproc no longer holds the container's lock. I've made a very start on this work over here, but am nowhere near competent enough in the docker codebase to see it through: sethp-jive:bugfix/shorten-critical-section. For what it's worth, I'm able to confirm that it works as expected in this narrow case – docker ps remains as snappy as ever. If you'd prefer this approach, I'm happy to abandon this PR, clean that commit up a bit, and resubmit it.
  3. Re-architect the daemon's status mechanism so that hung writers do not prevent status reads. The golang maintainers' rationale for not including lock-timeouts seems sensible: there's usually a better concurrency tool in Go that precludes the need for a racy timeout. In this case, it would seem better to read a slightly-stale state (e.g. updated by sending state deltas over a channel vs. taking out a lock) than to block forever waiting for the state updater that will never return. This option being the most nebulous of the three I've not even tried to put together a POC, but I wanted to include it for completeness.

So, in summary: The docker daemon's status reporting mechanism completely hangs when a single container gets in a bad state. We don't understand the bad state well enough to resolve it, but hopefully we can reduce the failure domain from "the entire docker daemon" to "the single container".

- What I did

Changed the lock acquisition for docker ps and docker inspect to time out and drop the container from the list or provide an error message respectively to limit the failure mode induced by a kernel bug to the single affected container rather than the entire daemon.

- How I did it

By leaking goprocs (& a channel), unfortunately. Back-of-the envelope math suggests we're looking at weeks of uptime once we hit the bug until the daemon OOMs, which is much better than the minutes we're seeing now. A less general point solution that also resolves the specific issue is here: sethp-jive:bugfix/shorten-critical-section. That approach has the advantage of not leaking an unbounded number of resources.

- How to verify it

  1. Turn up a kubernetes cluster (sorry!); GKE should work fine as long as there's a way to get the nodes' kernel message buffers to look for the unregister_netdevice
  2. Create job-stress.yml with the contents above, starting with apiVersion: batch/v1
  3. Pick a lucky node, i.e. from the output of kubectl nodes, and run kubectl label nodes NODE_NAME docker-test=yes
    1. Or, alternatively, create a single-node cluster and remove the "labelSelector" line from job-stress.yml
  4. kubectl create -f job-stress.yml
  5. Watch the kernel buffer (e.g. dmesg -w) for the line unregister_netdevice: waiting for vethXXXXXX to become free. Usage count = 1
  6. Once that appears, run docker ps and ensure that it returns.
  7. Running docker inspect on each container (i.e. by grabbing the ids from /var/lib/docker/containers/* , docker ps won't report on the broken one) should yield one failure with the message: Error response from daemon: Unable to lock container for inspect.

- Description for the changelog

Reduces kernel-induced hang signaled by unregister_netdevice to the single stuck container rather than the entire daemon.

- A picture of a cute animal (not mandatory but encouraged)

image

Signed-off-by: Seth Pellegrino seth.pellegrino@jivesoftware.com

To aid an issue where the entire daemon becomes hung when a lock is held
for an operation that cannot complete, we time out the lock and report
failure instead of blocking readers forever.

This limits the failed operation to a single container rather than the
entire daemon.

Signed-off-by: Seth Pellegrino <seth.pellegrino@jivesoftware.com>
@thaJeztah
Copy link
Member

/cc @cpuguy83 @mrjana

@cpuguy83
Copy link
Member

cpuguy83 commented Jun 1, 2016

This doesn't really work. It's not only leaking a goroutine, but also something trying to acquire a lock.
There is nothing unlocking. You would have to run a 2nd leaked goroutine waiting for the lock to actually acquire and then immediately unlock it.

NOTLGTM sorry :(

@sethp-jive
Copy link
Author

Thanks for the quick reply! The contract of tryLock here is intended to be similar to other languages like Swift and Java, where the boolean return value indicates whether the caller holds responsibility for the lock or not. In other words, the caller must call Unlock if and only if tryLock returns true – at both call-sites, the Unlock is immediately deferred if tryLock returns in the affirmative. When it does not, it does indeed fire a second goproc whose job it is to wait (possibly forever) for the lock attempt to succeed so it may Unlock.

I don't disagree it's a bit hard to use correctly, which is the reason the golang authors seem disinclined to include it in the standard library, but in this case it does work to reduce our failure domain from "the entire daemon" to "a single container's status."

@sethp-jive
Copy link
Author

Also, just for clarity – I don't think this is anywhere near an ideal solution. The cleanest solution we could come up with was option 3, where it's not necessary to take out a lock to report on a container's status and other concurrency mechanisms (channels, single-writer-ownership, etc.) take care of the concurrency properties of readers and writers mixing together. This PR is merely the solution I've been able to implement so we can keep using docker.

@icecrime
Copy link
Contributor

icecrime commented Jun 1, 2016

You have a small formatting issue here:

12:16:08 ---> Making bundle: validate-gofmt (in bundles/1.12.0-dev/validate-gofmt)
12:16:08 These files are not properly gofmt'd:
12:16:08  - daemon/list.go

@icecrime icecrime added the status/failing-ci Indicates that the PR in its current state fails the test suite label Jun 1, 2016
Signed-off-by: Seth Pellegrino <seth.pellegrino@jivesoftware.com>
@LK4D4
Copy link
Contributor

LK4D4 commented Jun 3, 2016

btw it's possible to check mutex status without Lock:
https://github.com/LK4D4/trylock/blob/master/trylock.go#L35
But I still think it should be fixed like in #23178

@sethp-jive
Copy link
Author

Ah, so that's how to get at the underlying bits. Nice!

I'm not convinced this change is exclusive with #23178 or #23240 (both of which look like they'd help this specific issue), but I'm happy to close this issue in favor of one of those approaches.

@sethp-jive sethp-jive closed this Jun 3, 2016
@manoj0077
Copy link

manoj0077 commented Jun 30, 2016

Hi @sethp-jive , so with 1.12 docker daemon restart without effecting running containers. This will give an alternative correct. Or do you see any other issues.

@sethp-jive
Copy link
Author

@manoj0077 I haven't tried restarting the daemon on 1.12 to see whether it's a valid remediation – it depends on whether the new daemon process tries again to clean up the stuck container (thus calling releaseNetwork while holding the lock as in the backtrace above) or whether it just leaks it.

@manoj0077
Copy link

manoj0077 commented Jul 1, 2016

@sethp-jive Did you get a chance to try with 1.12 ?...We are not hitting hang in 1.12 ...not sure what got improved.....
@cpuguy83 any comments here ?

@andyxning
Copy link

we encounter docker ps hang in Docker 1.12.0. :(

@cpuguy83
Copy link
Member

@andyxning ps hang can be caused by many things, a few of which are regressions in 1.12.0 that are fixed in 1.12.x patch releases.
Please make sure to run 1.12.3 which is the current latest patch release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/failing-ci Indicates that the PR in its current state fails the test suite status/2-code-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants