New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: time out reader locks for reliability #23175
Conversation
To aid an issue where the entire daemon becomes hung when a lock is held for an operation that cannot complete, we time out the lock and report failure instead of blocking readers forever. This limits the failed operation to a single container rather than the entire daemon. Signed-off-by: Seth Pellegrino <seth.pellegrino@jivesoftware.com>
This doesn't really work. It's not only leaking a goroutine, but also something trying to acquire a lock. NOTLGTM sorry :( |
Thanks for the quick reply! The contract of I don't disagree it's a bit hard to use correctly, which is the reason the golang authors seem disinclined to include it in the standard library, but in this case it does work to reduce our failure domain from "the entire daemon" to "a single container's status." |
Also, just for clarity – I don't think this is anywhere near an ideal solution. The cleanest solution we could come up with was option 3, where it's not necessary to take out a lock to report on a container's status and other concurrency mechanisms (channels, single-writer-ownership, etc.) take care of the concurrency properties of readers and writers mixing together. This PR is merely the solution I've been able to implement so we can keep using docker. |
You have a small formatting issue here:
|
Signed-off-by: Seth Pellegrino <seth.pellegrino@jivesoftware.com>
btw it's possible to check mutex status without Lock: |
Hi @sethp-jive , so with 1.12 docker daemon restart without effecting running containers. This will give an alternative correct. Or do you see any other issues. |
@manoj0077 I haven't tried restarting the daemon on 1.12 to see whether it's a valid remediation – it depends on whether the new daemon process tries again to clean up the stuck container (thus calling releaseNetwork while holding the lock as in the backtrace above) or whether it just leaks it. |
@sethp-jive Did you get a chance to try with 1.12 ?...We are not hitting hang in 1.12 ...not sure what got improved..... |
we encounter |
@andyxning |
Hello! We're regularly seeing a critical failure with our docker daemon that causes it to get blocked and halts any forward progress. The underlying problem appears to be the kernel bug most discussed in #5618 (though there are many issues referencing the kernel bug). As mentioned in that other issue, it seems that the only remediation possible at present is a system-wide reboot.
At one point we were experiencing a ~10 minute uptime for our nodes, with them running into the kernel race after just that many minutes. We've improved stability by reducing the workload on those boxes via the removal of some Kubernetes resources, but we are able to reproduce the issue at will with a stripped-down Kubernetes Job as follows (with apologies to alpine linux):
The gist here is that we're asking Kubernetes to run a lot of containers that move a fairly large number of packets and then shut down very quickly. The exit code + RestartPolicy are intended to drive as many creations as possible through in a short time. After a few minutes of that behavior – with Kubernetes creating new short-lived containers often, multiple times a second – we start seeing the kernel error message
unregister_netdevice: waiting for vethXXXXXX to become free. Usage count = 1
and the system locks up. Specifically, it locks up in the following way:One of the shutting-down containers, lucky owner of the net device whose refcount the kernel has mis-managed, is attempting to clean itself up:
The exciting part of this stack is that, in daemon.StateChanged (4th frame up from the bottom) the docker daemon has locked the container at memory address
0xc820b6d880
. Due to the kernel bug, it's important to note that the syscall will never complete successfully, so we'll never get that far back up the stack and therefore release the lock.Some hapless process (in our case, usually the kubernetes kubelet, but occasionally something else) will make a request to
/containers/json
and get stuck attempting to acquire that container's lock way down inreducePsContainer
:We know it's the same lock, here, because reducePsContainer's second argument (the first non-implicit-daemon-pointer argument) refers to just that same container,
0xc820b6d880
.Since that lock will never be unlocked, and the
/containers/json
request never returns, the upstream caller ultimately hangs on the kernel bug.At this point, we move somewhat into the realm of desperation – a single container that's hung is able to hang the entire container-status-listing mechanism, which appears to be the only way to get a complete list of the node's container status (and is thus heavily depended-on, e.g. by the kubelet). We would therefore find it desirable to isolate that failure to the single wedged container and not block all callers, for which we've considered three approaches:
reducePsContainer
(and the equivalent lock ininspect
) and report some flavor of "container status unknown" for the hung container. This approach provides us with isolation we're looking for, but 1)sync.Mutex
does not and will not have atryLock
equivalent (non-blocking lock acquisition), and from the outside the best I've found is a process that leaks resources on failed acquisitions. Further, 2) there is no generally approriate timeout (too short and we'll spuriously fail under load, too long and the daemon will become effectively hung).docker ps
remains as snappy as ever. If you'd prefer this approach, I'm happy to abandon this PR, clean that commit up a bit, and resubmit it.So, in summary: The docker daemon's status reporting mechanism completely hangs when a single container gets in a bad state. We don't understand the bad state well enough to resolve it, but hopefully we can reduce the failure domain from "the entire docker daemon" to "the single container".
- What I did
Changed the lock acquisition for
docker ps
anddocker inspect
to time out and drop the container from the list or provide an error message respectively to limit the failure mode induced by a kernel bug to the single affected container rather than the entire daemon.- How I did it
By leaking goprocs (& a channel), unfortunately. Back-of-the envelope math suggests we're looking at weeks of uptime once we hit the bug until the daemon OOMs, which is much better than the minutes we're seeing now. A less general point solution that also resolves the specific issue is here: sethp-jive:bugfix/shorten-critical-section. That approach has the advantage of not leaking an unbounded number of resources.
- How to verify it
unregister_netdevice
job-stress.yml
with the contents above, starting withapiVersion: batch/v1
kubectl nodes
, and runkubectl label nodes NODE_NAME docker-test=yes
kubectl create -f job-stress.yml
dmesg -w
) for the lineunregister_netdevice: waiting for vethXXXXXX to become free. Usage count = 1
docker ps
and ensure that it returns.docker inspect
on each container (i.e. by grabbing the ids from /var/lib/docker/containers/* ,docker ps
won't report on the broken one) should yield one failure with the message:Error response from daemon: Unable to lock container for inspect.
- Description for the changelog
Reduces kernel-induced hang signaled by
unregister_netdevice
to the single stuck container rather than the entire daemon.- A picture of a cute animal (not mandatory but encouraged)
Signed-off-by: Seth Pellegrino seth.pellegrino@jivesoftware.com