New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-controller-manager becomes deadlocked but still passes healthcheck #70819

jpbetz opened this Issue Nov 8, 2018 · 1 comment


None yet
2 participants

jpbetz commented Nov 8, 2018

What happened:

All three controller managers "wedged" in a high availability cluster--zero network traffic between any of the controller managers and the api server--due to a deadlock. All three controller manager processes running and passing healthcheck, but none performing any controller manager duties.

pprof stackdump surfaced a deadlock in the library on all masters preventing any http2 client requests from being made from the controller managers:

One goroutine already holds the lock and is deadlocked attempting to acquire the lock:

goroutine 1776 [semacquire, 3556 minutes]:
sync.runtime_SemacquireMutex(0xc420970a14, 0x522100)
        /usr/local/go/src/runtime/sema.go:71 +0x3d
        /usr/local/go/src/sync/mutex.go:134 +0xee*ClientConn).CanTakeNewRequest(0xc4209709c0, 0xc420929500)
        /workspace/kubernetes/_output/dockerized/go/src/ +0x3f*clientConnPool).getClientConn(0xc420985560, 0xc4295eeb00, 0xc429692cc0, 0xd, 0xc4207d9800, 0xc426376f68, 0x94bb20, 0xc4207d9810)
        /workspace/kubernetes/_output/dockerized/go/src/ +0xe8

Another already holds the lock and is deadlocked attempting to acquire the lock:

goroutine 85 [semacquire, 3556 minutes]:
sync.runtime_SemacquireMutex(0xc42098556c, 0x528300)
        /usr/local/go/src/runtime/sema.go:71 +0x3d
        /usr/local/go/src/sync/mutex.go:134 +0xee*clientConnPool).MarkDead(0xc420985560, 0xc4209709c0)
        /workspace/kubernetes/_output/dockerized/go/src/ +0x4b, 0xc4209709c0)
        <autogenerated>:1 +0x3e
panic(0x3058280, 0x3d35e60)
        /usr/local/go/src/runtime/panic.go:491 +0x283*pipe).closeWithError(0xc42451ea28, 0xc42451ea78, 0x0, 0x0, 0x0)
        /workspace/kubernetes/_output/dockerized/go/src/ +0x214*pipe).CloseWithError(0xc42451ea28, 0x0, 0x0)
        /workspace/kubernetes/_output/dockerized/go/src/ +0x53*clientConnReadLoop).cleanup(0xc420a51fb0)
        /workspace/kubernetes/_output/dockerized/go/src/ +0x219

What you expected to happen:

kube-controller-manager does not become deadlocked

kube-controller-manager should fail healthcheck when in a broken state like this. In this case, if the kube-controller-manager were to fail healthcheck when not able to reach the kube-apiserver for it's regular leader lock acquire attempts for some time threshold, the issue would have been mitigated by restarts.

How to reproduce it (as minimally and precisely as possible):

Exact steps to reproduce state not yet known. We found the cluster in the state and examined it in situ and were able to capture a pprof stackdump which proves the existence of the deadlock. We believe it may be possible to recreate the http2 error state that triggers deadlock with some careful analysis of the deadlock, but have not gotten that far yet.

Anything else we need to know?:

The call from http2.(*clientConnPool).getClientConn() to http2.(*CLientConn).CanTakeNewRequest() was removed from the call path by golang/net@6a8eb5e and is available in the release-branch.go1.11 branch of x/net/http2. Bumping the vendored version of x/net/http2 to 6a8eb5e2b1816b30aa88d7e3ecf9eb7c4559d9e6 or higher should resolve the deadlock issue.


  • Kubernetes version: 1.10.6

/sig api-machinery
cc @cheftako @mml @wenjiaswe @lavalamp

/kind bug


This comment has been minimized.


jpbetz commented Nov 8, 2018

We have both a fix and mitigation planned and will send out PRs shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment