New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-controller-manager becomes deadlocked but still passes healthcheck #70819

Open
jpbetz opened this Issue Nov 8, 2018 · 1 comment

Comments

Projects
None yet
2 participants
@jpbetz
Contributor

jpbetz commented Nov 8, 2018

What happened:

All three controller managers "wedged" in a high availability cluster--zero network traffic between any of the controller managers and the api server--due to a deadlock. All three controller manager processes running and passing healthcheck, but none performing any controller manager duties.

pprof stackdump surfaced a deadlock in the golang.org/x/net/http2 library on all masters preventing any http2 client requests from being made from the controller managers:

One goroutine already holds the clientConnPool.mu lock and is deadlocked attempting to acquire the ClientConn.mu lock:

goroutine 1776 [semacquire, 3556 minutes]:
sync.runtime_SemacquireMutex(0xc420970a14, 0x522100)
        /usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc420970a10)
        /usr/local/go/src/sync/mutex.go:134 +0xee
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*ClientConn).CanTakeNewRequest(0xc4209709c0, 0xc420929500)
        /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/transport.go:611 +0x3f
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn(0xc420985560, 0xc4295eeb00, 0xc429692cc0, 0xd, 0xc4207d9800, 0xc426376f68, 0x94bb20, 0xc4207d9810)
        /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:67 +0xe8

Another already holds the ClientConn.mu lock and is deadlocked attempting to acquire the clientConnPool.mu lock:

goroutine 85 [semacquire, 3556 minutes]:
sync.runtime_SemacquireMutex(0xc42098556c, 0x528300)
        /usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc420985568)
        /usr/local/go/src/sync/mutex.go:134 +0xee
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).MarkDead(0xc420985560, 0xc4209709c0)
        /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:201 +0x4b
k8s.io/kubernetes/vendor/golang.org/x/net/http2.noDialClientConnPool.MarkDead(0xc420985560, 0xc4209709c0)
        <autogenerated>:1 +0x3e
panic(0x3058280, 0x3d35e60)
        /usr/local/go/src/runtime/panic.go:491 +0x283
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*pipe).closeWithError(0xc42451ea28, 0xc42451ea78, 0x0, 0x0, 0x0)
        /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/pipe.go:106 +0x214
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*pipe).CloseWithError(0xc42451ea28, 0x0, 0x0)
        /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/pipe.go:93 +0x53
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnReadLoop).cleanup(0xc420a51fb0)
        /workspace/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/transport.go:1408 +0x219

What you expected to happen:

kube-controller-manager does not become deadlocked

kube-controller-manager should fail healthcheck when in a broken state like this. In this case, if the kube-controller-manager were to fail healthcheck when not able to reach the kube-apiserver for it's regular leader lock acquire attempts for some time threshold, the issue would have been mitigated by restarts.

How to reproduce it (as minimally and precisely as possible):

Exact steps to reproduce state not yet known. We found the cluster in the state and examined it in situ and were able to capture a pprof stackdump which proves the existence of the deadlock. We believe it may be possible to recreate the http2 error state that triggers deadlock with some careful analysis of the deadlock, but have not gotten that far yet.

Anything else we need to know?:

The call from http2.(*clientConnPool).getClientConn() to http2.(*CLientConn).CanTakeNewRequest() was removed from the call path by golang/net@6a8eb5e and is available in the release-branch.go1.11 branch of x/net/http2. Bumping the vendored version of x/net/http2 to 6a8eb5e2b1816b30aa88d7e3ecf9eb7c4559d9e6 or higher should resolve the deadlock issue.

Environment:

  • Kubernetes version: 1.10.6

/sig api-machinery
cc @cheftako @mml @wenjiaswe @lavalamp

/kind bug

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Nov 8, 2018

We have both a fix and mitigation planned and will send out PRs shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment