Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
kube-controller-manager becomes deadlocked but still passes healthcheck #70819
All three controller managers "wedged" in a high availability cluster--zero network traffic between any of the controller managers and the api server--due to a deadlock. All three controller manager processes running and passing healthcheck, but none performing any controller manager duties.
pprof stackdump surfaced a deadlock in the golang.org/x/net/http2 library on all masters preventing any http2 client requests from being made from the controller managers:
One goroutine already holds the
Another already holds the
What you expected to happen:
kube-controller-manager does not become deadlocked
kube-controller-manager should fail healthcheck when in a broken state like this. In this case, if the kube-controller-manager were to fail healthcheck when not able to reach the kube-apiserver for it's regular leader lock acquire attempts for some time threshold, the issue would have been mitigated by restarts.
How to reproduce it (as minimally and precisely as possible):
Exact steps to reproduce state not yet known. We found the cluster in the state and examined it in situ and were able to capture a pprof stackdump which proves the existence of the deadlock. We believe it may be possible to recreate the http2 error state that triggers deadlock with some careful analysis of the deadlock, but have not gotten that far yet.
Anything else we need to know?:
The call from