New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readiness check fails with http2: no cached connection was available
which is not a server problem
#49740
Comments
I'm seeing this as well. It doesn't happen enough to trigger the pod to fail liveness but does happen enough that it creates a bunch of events.
I'm on v1.7.1 |
There is still some bug in the http2 implementation in golang. This only happens if the readiness/liveness endpoint supports http2, as is the case with the docker registry. I found a hack that should prevent it, but it disables using http2 for probes. Not sure why you would need any of http2's features for a probe so I don't see it hurting anything: diff --git a/pkg/probe/http/http.go b/pkg/probe/http/http.go
index b9821be05f..1ea7a75c4d 100644
--- a/pkg/probe/http/http.go
+++ b/pkg/probe/http/http.go
@@ -38,7 +38,7 @@ func New() HTTPProber {
// NewWithTLSConfig takes tls config as parameter.
func NewWithTLSConfig(config *tls.Config) HTTPProber {
- transport := utilnet.SetTransportDefaults(&http.Transport{TLSClientConfig: config, DisableKeepAlives: true})
+ transport := utilnet.SetOldTransportDefaults(&http.Transport{TLSClientConfig: config, DisableKeepAlives: true})
return httpProber{transport}
} This avoids running The only theory I have about what causes this is the readiness and liveness probes happening at the exact same time. In the case of the registry, both liveness and readiness probes hit the same port and endpoint within microseconds of one another. I might disable the readiness probe (without my hack) to see if the problem persists. |
@jhorwit2 I'd be interested to know does your pod have both readiness and liveness probes defined? I'm assuming it supports http2 otherwise you wouldn't get the error. |
I have run for a hour now and have yet to see another http2 error after removing the readiness probe. This supports the theory that a very small timing window where the readiness and liveness probe block one other getting an http2 connection is responsible. My node is under very light load (two idle pods), so this is not about number of connections; it is about timing. Additionally, I get errors on both liveness and readiness probes further supporting the racing theory. Sometimes liveness wins and sometimes readiness wins.
That being said, we are largely powerless to do anything about this as the problem lies in golang. Some possible workarounds until golang can be fixed:
@smarterclayton any thoughts? |
Filed an issue with golang yet identifying symptoms?
We definitely do not want to share any possible connection pooling between
liveness and readiness. We should *not* be using the default transport for
either probe (that results in reuse).
If we can confirm we are sharing a pool with http2 I'm ok to temporarily
disable it until we can track down the cause (if we can't split the pool).
On Sep 27, 2017, at 12:17 AM, Seth Jennings <notifications@github.com> wrote:
I have run for a hour now and have yet to see another http2 error after
removing the readiness probe. This supports the theory that a very small
timing window where the readiness and liveness probe block one other
getting an http2 connection is responsible. My node is under very light
load (two idle pods), so this is not about number of connections; it is
about timing.
Additionally, I get errors on both liveness and readiness probes further
supporting the racing theory. Sometimes liveness wins and sometimes
readiness wins.
$ oc get events
LASTSEEN FIRSTSEEN COUNT NAME KIND
SUBOBJECT TYPE REASON
SOURCE MESSAGE
1h 23h 44 docker-registry-1-rn2c6 Pod
spec.containers{registry} Warning Unhealthy
kubelet, infra.lab.variantweb.net Liveness probe failed: Get
https://10.128.0.4:5000/healthz: http2: no cached connection was
available
1h 23h 58 docker-registry-1-rn2c6 Pod
spec.containers{registry} Warning Unhealthy
kubelet, infra.lab.variantweb.net Readiness probe failed:
Get https://10.128.0.4:5000/healthz: http2: no cached connection was
available
That being said, we are largely powerless to do anything about this as the
problem lies in golang.
Some possible workarounds until golang can be fixed:
- disable http2 for probes
- add jitter to the probe timers so they are less likely to run in the
same window
@smarterclayton <https://github.com/smarterclayton> any thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49740 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p8xCchalEpuk_fuJors9Qm23M1ugks5smcxtgaJpZM4OlvDs>
.
|
@sjenning yes, every pod experiencing this issue has a readiness and liveness probe. Also, an easier way to disable HTTP2 for probes is to set the @smarterclayton From what I can tell we create new probe workers for each type of probe, which under the hood creates a new transport. Each transport according to |
@jhorwit2 while it is true that while each probe has its own worker, which contains the goroutine to periodically trigger the probe, each uses a common https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/prober/worker.go#L201 The singular https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L759 So all HTTP probes are using the same transport and, therefore, the same connection pool. |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. create separate transports for liveness and readiness probes There is currently an issue with the http2 connection pools in golang such that two GETs to the same host:port using the same Transport can collide and one gets rejected with `http2: no cached connection was available`. This happens with readiness and liveness probes if the intervals line up such that worker goroutines invoke the two probes at the exact same time. The result is a transient probe error that appears in the events. If the failureThreshold is 1, which is kinda crazy, it would cause a pod restart. The PR creates a separate `httprobe` instance for readiness and liveness probes so that they don't share a Transport and connection pool. Fixes #49740 @smarterclayton @jhorwit2
This is a bug on the client side, not the server, and should not be reported as a failed health check.
The text was updated successfully, but these errors were encountered: