Readiness check fails with `http2: no cached connection was available` which is not a server problem #49740

smarterclayton · 2017-07-27T19:30:33Z

I0727 14:32:26.621390   24885 prober.go:106] Readiness probe for "docker-registry-2-8glng_default(0c3c551d-72f3-11e7-a950-42010a800004):registry" failed (failure): Get https://172.16.0.6:5000/healthz: http2: no cached connection was available

This is a bug on the client side, not the server, and should not be reported as a failed health check.

The text was updated successfully, but these errors were encountered:

jhorwit2 · 2017-08-11T12:56:24Z

I'm seeing this as well. It doesn't happen enough to trigger the pod to fail liveness but does happen enough that it creates a bunch of events.

FirstSeen    LastSeen    Count    From                            SubObjectPath            Type        Reason        Message
  ---------    --------    -----    ----                            -------------            --------    ------        -------
  20d        37m        351    kubelet, <host>    spec.containers{kube-authn}    Warning        Unhealthy    Readiness probe failed: Get https://<ip>:8080/healthz: http2: no cached connection was available

I'm on v1.7.1

sjenning · 2017-09-27T02:51:49Z

There is still some bug in the http2 implementation in golang. This only happens if the readiness/liveness endpoint supports http2, as is the case with the docker registry.

I found a hack that should prevent it, but it disables using http2 for probes. Not sure why you would need any of http2's features for a probe so I don't see it hurting anything:

diff --git a/pkg/probe/http/http.go b/pkg/probe/http/http.go
index b9821be05f..1ea7a75c4d 100644
--- a/pkg/probe/http/http.go
+++ b/pkg/probe/http/http.go
@@ -38,7 +38,7 @@ func New() HTTPProber {
 
 // NewWithTLSConfig takes tls config as parameter.
 func NewWithTLSConfig(config *tls.Config) HTTPProber {
-       transport := utilnet.SetTransportDefaults(&http.Transport{TLSClientConfig: config, DisableKeepAlives: true})
+       transport := utilnet.SetOldTransportDefaults(&http.Transport{TLSClientConfig: config, DisableKeepAlives: true})
        return httpProber{transport}
 }

This avoids running ConfigureTransport() on the transport, leaving http2 disabled.

The only theory I have about what causes this is the readiness and liveness probes happening at the exact same time. In the case of the registry, both liveness and readiness probes hit the same port and endpoint within microseconds of one another.

I might disable the readiness probe (without my hack) to see if the problem persists.

sjenning · 2017-09-27T02:57:47Z

@jhorwit2 I'd be interested to know does your pod have both readiness and liveness probes defined? I'm assuming it supports http2 otherwise you wouldn't get the error.

sjenning · 2017-09-27T04:17:15Z

I have run for a hour now and have yet to see another http2 error after removing the readiness probe. This supports the theory that a very small timing window where the readiness and liveness probe block one other getting an http2 connection is responsible. My node is under very light load (two idle pods), so this is not about number of connections; it is about timing.

Additionally, I get errors on both liveness and readiness probes further supporting the racing theory. Sometimes liveness wins and sometimes readiness wins.

$ oc get events
LASTSEEN   FIRSTSEEN   COUNT     NAME                       KIND                    SUBOBJECT                     TYPE      REASON                    SOURCE                              MESSAGE
1h         23h         44        docker-registry-1-rn2c6    Pod                     spec.containers{registry}     Warning   Unhealthy                 kubelet, infra.lab.variantweb.net   Liveness probe failed: Get https://10.128.0.4:5000/healthz: http2: no cached connection was available
1h         23h         58        docker-registry-1-rn2c6    Pod                     spec.containers{registry}     Warning   Unhealthy                 kubelet, infra.lab.variantweb.net   Readiness probe failed: Get https://10.128.0.4:5000/healthz: http2: no cached connection was available

That being said, we are largely powerless to do anything about this as the problem lies in golang.

Some possible workarounds until golang can be fixed:

disable http2 for probes
add jitter to the probe timers so they are less likely to run in the same window

@smarterclayton any thoughts?

smarterclayton · 2017-09-27T04:20:07Z

Filed an issue with golang yet identifying symptoms? We definitely do not want to share any possible connection pooling between liveness and readiness. We should *not* be using the default transport for either probe (that results in reuse). If we can confirm we are sharing a pool with http2 I'm ok to temporarily disable it until we can track down the cause (if we can't split the pool). On Sep 27, 2017, at 12:17 AM, Seth Jennings <notifications@github.com> wrote: I have run for a hour now and have yet to see another http2 error after removing the readiness probe. This supports the theory that a very small timing window where the readiness and liveness probe block one other getting an http2 connection is responsible. My node is under very light load (two idle pods), so this is not about number of connections; it is about timing. Additionally, I get errors on both liveness and readiness probes further supporting the racing theory. Sometimes liveness wins and sometimes readiness wins. $ oc get events LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 1h 23h 44 docker-registry-1-rn2c6 Pod spec.containers{registry} Warning Unhealthy kubelet, infra.lab.variantweb.net Liveness probe failed: Get https://10.128.0.4:5000/healthz: http2: no cached connection was available 1h 23h 58 docker-registry-1-rn2c6 Pod spec.containers{registry} Warning Unhealthy kubelet, infra.lab.variantweb.net Readiness probe failed: Get https://10.128.0.4:5000/healthz: http2: no cached connection was available That being said, we are largely powerless to do anything about this as the problem lies in golang. Some possible workarounds until golang can be fixed: - disable http2 for probes - add jitter to the probe timers so they are less likely to run in the same window @smarterclayton <https://github.com/smarterclayton> any thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#49740 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p8xCchalEpuk_fuJors9Qm23M1ugks5smcxtgaJpZM4OlvDs> .

jhorwit2 · 2017-09-27T12:30:40Z

@sjenning yes, every pod experiencing this issue has a readiness and liveness probe. Also, an easier way to disable HTTP2 for probes is to set the DISABLE_HTTP2 env var if you want to test it.

@smarterclayton From what I can tell we create new probe workers for each type of probe, which under the hood creates a new transport. Each transport according to ConfigureTransport in the stdlib gets a separate connection pool, so I don't think the issue is pooling.

sjenning · 2017-10-02T02:13:28Z

@jhorwit2 while it is true that while each probe has its own worker, which contains the goroutine to periodically trigger the probe, each uses a common probeManager with a common prober and a common httpProber for all probes kubelet wide.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/prober/worker.go#L201

The singular probeManager for the whole kubelet is created here

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L759

So all HTTP probes are using the same transport and, therefore, the same connection pool.

@smarterclayton

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. create separate transports for liveness and readiness probes There is currently an issue with the http2 connection pools in golang such that two GETs to the same host:port using the same Transport can collide and one gets rejected with `http2: no cached connection was available`. This happens with readiness and liveness probes if the intervals line up such that worker goroutines invoke the two probes at the exact same time. The result is a transient probe error that appears in the events. If the failureThreshold is 1, which is kinda crazy, it would cause a pod restart. The PR creates a separate `httprobe` instance for readiness and liveness probes so that they don't share a Transport and connection pool. Fixes #49740 @smarterclayton @jhorwit2

smarterclayton added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jul 27, 2017

yujuhong added the kind/bug Categorizes issue or PR as related to a bug. label Jul 27, 2017

sjenning mentioned this issue Sep 29, 2017

x/net/http2: http2: no cached connection was available golang/go#22091

Closed

sjenning mentioned this issue Oct 2, 2017

create separate transports for liveness and readiness probes #53318

Merged

k8s-github-robot closed this as completed in #53318 Oct 2, 2017

Pwntus mentioned this issue Apr 15, 2020

client-go: No cached connection was available upmc-enterprises/registry-creds#92

Open

ehashman mentioned this issue Mar 21, 2022

remove TODOs from http package and prober #108803

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readiness check fails with `http2: no cached connection was available` which is not a server problem #49740

Readiness check fails with `http2: no cached connection was available` which is not a server problem #49740

smarterclayton commented Jul 27, 2017

jhorwit2 commented Aug 11, 2017

sjenning commented Sep 27, 2017 •

edited

sjenning commented Sep 27, 2017 •

edited

sjenning commented Sep 27, 2017

smarterclayton commented Sep 27, 2017 via email

jhorwit2 commented Sep 27, 2017 •

edited

sjenning commented Oct 2, 2017

Readiness check fails with http2: no cached connection was available which is not a server problem #49740

Readiness check fails with http2: no cached connection was available which is not a server problem #49740

Comments

smarterclayton commented Jul 27, 2017

jhorwit2 commented Aug 11, 2017

sjenning commented Sep 27, 2017 • edited

sjenning commented Sep 27, 2017 • edited

sjenning commented Sep 27, 2017

smarterclayton commented Sep 27, 2017 via email

jhorwit2 commented Sep 27, 2017 • edited

sjenning commented Oct 2, 2017

Readiness check fails with `http2: no cached connection was available` which is not a server problem #49740

Readiness check fails with `http2: no cached connection was available` which is not a server problem #49740

sjenning commented Sep 27, 2017 •

edited

sjenning commented Sep 27, 2017 •

edited

jhorwit2 commented Sep 27, 2017 •

edited