Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness check fails with http2: no cached connection was available which is not a server problem #49740

Closed
smarterclayton opened this issue Jul 27, 2017 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@smarterclayton
Copy link
Contributor

I0727 14:32:26.621390   24885 prober.go:106] Readiness probe for "docker-registry-2-8glng_default(0c3c551d-72f3-11e7-a950-42010a800004):registry" failed (failure): Get https://172.16.0.6:5000/healthz: http2: no cached connection was available

This is a bug on the client side, not the server, and should not be reported as a failed health check.

@smarterclayton smarterclayton added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jul 27, 2017
@yujuhong yujuhong added the kind/bug Categorizes issue or PR as related to a bug. label Jul 27, 2017
@jhorwit2
Copy link
Contributor

I'm seeing this as well. It doesn't happen enough to trigger the pod to fail liveness but does happen enough that it creates a bunch of events.

FirstSeen    LastSeen    Count    From                            SubObjectPath            Type        Reason        Message
  ---------    --------    -----    ----                            -------------            --------    ------        -------
  20d        37m        351    kubelet, <host>    spec.containers{kube-authn}    Warning        Unhealthy    Readiness probe failed: Get https://<ip>:8080/healthz: http2: no cached connection was available

I'm on v1.7.1

@sjenning
Copy link
Contributor

sjenning commented Sep 27, 2017

There is still some bug in the http2 implementation in golang. This only happens if the readiness/liveness endpoint supports http2, as is the case with the docker registry.

I found a hack that should prevent it, but it disables using http2 for probes. Not sure why you would need any of http2's features for a probe so I don't see it hurting anything:

diff --git a/pkg/probe/http/http.go b/pkg/probe/http/http.go
index b9821be05f..1ea7a75c4d 100644
--- a/pkg/probe/http/http.go
+++ b/pkg/probe/http/http.go
@@ -38,7 +38,7 @@ func New() HTTPProber {
 
 // NewWithTLSConfig takes tls config as parameter.
 func NewWithTLSConfig(config *tls.Config) HTTPProber {
-       transport := utilnet.SetTransportDefaults(&http.Transport{TLSClientConfig: config, DisableKeepAlives: true})
+       transport := utilnet.SetOldTransportDefaults(&http.Transport{TLSClientConfig: config, DisableKeepAlives: true})
        return httpProber{transport}
 }

This avoids running ConfigureTransport() on the transport, leaving http2 disabled.

The only theory I have about what causes this is the readiness and liveness probes happening at the exact same time. In the case of the registry, both liveness and readiness probes hit the same port and endpoint within microseconds of one another.

I might disable the readiness probe (without my hack) to see if the problem persists.

@sjenning
Copy link
Contributor

sjenning commented Sep 27, 2017

@jhorwit2 I'd be interested to know does your pod have both readiness and liveness probes defined? I'm assuming it supports http2 otherwise you wouldn't get the error.

@sjenning
Copy link
Contributor

I have run for a hour now and have yet to see another http2 error after removing the readiness probe. This supports the theory that a very small timing window where the readiness and liveness probe block one other getting an http2 connection is responsible. My node is under very light load (two idle pods), so this is not about number of connections; it is about timing.

Additionally, I get errors on both liveness and readiness probes further supporting the racing theory. Sometimes liveness wins and sometimes readiness wins.

$ oc get events
LASTSEEN   FIRSTSEEN   COUNT     NAME                       KIND                    SUBOBJECT                     TYPE      REASON                    SOURCE                              MESSAGE
1h         23h         44        docker-registry-1-rn2c6    Pod                     spec.containers{registry}     Warning   Unhealthy                 kubelet, infra.lab.variantweb.net   Liveness probe failed: Get https://10.128.0.4:5000/healthz: http2: no cached connection was available
1h         23h         58        docker-registry-1-rn2c6    Pod                     spec.containers{registry}     Warning   Unhealthy                 kubelet, infra.lab.variantweb.net   Readiness probe failed: Get https://10.128.0.4:5000/healthz: http2: no cached connection was available

That being said, we are largely powerless to do anything about this as the problem lies in golang.

Some possible workarounds until golang can be fixed:

  • disable http2 for probes
  • add jitter to the probe timers so they are less likely to run in the same window

@smarterclayton any thoughts?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Sep 27, 2017 via email

@jhorwit2
Copy link
Contributor

jhorwit2 commented Sep 27, 2017

@sjenning yes, every pod experiencing this issue has a readiness and liveness probe. Also, an easier way to disable HTTP2 for probes is to set the DISABLE_HTTP2 env var if you want to test it.

@smarterclayton From what I can tell we create new probe workers for each type of probe, which under the hood creates a new transport. Each transport according to ConfigureTransport in the stdlib gets a separate connection pool, so I don't think the issue is pooling.

@sjenning
Copy link
Contributor

sjenning commented Oct 2, 2017

@jhorwit2 while it is true that while each probe has its own worker, which contains the goroutine to periodically trigger the probe, each uses a common probeManager with a common prober and a common httpProber for all probes kubelet wide.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/prober/worker.go#L201

The singular probeManager for the whole kubelet is created here

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L759

So all HTTP probes are using the same transport and, therefore, the same connection pool.

k8s-github-robot pushed a commit that referenced this issue Oct 2, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

create separate transports for liveness and readiness probes

There is currently an issue with the http2 connection pools in golang such that two GETs to the same host:port using the same Transport can collide and one gets rejected with `http2: no cached connection was available`.  This happens with readiness and liveness probes if the intervals line up such that worker goroutines invoke the two probes at the exact same time.

The result is a transient probe error that appears in the events.  If the failureThreshold is 1, which is kinda crazy, it would cause a pod restart.

The PR creates a separate `httprobe` instance for readiness and liveness probes so that they don't share a Transport and connection pool.

Fixes #49740

@smarterclayton @jhorwit2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

4 participants