Prometheus should use http2 health monitor/set http2 ReadIdleTimeout #7588

JensErat · 2020-07-16T11:18:12Z

What did you do?

After upgrade from Prometheus v2.18.1 to v2.19.2, we had massive issues with a central Prometheus federating a bunch of others. By bisecting the changes, I was able to nail it down to enabling http2 in #7258 (which was well hidden in the commit message and not mentioned in the changelog). Federation failed for up to 15 minutes, which we explain with the kernel getting hold of the broken connection after some time, but Prometheus still trying to use it for that long. Revoking the prometheus/common upgrade resolved the issue with 2.19.2. We assume we have rare issues of long-lasting network connections breaking on infrastructure level.

The issue somewhat reminds me of kubernetes/kubernetes#87615, where golang/net#55 is proposed to early recognize and close broken connections. Investigating how to set the ReadIdleTimeout value in Prometheus, I realized we're lucky that Prometheus is vendoring x/net and does not require Golang 1.15, but the value is not properly exposed the way htt2.Transport is usually instantiated (also Kubernetes is doing it this way): golang/go#40201. A pull request to resolve this is (allow configuration through http2.ConfigureTransportWithOptions) already proposed, but not merged or even discussed yet: golang/net#74. It even "will not make it into 1.15".

From discussion with the implementer on golang side, there would be three options to resolve this issue until we can finally make the value configurable in an official way:

Hacking the vendored x/net/http2 code to include the variable. Easy, but probably worst method to implement it. I used it anyway to build an internal hotfix release and verify the issue and solution through http2 health checks.
Using reflection to access the inadvertently hidden fields by getting access to the http2.Transport (example in the golang/net PR's test case).
Implementing a (hidden?) feature toggle to not upgrade the http.Transport for http2, ie. putting an if statement around the http2.ConfigureTransport call (but this would need to passed through from Prometheus to common somehow). Not using http2 also resolves the issue, as I tested by reverting the entire common upgrade.

We ran code with the fix for ~48h now, and did not have any federation down issues any more (we had in average one per hour before with 2.19.2).

What did you expect to see?

Broken http2 connections are recognized soon and reestablished.

What did you see instead? Under which circumstances?

Federation failing, paging oncall engineer quite often.

Environment

System information:

$ uname -srm
Linux 5.3.0-46-generic x86_64

Prometheus version:

v2.19.2 as from docker hub, and manually build/patched Prometheus as discussed above

Prometheus configuration file:

global:
  scrape_interval: 1m
  scrape_timeout: 45s
  evaluation_interval: 1m
[snip]
scrape_configs:
- job_name: some_federation
  honor_timestamps: true
  params:
    match[]:
    - '{__name__=~"kube_.*|kubelet_.*|pv_collector_bound_pvc_count"}'
  scrape_interval: 1m
  scrape_timeout: 45s
  metrics_path: /api/v1/namespaces/kube-system/services/prometheus:9090/proxy/prometheus/federate // access through Kubernetes API server
  scheme: https
  static_configs:
  - targets:
    - 198.51.100.100:6443
    labels:
      tenant: some_cluster
  tls_config:
    ca_file: /tmp/prometheus/tls/some_cluster-ca.crt
    cert_file: /tmp/prometheus/tls/some_cluster.crt
    key_file: /tmp/prometheus/tls/some_cluster.key
    insecure_skip_verify: false
[snip]

The text was updated successfully, but these errors were encountered:

brian-brazil · 2020-07-16T11:28:40Z

Thanks for the report. Note that common where the relevant code lives follows the client_golang Go version policy, so any fix needs to work with Go 1.9.

This will also affect more than Prometheus, as the Alertmanager also uses the code (blackbox shouldn't be affected, as it creates a new connection each time).

Not using http2 also resolves the issue, as I tested by reverting the entire common upgrade.

Using http2 where possible is desired, that we inadvertently weren't doing so was a bug.

JensErat · 2020-07-16T11:51:12Z

👍 for the not-as-easy-as-it-looks label.

I totally agree upon using http2 is desired. This is why I proposed the "opt-out-option" instead of entirely disabling it -- we'd rather live without http2 than being paged all hour (or silencing the federation down alert, which is critical for monitoring). :)

I did not test upgrading x/net to a pretty-much-master commit (pinned to golang.org/x/net v0.0.0-20190613194153-d28f0bde5980 right now). If this upgrade is compatible with golang 1.9, what's your opinion on bumping the library and the reflection workaround until golang/net#74 is merged? Forking golang/net seems even worse to me.

I see none of these options are really looking good, but this issue already cost us a good night of sleep until we decided to relax the federation alert for a while, and days of debugging; and I assume this will also hit others.

brian-brazil · 2020-07-16T11:58:14Z

f this upgrade is compatible with golang 1.9, what's your opinion on bumping the library and the reflection workaround until golang/net#74 is merged?

I'd prefer that we wait for a fix upstream, and then pin rather than messing around with reflection. Pinning to an unreleased commit of a library is not a problem.

I'd also rather not add user-visible configuration fields, as we'd likely end up having to keep them around indefinitely after this is all resolved.

brian-brazil · 2020-07-16T12:02:31Z

Looking at the PRs, why does the user even need to configure this? Having every single user have to specify this in order to provide a reliable connection seems suboptimal. A sane default would be to enable this by default on all connections on the Go end of things.

JensErat · 2020-07-16T12:15:14Z

The issue is that the golang default for ReadIdleTimeout is 0, which disables the connection health check. I wouldn't say this must be user-configurable (it is a sane thing to do with http2!), but it cannot even be set by the developer without quirks.

I think this is best expressed by this post from open PR golang/net#74:

Why would you need to change these parameters?

In order to put the health check I added in https://go-review.googlesource.com/c/net/+/198040 in use. The feature by default is off.

Without this change, it's very difficult to change the ReadIdleTimeout and PingTimeout. One would need to use the unsafe package like what I did in the test.

The PR is not even discussed so far. I'm not familiar about the golang release process, but we might be stuck in a merge freeze with breaking http2 connections until 1.15 is released and golang/net#74 merged only after this. The PR is not worked on since begin of June.

brian-brazil · 2020-07-16T12:23:40Z

I wouldn't say this must be user-configurable (it is a sane thing to do with http2!), but it cannot even be set by the developer without quirks.

My point is more that the developer shouldn't even need to configure this, it should be enabled by default. Do we not have tcp keepalives to catch this?

I'm not familiar about the golang release process

Golang releases are every 6 months, so it sounds like best case we're talking a 6+ month wait.

roidelapluie · 2020-07-16T12:49:31Z

Does GODEBUG=http2client=0 work as well?

JensErat · 2020-07-20T11:07:20Z

I fear not: "Manually configuring HTTP/2 via the golang.org/x/net/http2 package takes precedence over the net/http package's built-in HTTP/2 support.", which I understand talks about net/http2.ConfigureTransport. Having a try anyway and rolled out v2.19.2 with the env variable set.

Update: no, did not help, GODEBUG=http2client=0 does not disable http2 when enabled through ConfigureTransport. Also a check through netstat shows very long-lasting established connections with stable client-side ports.

JensErat · 2020-07-20T11:24:26Z

My point is more that the developer shouldn't even need to configure this, it should be enabled by default. Do we not have tcp keepalives to catch this?

Well, golang decided not to do. I don't think tcp keepalive will help us here, at least not with the Linux kernel defaults:

tcp_keepalive_time - INTEGER
	How often TCP sends out keepalive messages when keepalive is enabled.
	Default: 2hours.

tcp_keepalive_probes - INTEGER
	How many keepalive probes TCP sends out, until it decides that the
	connection is broken. Default value: 9.

tcp_keepalive_intvl - INTEGER
	How frequently the probes are send out. Multiplied by
	tcp_keepalive_probes it is time to kill not responding connection,
	after probes started. Default value: 75sec i.e. connection
	will be aborted after ~11 minutes of retries.

A network connection might be left broken for 2h+11m. There might be more knobs and Kernel timers in netfilter and other places though that catch the issue more early. I still haven understood what mechanics trigger the recovery (we're only ever seen this up to 30 minutes until federation recovered, but with Kubernetes similar-looking issues remained forever until kubelet was restarted).

Golang releases are every 6 months, so it sounds like best case we're talking a 6+ month wait.

I guess the PR merged should be sufficient, which might even further reduce that time window.

JensErat · 2020-08-11T09:10:53Z

We can definitely confirm http2 timeout fixes lots of issues on our stack -- and looking at the kubelet issue ("use of closed connection") we are by far not the only ones affected.

I agree this should be fixed upstream, and share your thoughts on a having a reasonable default and developers shouldn't have to do anything to get stable and robust http2 connections. Still, this is a very nasty and hard to trace down issue (we were hunting for for weeks of work, if we sum up effort we investigated into debugging in several places).

Do you think a temporary workaround to set this value (I guess there should be a safe and downwards compatible way to do so even in prometheus/common) until there is a proper solution in golang/x/net?

roidelapluie · 2020-08-11T09:47:25Z

Hello,

We have planned to disable http2 in Prometheus 2.21.

brian-brazil · 2020-08-11T09:47:38Z

The present plan is that we're going to disable HTTP/2 again, until this is all fixed in Go. HTTP/2 will remain for the blackbox exporter, as these issues shouldn't affect it.

JensErat · 2020-08-11T15:37:02Z

This is good news for the Prometheus user community. Did hurt us a quite a bit to trace down the issue, but in the end we managed understanding quite a bunch of seemingly unrelated issues all over our stack. We'll notify our cluster users to skip 2.19/2.20, thank you for the roadmap.

JensErat · 2020-10-23T08:22:05Z

Good news: golang/net master has support for configuring http2 transports, which now allows setting the timeouts! golang/net@08b3837

roidelapluie · 2021-08-09T12:56:33Z

hello @JensErat

Could you test Prometheus 2.29 with a non-empty PROMETHEUS_COMMON_ENABLE_HTTP2 environment variable?

e.g. PROMETHEUS_COMMON_ENABLE_HTTP2=1

We have implemented the change but it is hidden for now.

We are re-enabling HTTP 2 again. There has been a few bugfixes upstream in go, and we have also enabled ReadIdleTimeout. Fix prometheus#7588 Fix prometheus#9068 Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>

We are re-enabling HTTP 2 again. There has been a few bugfixes upstream in go, and we have also enabled ReadIdleTimeout. Fix #7588 Fix #9068 Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>

JensErat added the kind/bug label Jul 16, 2020

JensErat mentioned this issue Jul 16, 2020

(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection) kubernetes/kubernetes#87615

Closed

brian-brazil added component/scraping not-as-easy-as-it-looks priority/P3 labels Jul 16, 2020

JensErat mentioned this issue Aug 11, 2020

Seeing metric gaps after upgrading to 2.19.2 #7669

Open

JensErat mentioned this issue Aug 11, 2020

x/net/http2: ReadIdleTimeout/PingTimeout not exposed if configured through http package golang/go#40201

Closed

roidelapluie mentioned this issue Aug 19, 2020

Add the ability to inject a custom round tripper prometheus/common#252

Closed

roidelapluie mentioned this issue Nov 2, 2020

Prometheus 2.19.0-2.20.0 scrapes hanging on _all_ targets, some time after network partition of a _single_ host #8143

Closed

roidelapluie mentioned this issue Dec 3, 2020

Unable to scrape HTTP/2-only endpoint #7237

Closed

roidelapluie mentioned this issue Sep 26, 2021

Enable HTTP2 again #9398

Merged

roidelapluie closed this as completed in #9398 Sep 26, 2021

prometheus locked as resolved and limited conversation to collaborators Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus should use http2 health monitor/set http2 ReadIdleTimeout #7588

Prometheus should use http2 health monitor/set http2 ReadIdleTimeout #7588

JensErat commented Jul 16, 2020

brian-brazil commented Jul 16, 2020

JensErat commented Jul 16, 2020

brian-brazil commented Jul 16, 2020

brian-brazil commented Jul 16, 2020

JensErat commented Jul 16, 2020

brian-brazil commented Jul 16, 2020

roidelapluie commented Jul 16, 2020

JensErat commented Jul 20, 2020 •

edited

JensErat commented Jul 20, 2020

JensErat commented Aug 11, 2020

roidelapluie commented Aug 11, 2020

brian-brazil commented Aug 11, 2020

JensErat commented Aug 11, 2020

JensErat commented Oct 23, 2020

roidelapluie commented Aug 9, 2021

Prometheus should use http2 health monitor/set http2 ReadIdleTimeout #7588

Prometheus should use http2 health monitor/set http2 ReadIdleTimeout #7588

Comments

JensErat commented Jul 16, 2020

brian-brazil commented Jul 16, 2020

JensErat commented Jul 16, 2020

brian-brazil commented Jul 16, 2020

brian-brazil commented Jul 16, 2020

JensErat commented Jul 16, 2020

brian-brazil commented Jul 16, 2020

roidelapluie commented Jul 16, 2020

JensErat commented Jul 20, 2020 • edited

JensErat commented Jul 20, 2020

JensErat commented Aug 11, 2020

roidelapluie commented Aug 11, 2020

brian-brazil commented Aug 11, 2020

JensErat commented Aug 11, 2020

JensErat commented Oct 23, 2020

roidelapluie commented Aug 9, 2021

JensErat commented Jul 20, 2020 •

edited