Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus should use http2 health monitor/set http2 ReadIdleTimeout #7588

Closed
JensErat opened this issue Jul 16, 2020 · 15 comments · Fixed by #9398
Closed

Prometheus should use http2 health monitor/set http2 ReadIdleTimeout #7588

JensErat opened this issue Jul 16, 2020 · 15 comments · Fixed by #9398

Comments

@JensErat
Copy link
Contributor

What did you do?

After upgrade from Prometheus v2.18.1 to v2.19.2, we had massive issues with a central Prometheus federating a bunch of others. By bisecting the changes, I was able to nail it down to enabling http2 in #7258 (which was well hidden in the commit message and not mentioned in the changelog). Federation failed for up to 15 minutes, which we explain with the kernel getting hold of the broken connection after some time, but Prometheus still trying to use it for that long. Revoking the prometheus/common upgrade resolved the issue with 2.19.2. We assume we have rare issues of long-lasting network connections breaking on infrastructure level.

The issue somewhat reminds me of kubernetes/kubernetes#87615, where golang/net#55 is proposed to early recognize and close broken connections. Investigating how to set the ReadIdleTimeout value in Prometheus, I realized we're lucky that Prometheus is vendoring x/net and does not require Golang 1.15, but the value is not properly exposed the way htt2.Transport is usually instantiated (also Kubernetes is doing it this way): golang/go#40201. A pull request to resolve this is (allow configuration through http2.ConfigureTransportWithOptions) already proposed, but not merged or even discussed yet: golang/net#74. It even "will not make it into 1.15".

From discussion with the implementer on golang side, there would be three options to resolve this issue until we can finally make the value configurable in an official way:

  • Hacking the vendored x/net/http2 code to include the variable. Easy, but probably worst method to implement it. I used it anyway to build an internal hotfix release and verify the issue and solution through http2 health checks.
  • Using reflection to access the inadvertently hidden fields by getting access to the http2.Transport (example in the golang/net PR's test case).
  • Implementing a (hidden?) feature toggle to not upgrade the http.Transport for http2, ie. putting an if statement around the http2.ConfigureTransport call (but this would need to passed through from Prometheus to common somehow). Not using http2 also resolves the issue, as I tested by reverting the entire common upgrade.

We ran code with the fix for ~48h now, and did not have any federation down issues any more (we had in average one per hour before with 2.19.2).

What did you expect to see?

Broken http2 connections are recognized soon and reestablished.

What did you see instead? Under which circumstances?

Federation failing, paging oncall engineer quite often.

Environment

  • System information:
$ uname -srm
Linux 5.3.0-46-generic x86_64
  • Prometheus version:

v2.19.2 as from docker hub, and manually build/patched Prometheus as discussed above

  • Prometheus configuration file:
global:
  scrape_interval: 1m
  scrape_timeout: 45s
  evaluation_interval: 1m
[snip]
scrape_configs:
- job_name: some_federation
  honor_timestamps: true
  params:
    match[]:
    - '{__name__=~"kube_.*|kubelet_.*|pv_collector_bound_pvc_count"}'
  scrape_interval: 1m
  scrape_timeout: 45s
  metrics_path: /api/v1/namespaces/kube-system/services/prometheus:9090/proxy/prometheus/federate // access through Kubernetes API server
  scheme: https
  static_configs:
  - targets:
    - 198.51.100.100:6443
    labels:
      tenant: some_cluster
  tls_config:
    ca_file: /tmp/prometheus/tls/some_cluster-ca.crt
    cert_file: /tmp/prometheus/tls/some_cluster.crt
    key_file: /tmp/prometheus/tls/some_cluster.key
    insecure_skip_verify: false
[snip]
@brian-brazil
Copy link
Contributor

Thanks for the report. Note that common where the relevant code lives follows the client_golang Go version policy, so any fix needs to work with Go 1.9.

This will also affect more than Prometheus, as the Alertmanager also uses the code (blackbox shouldn't be affected, as it creates a new connection each time).

Not using http2 also resolves the issue, as I tested by reverting the entire common upgrade.

Using http2 where possible is desired, that we inadvertently weren't doing so was a bug.

@JensErat
Copy link
Contributor Author

👍 for the not-as-easy-as-it-looks label.

I totally agree upon using http2 is desired. This is why I proposed the "opt-out-option" instead of entirely disabling it -- we'd rather live without http2 than being paged all hour (or silencing the federation down alert, which is critical for monitoring). :)

I did not test upgrading x/net to a pretty-much-master commit (pinned to golang.org/x/net v0.0.0-20190613194153-d28f0bde5980 right now). If this upgrade is compatible with golang 1.9, what's your opinion on bumping the library and the reflection workaround until golang/net#74 is merged? Forking golang/net seems even worse to me.

I see none of these options are really looking good, but this issue already cost us a good night of sleep until we decided to relax the federation alert for a while, and days of debugging; and I assume this will also hit others.

@brian-brazil
Copy link
Contributor

f this upgrade is compatible with golang 1.9, what's your opinion on bumping the library and the reflection workaround until golang/net#74 is merged?

I'd prefer that we wait for a fix upstream, and then pin rather than messing around with reflection. Pinning to an unreleased commit of a library is not a problem.

I'd also rather not add user-visible configuration fields, as we'd likely end up having to keep them around indefinitely after this is all resolved.

@brian-brazil
Copy link
Contributor

Looking at the PRs, why does the user even need to configure this? Having every single user have to specify this in order to provide a reliable connection seems suboptimal. A sane default would be to enable this by default on all connections on the Go end of things.

@JensErat
Copy link
Contributor Author

The issue is that the golang default for ReadIdleTimeout is 0, which disables the connection health check. I wouldn't say this must be user-configurable (it is a sane thing to do with http2!), but it cannot even be set by the developer without quirks.

I think this is best expressed by this post from open PR golang/net#74:

Why would you need to change these parameters?

In order to put the health check I added in https://go-review.googlesource.com/c/net/+/198040 in use. The feature by default is off.

Without this change, it's very difficult to change the ReadIdleTimeout and PingTimeout. One would need to use the unsafe package like what I did in the test.

The PR is not even discussed so far. I'm not familiar about the golang release process, but we might be stuck in a merge freeze with breaking http2 connections until 1.15 is released and golang/net#74 merged only after this. The PR is not worked on since begin of June.

@brian-brazil
Copy link
Contributor

I wouldn't say this must be user-configurable (it is a sane thing to do with http2!), but it cannot even be set by the developer without quirks.

My point is more that the developer shouldn't even need to configure this, it should be enabled by default. Do we not have tcp keepalives to catch this?

I'm not familiar about the golang release process

Golang releases are every 6 months, so it sounds like best case we're talking a 6+ month wait.

@roidelapluie
Copy link
Member

Does GODEBUG=http2client=0 work as well?

@JensErat
Copy link
Contributor Author

JensErat commented Jul 20, 2020

I fear not: "Manually configuring HTTP/2 via the golang.org/x/net/http2 package takes precedence over the net/http package's built-in HTTP/2 support.", which I understand talks about net/http2.ConfigureTransport. Having a try anyway and rolled out v2.19.2 with the env variable set.

Update: no, did not help, GODEBUG=http2client=0 does not disable http2 when enabled through ConfigureTransport. Also a check through netstat shows very long-lasting established connections with stable client-side ports.

@JensErat
Copy link
Contributor Author

My point is more that the developer shouldn't even need to configure this, it should be enabled by default. Do we not have tcp keepalives to catch this?

Well, golang decided not to do. I don't think tcp keepalive will help us here, at least not with the Linux kernel defaults:

tcp_keepalive_time - INTEGER
	How often TCP sends out keepalive messages when keepalive is enabled.
	Default: 2hours.

tcp_keepalive_probes - INTEGER
	How many keepalive probes TCP sends out, until it decides that the
	connection is broken. Default value: 9.

tcp_keepalive_intvl - INTEGER
	How frequently the probes are send out. Multiplied by
	tcp_keepalive_probes it is time to kill not responding connection,
	after probes started. Default value: 75sec i.e. connection
	will be aborted after ~11 minutes of retries.

A network connection might be left broken for 2h+11m. There might be more knobs and Kernel timers in netfilter and other places though that catch the issue more early. I still haven understood what mechanics trigger the recovery (we're only ever seen this up to 30 minutes until federation recovered, but with Kubernetes similar-looking issues remained forever until kubelet was restarted).

Golang releases are every 6 months, so it sounds like best case we're talking a 6+ month wait.

I guess the PR merged should be sufficient, which might even further reduce that time window.

@JensErat
Copy link
Contributor Author

We can definitely confirm http2 timeout fixes lots of issues on our stack -- and looking at the kubelet issue ("use of closed connection") we are by far not the only ones affected.

I agree this should be fixed upstream, and share your thoughts on a having a reasonable default and developers shouldn't have to do anything to get stable and robust http2 connections. Still, this is a very nasty and hard to trace down issue (we were hunting for for weeks of work, if we sum up effort we investigated into debugging in several places).

Do you think a temporary workaround to set this value (I guess there should be a safe and downwards compatible way to do so even in prometheus/common) until there is a proper solution in golang/x/net?

@roidelapluie
Copy link
Member

Hello,

We have planned to disable http2 in Prometheus 2.21.

@brian-brazil
Copy link
Contributor

The present plan is that we're going to disable HTTP/2 again, until this is all fixed in Go. HTTP/2 will remain for the blackbox exporter, as these issues shouldn't affect it.

@JensErat
Copy link
Contributor Author

This is good news for the Prometheus user community. Did hurt us a quite a bit to trace down the issue, but in the end we managed understanding quite a bunch of seemingly unrelated issues all over our stack. We'll notify our cluster users to skip 2.19/2.20, thank you for the roadmap.

@JensErat
Copy link
Contributor Author

Good news: golang/net master has support for configuring http2 transports, which now allows setting the timeouts! golang/net@08b3837

@roidelapluie
Copy link
Member

hello @JensErat

Could you test Prometheus 2.29 with a non-empty PROMETHEUS_COMMON_ENABLE_HTTP2 environment variable?

e.g. PROMETHEUS_COMMON_ENABLE_HTTP2=1

We have implemented the change but it is hidden for now.

roidelapluie added a commit to roidelapluie/prometheus that referenced this issue Sep 26, 2021
We are re-enabling HTTP 2 again. There has been a few bugfixes upstream
in go, and we have also enabled ReadIdleTimeout.

Fix prometheus#7588
Fix prometheus#9068

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
roidelapluie added a commit to roidelapluie/prometheus that referenced this issue Sep 26, 2021
We are re-enabling HTTP 2 again. There has been a few bugfixes upstream
in go, and we have also enabled ReadIdleTimeout.

Fix prometheus#7588
Fix prometheus#9068

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
roidelapluie added a commit that referenced this issue Sep 26, 2021
We are re-enabling HTTP 2 again. There has been a few bugfixes upstream
in go, and we have also enabled ReadIdleTimeout.

Fix #7588
Fix #9068

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
@prometheus prometheus locked as resolved and limited conversation to collaborators Mar 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants