New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus should use http2 health monitor/set http2 ReadIdleTimeout #7588
Comments
Thanks for the report. Note that common where the relevant code lives follows the client_golang Go version policy, so any fix needs to work with Go 1.9. This will also affect more than Prometheus, as the Alertmanager also uses the code (blackbox shouldn't be affected, as it creates a new connection each time).
Using http2 where possible is desired, that we inadvertently weren't doing so was a bug. |
👍 for the I totally agree upon using http2 is desired. This is why I proposed the "opt-out-option" instead of entirely disabling it -- we'd rather live without http2 than being paged all hour (or silencing the federation down alert, which is critical for monitoring). :) I did not test upgrading I see none of these options are really looking good, but this issue already cost us a good night of sleep until we decided to relax the federation alert for a while, and days of debugging; and I assume this will also hit others. |
I'd prefer that we wait for a fix upstream, and then pin rather than messing around with reflection. Pinning to an unreleased commit of a library is not a problem. I'd also rather not add user-visible configuration fields, as we'd likely end up having to keep them around indefinitely after this is all resolved. |
Looking at the PRs, why does the user even need to configure this? Having every single user have to specify this in order to provide a reliable connection seems suboptimal. A sane default would be to enable this by default on all connections on the Go end of things. |
The issue is that the golang default for I think this is best expressed by this post from open PR golang/net#74:
The PR is not even discussed so far. I'm not familiar about the golang release process, but we might be stuck in a merge freeze with breaking http2 connections until 1.15 is released and golang/net#74 merged only after this. The PR is not worked on since begin of June. |
My point is more that the developer shouldn't even need to configure this, it should be enabled by default. Do we not have tcp keepalives to catch this?
Golang releases are every 6 months, so it sounds like best case we're talking a 6+ month wait. |
Does GODEBUG=http2client=0 work as well? |
I fear not: "Manually configuring HTTP/2 via the golang.org/x/net/http2 package takes precedence over the net/http package's built-in HTTP/2 support.", which I understand talks about Update: no, did not help, |
Well, golang decided not to do. I don't think tcp keepalive will help us here, at least not with the Linux kernel defaults:
A network connection might be left broken for 2h+11m. There might be more knobs and Kernel timers in netfilter and other places though that catch the issue more early. I still haven understood what mechanics trigger the recovery (we're only ever seen this up to 30 minutes until federation recovered, but with Kubernetes similar-looking issues remained forever until kubelet was restarted).
I guess the PR merged should be sufficient, which might even further reduce that time window. |
We can definitely confirm http2 timeout fixes lots of issues on our stack -- and looking at the kubelet issue ("use of closed connection") we are by far not the only ones affected. I agree this should be fixed upstream, and share your thoughts on a having a reasonable default and developers shouldn't have to do anything to get stable and robust http2 connections. Still, this is a very nasty and hard to trace down issue (we were hunting for for weeks of work, if we sum up effort we investigated into debugging in several places). Do you think a temporary workaround to set this value (I guess there should be a safe and downwards compatible way to do so even in prometheus/common) until there is a proper solution in golang/x/net? |
Hello, We have planned to disable http2 in Prometheus 2.21. |
The present plan is that we're going to disable HTTP/2 again, until this is all fixed in Go. HTTP/2 will remain for the blackbox exporter, as these issues shouldn't affect it. |
This is good news for the Prometheus user community. Did hurt us a quite a bit to trace down the issue, but in the end we managed understanding quite a bunch of seemingly unrelated issues all over our stack. We'll notify our cluster users to skip 2.19/2.20, thank you for the roadmap. |
Good news: golang/net master has support for configuring http2 transports, which now allows setting the timeouts! golang/net@08b3837 |
hello @JensErat Could you test Prometheus 2.29 with a non-empty PROMETHEUS_COMMON_ENABLE_HTTP2 environment variable? e.g. We have implemented the change but it is hidden for now. |
We are re-enabling HTTP 2 again. There has been a few bugfixes upstream in go, and we have also enabled ReadIdleTimeout. Fix prometheus#7588 Fix prometheus#9068 Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
We are re-enabling HTTP 2 again. There has been a few bugfixes upstream in go, and we have also enabled ReadIdleTimeout. Fix prometheus#7588 Fix prometheus#9068 Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
What did you do?
After upgrade from Prometheus v2.18.1 to v2.19.2, we had massive issues with a central Prometheus federating a bunch of others. By bisecting the changes, I was able to nail it down to enabling http2 in #7258 (which was well hidden in the commit message and not mentioned in the changelog). Federation failed for up to 15 minutes, which we explain with the kernel getting hold of the broken connection after some time, but Prometheus still trying to use it for that long. Revoking the
prometheus/common
upgrade resolved the issue with 2.19.2. We assume we have rare issues of long-lasting network connections breaking on infrastructure level.The issue somewhat reminds me of kubernetes/kubernetes#87615, where golang/net#55 is proposed to early recognize and close broken connections. Investigating how to set the
ReadIdleTimeout
value in Prometheus, I realized we're lucky that Prometheus is vendoringx/net
and does not require Golang 1.15, but the value is not properly exposed the wayhtt2.Transport
is usually instantiated (also Kubernetes is doing it this way): golang/go#40201. A pull request to resolve this is (allow configuration throughhttp2.ConfigureTransportWithOptions
) already proposed, but not merged or even discussed yet: golang/net#74. It even "will not make it into 1.15".From discussion with the implementer on golang side, there would be three options to resolve this issue until we can finally make the value configurable in an official way:
x/net/http2
code to include the variable. Easy, but probably worst method to implement it. I used it anyway to build an internal hotfix release and verify the issue and solution through http2 health checks.http2.Transport
(example in thegolang/net
PR's test case).http.Transport
for http2, ie. putting anif
statement around thehttp2.ConfigureTransport
call (but this would need to passed through from Prometheus to common somehow). Not using http2 also resolves the issue, as I tested by reverting the entire common upgrade.We ran code with the fix for ~48h now, and did not have any federation down issues any more (we had in average one per hour before with 2.19.2).
What did you expect to see?
Broken http2 connections are recognized soon and reestablished.
What did you see instead? Under which circumstances?
Federation failing, paging oncall engineer quite often.
Environment
v2.19.2 as from docker hub, and manually build/patched Prometheus as discussed above
The text was updated successfully, but these errors were encountered: