net/http: metastability in Transport pool for TLS+HTTP/1.1 #63404
Labels
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, this is present in the Go 1.21 series and in the development branch.
What operating system and processor architecture are you using (
go env
)?I see this on linux/amd64, and it's easiest to reproduce there. The reproducer isn't very reliable on my darwin/arm64 (M1) laptop.
What did you do?
The complete application includes several layers of Go programs, connected over the network with TLS-encrypted HTTP/1.1 calls. The Go programs that make up each layer use net/http.Server to receive HTTP/1.1 requests over the network, use the context.Context value associated with the http.Request for function calls within the app, and eventually attach a context.Context value derived from the inbound request to make outbound HTTP/1.1 requests with a http.Transport.
The reproducer I've written includes three layers: a client, a middle (both a client and server), and a backend (server only).
Calls to dependencies take time for network and processing, and sometimes that total duration is within the expectations of the immediate caller but is slower than the top-level client would like. The top-level client can indicate that they no longer want the results of the call by canceling the Context value they used for the http.Request. For requests made with HTTP/1.1, http.Transport will communicate that cancellation to its peer by closing the connection.
When an http.Server sees that the peer has closed the connection on which it received an HTTP/1.1 request, the http.Server will cancel the Context that it attached to the inbound http.Request, indicating that the http.Handler may stop working on the request. For the applications in question, the work done in the http.Handler includes outbound HTTP calls to sub-dependencies.
At this point, we get to the http.Transport connection pool.
The pool is a cache of TCP+TLS connections. If a connection is available in the pool, the program can use it to make an HTTP request at a cost of very little on-CPU time. If there's no connection available in the pool, making that same HTTP request involves much more on-CPU time since it triggers establishment of a new connection. For HTTPS requests, that's a TCP handshake (mostly network latency) followed by a TLS handshake (similar network latency, but also substantial on-CPU time). The on-CPU time cost of making a request is bimodal.
Without admission control to limit the amount of work a middle-layer app instance accepts, or to limit the concurrency of the http.Transport's TLS handshaking, the app can end up spending so much of its available on-CPU time in TLS handshakes that it creates its own timeouts through the exhaustion of its processing power. Dialing a new TCP+TLS+HTTP/1.1 connection involves several steps on several goroutines, and that work is scheduled onto software threads in a fair / round-robin sort of way, rather than prioritizing (among the handshaking goroutines) the handshakes that have the least amount of work left to complete.
Each handshake ends up taking a larger amount of wall-clock time because scarcity in on-CPU time means the involved goroutines need to wait in the runnable state for their turn to execute. That wall-clock time can end up being larger than the app's own expectations for how long its outbound calls should take, leading to the app's internal timeouts triggering and canceling the in-flight handshakes. (Or the app's remote caller can decide the timeout; the effect is the same.)
When the http.Transport's connection pool is full, the app works well and so it stays working well. When the pool is empty, the app works poorly (spending a huge amount of time on TLS handshakes) and so it stays working poorly. With two stable states, the system is metastable.
What did you expect to see?
I expected that basic use of net/http would not have such sharp edges. The key ingredients are:
These are apps that disable HTTP/2 for outbound calls (a separate challenge), and which contact some dependencies that do not enable HTTP/2, and which usually set MaxIdleConnsPerHost to something much larger than the default of 2. But otherwise, it looks to me like the apps are built with the encouraged best practices (including threading through Context values).
Admission control (don't accept additional work when the app is already overloaded) is part of the complete response to this problem, but that involves work far beyond what's provided in the core Go release. (The other half of the solution for us is to remove the blockers that have kept us from using HTTP/2, so we can move away from HTTP/1.1 -- but that also only works for places where we control the dependency and its load balancer.)
What did you see instead?
Go apps that use HTTP/1.1 for outbound calls, which encrypt those calls with TLS, and which either have timeouts on those calls or thread through Context values that can be canceled as a result of a timeout experience metastability, where a large fraction of their on-CPU time ends up going to re-establishing TLS connections which are immediately discarded because CPU overload in the process causes timeouts.
I've included a reproducer below. It allows sweeping through two major dimensions -- the speed at which requests enter the system and the fraction of requests that will experience timeouts at the backend -- and enabling/disabling HTTP/2.
The first step to using the reproducer is to sweep through the request arrival rate with 3% injected timeouts to find one where less than 50% of calls succeed. Repeat with 0.3% injected timeouts to confirm that only around 0.3% of calls fail.
Then use that arrival rate as fixed while varying the injected failure rate and (if you like) whether the TLS-encrypted requests use HTTP/2.
The test framework passes the -test.cpuprofile and -test.trace flags through to the middle and backend processes.
On my hardware (recent Intel, 8 hardware threads), the tipping point is around 60µs between requests. With that arrival interval (see collapsed output below for details), the "middle" application can keep its pool full of TLS connections if 0% or 0.3% of its calls time out. But if 1% or more of the calls time out, it's no longer able to keep up with demand and more than half of the calls the frontend makes to that middle layer end up timing out. This situation is better with HTTP/2, since canceling an HTTP/2 request does not require destroying the TLS connection.
CC @neild
use of reproducer
./pool_test.go
The text was updated successfully, but these errors were encountered: