net/http: http/2 connection management magnifies the effect of network timeouts #63422
Labels
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, this is present in the Go 1.21 series and in the development branch.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I made HTTP requests over a network where routing changes sometimes result in TCP flows transitioning from 0% packet loss to 100% packet loss.
What did you expect to see?
I expected that a TCP flow with 100% packet loss would not be used for new requests.
What did you see instead?
With HTTP/1, each connection is used for one request at a time. Complete packet loss on a connection leads to failure by timeout of the request that was assigned to it (if any), or the failure of the next request to be assigned to that connection. The http.Transport pool responds to that situation by closing the connection, preventing its use for any other requests. This is good.
With HTTP/2, each connection may be used for multiple concurrent requests. Complete packet loss on a connection leads to failure by timeout of every request assigned to it (if any). This is understandable, but without a user-defined limit on the number of streams the client is willing to open it seems unavoidable.
But continuing with HTTP/2, if the connection has not yet reached its limit for concurrent streams, the http.Transport pool will continue to assign new requests to that connection, even if the other requests it's assigned to that connection have failed to result in any HTTP/2 frames received at the client. If I understand correctly, a Go httptest.Server will allow up to 250 concurrent streams per connection. Failing hundreds of requests when one connection experiences packet loss is unfortunate, and seems like something we could improve.
Finally on HTTP/2, when a connection is experiencing 100% packet loss and so results in timeouts as I described above, any request that has timed out stops counting against the connection's concurrent streams limit. This results in the dead connection appearing to have capacity for new requests. This can result in thousands upon thousands of requests failing, especially if the concurrency never grows beyond what the http.Transport expects a single connection can handle. This can go on until the TCP connection's buffer within the OS fills, or until the OS times out the connection (which can take several minutes). Failing thousands of requests, beyond the connection's max streams limit, when one connection experiences packet loss is something we should be able to fix.
These stark differences between how HTTP/1 and Go's HTTP/2 respond to network perturbations makes it hard rely on Go's HTTP/2 support in production.
CC @neild
The reproducer below (collapsed) is structured as a test that fails unconditionally in order to print logs. It makes one request at a time, with a timeout. Every so often, it "unplugs" one of the TCP connections in the pool to simulate (from Go's perspective, if not the OS's) 100% packet loss. Usually there's only one connection in the pool.
Here's how it looks when making 30,000 requests over HTTP/1.1 with a 10ms timeout, unplugging a connection after 5000, 15,000, and 25,000 requests. It observes failure in 3 requests:
Here's how it looks when making 30,000 requests over HTTP/2 with a 10ms timeout, unplugging a connection after 5000 requests and attempting to unplug after 15,000 and 25,000 requests as well (though there's no new connection to unplug). It observes failure in 25,000 requests:
./h2_unplug_test.go
The text was updated successfully, but these errors were encountered: