net/http: setting a timeout on http requests that use TLS can result in excessive requests to server #59017
Labels
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Performance
What version of Go are you using (
go version
)?go version go1.20.2 darwin/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?GOARCH="amd64"
GOOS="darwin"
What did you do?
When running a large number of concurrent requests to Amazon S3 with a request timeout set (either in the
http.Client
instance or a context directly set on the request), we noticed an unexpectedly large amount of long running requests. When we removed the timeout, the number of long running requests dropped. The long running requests were not directly caused by the timeout being hit - all requests completed in under the timeout.We created a standalone program to reproduce the problem and added logging via
httptrace
. In the output fromhttptrace
we observed a large number of requests that were reported with the errorcontext cancelled
to theTLSHandshakeDone
callback in our trace. These requests did not cause failed requests as reported by the http client.Digging into the http Transport code, it appears that when a connection is not immediately available for use in the connection pool, the runtime starts a race between obtaining a connection returned to the pool and dialing a new connection. In our case, the "obtain connection returned to the pool" was generally winning the race. The behavior on the losing side of the race differed depending on whether the request used a timeout or not. On requests without a timeout, the losing leg of the connection continued through the TLS handshake, and was then placed into the pool as a valid connection. On requests with a timeout, the losing leg was aborted mid-TLS-handshake due to the cancellation of the request context as the request completed using the connection that was returned to the pool.
The net result of this behavior was that whenever a request legitimately required a new connection to be established, it was often queued up (probably at the server end) behind a large number of TLS handshakes that would be cancelled in flight. This manifested as excessive time to complete the request and noticeably lower throughput.
What did you expect to see?
Client does not produce a large volume of aborted TLS handshakes to server.
What did you see instead?
Slowness caused by excessive TLS handshakes to server
I created a gist that reproduces this at https://gist.github.com/mpoindexter/f6bc9dac16290343efba17129c49f9d5. If you uncomment the timeout on line 56 you can see the throughput of the test drop and stall periodically, but if the timeout remains commented throughput remains steady.
To test the analysis above, I implemented a custom
DialTLSContext
function that did the TLS dial using a new context if a deadline was set on the original context. This resolved the problem.The text was updated successfully, but these errors were encountered: