Description
What version of Go are you using (go version
)?
$ go version go version go1.17.7 linux/amd64
Does this issue reproduce with the latest release?
Go 1.17.7 is the latest stable release at the time of writing. I was also able to reproduce the issue on the current "Go dev branch" version on the playground.
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/morgan/.cache/go-build" GOENV="/home/morgan/.config/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/home/morgan/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/morgan/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/lib/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64" GOVCS="" GOVERSION="go1.17.7" GCCGO="gccgo" AR="ar" CC="x86_64-pc-linux-gnu-gcc" CXX="x86_64-pc-linux-gnu-g++" CGO_ENABLED="1" GOMOD="/dev/null" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2625616170=/tmp/go-build -gno-record-gcc-switches" GOROOT/bin/go version: go version go1.17.7 linux/amd64 GOROOT/bin/go tool compile -V: compile version go1.17.7 uname -sr: Linux 5.10.60.1-microsoft-standard-WSL2 /lib64/libc.so.6: GNU C Library (Gentoo 2.33-r7 p6) release release version 2.33.
We also observed this issue in various environments using the golang:1.17
Docker image on commodity cloud hardware, and our reproduction case demonstrates erroneous behavior in the Go playground.
What did you do?
https://go.dev/play/p/iUpPf4G8vZy is a small example demonstrating this behavior. We start a local HTTP server, and then create two http.Client
s with timeouts set. One client uses http.DefaultTransport
directly, and the other uses a local Transport
type that trivially wraps http.DefaultTransport
. Then, we make a large number of POST
requests to that local HTTP server using each client, allow them to time out by neglecting to read the response body, and read the size of http.Transport.reqCanceler
using reflection.
(The issue can be reproduced with const count = 1
set as well, though since there's a race condition involved it may take several runs of the test program to see an incorrect result.)
What did you expect to see?
Our understanding is that http.Transport.reqCanceler
should always be empty when there are no HTTP requests in flight. Since it maps "requests" to "functions that cancel requests", once a request has completed, errored, timed out, or otherwise entered a state where "cancellation" is not a meaningful operation, we'd expect that request to be removed from reqCanceler
. For this example, we would expect to see three lines of count = 0
printed.
What did you see instead?
Under certain conditions (described below), a request will not be correctly removed from reqCanceler
. This results in the request and associated objects being unable to be garbage-collected until the associated HTTP transport is dropped, which could be "for the lifetime of the program" if the associated transport is http.DefaultTransport
. In the example, the last line of the output has a non-zero count
despite there being no outstanding HTTP requests. In production, we observed excessive heap use involving persistConn
objects, leading in some cases to application servers running out of memory.
A detailed investigation
We believe the problem to be line 2220 of net/http/transport.go:
case <-rc.req.Cancel:
alive = false
pc.t.CancelRequest(rc.req) // [this line]
case <-rc.req.Context().Done():
alive = false
pc.t.cancelRequest(rc.cancelKey, rc.req.Context().Err())
case <-pc.closech:
alive = false
}
The original HTTP request has been stored in pc.t.reqCanceler
under rc.cancelKey
. If the original HTTP request had a body (like with a POST
), rc.req
will not be the same request that rc.cancelKey
refers to, and if the <-rc.req.Cancel
case is executed, this code will remove the wrong key from reqCanceler
(often one that doesn't exist to begin with). If no other CancelRequest
or cancelRequest
call is made with the correct cancel key, then the request stored in reqCanceler
under rc.cancelKey
will not be removed and will remain attached to the transport.
This will take place if:
- the server begins to send a response, but the request context times out before the response is read by the application (in our production case, this was caused by scheduler contention on both the client and server machines, so it's not just a case of forgetting to completely read the response body)
net.http.knownRoundTripperImpl
returnedfalse
for the given request, causingsetRequestCancel
to setreq.Cancel
to a new channel and start a timer to close it after the request's timeout period expires, and- that timer triggers prior to the request context's expiry timer triggering and closing
req.Context().Done()
. Since both timers will have the same timeout, this does not always happen. In our testing, we found it occurs slightly more than half the time.
Should all these things happen, only the <-rc.req.Cancel
case will be executed.
We believe the correct fix for this problem would be to replace that line with a call to cancelRequest
with the correct rc.cancelKey
, similar to what's done later on in http.Transport (in fact, this case will trigger if you do read the response body and cancel the request correctly, which is why that's a precondition). We have been experimenting with an appropriately-patched net/http
and found it to resolve the issue, both in that the test case given will correctly print three lines of count = 0
and that our production memory issues ceased.