Description
Go version
go1.22.1
Output of go env
in your module/workspace:
GO111MODULE='on'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/XXX/Library/Caches/go-build'
GOENV='/Users/XXX/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/XXX/go/pkg/mod'
GONOPROXY=''
GONOSUMDB='github.com/YYY/*'
GOOS='darwin'
GOPATH='/Users/XXX/go'
GOPRIVATE=''
GOPROXY='go-proxy.oss.wandera.net'
GOROOT='/opt/homebrew/opt/go/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/opt/go/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.22.1'
GCCGO='gccgo'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/gy/lwh57s5173s3m4058f2kqw_h0000gp/T/go-build1021772654=/tmp/go-build -gno-record-gcc-switches -fno-common'
What did you do?
During the load testing thousands of HTTP requests were generated by the golang HTTP client. The server couldn't keep up with the load and requests started timeouting.
What did you see happen?
We added a simple atomic counter to see the progress of processed requests and realized, that when the load is too high, some client goroutines never finish even though a context with a deadline is used. A dump from the client showed the following stuck stack traces:
# 0x7379b8 net/http.(*persistConn).roundTrip+0x978 /usr/local/go/src/net/http/transport.go:2675
# 0x72b859 net/http.(*Transport).roundTrip+0x799 /usr/local/go/src/net/http/transport.go:608
# 0x70f6b2 net/http.(*Transport).RoundTrip+0x12 /usr/local/go/src/net/http/roundtrip.go:17
# 0x1fafeb3 github.com/opentracing-contrib/go-stdlib/nethttp.(*Transport).RoundTrip+0x253
What did you expect to see?
As we went deeper we saw a possible place in the net/http/transport.go
that could cause the issue. When the readLoop
is waiting when reading the body of the response and the pc.roundTrip
waits until pgClosed/cancelChan/ctxDoneChan, it might happen, that the select in readLoop
processes the rc.req.Cancel
or rc.req.Context().Done()
, calls and removes the cancelKeyFn from the transport.reqCanceler
and the pc.roundTrip
.pcClosed case will never have canceled=true or even pc.t.replaceReqCanceler()=true because the transport.reqCanceler
won't contain the cancelKey anymore.
The prerequisites for this to happen are:
- the processing timeouted or is canceled
- the rc.reqCancel or rc.req.Context().Done() is processed by readLoop