Skip to content

net/http: Goroutine persistConn.roundTrip stucks when HTTP call timeouts #66193

Closed as not planned
@rschone

Description

@rschone

Go version

go1.22.1

Output of go env in your module/workspace:

GO111MODULE='on'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/XXX/Library/Caches/go-build'
GOENV='/Users/XXX/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/XXX/go/pkg/mod'
GONOPROXY=''
GONOSUMDB='github.com/YYY/*'
GOOS='darwin'
GOPATH='/Users/XXX/go'
GOPRIVATE=''
GOPROXY='go-proxy.oss.wandera.net'
GOROOT='/opt/homebrew/opt/go/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/opt/go/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.22.1'
GCCGO='gccgo'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/gy/lwh57s5173s3m4058f2kqw_h0000gp/T/go-build1021772654=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

During the load testing thousands of HTTP requests were generated by the golang HTTP client. The server couldn't keep up with the load and requests started timeouting.

What did you see happen?

We added a simple atomic counter to see the progress of processed requests and realized, that when the load is too high, some client goroutines never finish even though a context with a deadline is used. A dump from the client showed the following stuck stack traces:

#	0x7379b8	net/http.(*persistConn).roundTrip+0x978								/usr/local/go/src/net/http/transport.go:2675
#	0x72b859	net/http.(*Transport).roundTrip+0x799								/usr/local/go/src/net/http/transport.go:608
#	0x70f6b2	net/http.(*Transport).RoundTrip+0x12								/usr/local/go/src/net/http/roundtrip.go:17
#	0x1fafeb3	github.com/opentracing-contrib/go-stdlib/nethttp.(*Transport).RoundTrip+0x253

What did you expect to see?

As we went deeper we saw a possible place in the net/http/transport.go that could cause the issue. When the readLoop is waiting when reading the body of the response and the pc.roundTrip waits until pgClosed/cancelChan/ctxDoneChan, it might happen, that the select in readLoop processes the rc.req.Cancel or rc.req.Context().Done(), calls and removes the cancelKeyFn from the transport.reqCanceler and the pc.roundTrip.pcClosed case will never have canceled=true or even pc.t.replaceReqCanceler()=true because the transport.reqCanceler won't contain the cancelKey anymore.

The prerequisites for this to happen are:

  • the processing timeouted or is canceled
  • the rc.reqCancel or rc.req.Context().Done() is processed by readLoop

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.WaitingForInfoIssue is not actionable because of missing required information, which needs to be provided.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions