Description
Copy of http://issuetracker.google.com/140973477 to github.
What version of Go are you using (go version
)?
$ go version go version go1.13.1 linux/amd64
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/usr/local/google/home/mikedanese/.cache/go-build" GOENV="/usr/local/google/home/mikedanese/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/usr/local/google/home/mikedanese/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/google/home/mikedanese/.gimme/versions/go1.13.1.linux.amd64" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/google/home/mikedanese/.gimme/versions/go1.13.1.linux.amd64/pkg/tool/linux_amd64" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build932575926=/tmp/go-build -gno-record-gcc-switches"
What did you do?
Test case-ish is here:
https://gist.github.com/mikedanese/a9204f541d1b12740be2551e381b99fc
Trigger a connection write error after the TLS hanshake but during the h2 handshake.
What did you expect to see?
The persistent connection is not cached by http.Transport.
What did you see instead?
The persistent connection is cached by http.Transport.
More
This change appears to be the culprit:
We noticed that a client can get into an state where it is consistently failing to make requests and dumping logs like:
W1016 23:30:50.108262 1 plugin.go:164] rpc error: code = Unauthenticated desc = transport: Post ...: write tcp PRIVATE_IP:47518->108.177.119.95:443: write: broken pipe
While inspecting the process with strace, we noticed that it wasn't making any syscalls that would correspond to an outbound http request. We collected a core dump from the process and inspected it:
(dlv) p "net/http".DefaultTransport.idleConn
map[net/http.connectMethodKey][]*net/http.persistConn [
{proxy: "", scheme: "https", addr: "container.googleapis.com:443", onlyH1: false}: [
*(*"net/http.persistConn")(0xc00039e6c0),
],
]
(dlv) p (*(*"net/http.persistConn")(0xc00039e6c0)).alt
net/http.RoundTripper(net/http.http2erringRoundTripper) {
err: error(*net.OpError) *{
Op: "write",
Net: "tcp",
Source: net.Addr(*net.TCPAddr) ...,
Addr: net.Addr(*net.TCPAddr) ...,
Err: error(*os.SyscallError) ...,},}
(dlv) p (*(*"net/http.persistConn")(0xc00039e6c0)).alt.err
error(*net.OpError) *{
Op: "write",
Net: "tcp",
Source: net.Addr(*net.TCPAddr) *{
IP: net.IP len: 4, cap: 4, [PRIVATE_IP],
Port: 47518,
Zone: "",},
Addr: net.Addr(*net.TCPAddr) *{
IP: net.IP len: 4, cap: 4, [108,177,119,95],
Port: 443,
Zone: "",},
Err: error(*os.SyscallError) *{
Syscall: "write",
Err: error(syscall.Errno) *(*error)(0xc0003ddc90),},}
The core dump showed that http.Transport was holding onto one persistent connection with an alt http2erringRoundTripper that wrapped a failed write error. We successfully reproduced the behavior in the test case-ish linked above and saw that a failed write here:
Lines 7157 to 7164 in 8d2ea29
Causes the h2 upgradeFn:
Lines 6652 to 6656 in 8d2ea29
To return an http2erringRoundTripper. The persistent connection is cached by the transport storing the erring round tripper in pconn.alt. The erring round tripper is used in subsequent calls to:
Lines 531 to 533 in 8d2ea29
And the pconn remains in the idle connection cache.