Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: transport caches permanently broken persistent connections if write error happens during h2 handshake #34978

Open
mikedanese opened this issue Oct 18, 2019 · 12 comments
Assignees
Labels
Milestone

Comments

@mikedanese
Copy link
Contributor

@mikedanese mikedanese commented Oct 18, 2019

Copy of http://issuetracker.google.com/140973477 to github.

What version of Go are you using (go version)?

$ go version
go version go1.13.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/usr/local/google/home/mikedanese/.cache/go-build"
GOENV="/usr/local/google/home/mikedanese/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/usr/local/google/home/mikedanese/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/google/home/mikedanese/.gimme/versions/go1.13.1.linux.amd64"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/google/home/mikedanese/.gimme/versions/go1.13.1.linux.amd64/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build932575926=/tmp/go-build -gno-record-gcc-switches"        

What did you do?

Test case-ish is here:

https://gist.github.com/mikedanese/a9204f541d1b12740be2551e381b99fc

Trigger a connection write error after the TLS hanshake but during the h2 handshake.

What did you expect to see?

The persistent connection is not cached by http.Transport.

What did you see instead?

The persistent connection is cached by http.Transport.

More

This change appears to be the culprit:

43b9fcf

We noticed that a client can get into an state where it is consistently failing to make requests and dumping logs like:

W1016 23:30:50.108262       1 plugin.go:164] rpc error: code = Unauthenticated desc = transport: Post ...: write tcp PRIVATE_IP:47518->108.177.119.95:443: write: broken pipe

While inspecting the process with strace, we noticed that it wasn't making any syscalls that would correspond to an outbound http request. We collected a core dump from the process and inspected it:

(dlv) p "net/http".DefaultTransport.idleConn
map[net/http.connectMethodKey][]*net/http.persistConn [
        {proxy: "", scheme: "https", addr: "container.googleapis.com:443", onlyH1: false}: [
                *(*"net/http.persistConn")(0xc00039e6c0),
        ],
]

(dlv) p (*(*"net/http.persistConn")(0xc00039e6c0)).alt
net/http.RoundTripper(net/http.http2erringRoundTripper) {
        err: error(*net.OpError) *{
                Op: "write",
                Net: "tcp",
                Source: net.Addr(*net.TCPAddr) ...,
                Addr: net.Addr(*net.TCPAddr) ...,
                Err: error(*os.SyscallError) ...,},}

(dlv) p (*(*"net/http.persistConn")(0xc00039e6c0)).alt.err
error(*net.OpError) *{
        Op: "write",
        Net: "tcp",
        Source: net.Addr(*net.TCPAddr) *{
                IP: net.IP len: 4, cap: 4, [PRIVATE_IP],
                Port: 47518,
                Zone: "",},
        Addr: net.Addr(*net.TCPAddr) *{
                IP: net.IP len: 4, cap: 4, [108,177,119,95],
                Port: 443,
                Zone: "",},
        Err: error(*os.SyscallError) *{
                Syscall: "write",
                Err: error(syscall.Errno) *(*error)(0xc0003ddc90),},}

The core dump showed that http.Transport was holding onto one persistent connection with an alt http2erringRoundTripper that wrapped a failed write error. We successfully reproduced the behavior in the test case-ish linked above and saw that a failed write here:

go/src/net/http/h2_bundle.go

Lines 7157 to 7164 in 8d2ea29

cc.bw.Write(http2clientPreface)
cc.fr.WriteSettings(initialSettings...)
cc.fr.WriteWindowUpdate(0, http2transportDefaultConnFlow)
cc.inflow.add(http2transportDefaultConnFlow + http2initialWindowSize)
cc.bw.Flush()
if cc.werr != nil {
return nil, cc.werr
}

Causes the h2 upgradeFn:

go/src/net/http/h2_bundle.go

Lines 6652 to 6656 in 8d2ea29

upgradeFn := func(authority string, c *tls.Conn) RoundTripper {
addr := http2authorityAddr("https", authority)
if used, err := connPool.addConnIfNeeded(addr, t2, c); err != nil {
go c.Close()
return http2erringRoundTripper{err}

To return an http2erringRoundTripper. The persistent connection is cached by the transport storing the erring round tripper in pconn.alt. The erring round tripper is used in subsequent calls to:

// HTTP/2 path.
t.setReqCanceler(req, nil) // not cancelable with CancelRequest
resp, err = pconn.alt.RoundTrip(req)

And the pconn remains in the idle connection cache.

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Oct 18, 2019

@agnivade agnivade added this to the Unplanned milestone Oct 18, 2019
@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Oct 18, 2019

Does this reproduce with Go 1.13.3 (released yesterday)?

#34498 seems like it may be related, and the fix for that was backported in 1.13.3.

@mikedanese

This comment has been minimized.

Copy link
Contributor Author

@mikedanese mikedanese commented Oct 18, 2019

It reproduces on go1.13.3 and master.

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Oct 18, 2019

Is this a regression from 1.12?

CC @tombergan @fraenkel @bgentry

@mikedanese

This comment has been minimized.

Copy link
Contributor Author

@mikedanese mikedanese commented Oct 18, 2019

Yes, this is a new regression. This was broken by 43b9fcf which is present in: go1.13.3 go1.13.2 go1.13.1 go1.13 go1.13rc2 go1.13rc1 go1.13beta1.

@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Oct 18, 2019

@fraenkel, could you take a look?

I'll also try to look but it might still be in your head more than mine.

@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Oct 18, 2019

@mikedanese, thanks! This is a great bug report.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 18, 2019

Change https://golang.org/cl/202078 mentions this issue: net/http: don't cache http2.erringRoundTripper connections

@gopherbot gopherbot closed this in 88186e5 Oct 22, 2019
@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Oct 22, 2019

@gopherbot, please backport to Go 1.13.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 22, 2019

Backport issue(s) opened: #35087 (for 1.13).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 22, 2019

Change https://golang.org/cl/202642 mentions this issue: [release-branch.go1.13] net/http: don't cache http2.erringRoundTripper connections

gopherbot pushed a commit that referenced this issue Oct 23, 2019
…r connections

Fixes #35087
Updates #34978

Change-Id: I3baf1392ba7366ae6628889c47c343ef702ec438
Reviewed-on: https://go-review.googlesource.com/c/go/+/202078
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
(cherry picked from commit 88186e5)
Reviewed-on: https://go-review.googlesource.com/c/go/+/202642
Reviewed-by: Emmanuel Odeke <emm.odeke@gmail.com>
@bradfitz bradfitz self-assigned this Oct 23, 2019
@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Oct 23, 2019

Reopening, as investigation into the test failures from this CL (#35113) aren't leaving me entirely happy yet.

@bradfitz bradfitz reopened this Oct 23, 2019
@liggitt liggitt referenced this issue Oct 25, 2019
9 of 9 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.