Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: permanently broken connection with error "read: connection reset by peer" when response body is not closed #36700

Open
josharian opened this issue Jan 22, 2020 · 16 comments

Comments

@josharian
Copy link
Contributor

@josharian josharian commented Jan 22, 2020

What version of Go are you using (go version)?

Go 1.14 beta 1

Does this issue reproduce with the latest release?

Not sure; I can't compile my program with 1.13.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/josh/Library/Caches/go-build"
GOENV="/Users/josh/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/josh"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/josh/go/1.14"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/josh/go/1.14/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/josh/x/skiff/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/1t/n61cbvls5bl293bbb0zyypqw0000gn/T/go-build473522411=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Made a series of http requests from a Go client to a Go server. The requests were vanilla HTTP requests, using http.Get.

I lost connectivity at some point. When I regained connectivity, all subsequent http requests failed with error read: connection reset by peer. I waited quite a long time, and it never recovered.

It has happened a couple of times, but doesn't reproduce reliably (which is unsurprising, since me closing my laptop lid is not exactly a precision affair).

This looks very similar to #34978, although I don't know when exactly the connectivity failed.

Tentatively marking as Go 1.14.

cc @bradfitz @fraenkel @tombergan

@josharian josharian added this to the Go1.14 milestone Jan 22, 2020
@fraenkel
Copy link
Contributor

@fraenkel fraenkel commented Jan 23, 2020

You were probably affected by #24138

@josharian
Copy link
Contributor Author

@josharian josharian commented Jan 23, 2020

You were probably affected by #24138

Ugh. Thanks. This is about (not) having timeouts on the client side, right? It's unfortunate that it affects brand new requests and that it permanently breaks all new requests.

Could we add a (when-broken-only?) connection re-use timeout without breaking the compatibility promise? It seems like connection re-use decisions are entirely up to net/http.

Any suggestions for workarounds? It's tricky since I am actually doing long polling requests, where I actively want to keep connections open for hours, so I don't want to set aggressive timeouts. Would it help to call http.DefaultClient.CloseIdleConnections() whenever an http request fails? Or start using a new http.Client whenever an http request fails?

@networkimprov
Copy link

@networkimprov networkimprov commented Jan 23, 2020

A different culprit than lack of timeout might be lack of resume-after-suspend notification. That would let us ping the server on resume and reconnect if nec. Also discussed in #36141 (comment)

Maybe unrelated, but the CL for #23459 (net.Dial TCP keepalive on by default) didn't touch net/http. Does http.Client not use net.Dial, or is this disabled for http.Client by a different CL?

@bcmills
Copy link
Member

@bcmills bcmills commented Jan 23, 2020

Given that we don't know whether 1.13 was affected, I suspect this should not be milestoned to 1.14 at this point.

@josharian
Copy link
Contributor Author

@josharian josharian commented Jan 23, 2020

@bcmills I milestoned based on superficial similarity to #34978, which has outstanding backports to 1.13. Feel free to re-milestone, although it might be better to understand a bit more first.

@fraenkel
Copy link
Contributor

@fraenkel fraenkel commented Jan 24, 2020

@josharian Did you enable http2 on the Server? If so, curious what happens with http/1.1.

@josharian
Copy link
Contributor Author

@josharian josharian commented Jan 24, 2020

I used net/http out of the box. I will run for a bit with http2 disabled and see what happens.

@networkimprov
Copy link

@networkimprov networkimprov commented Jan 24, 2020

In 1.13 (and presumably earlier; not sure), http.DefaultTransport uses TCP keepalive. On client suspend/resume where the server drops the link (e.g. due to no traffic), subsequent client Write() disables TCP keepalive, preventing a timeout error from the last client Read() -- see #31490. I don't know whether this is preventing your http.DefaultClient from recovering, but thought I should mention it.

Setting a short keepalive helped in my case (a plain TCP/TLS context, not HTTP). Eventually I plan to disable keepalive and use a server pulse to detect a dead link.

From net/http/transport.go:

// DefaultTransport is the default implementation of Transport and is
// used by DefaultClient. It establishes network connections as needed
// and caches them for reuse by subsequent calls. It uses HTTP proxies
// as directed by the $HTTP_PROXY and $NO_PROXY (or $http_proxy and
// $no_proxy) environment variables.
var DefaultTransport RoundTripper = &Transport{
        Proxy: ProxyFromEnvironment,
        DialContext: (&net.Dialer{
                Timeout:   30 * time.Second,
                KeepAlive: 30 * time.Second,
                DualStack: true,
        }).DialContext,
        ForceAttemptHTTP2:     true,
        MaxIdleConns:          100,
        IdleConnTimeout:       90 * time.Second,
        TLSHandshakeTimeout:   10 * time.Second,
        ExpectContinueTimeout: 1 * time.Second,
}
@josharian josharian modified the milestones: Go1.14, Backlog Jan 25, 2020
@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Jan 25, 2020

@networkimprov, that seems unrelated.

@josharian
Copy link
Contributor Author

@josharian josharian commented Jan 26, 2020

I finally tracked this down.

It turns out that there was a code path in which in I didn't close resp.Body. Timeouts didn't help.

Although this was user error (sorry and thanks!), I do wonder whether there's a way to detect this, since it was not easy to find.

One option is a vet check that, similar to the lostcancel check, checks that resp.Body.Close gets called on all exits from a function. Another option would be some kind of check at runtime in the http package, although I don't know net/http well enough to know whether that is feasible.

@josharian
Copy link
Contributor Author

@josharian josharian commented Jan 26, 2020

Correction: timeouts were also necessary (in addition to properly closing resp.Body).

@josharian
Copy link
Contributor Author

@josharian josharian commented Jan 26, 2020

Closing as user error.

@josharian josharian closed this Jan 26, 2020
@bcmills
Copy link
Member

@bcmills bcmills commented Jan 27, 2020

@josharian, could you provide some more detail about how this resulting from a missing resp.Body.Close?

The documentation for the Body field says:

The default HTTP client's Transport may not
reuse HTTP/1.x "keep-alive" TCP connections if the Body is
not read to completion and closed.

It does not say anything about what happens for HTTP/2 connections, so at the very least there seems to be a documentation issue here.

@bcmills bcmills reopened this Jan 27, 2020
@bcmills bcmills changed the title net/http: permanently broken connection with error "read: connection reset by peer" net/http: permanently broken connection with error "read: connection reset by peer" when response body is not closed Jan 27, 2020
@bcmills
Copy link
Member

@bcmills bcmills commented Jan 27, 2020

This also seems like something we could detect and report, using a finalizer-based approach similar to the one described in #24739: if the Body becomes unreachable without being closed, that always indicates a caller error, and could be reported as such (perhaps as a panic, or perhaps by some other means).

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Jan 27, 2020

Yeah, I don't think forgetting to close resp.Body should cause anything worse than a TCP connection leak. It certainly shouldn't affect unrelated HTTP fetches.

@josharian
Copy link
Contributor Author

@josharian josharian commented Jan 27, 2020

I'm honestly not sure. I tried to reproduce using multiple simpler ssetup and failed. There was clearly something else going on as well that closing the Body helped with. I went looking to try to understand better, but I've already spent over a full day on this, and I'm not really inclined to spend more time digging.

I do think a finalizer-based unclosed unreachable Body detection would be a good idea, since that is definitely a bug, regardless of what precise consequences it has. I suspect it'd be better to log instead of panicking--net/http already logs when it notices an ineffectual WriteHeader call, so there's some precedent for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.