Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: random read deadlock #39812

Closed
Xfennec opened this issue Jun 24, 2020 · 8 comments
Closed

x/net/http2: random read deadlock #39812

Xfennec opened this issue Jun 24, 2020 · 8 comments
Milestone

Comments

@Xfennec
Copy link

@Xfennec Xfennec commented Jun 24, 2020

Hi!

What version of Go are you using (go version)?

$ go version
go version go1.13.1 linux/amd64

(seen from at least go1.10.4)

Does this issue reproduce with the latest release?

Currently testing with 1.14.3

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/xfennec/.cache/go-build"
GOENV="/home/xfennec/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/xfennec/Quiris/Go"
GOPRIVATE=""
GOPROXY="direct"
GOROOT="/usr/lib/golang"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build988333864=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We use a chain of two HTTP proxies written in Go to route incoming traffic to our virtual machines:

          incoming traffic
                 |
                 v
          +-------------+
          |   primary   |
          +--+-------+--+
             |       |
        +----+       +----+   HTTP/2
        |                 |
        v                 v
+---------------+    +---------------+
| secondary (1) |    | secondary (2) |
+---+---+---+---+    +---+---+---+---+
    |   |   |            |   |   |
    v   v   v            v   v   v

             VMs, HTTP/1.1

What did you expect to see?

Stable service over time.

What did you see instead?

Every ~10 days, we detect a total stall of requests between our frontal ("primary") proxy and one of the secondary proxies. New requests are accepted, sent to the secondary, then to the VM, but the response is stalled somewhere between the secondary and the primary proxy.

All request to this secondary proxy will then wait forever (since all requests are multiplexed inside the same HTTP2 connection)

Restarting one of the proxies, or even killing the "faulty" HTTP2 connection (using ss -K) between our proxies makes the service work perfectly again.

We've had this issue for months now, and I still fail to see any common pattern (time of the day, server load, requests, …) that would be the trigger. I've tried to write a few test/repro, without success either.

Here's a sample of the stack on the primary proxy during one of the outages:

goroutine 72163481 [sync.Cond.Wait, 5 minutes]:
runtime.goparkunlock(...)
        /usr/lib/golang/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc001b55300, 0x0)
        /usr/lib/golang/src/runtime/sema.go:510 +0xf8
sync.(*Cond).Wait(0xc001b552f0)
        /usr/lib/golang/src/sync/cond.go:56 +0x9d
net/http.(*http2pipe).Read(0xc001b552e8, 0xc0004e0000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/http/h2_bundle.go:3512 +0xa6
net/http.http2transportResponseBody.Read(0xc001b552c0, 0xc0004e0000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/http/h2_bundle.go:8448 +0xac
net/http/httputil.(*ReverseProxy).copyBuffer(0xc00014f810, 0x7f4cf39cc878, 0xc0000100a0, 0x7f4cf3a120f8, 0xc001b552c0, 0xc0004e0000, 0x8000, 0x8000, 0xc0006876d0, 0x920ca887, ...)
        /usr/lib/golang/src/net/http/httputil/reverseproxy.go:408 +0xa7
net/http/httputil.(*ReverseProxy).copyResponse(0xc00014f810, 0x7f4cf39cc878, 0xc0000100a0, 0x7f4cf3a120f8, 0xc001b552c0, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/http/httputil/reverseproxy.go:396 +0xb5
net/http/httputil.(*ReverseProxy).ServeHTTP(0xc00014f810, 0x859e00, 0xc0000100a0, 0xc00111fc00)
        /usr/lib/golang/src/net/http/httputil/reverseproxy.go:289 +0x7e2
main.(*ProxyServer).serveReverseProxy(0xc00006f1a0, 0xc00148e680, 0x7d6772, 0x5, 0x859e00, 0xc0000100a0, 0xc00111fc00, 0xc001f07100)
        /home/xfennec/Quiris/Go/src/github.com/OnitiFR/mulch/cmd/mulch-proxy/proxy.go:208 +0x277
main.(*ProxyServer).handleRequest(0xc00006f1a0, 0x859e00, 0xc0000100a0, 0xc00111fc00)
        /home/xfennec/Quiris/Go/src/github.com/OnitiFR/mulch/cmd/mulch-proxy/proxy.go:287 +0x608
net/http.HandlerFunc.ServeHTTP(0xc0000d98b0, 0x859e00, 0xc0000100a0, 0xc00111ac00)
        /usr/lib/golang/src/net/http/server.go:2007 +0x44
net/http.(*ServeMux).ServeHTTP(0xc0000124c0, 0x859e00, 0xc0000100a0, 0xc00111ac00)
        /usr/lib/golang/src/net/http/server.go:2387 +0x1bd
net/http.serverHandler.ServeHTTP(0xc0001460e0, 0x859e00, 0xc0000100a0, 0xc00111ac00)
        /usr/lib/golang/src/net/http/server.go:2802 +0xa4
net/http.initNPNRequest.ServeHTTP(0x85a8c0, 0xc0018b72f0, 0xc000a38380, 0xc0001460e0, 0x859e00, 0xc0000100a0, 0xc00111ac00)
        /usr/lib/golang/src/net/http/server.go:3366 +0x8d
net/http.(*http2serverConn).runHandler(0xc0008b9500, 0xc0000100a0, 0xc00111ac00, 0xc0003f52c0)
        /usr/lib/golang/src/net/http/h2_bundle.go:5713 +0x9f
created by net/http.(*http2serverConn).processHeaders
        /usr/lib/golang/src/net/http/h2_bundle.go:5447 +0x4eb

At the same time, on the secondary proxy, I see this:

goroutine 32086483 [select, 5 minutes]:
net/http.(*http2serverConn).writeDataFromHandler(0xc0024f7e00, 0xc0002b8d10, 0xc0007b2000, 0x53, 0x1000, 0xc000d4fc01, 0x1d, 0x6779ed)
        /usr/lib/golang/src/net/http/h2_bundle.go:4577 +0x23f
net/http.(*http2responseWriterState).writeChunk(0xc001188380, 0xc0007b2000, 0x53, 0x1000, 0xc00170fdb8, 0x1, 0xc000bf0340)
        /usr/lib/golang/src/net/http/h2_bundle.go:6043 +0x475
net/http.http2chunkWriter.Write(0xc001188380, 0xc0007b2000, 0x53, 0x1000, 0xc00144c420, 0x855e20, 0xc0003e0c80)
        /usr/lib/golang/src/net/http/h2_bundle.go:5926 +0x49
bufio.(*Writer).Flush(0xc000bf1340, 0x0, 0x0)
        /usr/lib/golang/src/bufio/bufio.go:593 +0x75
net/http.(*http2responseWriter).Flush(0xc001582088)
        /usr/lib/golang/src/net/http/h2_bundle.go:6122 +0x42
net/http.(*http2responseWriter).handlerDone(0xc001582088)
        /usr/lib/golang/src/net/http/h2_bundle.go:6258 +0x46
net/http.(*http2serverConn).runHandler.func1(0xc001582088, 0xc00170ff67, 0xc0024f7e00)
        /usr/lib/golang/src/net/http/h2_bundle.go:5711 +0x2bf
net/http.(*http2serverConn).runHandler(0xc0024f7e00, 0xc001582088, 0xc0020c2a00, 0xc0003509e0)
        /usr/lib/golang/src/net/http/h2_bundle.go:5715 +0xa5
created by net/http.(*http2serverConn).processHeaders
        /usr/lib/golang/src/net/http/h2_bundle.go:5447 +0x4eb
  • how can I make this bug report more helpful?
  • any idea for a sample code to replicate this issue without waiting 10 days each time?
  • as a workaround, what would be the timeout/deadline setting at Tranport level to mitigate this issue?
@davecheney
Copy link
Contributor

@davecheney davecheney commented Jun 24, 2020

what can I make this bug report more helpful?

Have you built and deployed a race enabled version of your application?
See https://blog.golang.org/race-detector .

@Xfennec
Copy link
Author

@Xfennec Xfennec commented Jun 24, 2020

Thanks for this link, @davecheney. I've just deployed a race-enabled binary on our primary proxy. No race detection so far.

Side question: are race enabled binaries supposed to be noticeably slower ?

@cagedmantis cagedmantis added this to the Unplanned milestone Jun 24, 2020
@davecheney
Copy link
Contributor

@davecheney davecheney commented Jun 25, 2020

Side question: are race enabled binaries supposed to be noticeably slower ?

yes, race detection does add overhead. I always recommend that people use go test -race but that relies on them having integration style tests that can exercise racy' code paths. The best way to spot data races that I've found is to deploy a race enabled build to a percentage of your production traffic. This seems to strike the right balance between coverage and race detector overhead.

@networkimprov
Copy link

@networkimprov networkimprov commented Jun 25, 2020

@fraenkel
Copy link
Contributor

@fraenkel fraenkel commented Jun 27, 2020

This is a dup of #32388 and #33425

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jun 29, 2020

Duplicate of #32388 and #33425

@davecheney davecheney closed this Jun 29, 2020
@Xfennec
Copy link
Author

@Xfennec Xfennec commented Jul 20, 2020

If it's of any help: with Go 1.14.3 + race enabled build, 24 days later, the issue popped again. Same symptoms, same stack trace, no race condition reported. It's an unusually long interval (the average duration between outages is around 10 days), I can't tell if it's related to the upgraded Go, the race enabled build or a change in our load in this period. Or just pure random.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 20, 2020

@Xfennec please follow up on one of the related issues. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants