Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: Stuck http requests #56385

Open
mhr3 opened this issue Oct 23, 2022 · 14 comments
Open

x/net/http2: Stuck http requests #56385

mhr3 opened this issue Oct 23, 2022 · 14 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mhr3
Copy link

mhr3 commented Oct 23, 2022

What version of Go are you using (go version)?

$ go version
go version go1.18.7 linux/amd64

Does this issue reproduce with the latest release?

Haven't tried

What did you do?

We're running a http server which uses a http client to proxy the request body to another server (using h2c) and return its response body, we've recently noticed there's sometimes a bunch of requests getting stuck waiting for a sync.Cond, unfortunately I do not have a reproducer (this works fine 99.9% of the time), but I do have goroutine stacktraces from both the servers. Hoping someone more familiar with the http code internals might offer some help, as from what I can tell from the stacktraces, the client is waiting for the server to do something and the server is waiting for the client (ie looks like a deadlock).

One thing to note is that only requests which are proxied directly (io.Copy from the server Request into the client Request) get stuck, other requests where the body is constructed in memory were never affected.

Here's stacktraces from the server doing the proxying (some lines containing private code trimmed):

goroutine profile: total 806
242 @ 0x43d156 0x435b37 0x467da9 0x4e1812 0x4e2b7a 0x4e2b68 0x548689 0x55b705 0x75591f 0x46d9c1
#	0x467da8	internal/poll.runtime_pollWait+0x88		runtime/netpoll.go:302
#	0x4e1811	internal/poll.(*pollDesc).wait+0x31		internal/poll/fd_poll_runtime.go:83
#	0x4e2b79	internal/poll.(*pollDesc).waitRead+0x259	internal/poll/fd_poll_runtime.go:88
#	0x4e2b67	internal/poll.(*FD).Read+0x247			internal/poll/fd_unix.go:167
#	0x548688	net.(*netFD).Read+0x28				net/fd_posix.go:55
#	0x55b704	net.(*conn).Read+0x44				net/net.go:183
#	0x75591e	net/http.(*connReader).backgroundRead+0x3e	net/http/server.go:672

242 @ 0x43d156 0x44cbf2 0x905589 0x904b7e 0x46d9c1
#	0x905588	golang.org/x/net/http2.(*clientStream).writeRequest+0x9c8	golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:1355
#	0x904b7d	golang.org/x/net/http2.(*clientStream).doRequest+0x1d		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:1217

⬇️ these requests are stuck
201 @ 0x43d156 0x4698bd 0x46989d 0x475cec 0x8efa6b 0x90bf25 0x4d8cf2 0x920b32 0x920b0c 0x132afe2 0x1234c7b 0x12dccce 0x122b084 0x12f64f5 0x122c2fe 0x122a570 0x923e71 0x122e098 0x9702a5 0x96f872 0x12df80b 0x924e52 0x122dc3b 0x122f511 0x924c7c 0x76057b 0x75ba37 0x46d9c1
#	0x46989c	sync.runtime_notifyListWait+0x11c							runtime/sema.go:513
#	0x475ceb	sync.(*Cond).Wait+0x8b									sync/cond.go:56
#	0x8efa6a	golang.org/x/net/http2.(*pipe).Read+0xea						golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/pipe.go:76
#	0x90bf24	golang.org/x/net/http2.transportResponseBody.Read+0x84					golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:2417
#	0x4d8cf1	io.copyBuffer+0x1b1									io/io.go:426
#	0x920b31	io.Copy+0x71										io/io.go:385
#	0x920b0b	github.com/labstack/echo/v4.(*context).Stream+0x4b					github.com/labstack/echo/v4@v4.5.0/context.go:561
#	0x132afe1	<private_pkg>.handleEndpoint+0xd81				private.go:234
...
#	0x924e51	github.com/labstack/echo/v4.(*Echo).ServeHTTP.func1+0x131				github.com/labstack/echo/v4@v4.5.0/echo.go:648
#	0x122dc3a	github.com/labstack/echo/v4/middleware.MethodOverrideWithConfig.func1.1+0xba		github.com/labstack/echo/v4@v4.5.0/middleware/method_override.go:65
#	0x122f510	github.com/labstack/echo/v4/middleware.RemoveTrailingSlashWithConfig.func1.1+0x210	github.com/labstack/echo/v4@v4.5.0/middleware/slash.go:118
#	0x924c7b	github.com/labstack/echo/v4.(*Echo).ServeHTTP+0x3bb					github.com/labstack/echo/v4@v4.5.0/echo.go:654
#	0x76057a	net/http.serverHandler.ServeHTTP+0x43a							net/http/server.go:2916
#	0x75ba36	net/http.(*conn).serve+0x5d6								net/http/server.go:1966

35 @ 0x43d156 0x4698bd 0x46989d 0x475cec 0x8efa6b 0x90bf25 0x570354 0x4d87ba 0x6f9105 0x6f90b7 0x6f8d1b 0x6f89fe 0x133580f 0x132bdb9 0x1234c7b 0x12dccce 0x122b084 0x12f64f5 0x122c2fe 0x122a570 0x923e71 0x122e098 0x9702a5 0x96f872 0x12df80b 0x924e52 0x122dc3b 0x122f511 0x924c7c 0x76057b 0x75ba37 0x46d9c1
#	0x46989c	sync.runtime_notifyListWait+0x11c							runtime/sema.go:513
#	0x475ceb	sync.(*Cond).Wait+0x8b									sync/cond.go:56
#	0x8efa6a	golang.org/x/net/http2.(*pipe).Read+0xea						golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/pipe.go:76
#	0x90bf24	golang.org/x/net/http2.transportResponseBody.Read+0x84					golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:2417
#	0x570353	bufio.(*Reader).Read+0x1b3								bufio/bufio.go:236
#	0x4d87b9	io.ReadAtLeast+0x99									io/io.go:331
#	0x6f9104	io.ReadFull+0xa4									io/io.go:350
#	0x6f90b6	compress/gzip.(*Reader).readHeader+0x56							compress/gzip/gunzip.go:175
#	0x6f8d1a	compress/gzip.(*Reader).Reset+0x2ba							compress/gzip/gunzip.go:113
#	0x6f89fd	compress/gzip.NewReader+0x3d								compress/gzip/gunzip.go:94
#	0x133580e	<private_pkg>.createRequest+0x28e			private.go:1225
...
#	0x924e51	github.com/labstack/echo/v4.(*Echo).ServeHTTP.func1+0x131				github.com/labstack/echo/v4@v4.5.0/echo.go:648
#	0x122dc3a	github.com/labstack/echo/v4/middleware.MethodOverrideWithConfig.func1.1+0xba		github.com/labstack/echo/v4@v4.5.0/middleware/method_override.go:65
#	0x122f510	github.com/labstack/echo/v4/middleware.RemoveTrailingSlashWithConfig.func1.1+0x210	github.com/labstack/echo/v4@v4.5.0/middleware/slash.go:118
#	0x924c7b	github.com/labstack/echo/v4.(*Echo).ServeHTTP+0x3bb					github.com/labstack/echo/v4@v4.5.0/echo.go:654
#	0x76057a	net/http.serverHandler.ServeHTTP+0x43a							net/http/server.go:2916
#	0x75ba36	net/http.(*conn).serve+0x5d6								net/http/server.go:1966

17 @ 0x43d156 0x435b37 0x467da9 0x4e1812 0x4e2b7a 0x4e2b68 0x548689 0x55b705 0x755ded 0x56fd83 0x56fedd 0x757338 0x75b78b 0x46d9c1
#	0x467da8	internal/poll.runtime_pollWait+0x88		runtime/netpoll.go:302
#	0x4e1811	internal/poll.(*pollDesc).wait+0x31		internal/poll/fd_poll_runtime.go:83
#	0x4e2b79	internal/poll.(*pollDesc).waitRead+0x259	internal/poll/fd_poll_runtime.go:88
#	0x4e2b67	internal/poll.(*FD).Read+0x247			internal/poll/fd_unix.go:167
#	0x548688	net.(*netFD).Read+0x28				net/fd_posix.go:55
#	0x55b704	net.(*conn).Read+0x44				net/net.go:183
#	0x755dec	net/http.(*connReader).Read+0x16c		net/http/server.go:780
#	0x56fd82	bufio.(*Reader).fill+0x102			bufio/bufio.go:106
#	0x56fedc	bufio.(*Reader).Peek+0x5c			bufio/bufio.go:144
#	0x757337	net/http.(*conn).readRequest+0x1f7		net/http/server.go:985
#	0x75b78a	net/http.(*conn).serve+0x32a			net/http/server.go:1891

7 @ 0x43d156 0x44cbf2 0x778a95 0x46d9c1
#	0x778a94	net/http.(*persistConn).writeLoop+0xf4	net/http/transport.go:2392

6 @ 0x43d156 0x4698bd 0x46989d 0x475cec 0x8efa6b 0x90bf25 0x7df5df 0x7df1db 0x7dee38 0x11d5f5a 0x11d69c7 0x11f3a7d 0x132adae 0x1234c7b 0x12dccce 0x122b084 0x12f64f5 0x122c2fe 0x122a570 0x923e71 0x122e098 0x9702a5 0x96f872 0x12df80b 0x924e52 0x122dc3b 0x122f511 0x924c7c 0x76057b 0x75ba37 0x46d9c1
#	0x46989c	sync.runtime_notifyListWait+0x11c							runtime/sema.go:513
#	0x475ceb	sync.(*Cond).Wait+0x8b									sync/cond.go:56
#	0x8efa6a	golang.org/x/net/http2.(*pipe).Read+0xea						golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/pipe.go:76
#	0x90bf24	golang.org/x/net/http2.transportResponseBody.Read+0x84					golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:2417
#	0x7df5de	encoding/json.(*Decoder).refill+0x17e							encoding/json/stream.go:165
#	0x7df1da	encoding/json.(*Decoder).readValue+0xba							encoding/json/stream.go:140
#	0x7dee37	encoding/json.(*Decoder).Decode+0x77							encoding/json/stream.go:63
#	0x11d5f59	<private_pkg>.doRequest+0x119				private.go:204
...
#	0x924e51	github.com/labstack/echo/v4.(*Echo).ServeHTTP.func1+0x131				github.com/labstack/echo/v4@v4.5.0/echo.go:648
#	0x122dc3a	github.com/labstack/echo/v4/middleware.MethodOverrideWithConfig.func1.1+0xba		github.com/labstack/echo/v4@v4.5.0/middleware/method_override.go:65
#	0x122f510	github.com/labstack/echo/v4/middleware.RemoveTrailingSlashWithConfig.func1.1+0x210	github.com/labstack/echo/v4@v4.5.0/middleware/slash.go:118
#	0x924c7b	github.com/labstack/echo/v4.(*Echo).ServeHTTP+0x3bb					github.com/labstack/echo/v4@v4.5.0/echo.go:654
#	0x76057a	net/http.serverHandler.ServeHTTP+0x43a							net/http/server.go:2916
#	0x75ba36	net/http.(*conn).serve+0x5d6								net/http/server.go:1966

4 @ 0x43d156 0x435b37 0x467da9 0x4e1812 0x4e2b7a 0x4e2b68 0x548689 0x55b705 0x570354 0x4d87ba 0x8e708e 0x8e7048 0x8e78f5 0x909f70 0x9094af 0x46d9c1
#	0x467da8	internal/poll.runtime_pollWait+0x88			runtime/netpoll.go:302
#	0x4e1811	internal/poll.(*pollDesc).wait+0x31			internal/poll/fd_poll_runtime.go:83
#	0x4e2b79	internal/poll.(*pollDesc).waitRead+0x259		internal/poll/fd_poll_runtime.go:88
#	0x4e2b67	internal/poll.(*FD).Read+0x247				internal/poll/fd_unix.go:167
#	0x548688	net.(*netFD).Read+0x28					net/fd_posix.go:55
#	0x55b704	net.(*conn).Read+0x44					net/net.go:183
#	0x570353	bufio.(*Reader).Read+0x1b3				bufio/bufio.go:236
#	0x4d87b9	io.ReadAtLeast+0x99					io/io.go:331
#	0x8e708d	io.ReadFull+0x6d					io/io.go:350
#	0x8e7047	golang.org/x/net/http2.readFrameHeader+0x27		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/frame.go:237
#	0x8e78f4	golang.org/x/net/http2.(*Framer).ReadFrame+0x94		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/frame.go:498
#	0x909f6f	golang.org/x/net/http2.(*clientConnReadLoop).run+0x12f	golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:2134
#	0x9094ae	golang.org/x/net/http2.(*ClientConn).readLoop+0x6e	golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/transport.go:2030

4 @ 0x43d156 0x435b37 0x467da9 0x4e1812 0x4e2b7a 0x4e2b68 0x548689 0x55b705 0x775f8e 0x56fd83 0x56fedd 0x776dac 0x46d9c1
#	0x467da8	internal/poll.runtime_pollWait+0x88		runtime/netpoll.go:302
#	0x4e1811	internal/poll.(*pollDesc).wait+0x31		internal/poll/fd_poll_runtime.go:83
#	0x4e2b79	internal/poll.(*pollDesc).waitRead+0x259	internal/poll/fd_poll_runtime.go:88
#	0x4e2b67	internal/poll.(*FD).Read+0x247			internal/poll/fd_unix.go:167
#	0x548688	net.(*netFD).Read+0x28				net/fd_posix.go:55
#	0x55b704	net.(*conn).Read+0x44				net/net.go:183
#	0x775f8d	net/http.(*persistConn).Read+0x4d		net/http/transport.go:1929
#	0x56fd82	bufio.(*Reader).fill+0x102			bufio/bufio.go:106
#	0x56fedc	bufio.(*Reader).Peek+0x5c			bufio/bufio.go:144
#	0x776dab	net/http.(*persistConn).readLoop+0x1ab		net/http/transport.go:2093

3 @ 0x43d156 0x435b37 0x467da9 0x4e1812 0x4e2b7a 0x4e2b68 0x548689 0x55b705 0x6a2fdd 0x50e5b8 0x6a31c5 0x6a07d6 0x6a622f 0x6a6230 0x775f8e 0x56fd83 0x56fedd 0x776dac 0x46d9c1
#	0x467da8	internal/poll.runtime_pollWait+0x88		runtime/netpoll.go:302
#	0x4e1811	internal/poll.(*pollDesc).wait+0x31		internal/poll/fd_poll_runtime.go:83
#	0x4e2b79	internal/poll.(*pollDesc).waitRead+0x259	internal/poll/fd_poll_runtime.go:88
#	0x4e2b67	internal/poll.(*FD).Read+0x247			internal/poll/fd_unix.go:167
#	0x548688	net.(*netFD).Read+0x28				net/fd_posix.go:55
#	0x55b704	net.(*conn).Read+0x44				net/net.go:183
#	0x6a2fdc	crypto/tls.(*atLeastReader).Read+0x3c		crypto/tls/conn.go:785
#	0x50e5b7	bytes.(*Buffer).ReadFrom+0x97			bytes/buffer.go:204
#	0x6a31c4	crypto/tls.(*Conn).readFromUntil+0xe4		crypto/tls/conn.go:807
#	0x6a07d5	crypto/tls.(*Conn).readRecordOrCCS+0x115	crypto/tls/conn.go:614
#	0x6a622e	crypto/tls.(*Conn).readRecord+0x16e		crypto/tls/conn.go:582
#	0x6a622f	crypto/tls.(*Conn).Read+0x16f			crypto/tls/conn.go:1285
#	0x775f8d	net/http.(*persistConn).Read+0x4d		net/http/transport.go:1929
#	0x56fd82	bufio.(*Reader).fill+0x102			bufio/bufio.go:106
#	0x56fedc	bufio.(*Reader).Peek+0x5c			bufio/bufio.go:144
#	0x776dab	net/http.(*persistConn).readLoop+0x1ab		net/http/transport.go:2093

And here's stacktraces from the h2c server:

goroutine profile: total 287
⬇️ these are the stuck ones (as you can see this is after the http Handler returns, which seems to point to the client being in bad state
243 @ 0x43d0d6 0x44cb72 0xcba285 0xcc1e65 0xcc175e 0x565f02 0xcc260f 0xcc3318 0xcc0c79 0xcc09f7 0x46db41
#	0xcba284	golang.org/x/net/http2.(*serverConn).writeDataFromHandler+0x204	golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/server.go:1064
#	0xcc1e64	golang.org/x/net/http2.(*responseWriterState).writeChunk+0x4c4	golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/server.go:2589
#	0xcc175d	golang.org/x/net/http2.chunkWriter.Write+0x1d			golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/server.go:2469
#	0x565f01	bufio.(*Writer).Flush+0x61					bufio/bufio.go:628
#	0xcc260e	golang.org/x/net/http2.(*responseWriter).Flush+0x2e		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/server.go:2669
#	0xcc3317	golang.org/x/net/http2.(*responseWriter).handlerDone+0x37	golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/server.go:2832
#	0xcc0c78	golang.org/x/net/http2.(*serverConn).runHandler.func1+0x218	golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/server.go:2253
#	0xcc09f6	golang.org/x/net/http2.(*serverConn).runHandler+0x96		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/server.go:2257

4 @ 0x43d0d6 0x435ab7 0x467e49 0x4d8632 0x4d999a 0x4d9988 0x5c9f49 0x5dc685 0x6f95ad 0x564303 0x564eef 0x565147 0x685a19 0x6f4979 0x6f497a 0x6fab4a 0x6fef4b 0x46db41
#	0x467e48	internal/poll.runtime_pollWait+0x88		runtime/netpoll.go:302
#	0x4d8631	internal/poll.(*pollDesc).wait+0x31		internal/poll/fd_poll_runtime.go:83
#	0x4d9999	internal/poll.(*pollDesc).waitRead+0x259	internal/poll/fd_poll_runtime.go:88
#	0x4d9987	internal/poll.(*FD).Read+0x247			internal/poll/fd_unix.go:167
#	0x5c9f48	net.(*netFD).Read+0x28				net/fd_posix.go:55
#	0x5dc684	net.(*conn).Read+0x44				net/net.go:183
#	0x6f95ac	net/http.(*connReader).Read+0x16c		net/http/server.go:780
#	0x564302	bufio.(*Reader).fill+0x102			bufio/bufio.go:106
#	0x564eee	bufio.(*Reader).ReadSlice+0x2e			bufio/bufio.go:371
#	0x565146	bufio.(*Reader).ReadLine+0x26			bufio/bufio.go:400
#	0x685a18	net/textproto.(*Reader).readLineSlice+0x98	net/textproto/reader.go:57
#	0x6f4978	net/textproto.(*Reader).ReadLine+0x78		net/textproto/reader.go:38
#	0x6f4979	net/http.readRequest+0x79			net/http/request.go:1029
#	0x6fab49	net/http.(*conn).readRequest+0x249		net/http/server.go:988
#	0x6fef4a	net/http.(*conn).serve+0x32a			net/http/server.go:1891

4 @ 0x43d0d6 0x435ab7 0x467e49 0x4d8632 0x4d999a 0x4d9988 0x5c9f49 0x5dc685 0xdc0f84 0x4b11da 0xcac56e 0xcac528 0xcacdd5 0xcb7fb1 0x46db41
#	0x467e48	internal/poll.runtime_pollWait+0x88			runtime/netpoll.go:302
#	0x4d8631	internal/poll.(*pollDesc).wait+0x31			internal/poll/fd_poll_runtime.go:83
#	0x4d9999	internal/poll.(*pollDesc).waitRead+0x259		internal/poll/fd_poll_runtime.go:88
#	0x4d9987	internal/poll.(*FD).Read+0x247				internal/poll/fd_unix.go:167
#	0x5c9f48	net.(*netFD).Read+0x28					net/fd_posix.go:55
#	0x5dc684	net.(*conn).Read+0x44					net/net.go:183
#	0xdc0f83	golang.org/x/net/http2/h2c.(*bufConn).Read+0xa3		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/h2c/h2c.go:214
#	0x4b11d9	io.ReadAtLeast+0x99					io/io.go:331
#	0xcac56d	io.ReadFull+0x6d					io/io.go:350
#	0xcac527	golang.org/x/net/http2.readFrameHeader+0x27		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/frame.go:237
#	0xcacdd4	golang.org/x/net/http2.(*Framer).ReadFrame+0x94		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/frame.go:498
#	0xcb7fb0	golang.org/x/net/http2.(*serverConn).readFrames+0x90	golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/server.go:786

4 @ 0x43d0d6 0x44cb72 0xcb8e6c 0xcb72ac 0xdc029d 0xdc9527 0x70074f 0x703d3b 0x6ff1f7 0x46db41
#	0xcb8e6b	golang.org/x/net/http2.(*serverConn).serve+0x88b		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/server.go:900
#	0xcb72ab	golang.org/x/net/http2.(*Server).ServeConn+0xb8b		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/server.go:504
#	0xdc029c	golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP+0x2bc		golang.org/x/net@v0.0.0-20220909164309-bea034e7d591/http2/h2c/h2c.go:88
#	0xdc9526	<private_pkg>.NewServer.func1+0x86	private.go:59
#	0x70074e	net/http.HandlerFunc.ServeHTTP+0x2e				net/http/server.go:2084
#	0x703d3a	net/http.serverHandler.ServeHTTP+0x43a				net/http/server.go:2916
#	0x6ff1f6	net/http.(*conn).serve+0x5d6					net/http/server.go:1966
@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 24, 2022
@mhr3
Copy link
Author

mhr3 commented Nov 7, 2022

Sorry, why is this marked "WaitingForInfo"? I don't see any requests for additional information.

@ianlancetaylor
Copy link
Contributor

CC @neild @bradfitz

@neild neild removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Nov 7, 2022
@seankhliao seankhliao added this to the Unplanned milestone Nov 19, 2022
@Yuhjiang
Copy link

Yuhjiang commented Nov 24, 2022

I'v also encountered a similar problem. go version is 1.16.15, os version is SMP Debian 5.4.143.bsk.8 Wed Jul 20 08:43:38 UTC 2022 x86_64 GNU/Linux. There are a lot of goroutines blocked due to http.(*http2clientStream).cleanupWriteRequest when getting mutex lock of wmu. I can't reproduce it because the problem appears occasionally.
Most of the requested urls are the same domain. The urls with same domain share the same http2ClientConn which meanings they share the same wmu(mutex). Maybe one broken request held the mutex lock and didn't release it properly.
The stack of goroutines:
image
And the code is https://github.com/golang/go/blob/go1.16.15/src/net/http/h2_bundle.go#L7977
1.16.15 is outdated, so I think I should use golang.org/x/net/http2.ConfigureTransports to use the latest HTTP2 implentation later.

@Yuhjiang
Copy link

Yuhjiang commented Dec 8, 2022

I use golang.org/x/net v0.0.0-20220822230855-b0a4917ee28c to configure Transport。(I'v found latest version golang.org/x/net v0.3.0 which however is not compatible with go 1.16)
Configuration:

var DefaultTransport = &http.Transport{
	Proxy: http.ProxyFromEnvironment,
	DialContext: (&net.Dialer{
		Timeout:   30 * time.Second,
		KeepAlive: 30 * time.Second}).DialContext,
	ForceAttemptHTTP2:     true,
	MaxIdleConns:          100,
	IdleConnTimeout:       90 * time.Second,
	TLSHandshakeTimeout:   10 * time.Second,
	ExpectContinueTimeout: 1 * time.Second,
	MaxConnsPerHost:       20,
	MaxIdleConnsPerHost:   100,
}

var defaultClient *http.Client

func InitDefaultClient() {
	err := http2.ConfigureTransport(DefaultTransport)
	if err != nil { 
                // just for exception
		logs.Info("config default http client transport err, %v", err)
		defaultClient = http.DefaultClient
	} else {
                // in use
		defaultClient = &http.Client{Transport: DefaultTransport}
	}
}

But there's still the same problem.
image

@stampy88
Copy link

stampy88 commented Dec 23, 2022

I have seen the same problem as @Yuhjiang, but haven't been able to reproduce it either. Here are the stacks when it occurs:

goroutine profile: total 1451
1258 @ 0x438936 0x449473 0x44944d 0x464865 0x4726e5 0x81d5af 0x81d59a 0x81bfee 0x468741
#	0x464864	sync.runtime_SemacquireMutex+0x24			/usr/local/go/src/runtime/sema.go:71
#	0x4726e4	sync.(*Mutex).lockSlow+0x164				/usr/local/go/src/sync/mutex.go:162
#	0x81d5ae	sync.(*Mutex).Lock+0x30e				/usr/local/go/src/sync/mutex.go:81
#	0x81d599	net/http.(*http2clientStream).cleanupWriteRequest+0x2f9	/usr/local/go/src/net/http/h2_bundle.go:8159
#	0x81bfed	net/http.(*http2clientStream).doRequest+0x2d		/usr/local/go/src/net/http/h2_bundle.go:7905

36 @ 0x438936 0x406a0c 0x406478 0x1005f13 0x468741
#	0x1005f12	main.main.func3.1+0xf2	/XXXXXX/main.go:383

16 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4ccb5a 0x4ccb48 0x6939a9 0x6a5345 0x787c1d 0x4f8b78 0x787e05 0x785416 0x78ae6f 0x78ae70 0x51af74 0x4c5f7a 0x7fe96e 0x7fe928 0x7ff1d5 0x8214f0 0x82094f 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88		/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31		/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4ccb59	internal/poll.(*pollDesc).waitRead+0x259	/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4ccb47	internal/poll.(*FD).Read+0x247			/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x6939a8	net.(*netFD).Read+0x28				/usr/local/go/src/net/fd_posix.go:55
#	0x6a5344	net.(*conn).Read+0x44				/usr/local/go/src/net/net.go:183
#	0x787c1c	crypto/tls.(*atLeastReader).Read+0x3c		/usr/local/go/src/crypto/tls/conn.go:785
#	0x4f8b77	bytes.(*Buffer).ReadFrom+0x97			/usr/local/go/src/bytes/buffer.go:204
#	0x787e04	crypto/tls.(*Conn).readFromUntil+0xe4		/usr/local/go/src/crypto/tls/conn.go:807
#	0x785415	crypto/tls.(*Conn).readRecordOrCCS+0x115	/usr/local/go/src/crypto/tls/conn.go:614
#	0x78ae6e	crypto/tls.(*Conn).readRecord+0x16e		/usr/local/go/src/crypto/tls/conn.go:582
#	0x78ae6f	crypto/tls.(*Conn).Read+0x16f			/usr/local/go/src/crypto/tls/conn.go:1285
#	0x51af73	bufio.(*Reader).Read+0x1b3			/usr/local/go/src/bufio/bufio.go:236
#	0x4c5f79	io.ReadAtLeast+0x99				/usr/local/go/src/io/io.go:331
#	0x7fe96d	io.ReadFull+0x6d				/usr/local/go/src/io/io.go:350
#	0x7fe927	net/http.http2readFrameHeader+0x27		/usr/local/go/src/net/http/h2_bundle.go:1566
#	0x7ff1d4	net/http.(*http2Framer).ReadFrame+0x94		/usr/local/go/src/net/http/h2_bundle.go:1830
#	0x8214ef	net/http.(*http2clientConnReadLoop).run+0x12f	/usr/local/go/src/net/http/h2_bundle.go:8820
#	0x82094e	net/http.(*http2ClientConn).readLoop+0x6e	/usr/local/go/src/net/http/h2_bundle.go:8716

13 @ 0x438936 0x449473 0x44944d 0x464745 0x474172 0x100ab15 0x10099af 0x1006073 0x468741
#	0x464744	sync.runtime_Semacquire+0x24		/usr/local/go/src/runtime/sema.go:56
#	0x474171	sync.(*WaitGroup).Wait+0x51		/usr/local/go/src/sync/waitgroup.go:136
#	0x100ab14	main.(*server).doProcessPacket+0x1034	XXXXXX/processpacket.go:168
#	0x10099ae	main.(*server).processPacket+0x18e	XXXXXX/processpacket.go:45
#	0x1006072	main.main.func3.1+0x252			XXXXXX/main.go:390

12 @ 0x438936 0x448392 0x857275 0x468741
#	0x857274	net/http.(*persistConn).writeLoop+0xf4	/usr/local/go/src/net/http/transport.go:2392

12 @ 0x438936 0x449473 0x44944d 0x464865 0x4726e5 0x825fb2 0x825f7e 0x81d545 0x81bfee 0x468741
#	0x464864	sync.runtime_SemacquireMutex+0x24			/usr/local/go/src/runtime/sema.go:71
#	0x4726e4	sync.(*Mutex).lockSlow+0x164				/usr/local/go/src/sync/mutex.go:162
#	0x825fb1	sync.(*Mutex).Lock+0x51					/usr/local/go/src/sync/mutex.go:81
#	0x825f7d	net/http.(*http2ClientConn).writeStreamReset+0x1d	/usr/local/go/src/net/http/h2_bundle.go:9574
#	0x81d544	net/http.(*http2clientStream).cleanupWriteRequest+0x2a4	/usr/local/go/src/net/http/h2_bundle.go:8145
#	0x81bfed	net/http.(*http2clientStream).doRequest+0x2d		/usr/local/go/src/net/http/h2_bundle.go:7905

11 @ 0x438936 0x448392 0x81bb71 0x817cb7 0x8264db 0x8264d5 0x84b790 0x832ab9 0x7f43f8 0x7f3c9b 0x7f60d5 0x100e0e6 0x100e0d9 0x468741
#	0x81bb70	net/http.(*http2ClientConn).RoundTrip+0x450		/usr/local/go/src/net/http/h2_bundle.go:7874
#	0x817cb6	net/http.(*http2Transport).RoundTripOpt+0x1b6		/usr/local/go/src/net/http/h2_bundle.go:7218
#	0x8264da	net/http.(*http2Transport).RoundTrip+0x1a		/usr/local/go/src/net/http/h2_bundle.go:7179
#	0x8264d4	net/http.http2noDialH2RoundTripper.RoundTrip+0x14	/usr/local/go/src/net/http/h2_bundle.go:9682
#	0x84b78f	net/http.(*Transport).roundTrip+0x38f			/usr/local/go/src/net/http/transport.go:539
#	0x832ab8	net/http.(*Transport).RoundTrip+0x18			/usr/local/go/src/net/http/roundtrip.go:17
#	0x7f43f7	net/http.send+0x5d7					/usr/local/go/src/net/http/client.go:252
#	0x7f3c9a	net/http.(*Client).send+0x9a				/usr/local/go/src/net/http/client.go:176
#	0x7f60d4	net/http.(*Client).do+0x8f4				/usr/local/go/src/net/http/client.go:725
#	0x100e0e5	net/http.(*Client).Do+0x985				/usr/local/go/src/net/http/client.go:593
#	0x100e0d8	main.(*server).sendRestMessage+0x978			/XXXXXX/rest.go:89

10 @ 0x438936 0x406a0c 0x406438 0x8f57d8 0x468741
#	0x8f57d7	google.golang.org/grpc.(*addrConn).resetTransport+0x477	/var/go/go/pkg/mod/google.golang.org/grpc@v1.28.1/clientconn.go:1155

10 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4ccb5a 0x4ccb48 0x6939a9 0x6a5345 0x787c1d 0x4f8b78 0x787e05 0x785416 0x78ae6f 0x78ae70 0x51af74 0x4c5f7a 0x86c2ce 0x86c288 0x86cb35 0x889c73 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88					/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31					/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4ccb59	internal/poll.(*pollDesc).waitRead+0x259				/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4ccb47	internal/poll.(*FD).Read+0x247						/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x6939a8	net.(*netFD).Read+0x28							/usr/local/go/src/net/fd_posix.go:55
#	0x6a5344	net.(*conn).Read+0x44							/usr/local/go/src/net/net.go:183
#	0x787c1c	crypto/tls.(*atLeastReader).Read+0x3c					/usr/local/go/src/crypto/tls/conn.go:785
#	0x4f8b77	bytes.(*Buffer).ReadFrom+0x97						/usr/local/go/src/bytes/buffer.go:204
#	0x787e04	crypto/tls.(*Conn).readFromUntil+0xe4					/usr/local/go/src/crypto/tls/conn.go:807
#	0x785415	crypto/tls.(*Conn).readRecordOrCCS+0x115				/usr/local/go/src/crypto/tls/conn.go:614
#	0x78ae6e	crypto/tls.(*Conn).readRecord+0x16e					/usr/local/go/src/crypto/tls/conn.go:582
#	0x78ae6f	crypto/tls.(*Conn).Read+0x16f						/usr/local/go/src/crypto/tls/conn.go:1285
#	0x51af73	bufio.(*Reader).Read+0x1b3						/usr/local/go/src/bufio/bufio.go:236
#	0x4c5f79	io.ReadAtLeast+0x99							/usr/local/go/src/io/io.go:331
#	0x86c2cd	io.ReadFull+0x6d							/usr/local/go/src/io/io.go:350
#	0x86c287	golang.org/x/net/http2.readFrameHeader+0x27				/var/go/go/pkg/mod/golang.org/x/net@v0.0.0-20220425223048-2871e0cb64e4/http2/frame.go:237
#	0x86cb34	golang.org/x/net/http2.(*Framer).ReadFrame+0x94				/var/go/go/pkg/mod/golang.org/x/net@v0.0.0-20220425223048-2871e0cb64e4/http2/frame.go:498
#	0x889c72	google.golang.org/grpc/internal/transport.(*http2Client).reader+0x232	/var/go/go/pkg/mod/google.golang.org/grpc@v1.28.1/internal/transport/http2_client.go:1270

10 @ 0x438936 0x448392 0x8790b5 0x879745 0x882045 0x468741
#	0x8790b4	google.golang.org/grpc/internal/transport.(*controlBuffer).get+0x114	/var/go/go/pkg/mod/google.golang.org/grpc@v1.28.1/internal/transport/controlbuf.go:395
#	0x879744	google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x84	/var/go/go/pkg/mod/google.golang.org/grpc@v1.28.1/internal/transport/controlbuf.go:513
#	0x882044	google.golang.org/grpc/internal/transport.newHTTP2Client.func3+0x64	/var/go/go/pkg/mod/google.golang.org/grpc@v1.28.1/internal/transport/http2_client.go:346

9 @ 0x438936 0x448392 0x81c16c 0x81bfde 0x468741
#	0x81c16b	net/http.(*http2clientStream).writeRequest+0x14b	/usr/local/go/src/net/http/h2_bundle.go:7929
#	0x81bfdd	net/http.(*http2clientStream).doRequest+0x1d		/usr/local/go/src/net/http/h2_bundle.go:7904

6 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4ccb5a 0x4ccb48 0x6939a9 0x6a5345 0x787c1d 0x4f8b78 0x787e05 0x785416 0x78ae6f 0x78ae70 0x85476e 0x51a9a3 0x51aafd 0x85558c 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88		/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31		/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4ccb59	internal/poll.(*pollDesc).waitRead+0x259	/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4ccb47	internal/poll.(*FD).Read+0x247			/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x6939a8	net.(*netFD).Read+0x28				/usr/local/go/src/net/fd_posix.go:55
#	0x6a5344	net.(*conn).Read+0x44				/usr/local/go/src/net/net.go:183
#	0x787c1c	crypto/tls.(*atLeastReader).Read+0x3c		/usr/local/go/src/crypto/tls/conn.go:785
#	0x4f8b77	bytes.(*Buffer).ReadFrom+0x97			/usr/local/go/src/bytes/buffer.go:204
#	0x787e04	crypto/tls.(*Conn).readFromUntil+0xe4		/usr/local/go/src/crypto/tls/conn.go:807
#	0x785415	crypto/tls.(*Conn).readRecordOrCCS+0x115	/usr/local/go/src/crypto/tls/conn.go:614
#	0x78ae6e	crypto/tls.(*Conn).readRecord+0x16e		/usr/local/go/src/crypto/tls/conn.go:582
#	0x78ae6f	crypto/tls.(*Conn).Read+0x16f			/usr/local/go/src/crypto/tls/conn.go:1285
#	0x85476d	net/http.(*persistConn).Read+0x4d		/usr/local/go/src/net/http/transport.go:1929
#	0x51a9a2	bufio.(*Reader).fill+0x102			/usr/local/go/src/bufio/bufio.go:106
#	0x51aafc	bufio.(*Reader).Peek+0x5c			/usr/local/go/src/bufio/bufio.go:144
#	0x85558b	net/http.(*persistConn).readLoop+0x1ab		/usr/local/go/src/net/http/transport.go:2093

3 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4ccb5a 0x4ccb48 0x6939a9 0x6a5345 0x787c1d 0x4f8b78 0x787e05 0x785416 0x78ae6f 0x78ae70 0x85476e 0x51a9a3 0x51b58f 0x7f12a5 0x7f0d28 0x7f0fee 0x8492dc 0x8491a5 0x858ea2 0x517fbf 0x517bbb 0x517818 0xf2b36b 0xf2afba 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88				/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31				/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4ccb59	internal/poll.(*pollDesc).waitRead+0x259			/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4ccb47	internal/poll.(*FD).Read+0x247					/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x6939a8	net.(*netFD).Read+0x28						/usr/local/go/src/net/fd_posix.go:55
#	0x6a5344	net.(*conn).Read+0x44						/usr/local/go/src/net/net.go:183
#	0x787c1c	crypto/tls.(*atLeastReader).Read+0x3c				/usr/local/go/src/crypto/tls/conn.go:785
#	0x4f8b77	bytes.(*Buffer).ReadFrom+0x97					/usr/local/go/src/bytes/buffer.go:204
#	0x787e04	crypto/tls.(*Conn).readFromUntil+0xe4				/usr/local/go/src/crypto/tls/conn.go:807
#	0x785415	crypto/tls.(*Conn).readRecordOrCCS+0x115			/usr/local/go/src/crypto/tls/conn.go:614
#	0x78ae6e	crypto/tls.(*Conn).readRecord+0x16e				/usr/local/go/src/crypto/tls/conn.go:582
#	0x78ae6f	crypto/tls.(*Conn).Read+0x16f					/usr/local/go/src/crypto/tls/conn.go:1285
#	0x85476d	net/http.(*persistConn).Read+0x4d				/usr/local/go/src/net/http/transport.go:1929
#	0x51a9a2	bufio.(*Reader).fill+0x102					/usr/local/go/src/bufio/bufio.go:106
#	0x51b58e	bufio.(*Reader).ReadSlice+0x2e					/usr/local/go/src/bufio/bufio.go:371
#	0x7f12a4	net/http/internal.readChunkLine+0x24				/usr/local/go/src/net/http/internal/chunked.go:129
#	0x7f0d27	net/http/internal.(*chunkedReader).beginChunk+0x27		/usr/local/go/src/net/http/internal/chunked.go:48
#	0x7f0fed	net/http/internal.(*chunkedReader).Read+0x14d			/usr/local/go/src/net/http/internal/chunked.go:98
#	0x8492db	net/http.(*body).readLocked+0x3b				/usr/local/go/src/net/http/transfer.go:844
#	0x8491a4	net/http.(*body).Read+0x124					/usr/local/go/src/net/http/transfer.go:836
#	0x858ea1	net/http.(*bodyEOFSignal).Read+0x141				/usr/local/go/src/net/http/transport.go:2774
#	0x517fbe	encoding/json.(*Decoder).refill+0x17e				/usr/local/go/src/encoding/json/stream.go:165
#	0x517bba	encoding/json.(*Decoder).readValue+0xba				/usr/local/go/src/encoding/json/stream.go:140
#	0x517817	encoding/json.(*Decoder).Decode+0x77				/usr/local/go/src/encoding/json/stream.go:63
#	0xf2b36a	XXXXXX/kuberesolver.(*streamWatcher).Decode+0x6a	XXXXXX/kuberesolver@v2.1.7+incompatible/stream.go:96
#	0xf2afb9	XXXXXX/kuberesolver.(*streamWatcher).receive+0x119	XXXXXX/kuberesolver@v2.1.7+incompatible/stream.go:72

3 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4ccb5a 0x4ccb48 0x6939a9 0x6a5345 0x8345ed 0x51a9a3 0x51b58f 0x51b7e7 0x7d4ff9 0x82f9b9 0x82f9ba 0x835b8a 0x839f8b 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88		/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31		/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4ccb59	internal/poll.(*pollDesc).waitRead+0x259	/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4ccb47	internal/poll.(*FD).Read+0x247			/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x6939a8	net.(*netFD).Read+0x28				/usr/local/go/src/net/fd_posix.go:55
#	0x6a5344	net.(*conn).Read+0x44				/usr/local/go/src/net/net.go:183
#	0x8345ec	net/http.(*connReader).Read+0x16c		/usr/local/go/src/net/http/server.go:780
#	0x51a9a2	bufio.(*Reader).fill+0x102			/usr/local/go/src/bufio/bufio.go:106
#	0x51b58e	bufio.(*Reader).ReadSlice+0x2e			/usr/local/go/src/bufio/bufio.go:371
#	0x51b7e6	bufio.(*Reader).ReadLine+0x26			/usr/local/go/src/bufio/bufio.go:400
#	0x7d4ff8	net/textproto.(*Reader).readLineSlice+0x98	/usr/local/go/src/net/textproto/reader.go:57
#	0x82f9b8	net/textproto.(*Reader).ReadLine+0x78		/usr/local/go/src/net/textproto/reader.go:38
#	0x82f9b9	net/http.readRequest+0x79			/usr/local/go/src/net/http/request.go:1029
#	0x835b89	net/http.(*conn).readRequest+0x249		/usr/local/go/src/net/http/server.go:988
#	0x839f8a	net/http.(*conn).serve+0x32a			/usr/local/go/src/net/http/server.go:1891

3 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4ccb5a 0x4ccb48 0x6939a9 0x6a5345 0x85476e 0x51a9a3 0x51aafd 0x85558c 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88		/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31		/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4ccb59	internal/poll.(*pollDesc).waitRead+0x259	/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4ccb47	internal/poll.(*FD).Read+0x247			/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x6939a8	net.(*netFD).Read+0x28				/usr/local/go/src/net/fd_posix.go:55
#	0x6a5344	net.(*conn).Read+0x44				/usr/local/go/src/net/net.go:183
#	0x85476d	net/http.(*persistConn).Read+0x4d		/usr/local/go/src/net/http/transport.go:1929
#	0x51a9a2	bufio.(*Reader).fill+0x102			/usr/local/go/src/bufio/bufio.go:106
#	0x51aafc	bufio.(*Reader).Peek+0x5c			/usr/local/go/src/bufio/bufio.go:144
#	0x85558b	net/http.(*persistConn).readLoop+0x1ab		/usr/local/go/src/net/http/transport.go:2093

3 @ 0x438936 0x448392 0x856179 0x468741
#	0x856178	net/http.(*persistConn).readLoop+0xd98	/usr/local/go/src/net/http/transport.go:2213

3 @ 0x438936 0x448392 0x8ed3cf 0x468741
#	0x8ed3ce	google.golang.org/grpc.(*ccBalancerWrapper).watcher+0x8e	/var/go/go/pkg/mod/google.golang.org/grpc@v1.28.1/balancer_conn_wrappers.go:69

3 @ 0x438936 0x448392 0xf29867 0xf289db 0xf2b5de 0xf2b50e 0x468741
#	0xf29866	XXXXXX/kuberesolver.(*kResolver).watch+0x206		XXXXXX/kuberesolver@v2.1.7+incompatible/builder.go:276
#	0xf289da	XXXXXX/kuberesolver.(*kubeBuilder).Build.func1+0x3a	XXXXXX/kuberesolver@v2.1.7+incompatible/builder.go:161
#	0xf2b5dd	XXXXXX/kuberesolver.until.func1+0x3d			XXXXXX/kuberesolver@v2.1.7+incompatible/util.go:19
#	0xf2b50d	XXXXXX/kuberesolver.until+0x4d				XXXXXX/kuberesolver@v2.1.7+incompatible/util.go:20

2 @ 0x438936 0x448392 0x81c9e9 0x81bfde 0x468741
#	0x81c9e8	net/http.(*http2clientStream).writeRequest+0x9c8	/usr/local/go/src/net/http/h2_bundle.go:8042
#	0x81bfdd	net/http.(*http2clientStream).doRequest+0x1d		/usr/local/go/src/net/http/h2_bundle.go:7904

2 @ 0x438936 0x448392 0x858134 0x84bbc9 0x832ab9 0x7f43f8 0x7f3c9b 0x7f60d5 0x100e0e6 0x100e0d9 0x468741
#	0x858133	net/http.(*persistConn).roundTrip+0x973	/usr/local/go/src/net/http/transport.go:2620
#	0x84bbc8	net/http.(*Transport).roundTrip+0x7c8	/usr/local/go/src/net/http/transport.go:594
#	0x832ab8	net/http.(*Transport).RoundTrip+0x18	/usr/local/go/src/net/http/roundtrip.go:17
#	0x7f43f7	net/http.send+0x5d7			/usr/local/go/src/net/http/client.go:252
#	0x7f3c9a	net/http.(*Client).send+0x9a		/usr/local/go/src/net/http/client.go:176
#	0x7f60d4	net/http.(*Client).do+0x8f4		/usr/local/go/src/net/http/client.go:725
#	0x100e0e5	net/http.(*Client).Do+0x985		/usr/local/go/src/net/http/client.go:593
#	0x100e0d8	main.(*server).sendRestMessage+0x978	XXXXXXs/rest.go:89

1 @ 0x40bf14 0x464eaf 0xf2c6d9 0x468741
#	0x464eae	os/signal.signal_recv+0x2e	/usr/local/go/src/runtime/sigqueue.go:151
#	0xf2c6d8	os/signal.loop+0x18		/usr/local/go/src/os/signal/signal_unix.go:23

1 @ 0x438936 0x406a0c 0x406438 0x10055e9 0x438572 0x468741
#	0x10055e8	main.main+0x1028	XXXXXX/main.go:425
#	0x438571	runtime.main+0x211	/usr/local/go/src/runtime/proc.go:250

1 @ 0x438936 0x406a0c 0x406438 0xf38312 0x468741
#	0xf38311	XXXXXX/app.(*Server).handleSignals+0x151	XXXXXX/app/server.go:194

1 @ 0x438936 0x406a0c 0x406478 0xea7614 0x468741
#	0xea7613	github.com/eclipse/paho%2emqtt%2egolang.startComms.func2+0x73	/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/net.go:430

1 @ 0x438936 0x406a0c 0x406478 0xea76b1 0x468741
#	0xea76b0	github.com/eclipse/paho%2emqtt%2egolang.startComms.func1+0x70	/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/net.go:407

1 @ 0x438936 0x406a0c 0x406478 0xeaab1c 0x468741
#	0xeaab1b	github.com/eclipse/paho%2emqtt%2egolang.(*router).matchAndDispatch.func2+0xbb	/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/router.go:174

1 @ 0x438936 0x406a0c 0x406478 0xf62628 0x468741
#	0xf62627	github.com/streadway/amqp.(*consumers).buffer+0x107	/var/go/go/pkg/mod/github.com/streadway/amqp@v0.0.0-20190404075320-75d898a42a94/consumers.go:61

1 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4ccb5a 0x4ccb48 0x6939a9 0x6a5345 0x51af74 0x4c5f7a 0xf63993 0xf6396f 0xf6048a 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88			/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31			/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4ccb59	internal/poll.(*pollDesc).waitRead+0x259		/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4ccb47	internal/poll.(*FD).Read+0x247				/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x6939a8	net.(*netFD).Read+0x28					/usr/local/go/src/net/fd_posix.go:55
#	0x6a5344	net.(*conn).Read+0x44					/usr/local/go/src/net/net.go:183
#	0x51af73	bufio.(*Reader).Read+0x1b3				/usr/local/go/src/bufio/bufio.go:236
#	0x4c5f79	io.ReadAtLeast+0x99					/usr/local/go/src/io/io.go:331
#	0xf63992	io.ReadFull+0x52					/usr/local/go/src/io/io.go:350
#	0xf6396e	github.com/streadway/amqp.(*reader).ReadFrame+0x2e	/var/go/go/pkg/mod/github.com/streadway/amqp@v0.0.0-20190404075320-75d898a42a94/read.go:49
#	0xf60489	github.com/streadway/amqp.(*Connection).reader+0x1c9	/var/go/go/pkg/mod/github.com/streadway/amqp@v0.0.0-20190404075320-75d898a42a94/connection.go:521

1 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4ccb5a 0x4ccb48 0x6939a9 0x6a5345 0x787c1d 0x4f8b78 0x787e05 0x785416 0x78ae6f 0x78ae70 0x4c5f7a 0xe8111f 0xe81102 0xea4ce5 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88					/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31					/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4ccb59	internal/poll.(*pollDesc).waitRead+0x259				/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4ccb47	internal/poll.(*FD).Read+0x247						/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x6939a8	net.(*netFD).Read+0x28							/usr/local/go/src/net/fd_posix.go:55
#	0x6a5344	net.(*conn).Read+0x44							/usr/local/go/src/net/net.go:183
#	0x787c1c	crypto/tls.(*atLeastReader).Read+0x3c					/usr/local/go/src/crypto/tls/conn.go:785
#	0x4f8b77	bytes.(*Buffer).ReadFrom+0x97						/usr/local/go/src/bytes/buffer.go:204
#	0x787e04	crypto/tls.(*Conn).readFromUntil+0xe4					/usr/local/go/src/crypto/tls/conn.go:807
#	0x785415	crypto/tls.(*Conn).readRecordOrCCS+0x115				/usr/local/go/src/crypto/tls/conn.go:614
#	0x78ae6e	crypto/tls.(*Conn).readRecord+0x16e					/usr/local/go/src/crypto/tls/conn.go:582
#	0x78ae6f	crypto/tls.(*Conn).Read+0x16f						/usr/local/go/src/crypto/tls/conn.go:1285
#	0x4c5f79	io.ReadAtLeast+0x99							/usr/local/go/src/io/io.go:331
#	0xe8111e	io.ReadFull+0x5e							/usr/local/go/src/io/io.go:350
#	0xe81101	github.com/eclipse/paho.mqtt.golang/packets.ReadPacket+0x41		/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/packets/packets.go:131
#	0xea4ce4	github.com/eclipse/paho%2emqtt%2egolang.startIncoming.func1+0xe4	/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/net.go:124

1 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4ccb5a 0x4ccb48 0x6939a9 0x6a5345 0x83411f 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88		/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31		/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4ccb59	internal/poll.(*pollDesc).waitRead+0x259	/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4ccb47	internal/poll.(*FD).Read+0x247			/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x6939a8	net.(*netFD).Read+0x28				/usr/local/go/src/net/fd_posix.go:55
#	0x6a5344	net.(*conn).Read+0x44				/usr/local/go/src/net/net.go:183
#	0x83411e	net/http.(*connReader).backgroundRead+0x3e	/usr/local/go/src/net/http/server.go:672

1 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4ceb05 0x4ceaf2 0x694229 0x6a5525 0x788528 0x7889f1 0x789cf4 0x8176e9 0x51c5a2 0x825fdc 0x81d545 0x81bfee 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88			/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31			/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4ceb04	internal/poll.(*pollDesc).waitWrite+0x2c4		/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4ceaf1	internal/poll.(*FD).Write+0x2b1				/usr/local/go/src/internal/poll/fd_unix.go:391
#	0x694228	net.(*netFD).Write+0x28					/usr/local/go/src/net/fd_posix.go:96
#	0x6a5524	net.(*conn).Write+0x44					/usr/local/go/src/net/net.go:195
#	0x788527	crypto/tls.(*Conn).write+0x107				/usr/local/go/src/crypto/tls/conn.go:920
#	0x7889f0	crypto/tls.(*Conn).writeRecordLocked+0x350		/usr/local/go/src/crypto/tls/conn.go:988
#	0x789cf3	crypto/tls.(*Conn).Write+0x3f3				/usr/local/go/src/crypto/tls/conn.go:1159
#	0x8176e8	net/http.http2stickyErrWriter.Write+0x148		/usr/local/go/src/net/http/h2_bundle.go:7133
#	0x51c5a1	bufio.(*Writer).Flush+0x61				/usr/local/go/src/bufio/bufio.go:628
#	0x825fdb	net/http.(*http2ClientConn).writeStreamReset+0x7b	/usr/local/go/src/net/http/h2_bundle.go:9576
#	0x81d544	net/http.(*http2clientStream).cleanupWriteRequest+0x2a4	/usr/local/go/src/net/http/h2_bundle.go:8145
#	0x81bfed	net/http.(*http2clientStream).doRequest+0x2d		/usr/local/go/src/net/http/h2_bundle.go:7905

1 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4d0a0c 0x4d09f9 0x695ab5 0x6ae528 0x6ad6fd 0x83f2e5 0x83ef1d 0x10069cf 0x100695b 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88		/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31		/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4d0a0b	internal/poll.(*pollDesc).waitRead+0x22b	/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4d09f8	internal/poll.(*FD).Accept+0x218		/usr/local/go/src/internal/poll/fd_unix.go:614
#	0x695ab4	net.(*netFD).accept+0x34			/usr/local/go/src/net/fd_unix.go:172
#	0x6ae527	net.(*TCPListener).accept+0x27			/usr/local/go/src/net/tcpsock_posix.go:139
#	0x6ad6fc	net.(*TCPListener).Accept+0x3c			/usr/local/go/src/net/tcpsock.go:288
#	0x83f2e4	net/http.(*Server).Serve+0x384			/usr/local/go/src/net/http/server.go:3039
#	0x83ef1c	net/http.(*Server).ListenAndServe+0x7c		/usr/local/go/src/net/http/server.go:2968
#	0x10069ce	net/http.ListenAndServe+0xce			/usr/local/go/src/net/http/server.go:3222
#	0x100695a	main.main.func1+0x5a				XXXXXX/main.go:286

1 @ 0x438936 0x4315b7 0x4630e9 0x4cb7f2 0x4d0a0c 0x4d09f9 0x695ab5 0x6ae528 0x6ad6fd 0x83f2e5 0xf37dff 0x468741
#	0x4630e8	internal/poll.runtime_pollWait+0x88				/usr/local/go/src/runtime/netpoll.go:302
#	0x4cb7f1	internal/poll.(*pollDesc).wait+0x31				/usr/local/go/src/internal/poll/fd_poll_runtime.go:83
#	0x4d0a0b	internal/poll.(*pollDesc).waitRead+0x22b			/usr/local/go/src/internal/poll/fd_poll_runtime.go:88
#	0x4d09f8	internal/poll.(*FD).Accept+0x218				/usr/local/go/src/internal/poll/fd_unix.go:614
#	0x695ab4	net.(*netFD).accept+0x34					/usr/local/go/src/net/fd_unix.go:172
#	0x6ae527	net.(*TCPListener).accept+0x27					/usr/local/go/src/net/tcpsock_posix.go:139
#	0x6ad6fc	net.(*TCPListener).Accept+0x3c					/usr/local/go/src/net/tcpsock.go:288
#	0x83f2e4	net/http.(*Server).Serve+0x384					/usr/local/go/src/net/http/server.go:3039
#	0xf37dfe	XXXXXX/app.(*Server).Start.func1+0xbe	XXXXXX/app/server.go:146

1 @ 0x438936 0x448392 0x1006492 0x468741
#	0x1006491	main.main.func2+0x151	XXXXXX/main.go:356

1 @ 0x438936 0x448392 0xe9c8fb 0x468741
#	0xe9c8fa	github.com/eclipse/paho%2emqtt%2egolang.(*client).startCommsWorkers.func2+0xfa	/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/client.go:663

1 @ 0x438936 0x448392 0xe9ce5c 0x468741
#	0xe9ce5b	github.com/eclipse/paho%2emqtt%2egolang.(*client).startCommsWorkers.func1+0x17b	/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/client.go:628

1 @ 0x438936 0x448392 0xea51cd 0x468741
#	0xea51cc	github.com/eclipse/paho%2emqtt%2egolang.startIncomingComms.func1+0x18c	/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/net.go:175

1 @ 0x438936 0x448392 0xea6565 0x468741
#	0xea6564	github.com/eclipse/paho%2emqtt%2egolang.startOutgoingComms.func1+0x1e4	/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/net.go:277

1 @ 0x438936 0x448392 0xea9569 0x468741
#	0xea9568	github.com/eclipse/paho%2emqtt%2egolang.keepalive+0x1e8	/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/ping.go:48

1 @ 0x438936 0x448392 0xead22a 0xf480d3 0xf47f0d 0xf48226 0x10017bf 0x100b2c5 0x10099af 0x1006073 0x468741
#	0xead229	github.com/eclipse/paho%2emqtt%2egolang.(*baseToken).WaitTimeout+0x69			/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/token.go:84
#	0xf480d2	XXXXXX/mqtt.(*MqttBuddy).SendMessage.func1+0x32	XXXXXX/mqtt/buddy.go:212
#	0xf47f0c	XXXXXX/mqtt.(*MqttBuddy).SendMessage+0x14c		XXXXXX/mqtt/buddy.go:222
#	0xf48225	XXXXXX/mqtt.BuddySendAndDone+0x105			XXXXXX/mqtt/buddy.go:244
#	0x10017be	main.(*server).sendAzureMessage+0x73e							XXXXXX/azure.go:98
#	0x100b2c4	main.(*server).doProcessPacket+0x17e4							XXXXXX/processpacket.go:157
#	0x10099ae	main.(*server).processPacket+0x18e							XXXXXX/processpacket.go:45
#	0x1006072	main.main.func3.1+0x252									XXXXXX/main.go:390

1 @ 0x438936 0x448392 0xf60738 0x468741
#	0xf60737	github.com/streadway/amqp.(*Connection).heartbeater+0x1b7	/var/go/go/pkg/mod/github.com/streadway/amqp@v0.0.0-20190404075320-75d898a42a94/connection.go:551

1 @ 0x438936 0x448392 0xf73aa5 0x468741
#	0xf73aa4	github.com/assembla/cony.(*Client).Loop.func1+0x104	/var/go/go/pkg/mod/github.com/assembla/cony@v0.3.2/client.go:156

1 @ 0x438936 0x448392 0xf742b9 0x468741
#	0xf742b8	github.com/assembla/cony.(*Consumer).serve+0x1d8	/var/go/go/pkg/mod/github.com/assembla/cony@v0.3.2/consumer.go:83

1 @ 0x438936 0x449473 0x44944d 0x464745 0x474172 0xea750a 0x468741
#	0x464744	sync.runtime_Semacquire+0x24					/usr/local/go/src/runtime/sema.go:56
#	0x474171	sync.(*WaitGroup).Wait+0x51					/usr/local/go/src/sync/waitgroup.go:136
#	0xea7509	github.com/eclipse/paho%2emqtt%2egolang.startComms.func3+0x29	/var/go/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/net.go:438

1 @ 0x438936 0x449473 0x44944d 0x464865 0x4726e5 0x81ce1b 0x81cdf1 0x81c4c5 0x81bfde 0x468741
#	0x464864	sync.runtime_SemacquireMutex+0x24				/usr/local/go/src/runtime/sema.go:71
#	0x4726e4	sync.(*Mutex).lockSlow+0x164					/usr/local/go/src/sync/mutex.go:162
#	0x81ce1a	sync.(*Mutex).Lock+0x9a						/usr/local/go/src/sync/mutex.go:81
#	0x81cdf0	net/http.(*http2clientStream).encodeAndWriteHeaders+0x70	/usr/local/go/src/net/http/h2_bundle.go:8064
#	0x81c4c4	net/http.(*http2clientStream).writeRequest+0x4a4		/usr/local/go/src/net/http/h2_bundle.go:7986
#	0x81bfdd	net/http.(*http2clientStream).doRequest+0x1d			/usr/local/go/src/net/http/h2_bundle.go:7904

1 @ 0x462bc5 0xee1f75 0xee1d8d 0xedeccb 0xeed9a5 0xf38167 0x83b78f 0x83d129 0xefa162 0x83ed7b 0x83a237 0x468741
#	0x462bc4	runtime/pprof.runtime_goroutineProfileWithLabels+0x24		/usr/local/go/src/runtime/mprof.go:753
#	0xee1f74	runtime/pprof.writeRuntimeProfile+0xb4				/usr/local/go/src/runtime/pprof/pprof.go:725
#	0xee1d8c	runtime/pprof.writeGoroutine+0x4c				/usr/local/go/src/runtime/pprof/pprof.go:685
#	0xedecca	runtime/pprof.(*Profile).WriteTo+0x14a				/usr/local/go/src/runtime/pprof/pprof.go:332
#	0xeed9a4	net/http/pprof.handler.ServeHTTP+0x4a4				/usr/local/go/src/net/http/pprof/pprof.go:253
#	0xf38166	XXXXXX/app.pprofIndexOverride+0x206	XXXXXX/app/server.go:161
#	0x83b78e	net/http.HandlerFunc.ServeHTTP+0x2e				/usr/local/go/src/net/http/server.go:2084
#	0x83d128	net/http.(*ServeMux).ServeHTTP+0x148				/usr/local/go/src/net/http/server.go:2462
#	0xefa161	github.com/gorilla/mux.(*Router).ServeHTTP+0x201		/var/go/go/pkg/mod/github.com/gorilla/mux@v1.7.2/mux.go:212
#	0x83ed7a	net/http.serverHandler.ServeHTTP+0x43a				/usr/local/go/src/net/http/server.go:2916
#	0x83a236	net/http.(*conn).serve+0x5d6					/usr/local/go/src/net/http/server.go:1966

@rhysh
Copy link
Contributor

rhysh commented Dec 23, 2022

@mhr3 , you've said you're using go version go1.18.7 linux/amd64, with a direct dependency on golang.org/x/net/http2. What version of the golang.org/x/net module are you using?

The goroutine profile from the proxy server reports a total of 806 goroutines in the app, but the stack traces shown only cover 242+242+201+35+17+7+6+4+4+3 = 761 goroutines. What are the call stacks of the other 45 goroutines? There's probably a single goroutine that's holding a lock of interest, and there's probably only one goroutine with that particular call stack.


@stampy88 , you're using the http2 support that comes bundled with net/http. What version of Go are you using? The line numbers make sense with go1.18.6. Can you please correct or confirm?

Assuming that's the version in play, the goroutine profile shows that there are 1258 goroutines waiting on cc.wmu at https://github.com/golang/go/blob/go1.18.6/src/net/http/h2_bundle.go#L8159, 12 waiting on cc.wmu at https://github.com/golang/go/blob/go1.18.6/src/net/http/h2_bundle.go#L9574, 1 waiting on cc.wmu at https://github.com/golang/go/blob/go1.18.6/src/net/http/h2_bundle.go#L8064, and one goroutine holding cc.wmu while blocked on a network connection to become writable at https://github.com/golang/go/blob/go1.18.6/src/net/http/h2_bundle.go#L9576 in a call to Flush.


@Yuhjiang , thank you for including the version info. The screenshots are tricky, text is better. If you're starting with go tool pprof, the output of the -traces command is probably the most descriptive (paired with -focus with an argument of whatever you used for the peek command, and maybe with -lines or granularity=lines).

Looking again at the goroutines that have a unique call stack and considering if they are holding locks that many other goroutines are trying to obtain:

It looks like this also is a call to Flush https://github.com/golang/go/blob/go1.16.15/src/net/http/h2_bundle.go#L8032 while holding cc.wmu (https://github.com/golang/go/blob/go1.16.15/src/net/http/h2_bundle.go#L7916 vs https://github.com/golang/go/blob/go1.16.15/src/net/http/h2_bundle.go#L7882-L7883).

@stampy88
Copy link

stampy88 commented Dec 23, 2022

yes, you are correct @rhysh, it is 1.18.6. When this occurs, we see a bunch of " context deadline exceeded (Client.Timeout exceeded while awaiting headers)" POSTing to the same URL (timeout is 5 seconds). There are 1,701 timeouts over the course of ~16 minutes, and no POSTs to that host succeed. It eventually clears up when the server on the other end starts responding.

@mhr3
Copy link
Author

mhr3 commented Dec 23, 2022

@rhysh the versions are in the stacktrace filenames: golang.org/x/net@v0.0.0-20220909164309-bea034e7d591, we've recently updated to v0.2.0 and are still seeing the same problem every now and then. And yes, I've dropped a few goroutines from the stacktrace, but they were mostly redis/postgres conn handling, nothing with x/net in them.

@rhysh
Copy link
Contributor

rhysh commented Dec 24, 2022

@stampy88 , I think that you and @Yuhjiang have a different bug than this issue describes. Could you please open a new issue for it?

Issue #36026 may be similar, but doesn't include call stacks.

Issue #32388 described a problem where an attempt to write to an http2-over-TLS connection could tie up an entire net/http.Transport while the Write call remained blocked. If I understand correctly, when there's a TCP connection that has outbound data in flight and which has received zero inbound packets (including ACKs), common configurations of Linux will time out the connection after about 15 or 16 minutes. The fix for 32388 (backported to go1.17.3 and go1.16.10) may have limited the impact of that to affect only a single connection, but the problem you're seeing might be a single-connection version. (Or a few simultaneous instances of a single-connection problem).


Thanks @mhr3 .

We're running a http server which uses a http client to proxy the request body to another server (using h2c) and return its response body

One thing to note is that only requests which are proxied directly (io.Copy from the server Request into the client Request) get stuck, other requests where the body is constructed in memory were never affected.

I'm not sure how those two statements go together. Does it mean that some requests are served completely by the Go http/1.1 server (in net/http.(*conn).serve goroutines) without involving a request to a backend over h2c? Or that the request body is sometimes passed along to the backend over h2c?

If the difference you're describing is in the request body that the proxy machine sends to the backend, then maybe there's a difference in how/whether the content-length is passed (to the extent that's still relevant with http2).

The stuck requests have golang.org/x/net/http2.transportResponseBody.Read on the call stack, rather than something related to sending the request. So it seems strange that the way the request was sent would have an effect.

The goroutine profile you shared shows a snapshot where four h2c connections were stalled in the same way, with a total of either 201+35+6 = 242 (proxy's view) or 243 (backend's view) requests in progress. Can you say any more about the dynamics?

  • When the system works, does it use 4 connections or only 1 (and the additional connections are created as a result of the first connection not completing its requests)?
  • Do the stuck requests become un-stuck, or do they last for at least an hour beyond the app's timeouts?
  • Do the stuck connections become un-stuck, or do they last for at least an hour beyond the app's timeouts?
  • Does the system ever run in a state where some connections are stuck but others are working? (Or is that a mode of operation that could go undetected, as it's not completely broken?)
  • Are all 201+35+6 of the proxy's requests in a state you consider "stuck"?

If it's a flow-control problem, I think it'll be hard to solve without a reproducer.

@mhr3
Copy link
Author

mhr3 commented Dec 24, 2022

I'm not sure how those two statements go together. Does it mean that some requests are served completely by the Go http/1.1 server (in net/http.(*conn).serve goroutines) without involving a request to a backend over h2c? Or that the request body is sometimes passed along to the backend over h2c?

If the difference you're describing is in the request body that the proxy machine sends to the backend, then maybe there's a difference in how/whether the content-length is passed (to the extent that's still relevant with http2).

Yes, the content-length is probably the difference here. To make it clearer, we have two types of endpoint handlers (greatly simplified):

func canGetStuck(c echo.Context) error {
  backendReq := prepareRequest(c.Request(), c.Request().Body) // also uses context from the request
  backendResp := getH2cClient().Do(backendReq)
  return c.Stream(backendResp.Body) // ultimately io.Copy(responseWriter, backendResp.Body)
}

func worksFine(c echo.Context) error {
  payload, _ := io.ReadAll(c.Request().Body)
  backendReq := prepareRequest(c.Request(), transform(payload)) // ends up as bytes.Buffer
  backendResp := getH2cClient().Do(backendReq)
  return c.Blob(io.ReadAll(backendResp.Body))
}

The goroutine profile you shared shows a snapshot where four h2c connections were stalled in the same way, with a total of either 201+35+6 = 242 (proxy's view) or 243 (backend's view) requests in progress. Can you say any more about the dynamics?

Fwiw the two snapshots weren't taken at the exact same millisecond, so do not expect the numbers to match up perfectly. The strange thing here is that the server gets into this stuck state after a few days or running fine (we used to see this happen about twice a week), yet if we restart the server every ~three days, it never gets into this broken state (we only restart the main server, not the backend server), perhaps it's related to cancellations (we do see small number of context cancellations coming from the client (it uses fairly small timeout)), but that's just a guess. FWIW here's a chart with number of active goroutines from last time this happened:

image

Some requests started failing around 11:20, and we restarted everything at 11:30. This also shows it's not like there was a gradual increase in number of stuck requests over the few days with some threshold being reached (or at least if there is, it doesn't consume goroutines).

When the system works, does it use 4 connections or only 1 (and the additional connections are created as a result of the first connection not completing its requests)?

We don't really have those numbers, but I think we usually use just one connection.

Do the stuck requests become un-stuck, or do they last for at least an hour beyond the app's timeouts?
Do the stuck connections become un-stuck, or do they last for at least an hour beyond the app's timeouts?

We restart the server after it gets into this state, so can't really say.

Does the system ever run in a state where some connections are stuck but others are working? (Or is that a mode of operation that could go undetected, as it's not completely broken?)

Yes, even when the server is stuck lots of requests still finish successfully.

Are all 201+35+6 of the proxy's requests in a state you consider "stuck"?

Not 100% certain, but yes, I think all those are stuck. But as I said, there are still other requests that do get processed correctly (by the "stuck" server, not just the other instances).

@Yuhjiang
Copy link

@rhysh thanks for you suggestion. Since last updating, I'v only seen the problem once which is the comment I posted two weeks ago. It's working properly over two weeks.
I upload the full goroutines stack of comment with some private code hided(I'm sorry for the confidentiality of my company)
goroutine.txt.
And you mentioned the trace, unfortunately I didn't catch the trace at that time, but I'll try to catch when the problem happens again.

@Yuhjiang
Copy link

There is a snapshot of trace I caught before I used golang/x/net

Goroutines:
github.com/cloudwego/netpoll.(*defaultPoll).Wait N=1
github.com/bytedance/gopkg/util/gopool.(*worker).run.func1 N=923
github.com/cloudwego/kitex/internal/wpool.(*Pool).Go.func1 N=127
private.mirror.org/pipeline.runSubPipeline.func1 N=936
private.mirror.org/gopkg/logs.(*Logger).StartLogger.func1 N=1
private.mirror.org/localcache.doCompute.func1 N=540
private.mirror.org/backgroundUpdate N=2
private.mirror.org/sentry/sentry_common/tengo.(*RpcAdapterImp).callByArguments.func1 N=187
runtime.bgscavenge N=1
net/http.(*http2ClientConn).readLoop N=9
private.mirror.org/extend/localcache.(*LocalFileCacheImp).syncToFile N=1
private.mirror.org/log_market/ttlogagent_gosdk.(*LogSender).run N=7
private.mirror.org/gopkg/metrics/v3.(*client).timerSend N=10
private.mirror.org/sentry/model_sdk/extend/localcache.(*LocalFileCacheImp).vacuum N=1
net/http.(*http2clientStream).doRequest N=498
net/http.(*persistConn).readLoop N=15
private.mirror.org/gopkg/metrics/v3.(*client).send N=10
private.mirror.org/log_market/tracelog.(*TraceLogger).start.func1 N=2
private.mirror.org/gopkg/metrics.flushClients N=1
private.mirror.org/gopkg/logs/v2.refreshTask N=1
github.com/bytedance/gopkg/cloud/circuitbreaker.(*sharedTicker).tick N=1
net/http.(*persistConn).writeLoop N=15
context.WithDeadline.func2 N=609
runtime/trace.Start.func1 N=1
private.mirror.org/data/databus.(*DatabusCollector).Consumer N=1
private.mirror.org/gopkg/idgenerator/utils.refreshTask N=1
net/http.(*persistConn).addTLS.func2 N=1
private.mirror.org/gopkg/stats.(*Stats).EmitLoop N=1
internal/singleflight.(*Group).doCall N=3
net.(*Resolver).goLookupIPCNAMEOrder.func3.1 N=6
github.com/cloudwego/netpoll.(*defaultPoll).detaches.func1 N=29
private.mirror.org/gopkg/circuitbreaker.tick N=1
net/http.(*Transport).dialConnFor N=4
private.mirror.org/sdk/repo.(*CacheImp).syncLoop N=1
private.mirror.org/sdk/repo.(*CacheImp).syncLoop N=1
private.mirror.org/sdk/repo.(*RepoImp).syncLoop N=1
net.(*sysDialer).dialParallel.func1 N=3
private.mirror.org/trace-client-go.(*agentReporter).run N=1
private.mirror.org/gopkg/client/general.(*cache).refresh N=1
private.mirror.org/gopkg/stats.(*StatsEmitor).Loop N=1
private.mirror.org/gopkg/stats.(*StatsEmitor).TickStat.func1 N=1
private.mirror.org/trace-client-go/utils.(*qpsCatcher).tick N=1
private.mirror.org/log_market/tracelog.(*fileOutput).flushRoutine N=4
private.mirror.org/gopkg/tccclient.(*emiter).EmitCache N=3
net/http.(*conn).serve N=1
net/http.(*http2addConnCall).run N=1
private.mirror.org/gopkg/logs/v2/writer.newRotatedFile.func1 N=8
private.mirror.org/gopkg/metrics.flushloop N=1
private.mirror.org/trace-client-go.(*agentProvider).run N=1
github.com/panjf2000/ants/v2.(*Pool).periodicallyPurge N=1
private.mirror.org/cdn_schedule_sdk.(*Client).backgroundUpdate.func1 N=2
net.(*netFD).connect.func2 N=4
private.mirror.org/log_market/tracelog.(*rotateTicker).start.func1 N=1
private.mirror.org/gopkg/general.(*emiter).emitCache N=1
private.mirror.org/gopkg/client.(*ClientV2).refresh N=1
net/http.(*persistConn).closeConnIfStillIdle N=1
net/http.(*connReader).backgroundRead N=1
N=36365

It seems also weird that the total goroutines is 36365 which didn't match with the sum of each stack's goroutines.

@stampy88
Copy link

stampy88 commented Dec 27, 2022

Thanks for point out those issues, @rhysh. #36026 does feel very similar. I am going to see if I can create a reproducer before opening an issue.

@stampy88
Copy link

@rhysh here's the new issue: #57478. Still working on a reproducer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

7 participants