Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: goroutine leak when when calling File on a TCPConn #24455

Closed
twmb opened this issue Mar 19, 2018 · 16 comments

Comments

Projects
None yet
5 participants
@twmb
Copy link
Contributor

commented Mar 19, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

$ go version
go version devel +2086f3503e Mon Mar 12 22:29:04 2018 +0000 linux/amd64

(Also tested on 1.7.6, 1.10)

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/twmb/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/twmb/go"
GORACE=""
GOROOT="/home/twmb/go/go"
GOTMPDIR=""
GOTOOLDIR="/home/twmb/go/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build789317015=/tmp/go-build -gno-record-gcc-switches"

What did you do?

File 1, the producer:
https://play.golang.org/p/eNMNVTGutd9

File 2 (separate to ensure that the goroutines do not show up):
https://play.golang.org/p/Mz1h9F3963o

What did you expect to see?

No goroutines hung.

What did you see instead?

$ curl 'localhost:9090/debug/pprof/goroutine?debug=1'
goroutine profile: total 3508
3504 @ 0x479e95 0x477e0f 0x476759 0x48f728 0x4f6def 0x507eaa 0x678ed6 0x566c1e 0x566d6a 0x679995 0x45b1b1
#	0x479e94	syscall.Syscall+0x4			/home/twmb/go/go/src/syscall/asm_linux_amd64.s:18
#	0x477e0e	syscall.read+0x5e			/home/twmb/go/go/src/syscall/zsyscall_linux_amd64.go:749
#	0x476758	syscall.Read+0x48			/home/twmb/go/go/src/syscall/syscall_unix.go:162
#	0x48f727	internal/poll.(*FD).Read+0x117		/home/twmb/go/go/src/internal/poll/fd_unix.go:153
#	0x4f6dee	net.(*netFD).Read+0x4e			/home/twmb/go/go/src/net/fd_unix.go:202
#	0x507ea9	net.(*conn).Read+0x69			/home/twmb/go/go/src/net/net.go:176
#	0x678ed5	net/http.(*persistConn).Read+0x135	/home/twmb/go/go/src/net/http/transport.go:1453
#	0x566c1d	bufio.(*Reader).fill+0x11d		/home/twmb/go/go/src/bufio/bufio.go:100
#	0x566d69	bufio.(*Reader).Peek+0x39		/home/twmb/go/go/src/bufio/bufio.go:132
#	0x679994	net/http.(*persistConn).readLoop+0x184	/home/twmb/go/go/src/net/http/transport.go:1601

These goroutines do not go away.

If tc.File() is commented out in the first play link, the goroutines do go
away. I suspect there is a race between connections being set to non-blocking
and removed from epoll management and with a goroutine falling into epoll to
wait for a read.

The examples use an http server / client, but I would suspect this issue is visible with raw connections.

@twmb twmb changed the title Potential goroutine leak when when calling `File` on a `TCPConn` net: goroutine leak when when calling File on a TCPConn Mar 20, 2018

@bcmills

This comment has been minimized.

Copy link
Member

commented Mar 20, 2018

https://golang.org/pkg/net/#TCPConn.File says:
“It is the caller's responsibility to close f when finished. Closing c does not affect f, and closing f does not affect c.”

Is the File call paired with a Close call in the real code you've observed the leak in? If so, a more realistic reproducer would be helpful. Otherwise, it's unfortunate that the Close requirement is only enforced by documentation but there's not much we can do about it in the short term.

@andybons andybons added this to the Go1.11 milestone Mar 20, 2018

@twmb

This comment has been minimized.

Copy link
Contributor Author

commented Mar 20, 2018

The File code internally calls into os's NewFile, which sets a finalizer on the fd itself to close it when it is garbage collected. However, the tc.File() line in the code above can be changed to

                f, err := tc.File() // <-- this is bad!
                chk(err)
                go func() {
                        chk(f.Close())
                }()

to ensure that not closing the file is not the issue. For ease, the following play link (which has the code) also reproduces the issue: https://play.golang.org/p/tnNDto5l8vf

@fraenkel

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

You do realize you are calling Do(req) twice and you never close those responses.

@twmb

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2018

It was a typo from modifying the same source enough to get a "minimal" reproducer. Here is an updated file: https://play.golang.org/p/v2RRGMZJAY3. This fixed playground code closes the responses the same way the prior code did in a goroutine.

I welcome other potential problems I have done, but this is a minimal reproducer that I have been trying to track down on and off for 2.5 years now. I have suspicions that the epoll implementation itself is a bit off due to other stuck goroutines I am seeing (a process that has been up for 140 days has many stuck in runtime_pollWait), but I will try to create a minimal reproducer for that separately.

@fraenkel

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

The problem is that the socket is added with EPOLLET (edge triggered) which is no longer true.

@bcmills

This comment has been minimized.

Copy link
Member

commented Mar 21, 2018

If you use a sync.WaitGroup instead of a spin-loop counter (#10958) and don't hard-code the port, you can make the program runnable in the Playground.

After removing the duplicated Do call (well-spotted, @fraenkel!), I'm not seeing a leak:
https://play.golang.org/p/kWWS6-gOxw9

Does that program produce a different result on your machine?

@bcmills

This comment has been minimized.

Copy link
Member

commented Mar 21, 2018

If I remove the runtime.GC() calls, I do see a short-term goroutine leak, but it's the one described in #21597, not the one reported above:
https://play.golang.org/p/WWD7CjCSFzS

@twmb

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2018

Yes, I do still see the leak. Part of the reason for the code being structured so tightly was to attempt to force the races. I ran your code; nothing leaked for one run at 1,000 connections, but leaks did occur at 10,000.

goroutine profile: total 330
163 @ 0x42e26a 0x42958a 0x428c07 0x48dcdb 0x48dd5d 0x48ebbd 0x4f1b5f 0x502c1a 0x65c518 0x561aae 0x56278c 0x5629a4 0x5ecae0 0x5ec8eb 0x656ebc 0x65d7bf 0x6614bc 0x45b211
#	0x428c06	internal/poll.runtime_pollWait+0x56		/home/twmb/go/go/src/runtime/netpoll.go:173
#	0x48dcda	internal/poll.(*pollDesc).wait+0x9a		/home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:85
#	0x48dd5c	internal/poll.(*pollDesc).waitRead+0x3c		/home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:90
#	0x48ebbc	internal/poll.(*FD).Read+0x17c			/home/twmb/go/go/src/internal/poll/fd_unix.go:157
#	0x4f1b5e	net.(*netFD).Read+0x4e				/home/twmb/go/go/src/net/fd_unix.go:202
#	0x502c19	net.(*conn).Read+0x69				/home/twmb/go/go/src/net/net.go:176
#	0x65c517	net/http.(*connReader).Read+0xf7		/home/twmb/go/go/src/net/http/server.go:764
#	0x561aad	bufio.(*Reader).fill+0x11d			/home/twmb/go/go/src/bufio/bufio.go:100
#	0x56278b	bufio.(*Reader).ReadSlice+0x2b			/home/twmb/go/go/src/bufio/bufio.go:341
#	0x5629a3	bufio.(*Reader).ReadLine+0x33			/home/twmb/go/go/src/bufio/bufio.go:370
#	0x5ecadf	net/textproto.(*Reader).readLineSlice+0x6f	/home/twmb/go/go/src/net/textproto/reader.go:55
#	0x5ec8ea	net/textproto.(*Reader).ReadLine+0x2a		/home/twmb/go/go/src/net/textproto/reader.go:36
#	0x656ebb	net/http.readRequest+0x8b			/home/twmb/go/go/src/net/http/request.go:929
#	0x65d7be	net/http.(*conn).readRequest+0x16e		/home/twmb/go/go/src/net/http/server.go:944
#	0x6614bb	net/http.(*conn).serve+0x4db			/home/twmb/go/go/src/net/http/server.go:1768

163 @ 0x479ef5 0x477e6f 0x4767b9 0x48eb58 0x4f1b5f 0x502c1a 0x673db6 0x561aae 0x561bfa 0x674875 0x45b211
#	0x479ef4	syscall.Syscall+0x4			/home/twmb/go/go/src/syscall/asm_linux_amd64.s:18
#	0x477e6e	syscall.read+0x5e			/home/twmb/go/go/src/syscall/zsyscall_linux_amd64.go:749
#	0x4767b8	syscall.Read+0x48			/home/twmb/go/go/src/syscall/syscall_unix.go:162
#	0x48eb57	internal/poll.(*FD).Read+0x117		/home/twmb/go/go/src/internal/poll/fd_unix.go:153
#	0x4f1b5e	net.(*netFD).Read+0x4e			/home/twmb/go/go/src/net/fd_unix.go:202
#	0x502c19	net.(*conn).Read+0x69			/home/twmb/go/go/src/net/net.go:176
#	0x673db5	net/http.(*persistConn).Read+0x135	/home/twmb/go/go/src/net/http/transport.go:1453
#	0x561aad	bufio.(*Reader).fill+0x11d		/home/twmb/go/go/src/bufio/bufio.go:100
#	0x561bf9	bufio.(*Reader).Peek+0x39		/home/twmb/go/go/src/bufio/bufio.go:132
#	0x674874	net/http.(*persistConn).readLoop+0x184	/home/twmb/go/go/src/net/http/transport.go:1601

Here is my current diff on your code (with the GC):

$ diff three.go his.go 
47c47
< 	for n := 10000; n > 0; n-- {
---
> 	for n := 1000; n > 0; n-- {
74,75d73
< 	fmt.Println(l.Addr().String())
< 
82d79
< 	select {}

What is most exciting for me actually is that re-running the code (with 10k connections) a few times in a row without the tc.File occasionally causes the leak I've been trying to make a reproducer for as well (edit: this one may just be from KeepAlives):

goroutine profile: total 12
6 @ 0x42e26a 0x42958a 0x428c07 0x48dcdb 0x48dd5d 0x48ebbd 0x4f1b5f 0x502c1a 0x65c518 0x561aae 0x56278c 0x5629a4 0x5ecae0 0x5ec8eb 0x656ebc 0x65d7bf 0x6614bc 0x45b211
#	0x428c06	internal/poll.runtime_pollWait+0x56		/home/twmb/go/go/src/runtime/netpoll.go:173
#	0x48dcda	internal/poll.(*pollDesc).wait+0x9a		/home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:85
#	0x48dd5c	internal/poll.(*pollDesc).waitRead+0x3c		/home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:90
#	0x48ebbc	internal/poll.(*FD).Read+0x17c			/home/twmb/go/go/src/internal/poll/fd_unix.go:157
#	0x4f1b5e	net.(*netFD).Read+0x4e				/home/twmb/go/go/src/net/fd_unix.go:202
#	0x502c19	net.(*conn).Read+0x69				/home/twmb/go/go/src/net/net.go:176
#	0x65c517	net/http.(*connReader).Read+0xf7		/home/twmb/go/go/src/net/http/server.go:764
#	0x561aad	bufio.(*Reader).fill+0x11d			/home/twmb/go/go/src/bufio/bufio.go:100
#	0x56278b	bufio.(*Reader).ReadSlice+0x2b			/home/twmb/go/go/src/bufio/bufio.go:341
#	0x5629a3	bufio.(*Reader).ReadLine+0x33			/home/twmb/go/go/src/bufio/bufio.go:370
#	0x5ecadf	net/textproto.(*Reader).readLineSlice+0x6f	/home/twmb/go/go/src/net/textproto/reader.go:55
#	0x5ec8ea	net/textproto.(*Reader).ReadLine+0x2a		/home/twmb/go/go/src/net/textproto/reader.go:36
#	0x656ebb	net/http.readRequest+0x8b			/home/twmb/go/go/src/net/http/request.go:929
#	0x65d7be	net/http.(*conn).readRequest+0x16e		/home/twmb/go/go/src/net/http/server.go:944
#	0x6614bb	net/http.(*conn).serve+0x4db			/home/twmb/go/go/src/net/http/server.go:1768

It is much more subtle, but is much more real; let me know if I should make a separate issue for that or if I should add more details here.

I recommend running the code with 10k connections and maybe on worse hardware.

@fraenkel

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

The only "leak" I see is with

net/http.(*persistConn).readLoop(0xc422be0c60)
	/snap/go/1473/src/net/http/transport.go:1601 +0x185
created by net/http.(*Transport).dialConn
	/snap/go/1473/src/net/http/transport.go:1237 +0x95a

The problem there is that it will never break out of the read since its blocking and all deadlines are useless.

I do not see any other leaks.

The persistConn.readLoop stack, I do see when the initial stacks are dumped, but if I wait and then dump again, they are gone. The dialConn ones are still present.

@twmb

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2018

I have just re-ran this by copying the code out of the play link in #24455 (comment) and applied my bump-to-10k and hang-with-select mods.

I ran this and then waited 5 minutes before checking the goroutines. At 5 minutes:

1020 @ 0x4795a5 0x4774ff 0x475e49 0x48e0e9 0x4fed2f 0x5118c8 0x67e348 0x56d1a5 0x56d2fa 0x67eea8 0x45ad31
#	0x4795a4	syscall.Syscall+0x4			/home/twmb/go/go/src/syscall/asm_linux_amd64.s:18
#	0x4774fe	syscall.read+0x5e			/home/twmb/go/go/src/syscall/zsyscall_linux_amd64.go:749
#	0x475e48	syscall.Read+0x48			/home/twmb/go/go/src/syscall/syscall_unix.go:172
#	0x48e0e8	internal/poll.(*FD).Read+0x118		/home/twmb/go/go/src/internal/poll/fd_unix.go:159
#	0x4fed2e	net.(*netFD).Read+0x4e			/home/twmb/go/go/src/net/fd_unix.go:202
#	0x5118c7	net.(*conn).Read+0x67			/home/twmb/go/go/src/net/net.go:176
#	0x67e347	net/http.(*persistConn).Read+0x77	/home/twmb/go/go/src/net/http/transport.go:1453
#	0x56d1a4	bufio.(*Reader).fill+0x104		/home/twmb/go/go/src/bufio/bufio.go:100
#	0x56d2f9	bufio.(*Reader).Peek+0x39		/home/twmb/go/go/src/bufio/bufio.go:132
#	0x67eea7	net/http.(*persistConn).readLoop+0x187	/home/twmb/go/go/src/net/http/transport.go:1601

988 @ 0x42f085 0x42a4ec 0x429b86 0x48d23a 0x48d2bd 0x48e149 0x4fed2f 0x5118c8 0x666d7c 0x56d1a5 0x56de4c 0x56e064 0x5f7f50 0x5f7d4b 0x6616fc 0x668012 0x66bc1a 0x45ad31
#	0x429b85	internal/poll.runtime_pollWait+0x65		/home/twmb/go/go/src/runtime/netpoll.go:173
#	0x48d239	internal/poll.(*pollDesc).wait+0x99		/home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:85
#	0x48d2bc	internal/poll.(*pollDesc).waitRead+0x3c		/home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:90
#	0x48e148	internal/poll.(*FD).Read+0x178			/home/twmb/go/go/src/internal/poll/fd_unix.go:163
#	0x4fed2e	net.(*netFD).Read+0x4e				/home/twmb/go/go/src/net/fd_unix.go:202
#	0x5118c7	net.(*conn).Read+0x67				/home/twmb/go/go/src/net/net.go:176
#	0x666d7b	net/http.(*connReader).Read+0xfb		/home/twmb/go/go/src/net/http/server.go:765
#	0x56d1a4	bufio.(*Reader).fill+0x104			/home/twmb/go/go/src/bufio/bufio.go:100
#	0x56de4b	bufio.(*Reader).ReadSlice+0x2b			/home/twmb/go/go/src/bufio/bufio.go:341
#	0x56e063	bufio.(*Reader).ReadLine+0x33			/home/twmb/go/go/src/bufio/bufio.go:370
#	0x5f7f4f	net/textproto.(*Reader).readLineSlice+0x6f	/home/twmb/go/go/src/net/textproto/reader.go:55
#	0x5f7d4a	net/textproto.(*Reader).ReadLine+0x2a		/home/twmb/go/go/src/net/textproto/reader.go:36
#	0x6616fb	net/http.readRequest+0x8b			/home/twmb/go/go/src/net/http/request.go:929
#	0x668011	net/http.(*conn).readRequest+0x161		/home/twmb/go/go/src/net/http/server.go:945
#	0x66bc19	net/http.(*conn).serve+0x4b9			/home/twmb/go/go/src/net/http/server.go:1769

I would not be surprised if your hardware is better than mine and that maybe things do not mess up on better hardware. What more can I do in this situation?

@fraenkel

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

I can tell you that I did an strace with and without the File() call to see what the underlying difference was and I only needed <5 connections to show that the normal path would get a read failure with EAGAIN which didn't and wouldn't occur once the connect was blocking.

@bcmills

This comment has been minimized.

Copy link
Member

commented Mar 21, 2018

@fraenkel Do you have a fix in mind for the dialConn / readLoop hang, or does this need further investigation?

@twmb

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2018

Potentially related to #24483, but I have not seen the same race trace while repeatedly running this.

@twmb

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2018

Also potentially related (outside of the File call in particular):
https://groups.google.com/forum/#!topic/golang-nuts/zqsC6xcnP24
#16060
go-sql-driver/mysql#285
jlaffaye/ftp#112

@fraenkel

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

@bcmills It would need further investigation but this is a slippery slope. The poller assumes edge triggered is enabled.
This is very similar to #24331 which is attempting to add a new method to prevent the flip.

@twmb

This comment has been minimized.

Copy link
Contributor Author

commented Apr 19, 2018

Fixed by #24942 (but maybe the underlying issue still exists with nonblocking connections)

@twmb twmb closed this Apr 19, 2018

@golang golang locked and limited conversation to collaborators Apr 19, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.