Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: apparent deadlock in TestTransportDialCancelRace #45211

Open
bcmills opened this issue Mar 24, 2021 · 16 comments
Open

net/http: apparent deadlock in TestTransportDialCancelRace #45211

bcmills opened this issue Mar 24, 2021 · 16 comments
Milestone

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Mar 24, 2021

It's not obvious to me whether this deadlock is a bug in the specific test function, the net/http package, or *httptest.Server in particular.

2021-03-24T14:20:32-747f426/linux-arm64-aws

…

goroutine 3475 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff94117d10, 0x72, 0xffffffffffffffff)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400030a898, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x400030a880, 0x400048b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:167 +0x1ec
net.(*netFD).Read(0x400030a880, 0x400048b000, 0x1000, 0x1000, 0xffff94e94f01, 0x40003bd680, 0x4000432668)
	/workdir/go/src/net/fd_posix.go:56 +0x44
net.(*conn).Read(0x400037c168, 0x400048b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/net/net.go:183 +0x4c
net/http.(*connReader).Read(0x40000a1b60, 0x400048b000, 0x1000, 0x1000, 0xffff94275900, 0x1, 0x4000432788)
	/workdir/go/src/net/http/server.go:800 +0x268
bufio.(*Reader).fill(0x40005e6300)
	/workdir/go/src/bufio/bufio.go:101 +0x10c
bufio.(*Reader).ReadSlice(0x40005e6300, 0x1df0a, 0x40004327e8, 0x1000101d7b4, 0xffff94279340, 0x0, 0xf0)
	/workdir/go/src/bufio/bufio.go:360 +0x38
bufio.(*Reader).ReadLine(0x40005e6300, 0x40004ea000, 0x1000, 0x4000352b40, 0x77, 0xf0, 0x4000432898)
	/workdir/go/src/bufio/bufio.go:389 +0x30
net/textproto.(*Reader).readLineSlice(0x40000a1440, 0x40003a5200, 0x4000432968, 0xab164, 0x400015cc20, 0x0)
	/workdir/go/src/net/textproto/reader.go:57 +0x84
net/textproto.(*Reader).ReadLine(...)
	/workdir/go/src/net/textproto/reader.go:38
net/http.readRequest(0x40005e6300, 0x0, 0x40003a5200, 0x0, 0x0)
	/workdir/go/src/net/http/request.go:1028 +0x74
net/http.(*conn).readRequest(0x400011f900, 0x539c98, 0x40003bd640, 0x0, 0x0, 0x0)
	/workdir/go/src/net/http/server.go:986 +0x21c
net/http.(*conn).serve(0x400011f900, 0x539d40, 0x40003bd640)
	/workdir/go/src/net/http/server.go:1878 +0x844
created by net/http.(*Server).Serve
	/workdir/go/src/net/http/server.go:3013 +0x4b0

…

goroutine 3439 [semacquire, 2 minutes]:
internal/poll.runtime_Semacquire(0x400015ce28)
	/workdir/go/src/runtime/sema.go:61 +0x38
internal/poll.(*FD).Close(0x400015ce00, 0x400015ce00, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:116 +0x80
net.(*netFD).Close(0x400015ce00, 0x0, 0x40000ca1c8)
	/workdir/go/src/net/fd_posix.go:38 +0x48
net.(*TCPListener).close(...)
	/workdir/go/src/net/tcpsock_posix.go:157
net.(*TCPListener).Close(0x40003a1308, 0x4, 0x40004213a0)
	/workdir/go/src/net/tcpsock.go:275 +0x34
net/http/httptest.(*Server).Close(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:204 +0xd0
net/http_test.TestTransportDialCancelRace(0x40004ec180)
	/workdir/go/src/net/http/transport_test.go:4015 +0x308
testing.tRunner(0x40004ec180, 0x4b91d8)
	/workdir/go/src/testing/testing.go:1235 +0x100
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1280 +0x350

…

goroutine 3490 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff941180b0, 0x72, 0xffffffffffffffff)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400015cf18, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x400015cf00, 0x40004df000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:167 +0x1ec
net.(*netFD).Read(0x400015cf00, 0x40004df000, 0x1000, 0x1000, 0x400035cc68, 0xe53, 0x0)
	/workdir/go/src/net/fd_posix.go:56 +0x44
net.(*conn).Read(0x40006a61f0, 0x40004df000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/workdir/go/src/net/net.go:183 +0x4c
net/http.(*persistConn).Read(0x40000b0900, 0x40004df000, 0x1000, 0x1000, 0x60, 0x0, 0x101801)
	/workdir/go/src/net/http/transport.go:1922 +0x64
bufio.(*Reader).fill(0x40004c68a0)
	/workdir/go/src/bufio/bufio.go:101 +0x10c
bufio.(*Reader).Peek(0x40004c68a0, 0x1, 0x40003fb620, 0x483939, 0x3, 0x486b25, 0xb)
	/workdir/go/src/bufio/bufio.go:139 +0x74
net/http.(*persistConn).readLoop(0x40000b0900)
	/workdir/go/src/net/http/transport.go:2083 +0x170
created by net/http.(*Transport).dialConn
	/workdir/go/src/net/http/transport.go:1743 +0x18b0

…

goroutine 3440 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff94117c28, 0x72, 0x0)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400015ce18, 0x72, 0x0, 0x0, 0x484ab9)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x400015ce00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:402 +0x204
net.(*netFD).accept(0x400015ce00, 0x4000182cb8, 0x3755834c00, 0x12b9526b)
	/workdir/go/src/net/fd_unix.go:173 +0x2c
net.(*TCPListener).accept(0x40003a1308, 0x400037c168, 0x40002a1e08, 0x53a74)
	/workdir/go/src/net/tcpsock_posix.go:140 +0x2c
net.(*TCPListener).Accept(0x40003a1308, 0x40002a1e70, 0x40002a1e78, 0x18, 0x40001f1680)
	/workdir/go/src/net/tcpsock.go:262 +0x34
net/http.(*Server).Serve(0x400067eee0, 0x539100, 0x40003a1308, 0x0, 0x0)
	/workdir/go/src/net/http/server.go:2981 +0x364
net/http/httptest.(*Server).goServe.func1(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:308 +0x68
created by net/http/httptest.(*Server).goServe
	/workdir/go/src/net/http/httptest/server.go:306 +0x58

…

CC @neild @bradfitz @empijei

@bcmills bcmills added this to the Backlog milestone Mar 24, 2021
@neild
Copy link
Contributor

@neild neild commented Mar 24, 2021

I think this is a internal/poll race, although I might be missing something:

goroutine 3439 [semacquire, 2 minutes]:
internal/poll.runtime_Semacquire(0x400015ce28)
	/workdir/go/src/runtime/sema.go:61 +0x38
internal/poll.(*FD).Close(0x400015ce00, 0x400015ce00, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:116 +0x80
net.(*netFD).Close(0x400015ce00, 0x0, 0x40000ca1c8)
	/workdir/go/src/net/fd_posix.go:38 +0x48
net.(*TCPListener).close(...)
	/workdir/go/src/net/tcpsock_posix.go:157
net.(*TCPListener).Close(0x40003a1308, 0x4, 0x40004213a0)
	/workdir/go/src/net/tcpsock.go:275 +0x34
net/http/httptest.(*Server).Close(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:204 +0xd0
net/http_test.TestTransportDialCancelRace(0x40004ec180)
	/workdir/go/src/net/http/transport_test.go:4015 +0x308
testing.tRunner(0x40004ec180, 0x4b91d8)
	/workdir/go/src/testing/testing.go:1235 +0x100
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1280 +0x350

goroutine 3440 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0xffff94117c28, 0x72, 0x0)
	/workdir/go/src/runtime/netpoll.go:229 +0xa4
internal/poll.(*pollDesc).wait(0x400015ce18, 0x72, 0x0, 0x0, 0x484ab9)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x38
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x400015ce00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/workdir/go/src/internal/poll/fd_unix.go:402 +0x204
net.(*netFD).accept(0x400015ce00, 0x4000182cb8, 0x3755834c00, 0x12b9526b)
	/workdir/go/src/net/fd_unix.go:173 +0x2c
net.(*TCPListener).accept(0x40003a1308, 0x400037c168, 0x40002a1e08, 0x53a74)
	/workdir/go/src/net/tcpsock_posix.go:140 +0x2c
net.(*TCPListener).Accept(0x40003a1308, 0x40002a1e70, 0x40002a1e78, 0x18, 0x40001f1680)
	/workdir/go/src/net/tcpsock.go:262 +0x34
net/http.(*Server).Serve(0x400067eee0, 0x539100, 0x40003a1308, 0x0, 0x0)
	/workdir/go/src/net/http/server.go:2981 +0x364
net/http/httptest.(*Server).goServe.func1(0x40004f6000)
	/workdir/go/src/net/http/httptest/server.go:308 +0x68
created by net/http/httptest.(*Server).goServe
	/workdir/go/src/net/http/httptest/server.go:306 +0x58

One goroutine is blocked in internal/poll.(*FD).Close waiting to acquire fd.csema:
https://go.googlesource.com/go/+/747f426944/src/internal/poll/fd_unix.go#116

The other goroutine is blocked in internal/poll.(*FD).Accept waiting for the FD to become readable:
https://go.googlesource.com/go/+/747f426944/src/internal/poll/fd_unix.go#402

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Mar 24, 2021

Maybe unrelated, but we're actively debugging an issue (with Go 1.16.2) where some users get into a state where it seems internal/poll is forever broken with tens of thousands of UDP writes stuck in goroutines like:

goroutine 181363 [semacquire, 152 minutes]:
internal/poll.runtime_Semacquire(0xc00011028c)
	runtime/sema.go:61 +0x45
internal/poll.(*fdMutex).rwlock(0xc000110280, 0x10d599400, 0x4)
	internal/poll/fd_mutex.go:154 +0xb3
internal/poll.(*FD).writeLock(...)
	internal/poll/fd_mutex.go:239
internal/poll.(*FD).WriteTo(0xc000110280, 0xc0007809f0, 0x28, 0x28, 0x10d7080c8, 0xc000333d80, 0x0, 0x0, 0x0)
	internal/poll/fd_unix.go:331 +0x5a
net.(*netFD).writeTo(0xc000110280, 0xc0007809f0, 0x28, 0x28, 0x10d7080c8, 0xc000333d80, 0x10, 0x0, 0x4)
	net/fd_posix.go:79 +0x6f
net.(*UDPConn).writeTo(0xc00057e080, 0xc0007809f0, 0x28, 0x28, 0xc00071cff0, 0xc008948768, 0xc0089489a0, 0x3)
	net/udpsock_posix.go:79 +0xd6
net.(*UDPConn).WriteTo(0xc00057e080, 0xc0007809f0, 0x28, 0x28, 0x10d70a4a0, 0xc00071cff0, 0xad0000c005f028c0, 0x10d1f65b6, 0xadab3d94fe203440)
	net/udpsock.go:176 +0x86
tailscale.com/wgengine/magicsock.(*RebindingUDPConn).WriteTo(0xc000214120, 0xc0007809f0, 0x28, 0x28, 0x10d70a4a0, 0xc00071cff0, 0xc0003c4880, 0xc000679f01, 0x10d506c05)
	tailscale.com@v0.0.0-00010101000000-000000000000/wgengine/magicsock/magicsock.go:2947 +0xbe

There's no evidence (panics in logs) that we hit internal/poll's one million concurrent user limit of that UDP socket, though.

@rsc
Copy link
Contributor

@rsc rsc commented Mar 24, 2021

@bradfitz what architecture?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 24, 2021

@neild What is supposed to happen there is that (*FD).Close earlier called fd.pd.evict which calls runtime_pollUnblock, and that is supposed to wake up the call sleeping in runtime_pollWait.

Looking at the netpoll code, it does seem possible that poll_runtime_pollUnblock sets fd.closing just before a different goroutine calls poll_runtime_pollWait. I'm not immediately sure what prevents that.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 24, 2021

@bradfitz That seems like a different problem to me. Any idea what is holding the write lock? Do we even need to hold a write lock during (*FD).WriteTo and (*FD).WriteMsg?

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Mar 24, 2021

@rsc, at least amd64 for the user we're debugging.

@ianlancetaylor, it's not obvious from the stack traces. Nothing stuck in a system call doing a write, for instance. I can email it to you privately.

@neild
Copy link
Contributor

@neild neild commented Mar 24, 2021

@ianlancetaylor I think the atomic.StorepNoWB in poll_runtime_Unblock is supposed to create a memory barrier ensuring that the write to pb.closing is observable in netpollblock after the atomic.Casuintptr call.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 24, 2021

@neild Thanks. Now I wonder whether runtime.arm64HasATOMICS gets set to true on this system.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 24, 2021

@bradfitz Sure, I can take a look though I don't suppose that I'll see anything. But something must be holding the lock, somehow.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 24, 2021

@bradfitz sent a stack trace offline. The stack trace showed a goroutine holding the lock that all the other goroutines are waiting for:

goroutine 180969 [IO wait, 150 minutes]:
internal/poll.runtime_pollWait(0x10df1bb40, 0x77, 0x10d708d08)
	runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc000110298, 0x77, 0x0, 0x200, 0x0)
	internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitWrite(...)
	internal/poll/fd_poll_runtime.go:96
internal/poll.(*FD).WriteTo(0xc000110280, 0xc00058e200, 0x5c, 0x200, 0x10d7080c8, 0xc0001547e0, 0x0, 0x0, 0x0)
	internal/poll/fd_unix.go:344 +0x1b5
net.(*netFD).writeTo(0xc000110280, 0xc00058e200, 0x5c, 0x200, 0x10d7080c8, 0xc0001547e0, 0xc0005a6001, 0x1da0acc6ae8f8, 0x11606b1c8)
	net/fd_posix.go:79 +0x6f
net.(*UDPConn).writeTo(0xc00057e080, 0xc00058e200, 0x5c, 0x200, 0xc0007d6090, 0x0, 0xc0005e9aa8, 0x10d1f74c8)
	net/udpsock_posix.go:79 +0xd6
net.(*UDPConn).WriteTo(0xc00057e080, 0xc00058e200, 0x5c, 0x200, 0x10d70a4a0, 0xc0007d6090, 0x0, 0xc0005e9b48, 0x10d1f6c8e)
	net/udpsock.go:176 +0x86

This means that the sendto system call returned EAGAIN, and the goroutine is waiting for epoll to report that the socket is available for writing.

Normally EAGAIN means that the socket buffer is full. However, looking at the man page for sendto, it appears that EAGAIN can be returned if the socket was not bound to a local port, and there are no ports available in the ephemeral range. Could that possibly be the case here? If so, that could be a problem, because we use epoll in edge-triggered mode, so if the socket was ready for writing although it did not have an ephemeral port, epoll would never tell the program to try the sendto again.

How is this UDP socket created? It is remotely plausible that the ephemeral ports were exhausted on this system?

@josharian
Copy link
Contributor

@josharian josharian commented Mar 24, 2021

I just recalled from the distant past (#18541) that Little Snitch running on macOS can cause EAGAIN in unexpected places. It is possible that the user that those stack traces are from is running Little Snitch (or some other network filter extension).

Possibly related, I just encountered some runtime poller failures on a new M1, and I also run Little Snitch. I will write them up for you tomorrow, or you can ask Brad for details if you want them sooner.

@josharian
Copy link
Contributor

@josharian josharian commented Mar 24, 2021

Also #18751. And FWIW, runtime.arm64HasATOMICS is true on my machine.

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Mar 24, 2021

@ianlancetaylor, forgot to tell you in the email: those stacks were from macOS, not Linux. (not super obvious, but some have e.g. "golang.org/x/sys@v0.0.0-20210301091718-77cc2087c03b/unix/zsyscall_darwin_amd64.go:1691 +0x66").

The UDP socket is created from:

https://github.com/tailscale/tailscale/blob/c99f260e403f8aa6665c54bf9a7f25ca696fa480/wgengine/magicsock/magicsock.go#L2669

... which is effectively net.Listen("udp4", ":41641") in this case, which I just confirmed from the user logs corresponding to those stacks.

It is remotely plausible that the ephemeral ports were exhausted on this system?

That's quite likely actually.

@neild
Copy link
Contributor

@neild neild commented Mar 24, 2021

FWIW, the net/http flake in TestTransportDialCancelRace is very reproducible on the linux-arm64-aws builder. This reproduces it for me every time:

gomote run $GOMOTE go/bin/go test net/http -run=DialCancelRace -cpu=4  -count=10000 -v -timeout=20s
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 25, 2021

@bradfitz @josharian If the UDP socket is being created using net.ListenPacket then it's not an ephemeral port problem. But if Little Snitch can cause EAGAIN errors then it could be a kqueue problem, as our use of kqueue is also edge-triggered rather than level-triggered. If sendto can return an EAGAIN error when the UDP socket is writable, we can indeed block forever waiting for the socket to become writable when it already is.

Unfortunately I'm not sure how to verify that that is the problem, other than observing that I don't see how it could be anything else. The goroutine is clearly blocking waiting for kqueue to say that the socket is writable, and that clearly isn't happening.

Even more unfortunately I'm not sure how to fix this. We could reduce the severity by changing (*FD).WriteTo and (*FD).WriteMsg to attempt the call to syscall.Sendto and syscall.SendmsgN before acquiring the write lock. There is really no need to acquire the write lock to write a single packet. But if the system call returns EAGAIN, then we more or less do have to take the write lock, and then try the system call again, and if it returns EAGAIN again, we have to call into the netpoll system. And if the problem is indeed a spurious EAGAIN, then we can wait there forever.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 25, 2021

By the way I think we're pretty clearly tracking two independent problems in this one issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants