Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: apparent deadlock in TestSOCKS5Proxy on plan9-arm builder #38769

Open
bcmills opened this issue Apr 30, 2020 · 4 comments
Open

net/http: apparent deadlock in TestSOCKS5Proxy on plan9-arm builder #38769

bcmills opened this issue Apr 30, 2020 · 4 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Apr 30, 2020

This looks like a possible deadlock involving the network poller on Plan 9:

goroutine 4622 [chan receive, 17 minutes]:
internal/poll.(*asyncIO).Wait(...)
	/boot/workdir/go/src/internal/poll/fd_io_plan9.go:84
internal/poll.(*FD).Read(0x10e5d730, 0x10daeca8, 0x111bc000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/boot/workdir/go/src/internal/poll/fd_plan9.go:73 +0x124
net.(*netFD).Read(0x10e5d730, 0x111bc000, 0x8000, 0x8000, 0x111bc000, 0x4b, 0x8000)
	/boot/workdir/go/src/net/fd_plan9.go:86 +0xb0
net.(*conn).Read(0x11097078, 0x111bc000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/boot/workdir/go/src/net/net.go:182 +0x6c
io.copyBuffer(0x4c14a0, 0x11097080, 0x10b400b0, 0x11097078, 0x111bc000, 0x8000, 0x8000, 0x10c1aa48, 0x10b47070, 0x3ff, ...)
	/boot/workdir/go/src/io/io.go:409 +0xd8
io.Copy(...)
	/boot/workdir/go/src/io/io.go:368
net.genericReadFrom(0x4c0ca8, 0x11097000, 0x10b400b0, 0x11097078, 0x3, 0x10e49a00, 0x10b50f70, 0x6e1a4)
	/boot/workdir/go/src/net/net.go:639 +0x7c
net.(*TCPConn).readFrom(...)
	/boot/workdir/go/src/net/tcpsock_plan9.go:14
net.(*TCPConn).ReadFrom(0x11097000, 0x10b400b0, 0x11097078, 0x10b47070, 0x11097000, 0x10b40001, 0x3f048)
	/boot/workdir/go/src/net/tcpsock.go:103 +0x44
io.copyBuffer(0x4c0ca8, 0x11097000, 0x10b400b0, 0x11097078, 0x0, 0x0, 0x0, 0x0, 0xf, 0x0, ...)
	/boot/workdir/go/src/io/io.go:395 +0x2ec
io.Copy(...)
	/boot/workdir/go/src/io/io.go:368
net/http_test.TestSOCKS5Proxy.func1(0x10efccc0)
	/boot/workdir/go/src/net/http/transport_test.go:1280 +0x930
created by net/http_test.TestSOCKS5Proxy.func3
	/boot/workdir/go/src/net/http/transport_test.go:1302 +0x94

2020-04-28T21:46:16-b1b6784/plan9-arm
2020-04-27T16:18:30-f1a2a0e/plan9-arm
2020-04-24T01:48:48-f5fcc9b/plan9-arm
2020-04-21T00:09:00-eec981e/plan9-arm
2020-04-11T16:20:57-b78109e/plan9-arm
2020-03-15T08:14:24-dc32553/plan9-arm
2020-02-24T15:55:01-90c71ce/plan9-arm
2020-02-22T04:31:20-f9c51e3/plan9-arm
2020-02-13T16:25:03-e237df5/plan9-arm
2020-01-19T14:04:09-8e0be05/plan9-arm
2020-01-02T22:26:56-73b657e/plan9-arm
2019-11-25T23:37:03-87805c9/plan9-arm
2019-11-15T22:47:41-72f333a/plan9-arm
2019-11-15T21:02:07-d856e05/plan9-arm
2019-11-12T05:35:33-99957b6/plan9-arm
2019-11-01T21:32:33-08a72c8/plan9-arm
2019-10-24T13:58:59-05ee506/plan9-arm
2019-10-16T00:05:37-c4817f5/plan9-arm
2019-10-10T22:38:38-03ef105/plan9-arm
2019-10-07T18:50:14-a0894ea/plan9-arm
2019-10-04T21:22:55-5f4aa5d/plan9-arm
2019-09-21T21:01:41-9c0e56b/plan9-arm
2019-09-20T05:02:16-66e562c/plan9-arm
2019-09-18T18:32:20-e70e0a6/plan9-arm
2019-09-16T21:37:16-0e0bff8/plan9-arm
2019-08-28T18:38:52-7ed973b/plan9-arm
2019-08-06T11:13:05-546ea78/plan9-arm
2019-07-29T20:29:54-7b8234b/plan9-arm
2019-06-14T18:41:07-9838f4d/plan9-arm
2019-06-05T23:57:39-e9073ef/plan9-arm
2019-06-05T21:21:59-fc70527/plan9-arm
2019-04-29T17:42:27-08b956f/plan9-arm

CC @millerresearch @ianlancetaylor @0intro

@bcmills bcmills added this to the Unplanned milestone Apr 30, 2020
@millerresearch
Copy link

@millerresearch millerresearch commented May 20, 2020

This might not be specific to TestSOCKS5Proxy. There are similar looking http deadlocks where that test is not running, for example today.

But internal/poll/fd*_plan9.go still seems to be involved here too.

@millerresearch
Copy link

@millerresearch millerresearch commented May 21, 2020

Here's a possibly related clue: I'm monitoring a test machine now which seems to have deadlocked in the x/net/http2 test, and the last message on the system console is:

Fsnewcall: incall queue full (64) on port 39931

It may be that too many simultaneous connection attempts are causing some to be dropped. I would have expected an error condition that the test would notice, not a deadlock, but I'll see if increasing the queue size helps.

@millerresearch
Copy link

@millerresearch millerresearch commented May 25, 2020

I'll see if increasing the queue size helps.

It didn't.

@millerresearch
Copy link

@millerresearch millerresearch commented May 25, 2020

Looking at internal/poll/fd*_plan9.go , I think there are at least three potential data races when a network i/o is being started and cancelled concurrently in different threads. An example of this situation, seen in the logs cited above, is an http server calling http.(*connReader).abortPendingRead to cancel an i/o which is being started by http.(*connReader).startBackgroundRead.

  1. Cancelling a read by setting a deadline in the past: (*FD).Read checks fd.rtimedout and finds it false, so it calls newAsyncIO to launch a goroutine to perform the i/o. Now setDeadlineImpl sets fd.rtimedout to true (too late), checks fd.raio and finds the i/o goroutine hasn't been launched, so it returns without cancelling it. Then newAsyncIO launches the i/o goroutine, and returns to Read which sets fd.raio (too late). Meanwhile the i/o goroutine starts a read syscall which may never terminate (that's why it needed to be cancelled). The callers of Read and SetReadDeadline both deadlock waiting for the read to end.

  2. Deadline expiry while async i/o is being launched: setDeadlineImpl registers a timer function whose purpose is to set fd.rtimedout and cancel the i/o goroutine if it's running. On a heavily loaded system, the i/o goroutine may be created after a delay, just as the timer interval expires, exposing the same races on fd.rtimedout and fd.raio as in 1. above.

  3. Signalling the async i/o goroutine before the read syscall is executed: in this sequence setDeadlineImpl, or its timer expiry function, observes that fd.raio is set, and calls (*asyncIO).Cancel to send a hangup signal to the OS process running the i/o goroutine. The intention is to interrupt the read syscall so control can return to the i/o goroutine, allowing that to send a finish message to (*asyncIO).Wait. But if the hangup arrives before the syscall instruction has been executed, there's nothing to interrupt: the runtime.sighandler function will simply ignore the signal (because of runtime.ignoreHangup), and the read syscall will then proceed and possibly never terminate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.