Here's a possibly related clue: I'm monitoring a test machine now which seems to have deadlocked in the x/net/http2 test, and the last message on the system console is:
Fsnewcall: incall queue full (64) on port 39931
It may be that too many simultaneous connection attempts are causing some to be dropped. I would have expected an error condition that the test would notice, not a deadlock, but I'll see if increasing the queue size helps.
Looking at internal/poll/fd*_plan9.go , I think there are at least three potential data races when a network i/o is being started and cancelled concurrently in different threads. An example of this situation, seen in the logs cited above, is an http server calling http.(*connReader).abortPendingRead to cancel an i/o which is being started by http.(*connReader).startBackgroundRead.
Cancelling a read by setting a deadline in the past: (*FD).Read checks fd.rtimedout and finds it false, so it calls newAsyncIO to launch a goroutine to perform the i/o. Now setDeadlineImpl sets fd.rtimedout to true (too late), checks fd.raio and finds the i/o goroutine hasn't been launched, so it returns without cancelling it. Then newAsyncIO launches the i/o goroutine, and returns to Read which sets fd.raio (too late). Meanwhile the i/o goroutine starts a read syscall which may never terminate (that's why it needed to be cancelled). The callers of Read and SetReadDeadline both deadlock waiting for the read to end.
Deadline expiry while async i/o is being launched: setDeadlineImpl registers a timer function whose purpose is to set fd.rtimedout and cancel the i/o goroutine if it's running. On a heavily loaded system, the i/o goroutine may be created after a delay, just as the timer interval expires, exposing the same races on fd.rtimedout and fd.raio as in 1. above.
Signalling the async i/o goroutine before the read syscall is executed: in this sequence setDeadlineImpl, or its timer expiry function, observes that fd.raio is set, and calls (*asyncIO).Cancel to send a hangup signal to the OS process running the i/o goroutine. The intention is to interrupt the read syscall so control can return to the i/o goroutine, allowing that to send a finish message to (*asyncIO).Wait. But if the hangup arrives before the syscall instruction has been executed, there's nothing to interrupt: the runtime.sighandler function will simply ignore the signal (because of runtime.ignoreHangup), and the read syscall will then proceed and possibly never terminate.
There are data races on fd.[rw]aio and fd.[rw]timedout when Read/Write
is called on a polled fd concurrently with SetDeadline (see #38769).
Adding a mutex around accesses to each pair (read and write) prevents
the race, which was causing deadlocks in net/http tests on the builders.
Run-TryBot: David du Colombier <firstname.lastname@example.org>
TryBot-Result: Gobot Gobot <email@example.com>
Reviewed-by: David du Colombier <firstname.lastname@example.org>