New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net, internal/poll, runtime: remove mutex from UDP reads/writes #17520

Open
neild opened this Issue Oct 19, 2016 · 6 comments

Comments

Projects
None yet
5 participants
@neild
Contributor

neild commented Oct 19, 2016

Reads and writes on net.UDPConns are guarded by a mutex. Contention on the mutex makes it difficult to efficiently handle UDP requests concurrently. Or perhaps I'm overlooking the right way to do this.

The attached benchmark attempts to demonstrate the problem:
socks_test.go.txt

Annotated benchmark results from my desktop:

# All tests are of a server reading 64-byte UDP messages and responding to them.
#
# /echo tests are of an echo server--no processing is done of messages, so
# the test time is entirely spent in socket operations.
#
# /sha tests compute a SHA-256 sum of the input message 50 times, to simulate
# doing a small amount of real work per message.

# The read_1 tests process messages in serial on a single goroutine.
#
# Increasing GOMAXPROCS introduces a minor inefficiency for some reason,
# but these results are largely what you would expect from a non-concurrent server.
BenchmarkUDP/read_1/echo                 1000000              8698 ns/op
BenchmarkUDP/read_1/echo-2               1000000             11229 ns/op
BenchmarkUDP/read_1/echo-4               1000000             11873 ns/op
BenchmarkUDP/read_1/sha                   200000             29676 ns/op
BenchmarkUDP/read_1/sha-2                 200000             30997 ns/op
BenchmarkUDP/read_1/sha-4                 200000             35817 ns/op

# The read_n tests start multiple goroutines, each of which reads from
# and writes to a shared UDP socket.
#
# Increasing the number of goroutines causes the server to become slower,
# presumably due to lock contention on the socket.
BenchmarkUDP/read_n/echo                 1000000             10201 ns/op
BenchmarkUDP/read_n/echo-2                500000             19274 ns/op
BenchmarkUDP/read_n/echo-4                300000             24263 ns/op
BenchmarkUDP/read_n/sha                   200000             29522 ns/op
BenchmarkUDP/read_n/sha-2                 200000             41015 ns/op
BenchmarkUDP/read_n/sha-4                 200000             58748 ns/op

# The read_1n1 tests start one reader, one writer, and multiple worker goroutines
# connected by channels.
#
# Increasing the number of worker goroutines does not improve performance here either,
# presumably due to lock contention on the channels.
BenchmarkUDP/read_1n1/echo               1000000             11194 ns/op
BenchmarkUDP/read_1n1/echo-2              500000             20991 ns/op
BenchmarkUDP/read_1n1/echo-4              300000             28297 ns/op
BenchmarkUDP/read_1n1/sha                 200000             39178 ns/op
BenchmarkUDP/read_1n1/sha-2               200000             45770 ns/op
BenchmarkUDP/read_1n1/sha-4               200000             38197 ns/op

# The read_fake tests just run the work function in a loop without network operations.
# Performance scales mostly linearly with the number of worker goroutines.
BenchmarkUDP/read_fake/echo             2000000000               4.05 ns/op
BenchmarkUDP/read_fake/echo-2           3000000000               2.00 ns/op
BenchmarkUDP/read_fake/echo-4           10000000000              1.02 ns/op
BenchmarkUDP/read_fake/sha                300000             21178 ns/op
BenchmarkUDP/read_fake/sha-2              500000             10691 ns/op
BenchmarkUDP/read_fake/sha-4             1000000              5609 ns/op
@neild

This comment has been minimized.

Contributor

neild commented Oct 19, 2016

Forgot to mention the Go version used:

$ go version
go version go1.7.1 linux/amd64

@rsc rsc added this to the Go1.9Early milestone Oct 20, 2016

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 20, 2016

We might be able to fall back to using the lock only in the blocking-on-fd path.

@mikioh

This comment has been minimized.

Contributor

mikioh commented Apr 12, 2017

Just skimmed runtime and internal/poll packages. It looks like we need to do two things: a) relax runtime.netpollblock for accommodating multiple IO wait requests from consumer goroutines using datagram protocols, b) replace the use of existing read/write{Lock,Unlock} methods of poll.fdMutex with new IO wait contact between runtime package.

@mikioh mikioh changed the title from net: remove mutex from UDP reads/writes to net, internal/poll, runtime: remove mutex from UDP reads/writes Apr 12, 2017

@bradfitz bradfitz added the Performance label May 3, 2017

@bradfitz bradfitz modified the milestones: Go1.9Maybe, Go1.9Early May 3, 2017

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jun 2, 2017

I can't recreate the slowdown on tip. This is what I see from go test -bench=. -cpu=1,2,4,8 socks_test.go. There is some fluctuation, but I see no sign of the clear slowdown in the original bug report. Of course, I don't see a speedup either.

goos: linux
goarch: amd64
BenchmarkUDP/read_1/echo           	  200000	     10298 ns/op
BenchmarkUDP/read_1/echo-2         	  200000	     12372 ns/op
BenchmarkUDP/read_1/echo-4         	  200000	     11717 ns/op
BenchmarkUDP/read_1/echo-8         	  200000	     11989 ns/op
BenchmarkUDP/read_1/sha            	   20000	     58078 ns/op
BenchmarkUDP/read_1/sha-2          	   30000	     58502 ns/op
BenchmarkUDP/read_1/sha-4          	   30000	     58602 ns/op
BenchmarkUDP/read_1/sha-8          	   30000	     57665 ns/op
BenchmarkUDP/read_n/echo           	  200000	     10120 ns/op
BenchmarkUDP/read_n/echo-2         	  100000	     13474 ns/op
BenchmarkUDP/read_n/echo-4         	  100000	     13023 ns/op
BenchmarkUDP/read_n/echo-8         	  100000	     13747 ns/op
BenchmarkUDP/read_n/sha            	   30000	     58163 ns/op
BenchmarkUDP/read_n/sha-2          	   20000	     51673 ns/op
BenchmarkUDP/read_n/sha-4          	   30000	     48593 ns/op
BenchmarkUDP/read_n/sha-8          	   50000	     40292 ns/op
BenchmarkUDP/read_1n1/echo         	  100000	     16703 ns/op
BenchmarkUDP/read_1n1/echo-2       	  100000	     14138 ns/op
BenchmarkUDP/read_1n1/echo-4       	  100000	     14869 ns/op
BenchmarkUDP/read_1n1/echo-8       	  100000	     13253 ns/op
BenchmarkUDP/read_1n1/sha          	   30000	     60431 ns/op
BenchmarkUDP/read_1n1/sha-2        	   30000	     44566 ns/op
BenchmarkUDP/read_1n1/sha-4        	   50000	     30539 ns/op
BenchmarkUDP/read_1n1/sha-8        	  100000	     18808 ns/op
BenchmarkUDP/read_fake/echo        	500000000	         2.95 ns/op
BenchmarkUDP/read_fake/echo-2      	2000000000	         1.50 ns/op
BenchmarkUDP/read_fake/echo-4      	2000000000	         0.78 ns/op
BenchmarkUDP/read_fake/echo-8      	2000000000	         0.52 ns/op
BenchmarkUDP/read_fake/sha         	   30000	     45475 ns/op
BenchmarkUDP/read_fake/sha-2       	  100000	     19642 ns/op
BenchmarkUDP/read_fake/sha-4       	  200000	     10297 ns/op
BenchmarkUDP/read_fake/sha-8       	  200000	      7180 ns/op
PASS
ok  	command-line-arguments	63.536s
@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jun 2, 2017

I don't see a clear advantage to modifying the runtime netpoll code to support multiple goroutines handling a single socket. The netpoll code is only really relevant if we have to wait for a packet, and if we have to wait for a packet then serializing the reads doesn't matter too much. For real cases we should be able to get a new goroutine ready to read in the time it takes a new packet to arrive anyhow. I think the interesting case is when there is a packet available already, which is what Russ's suggestions is intended to address.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jun 2, 2017

I can get a small speedup on this test case by calling syscall.Recvfrom as we enter internal/poll.FD.ReadFrom, and returning immediately if we get anything other than EAGAIN. But if I add the incref/decref locking that that requires for safety, then I don't get any speedup any more.

So this requires more investigation to see if there is anything to be done. Moving to unplanned unless and until someone comes up with more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment