/ go Public
net: poor performance of Dial & friends #18601
help wanted NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
@bradfitz asked me to file a bug as a continuation of https://groups.google.com/forum/#!topic/golang-nuts/52gePwVq2sc
What version of Go are you using (
What operating system and processor architecture are you using (
The numbers below are all from a 24-core Intel machine, running Linux with Go 1.7.3, cross compiled from OS X. The machine has a multi-queue nic with RSS enabled.
What did you do?
Attempt ~100k TCP connections per second to remote hosts (same local network) using DialContext().
What did you expect to see?
At least 10k TCP connections/second per core, scaling close to linearly with cores. Ideally with errors rates close to that seen in C.
What did you see instead?
GOMAXPROCS=1 can't sustain 10k conns/ section without triggering connection timeouts at a 400ms deadline. Similarly GOMAXPROCS=24 can't sustain 100k conns / second.
Full email below.
My problem domain is such that I need to make a large number of TCP connections from a small set of hosts to many other hosts (targets), on a local network. The connections are short lived, usually <200ms and transfer <100 bytes in each direction, I need to do about 100k connections / second per source host.
The numbers below are all from a 24-core Intel machine, running Linux with Go 1.7.3, cross compiled from OS X. The machine has a multi-queue nic with RSS enabled. The targets are multiple machines running Go servers listening on 200 ports each (to avoid 5-tuple exhaustion).
My Go code  spawns N go routines, each of which calls net.Dial(), performs the transaction and then sleeps for 1s.
With this approach, setting GOMAXPROCS=1 can't sustain 10k conns/ section without triggering connection timeouts at a 400ms deadline. Similarly GOMAXPROCS=24 can't sustain 100k conns / second. Removing the context timeout passed to Dial() improves performance to the point where GOMAXPROCS=1 can do 10k conns/second at a 1% timeout rate with a 200ms deadline.
I've written a C++ solution that uses N-threads, each calling epoll(). Targets are assigned to threads and then the sockets stay local to the thread for the duration of the transaction. On the same host a single thread can do 20k conns/second with a 0.12% timeout rate at a 200ms deadline. 6 threads with 10k conn/s each produce <2% of timeouts @ 200ms and with 16 threads, 10k each, <2% exceed 200ms and <0.5% of requests exceed 300ms.
I believe the Go solution suffers from at least two issues:
i) net.Dial() is fairly expensive, both in terms of allocations & syscalls. 
ii) syscalls cause the Go routine to be rescheduled, bouncing the work for a single socket across CPU cores, hurting locality. Correct me if I'm wrong here but from my reading that's what occuring.
I've tried a number of workarounds:
Does anyone have suggestions on speeding this up? I'd prefer to keep this component in written in Go but I'm running out of options to meet the performance & efficiency targets.
 Results from https://github.com/prashantv/go-bench/blob/master/dial_test.go
BenchmarkDial/dialer.DialContext-8 1000 1344 B/op 28 allocs/op
BenchmarkDial/net.Dial-8 3000 863 B/op 20 allocs/op
BenchmarkDial/net.DialTCP-8 2000 638 B/op 15 allocs/op
BenchmarkDial/net.DialTimeout-8 2000 1344 B/op 28 allocs/op
BenchmarkDial/net.dialTCP-8 1000 1120 B/op 23 allocs/op
The text was updated successfully, but these errors were encountered: