@bradfitz asked me to file a bug as a continuation of https://groups.google.com/forum/#!topic/golang-nuts/52gePwVq2sc
The numbers below are all from a 24-core Intel machine, running Linux with Go 1.7.3, cross compiled from OS X. The machine has a multi-queue nic with RSS enabled.
Attempt ~100k TCP connections per second to remote hosts (same local network) using DialContext().
At least 10k TCP connections/second per core, scaling close to linearly with cores. Ideally with errors rates close to that seen in C.
GOMAXPROCS=1 can't sustain 10k conns/ section without triggering connection timeouts at a 400ms deadline. Similarly GOMAXPROCS=24 can't sustain 100k conns / second.
Full email below.
My problem domain is such that I need to make a large number of TCP connections from a small set of hosts to many other hosts (targets), on a local network. The connections are short lived, usually <200ms and transfer <100 bytes in each direction, I need to do about 100k connections / second per source host.
The numbers below are all from a 24-core Intel machine, running Linux with Go 1.7.3, cross compiled from OS X. The machine has a multi-queue nic with RSS enabled. The targets are multiple machines running Go servers listening on 200 ports each (to avoid 5-tuple exhaustion).
My Go code  spawns N go routines, each of which calls net.Dial(), performs the transaction and then sleeps for 1s.
With this approach, setting GOMAXPROCS=1 can't sustain 10k conns/ section without triggering connection timeouts at a 400ms deadline. Similarly GOMAXPROCS=24 can't sustain 100k conns / second. Removing the context timeout passed to Dial() improves performance to the point where GOMAXPROCS=1 can do 10k conns/second at a 1% timeout rate with a 200ms deadline.
I've written a C++ solution that uses N-threads, each calling epoll(). Targets are assigned to threads and then the sockets stay local to the thread for the duration of the transaction. On the same host a single thread can do 20k conns/second with a 0.12% timeout rate at a 200ms deadline. 6 threads with 10k conn/s each produce <2% of timeouts @ 200ms and with 16 threads, 10k each, <2% exceed 200ms and <0.5% of requests exceed 300ms.
I believe the Go solution suffers from at least two issues:
i) net.Dial() is fairly expensive, both in terms of allocations & syscalls. 
ii) syscalls cause the Go routine to be rescheduled, bouncing the work for a single socket across CPU cores, hurting locality. Correct me if I'm wrong here but from my reading that's what occuring.
I've tried a number of workarounds:
Does anyone have suggestions on speeding this up? I'd prefer to keep this component in written in Go but I'm running out of options to meet the performance & efficiency targets.
 Results from https://github.com/prashantv/go-bench/blob/master/dial_test.go
BenchmarkDial/dialer.DialContext-8 1000 1344 B/op 28 allocs/op
BenchmarkDial/net.Dial-8 3000 863 B/op 20 allocs/op
BenchmarkDial/net.DialTCP-8 2000 638 B/op 15 allocs/op
BenchmarkDial/net.DialTimeout-8 2000 1344 B/op 28 allocs/op
BenchmarkDial/net.dialTCP-8 1000 1120 B/op 23 allocs/op
/cc @dvyukov @aclements @mikioh @mdempsky @ianlancetaylor
One way to evaluate the cost of unnecessary syscall rescheduling would be to replace some of the syscalls in the Dial path with syscall.RawSyscall calls. (I wouldn't do this for connect, and any others that could obviously block.)