At least 10k TCP connections/second per core, scaling close to linearly with cores. Ideally with errors rates close to that seen in C.
What did you see instead?
GOMAXPROCS=1 can't sustain 10k conns/ section without triggering connection timeouts at a 400ms deadline. Similarly GOMAXPROCS=24 can't sustain 100k conns / second.
Full email below.
My problem domain is such that I need to make a large number of TCP connections from a small set of hosts to many other hosts (targets), on a local network. The connections are short lived, usually <200ms and transfer <100 bytes in each direction, I need to do about 100k connections / second per source host.
The numbers below are all from a 24-core Intel machine, running Linux with Go 1.7.3, cross compiled from OS X. The machine has a multi-queue nic with RSS enabled. The targets are multiple machines running Go servers listening on 200 ports each (to avoid 5-tuple exhaustion).
My Go code  spawns N go routines, each of which calls net.Dial(), performs the transaction and then sleeps for 1s.
With this approach, setting GOMAXPROCS=1 can't sustain 10k conns/ section without triggering connection timeouts at a 400ms deadline. Similarly GOMAXPROCS=24 can't sustain 100k conns / second. Removing the context timeout passed to Dial() improves performance to the point where GOMAXPROCS=1 can do 10k conns/second at a 1% timeout rate with a 200ms deadline.
I've written a C++ solution that uses N-threads, each calling epoll(). Targets are assigned to threads and then the sockets stay local to the thread for the duration of the transaction. On the same host a single thread can do 20k conns/second with a 0.12% timeout rate at a 200ms deadline. 6 threads with 10k conn/s each produce <2% of timeouts @ 200ms and with 16 threads, 10k each, <2% exceed 200ms and <0.5% of requests exceed 300ms.
I believe the Go solution suffers from at least two issues:
i) net.Dial() is fairly expensive, both in terms of allocations & syscalls. 
ii) syscalls cause the Go routine to be rescheduled, bouncing the work for a single socket across CPU cores, hurting locality. Correct me if I'm wrong here but from my reading that's what occuring.
I've tried a number of workarounds:
Use net.DialTCP() at GOMAXPROCS=4, 40k conns/second all requests complete in <200ms. That's an improvement but it doesn't allow me to provide a timeout.
exposing net.tcpDial() directly gives 5% timeouts @200ms with GOMAXPROCS=4, 40k conn/s second. Setting GOMAXPROCS=24 produces a 0% timeout rate, and can scale up to 80k conn/s before timeouts start appearing (1% @ 100k conns/s). This is the best option I've found so far but requires use of an internal API.
using syscall.Socket() directly. The problem here is receiving notification when the socket is writable (connected). There doesn't appear to be a way to hook into the netpoller. I wrote a solution using syscall.EPoll() directly but that had even worse performance than the native Go solution.
Does anyone have suggestions on speeding this up? I'd prefer to keep this component in written in Go but I'm running out of options to meet the performance & efficiency targets.
One way to evaluate the cost of unnecessary syscall rescheduling would be to replace some of the syscalls in the Dial path with syscall.RawSyscall calls. (I wouldn't do this for connect, and any others that could obviously block.)