net: poor performance of Dial & friends #18601

Open
nomis52 opened this Issue Jan 10, 2017 · 2 comments

Projects

None yet

3 participants

@nomis52
nomis52 commented Jan 10, 2017

@bradfitz asked me to file a bug as a continuation of https://groups.google.com/forum/#!topic/golang-nuts/52gePwVq2sc

What version of Go are you using (go version)?

1.7.3

What operating system and processor architecture are you using (go env)?

The numbers below are all from a 24-core Intel machine, running Linux with Go 1.7.3, cross compiled from OS X. The machine has a multi-queue nic with RSS enabled.

What did you do?

Attempt ~100k TCP connections per second to remote hosts (same local network) using DialContext().

https://gist.github.com/nomis52/7b8405644132a09d2e8f9b8f769297cb

What did you expect to see?

At least 10k TCP connections/second per core, scaling close to linearly with cores. Ideally with errors rates close to that seen in C.

What did you see instead?

GOMAXPROCS=1 can't sustain 10k conns/ section without triggering connection timeouts at a 400ms deadline. Similarly GOMAXPROCS=24 can't sustain 100k conns / second.

Full email below.


My problem domain is such that I need to make a large number of TCP connections from a small set of hosts to many other hosts (targets), on a local network. The connections are short lived, usually <200ms and transfer <100 bytes in each direction, I need to do about 100k connections / second per source host.

The numbers below are all from a 24-core Intel machine, running Linux with Go 1.7.3, cross compiled from OS X. The machine has a multi-queue nic with RSS enabled. The targets are multiple machines running Go servers listening on 200 ports each (to avoid 5-tuple exhaustion).

My Go code [1] spawns N go routines, each of which calls net.Dial(), performs the transaction and then sleeps for 1s.

With this approach, setting GOMAXPROCS=1 can't sustain 10k conns/ section without triggering connection timeouts at a 400ms deadline. Similarly GOMAXPROCS=24 can't sustain 100k conns / second. Removing the context timeout passed to Dial() improves performance to the point where GOMAXPROCS=1 can do 10k conns/second at a 1% timeout rate with a 200ms deadline.

I've written a C++ solution that uses N-threads, each calling epoll(). Targets are assigned to threads and then the sockets stay local to the thread for the duration of the transaction. On the same host a single thread can do 20k conns/second with a 0.12% timeout rate at a 200ms deadline. 6 threads with 10k conn/s each produce <2% of timeouts @ 200ms and with 16 threads, 10k each, <2% exceed 200ms and <0.5% of requests exceed 300ms.

I believe the Go solution suffers from at least two issues:

i) net.Dial() is fairly expensive, both in terms of allocations & syscalls. [2]
ii) syscalls cause the Go routine to be rescheduled, bouncing the work for a single socket across CPU cores, hurting locality. Correct me if I'm wrong here but from my reading that's what occuring.

I've tried a number of workarounds:

  • Use net.DialTCP() at GOMAXPROCS=4, 40k conns/second all requests complete in <200ms. That's an improvement but it doesn't allow me to provide a timeout.
  • exposing net.tcpDial() directly gives 5% timeouts @200ms with GOMAXPROCS=4, 40k conn/s second. Setting GOMAXPROCS=24 produces a 0% timeout rate, and can scale up to 80k conn/s before timeouts start appearing (1% @ 100k conns/s). This is the best option I've found so far but requires use of an internal API.
  • using syscall.Socket() directly. The problem here is receiving notification when the socket is writable (connected). There doesn't appear to be a way to hook into the netpoller. I wrote a solution using syscall.EPoll() directly but that had even worse performance than the native Go solution.

Does anyone have suggestions on speeding this up? I'd prefer to keep this component in written in Go but I'm running out of options to meet the performance & efficiency targets.

[1] https://gist.github.com/nomis52/7b8405644132a09d2e8f9b8f769297cb
[2] Results from https://github.com/prashantv/go-bench/blob/master/dial_test.go

BenchmarkDial/dialer.DialContext-8 1000 1344 B/op 28 allocs/op
BenchmarkDial/net.Dial-8 3000 863 B/op 20 allocs/op
BenchmarkDial/net.DialTCP-8 2000 638 B/op 15 allocs/op
BenchmarkDial/net.DialTimeout-8 2000 1344 B/op 28 allocs/op
BenchmarkDial/net.dialTCP-8 1000 1120 B/op 23 allocs/op

@bradfitz bradfitz added this to the Go1.9Maybe milestone Jan 10, 2017
@bradfitz bradfitz added the HelpWanted label Jan 10, 2017
@bradfitz bradfitz changed the title from Poor performance of net.Dial() & friends to net: poor performance of Dial & friends Jan 10, 2017
@philhofer
Contributor

One way to evaluate the cost of unnecessary syscall rescheduling would be to replace some of the syscalls in the Dial path with syscall.RawSyscall calls. (I wouldn't do this for connect, and any others that could obviously block.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment