Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Summary: An idiomatic serve loop that reads from a file (as opposed to a socket) pauses for 40us before the request goroutine starts. We isolated this from a FUSE filesystem into a small benchmark.
We compared it against a no-concurrency serial server:
We measured the overhead (from the end of accept() to beginning of process()). Median of 100k runs for each strategy:
For context, we've found the rest of our logic can run in ~10us, so the go scheduling overhead would be 400%
We tried a channel:
Which is comparable to a Goroutine:
Most servers read/accept from a socket, which the Go runtime implements using polling via the netpoller. FUSE reads from a file, which uses a blocking syscall. AIUI, the Go runtime will let the thread make the syscall, and then another thread of the runtime will notice it's blocked after ~20 us, and then reschedule work. These numbers seem similar which makes me think they're related.
Our workaround: service one process (both accept and process) on one goroutine and hand off the next request to a new goroutine. Strategy "handoff" looks like:
I'm seeing this on Go version 1.6 darwin/amd64
Is this a known issue? Does it affect other platforms? Is our rewriting of the serve loop a known workaround?
Our full benchmark: http://play.golang.org/p/wc6hPnN778
Build it into a binary fileping. To run:
(Modify the -strategy argument in the first command to try different strategies)
The measurement ignores time the connection spent in the kernel queue before Go picks them up. If you measure request latency from the client, I doubt the handoff based solution helps much (it just moves the latency from measured time to unmeasured time so the measured latency might appear to be less, but the latency is actually still there.) Optimizing the delay before a new goroutine starts running will have a negative impact on system throughput, so we must find a balance here (as you've discovered, the obvious solution is for the runtime to immediately preempt the current goroutine and execute the newly created one, but then the accept goroutine must migrate to another thread, and that will hurt future connections.)
The benchmark that I showed has two servers/clients talking to each other. With strategy goroutine, the program takes 19s. With strategy handoff it takes 1.9s, so there is an improvement in the RTT.
Would it help if I made a benchmark that captured the latency from the client side?