server.go: use worker goroutines for fewer stack allocations #3204
Per offline discussions:
Turns out to be insignificant. Also, since we're now allowing the user to set the number of stream workers, enforcing it to be an exponent of two is probably not a good idea. Switched to modulo.
I just realised that there's a slight issue with using stream ID -- it's biased towards lower numbers. If a server receives one unary request from several clients (a common workload), they'll all go to the same channel because the first stream ID is always 0.
Switched to a round-robin method that's actually fairer and 1% faster (~compared to master, round robin is a 5.33% improvement while stream ID is a ~4.54% improvement).
See previous comment.
Currently (go1.13.4), the default stack size for newly spawned goroutines is 2048 bytes. This is insufficient when processing gRPC requests as the we often require more than 4 KiB stacks. This causes the Go runtime to call runtime.morestack at least twice per RPC, which causes performance to suffer needlessly as stack reallocations require all sorts of internal work such as changing pointers to point to new addresses. Since this stack growth is guaranteed to happen at least twice per RPC, reusing goroutines gives us two wins: 1. The stack is already grown to 8 KiB after the first RPC, so subsequent RPCs do not call runtime.morestack. 2. We eliminate the need to spawn a new goroutine for each request (even though they're relatively inexpensive). Performance improves across the board. The improvement is especially visible in small, unary requests as the overhead of stack reallocation is higher, percentage-wise. QPS is up anywhere between 3% and 5% depending on the number of concurrent RPC requests in flight. Latency is down ~3%. There is even a 1% decrease in memory footprint in some cases, though that is an unintended, but happy coincidence. unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false Title Before After Percentage TotalOps 2613512 2701705 3.37% SendOps 0 0 NaN% RecvOps 0 0 NaN% Bytes/op 8657.00 8654.17 -0.03% Allocs/op 173.37 173.28 0.00% ReqT/op 348468.27 360227.33 3.37% RespT/op 348468.27 360227.33 3.37% 50th-Lat 174.601µs 167.378µs -4.14% 90th-Lat 233.132µs 229.087µs -1.74% 99th-Lat 438.98µs 441.857µs 0.66% Avg-Lat 183.263µs 177.26µs -3.28%