runtime: idle mark workers run when there's work to do #16528
Currently idle mark workers always run for a whole Go scheduler quantum (10ms), even when more work comes in in the middle of that quantum (e.g., more runnable goroutines or incoming network packets). This is really bad for latency-sensitive systems.
The effect of this is particularly noticeable in RPC servers with very short handlers, such as in issue #16432. In these systems, all of the Ps have frequent, short idle periods. With the current idle worker behavior, these idle periods immediately trigger the idle worker, putting that P out of commission for 10ms. This cascades through the Ps until it reaches the point where some set of Ps is completely and continuously saturated.
We should fix idle workers to really have idle priority, returning control to the scheduler as soon as possible if there is other work the P could be doing.
Ideally, the idle worker would be able to check a simple flag on each iteration through the drain loop.
A simpler solution would be to only scan one object or root job at a time and then return to the scheduler. This is essentially how the background sweep currently achieves an "idle priority". This would have fairly high overhead for the idle worker, but it would accomplish the desired effect. We could batch objects into, say, units of 10k bytes, which would limit the idle worker to ~10 microseconds, while somewhat amortizing the cost of returning to the scheduler. I have a version of this prototyped (though with a 1ms time-based bound) at https://go-review.googlesource.com/c/24706/.
Another possible solution (from #16432 (comment)) would be to always keep one P "truly idle". If that P finds itself with work, it would signal some idle worker to exit immediately.
The text was updated successfully, but these errors were encountered:
I explored this problem more today. The rpc benchmark from gcbench shows this off particularly well. This benchmark fires up trivial client and server processes, which communicate over a set of TCP sockets. The client uses open-loop control to send requests according to a Poisson process, which models the behavior of real RPC systems.
This is what the execution trace looks like during a GC:
G18 and G19 are the background workers. You can see that they always run for a full quantum, and often run simultaneously, blocking both Ps. Also, even though there's only a little overall idle time, together they're clearly taking up much more than 25% of the total CPU. In the goroutines row, you can see the system regularly falling behind throughout the GC cycle, and in the network row, you can see it mostly falls back to the periodic sysmon network poll.
In this trace, G8 and G9 are the background workers. There are still large regions where they're running the fractional worker, but there's only one at a time, and they add up to almost exactly 25% of the CPU. In the goroutines and network rows, you can see that it still falls behind at the beginning of GC and switches to the sysmon network poll. However, about 80ms in to the cycle, it catches up and things run quite smoothly for the rest of the cycle: the goroutine queue stays short, and network events are coming in rapidly.
There are a few extra colors in this trace. In the lower half of each proc row, red means "syscall" (which means user code in this case), teal means "assist", and blue means "idle worker". At the beginning of GC, when the system is falling behind, there are a lot of assists. In fact, this phase lasts about 70ms, which is right around the P99.9 and max latencies of this run. Eventually, however, the background workers get ahead, the assists stop, the runnable goroutine queue drains, we start having idle time again, and we start seeing the idle worker kicking in for very brief periods.
So I ran it with few different versions of GO, something like:
So each one does 5mln QPS total in the benchmark, and then I checked how many out of these 5mln QPS took more than 30ms.
So overall it looks like, there are actually more requests that are slowed down/stuck than there were before.
Note that this experiment is running on a pretty beefy machine in general, so there is almost no CPU utilization, since it is only doing 10k QPS. Also as before the GC lines look something like:
So there is almost no STW gc time, it is all in the concurrent phase.
I can also try to augment the benchmark to to get the perf traces too if that will be helpful.
Actually looks like I made a mistake of not properly rebuilding things with GO master, so I have rebuilt it properly and thinks do look better with the patch compared to Go1.7:
Finishing Up. GO Version: devel +8ddf8d4 Sun Oct 30 20:27:14 2016 -0400,
Finishing Up. GO Version: devel +d70b0fe Fri Oct 28 18:07:30 2016 +0000,
Finishing Up. GO Version: go1.7.1, Total Queries: 2000000, STALLS: 6952
So there are almost 3 times less stalls with +8ddf8d4 compared to GO 1.7. But it looks like at
One more data point, I ran a more real/complex application experiment with: 8ddf8d4. There is definitely improvement, but having lots of go routines is still quite problematic.
The p75 times are great, however because of huge p95/p99 times, even average latency is at ~5ms.
This real application also has similar GC situation where actual STW times are <1-2ms. The concurrent part of GC is taking 100-150ms, and it still causes massive p95/p99 request spikes.
Some more numbers from scale perspective:
When same experiment is done with 6k total connections, but same exact QPS (i.e. less clients that are more active), performance is significantly better, with p99 times in <=5ms range, because concurrent GC time is much smaller.
@zviadm, thanks for the data. Hopefully the commit I just pushed helps, but it sounds like there are other latency issues affecting your benchmark. This isn't too surprising; we're aware there are other issues. I also have an RPC microbenchmark that shows similar tail latency behavior to yours. I don't completely understand what's going on in my benchmark, but I've analyzed it enough to determine that it's something subtler than the GC just taking away long time slices.
You might want to follow #14812, though it's possible that issue has been fixed (waiting on confirmation). If it has, I'll open up a new tracking bug.