Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: idle mark workers run when there's work to do #16528

aclements opened this issue Jul 28, 2016 · 9 comments

runtime: idle mark workers run when there's work to do #16528

aclements opened this issue Jul 28, 2016 · 9 comments
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.


Copy link

@aclements aclements commented Jul 28, 2016

Currently idle mark workers always run for a whole Go scheduler quantum (10ms), even when more work comes in in the middle of that quantum (e.g., more runnable goroutines or incoming network packets). This is really bad for latency-sensitive systems.

The effect of this is particularly noticeable in RPC servers with very short handlers, such as in issue #16432. In these systems, all of the Ps have frequent, short idle periods. With the current idle worker behavior, these idle periods immediately trigger the idle worker, putting that P out of commission for 10ms. This cascades through the Ps until it reaches the point where some set of Ps is completely and continuously saturated.

We should fix idle workers to really have idle priority, returning control to the scheduler as soon as possible if there is other work the P could be doing.

Ideally, the idle worker would be able to check a simple flag on each iteration through the drain loop.

A simpler solution would be to only scan one object or root job at a time and then return to the scheduler. This is essentially how the background sweep currently achieves an "idle priority". This would have fairly high overhead for the idle worker, but it would accomplish the desired effect. We could batch objects into, say, units of 10k bytes, which would limit the idle worker to ~10 microseconds, while somewhat amortizing the cost of returning to the scheduler. I have a version of this prototyped (though with a 1ms time-based bound) at

Another possible solution (from #16432 (comment)) would be to always keep one P "truly idle". If that P finds itself with work, it would signal some idle worker to exit immediately.

/cc @RLH @rhysh

Copy link

@genez genez commented Sep 29, 2016

I can apply the patch at to an updated master source and test it.
Could it be useful for you?

Copy link
Member Author

@aclements aclements commented Sep 29, 2016

@genez, thanks for offering. There's no harm in testing with that patch, but I'm not sure it would be useful.

Copy link
Member Author

@aclements aclements commented Sep 29, 2016

I explored this problem more today. The rpc benchmark from gcbench shows this off particularly well. This benchmark fires up trivial client and server processes, which communicate over a set of TCP sockets. The client uses open-loop control to send requests according to a Poisson process, which models the behavior of real RPC systems.

This is what the execution trace looks like during a GC:

screenshot from 2016-09-29 14 59 07

G18 and G19 are the background workers. You can see that they always run for a full quantum, and often run simultaneously, blocking both Ps. Also, even though there's only a little overall idle time, together they're clearly taking up much more than 25% of the total CPU. In the goroutines row, you can see the system regularly falling behind throughout the GC cycle, and in the network row, you can see it mostly falls back to the periodic sysmon network poll.

CL 30031 is an experiment that changes the idle worker to constantly poll for either run queue work or network wake-ups. With this CL, the trace looks very different (trace file):

screenshot from 2016-09-29 15 06 42

In this trace, G8 and G9 are the background workers. There are still large regions where they're running the fractional worker, but there's only one at a time, and they add up to almost exactly 25% of the CPU. In the goroutines and network rows, you can see that it still falls behind at the beginning of GC and switches to the sysmon network poll. However, about 80ms in to the cycle, it catches up and things run quite smoothly for the rest of the cycle: the goroutine queue stays short, and network events are coming in rapidly.

There are a few extra colors in this trace. In the lower half of each proc row, red means "syscall" (which means user code in this case), teal means "assist", and blue means "idle worker". At the beginning of GC, when the system is falling behind, there are a lot of assists. In fact, this phase lasts about 70ms, which is right around the P99.9 and max latencies of this run. Eventually, however, the background workers get ahead, the assists stop, the runnable goroutine queue drains, we start having idle time again, and we start seeing the idle worker kicking in for very brief periods.

@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Sep 30, 2016
@rsc rsc modified the milestones: Go1.8, Go1.8Early Oct 20, 2016
Copy link

@gopherbot gopherbot commented Oct 31, 2016

CL mentions this issue.

Copy link

@zviadm zviadm commented Nov 2, 2016

@aclements just as an additional data point, I got around to running the small micro benchmark that I wrote for: #17341

So I ran it with few different versions of GO, something like:

GODEBUG=gctrace=1 ./net_gctest_GO17 -port=5001 -nroutines=30000 -sleep_t=3s -max_latency=30ms -total_qps=5000000 &> gcbench_5mln_go17.txt

Go: d70b0fe...
GODEBUG=gctrace=1 ./net_gctest_GOd70b0fe -port=5001 -nroutines=30000 -sleep_t=3s -max_latency=30ms -total_qps=5000000 &> gcbench_5mln_god70b0fe.txt

Go: 8ddf8d4ea...
GODEBUG=gctrace=1 ./net_gctest_GO8ddf8d4ea -port=5001 -nroutines=30000 -sleep_t=3s -max_latency=30ms -total_qps=5000000 &> gcbench_5mln_go8ddf8d4ea.txt

So each one does 5mln QPS total in the benchmark, and then I checked how many out of these 5mln QPS took more than 30ms.

cat gcbench_5mln_go17.txt | grep "STALL" | wc -l

cat gcbench_5mln_god70b0fe.txt | grep "STALL" | wc -l

cat gcbench_5mln_go8ddf8d4ea.txt | grep "STALL" | wc -l

So overall it looks like, there are actually more requests that are slowed down/stuck than there were before.

Note that this experiment is running on a pretty beefy machine in general, so there is almost no CPU utilization, since it is only doing 10k QPS. Also as before the GC lines look something like:

gc 31 @314.612s 0%: 0.35+174+1.1 ms clock, 10+590/1383/3164+35 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P
gc 32 @330.130s 0%: 0.32+174+1.4 ms clock, 10+595/1358/3004+45 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P

So there is almost no STW gc time, it is all in the concurrent phase.

I can also try to augment the benchmark to to get the perf traces too if that will be helpful.

Copy link

@zviadm zviadm commented Nov 3, 2016

Actually looks like I made a mistake of not properly rebuilding things with GO master, so I have rebuilt it properly and thinks do look better with the patch compared to Go1.7:

Finishing Up. GO Version: devel +8ddf8d4 Sun Oct 30 20:27:14 2016 -0400,
Total Queries: 2000000, STALLS: 2301

Finishing Up. GO Version: devel +d70b0fe Fri Oct 28 18:07:30 2016 +0000,
TOTAL Queries: 2000000, STALLS: 11606

Finishing Up. GO Version: go1.7.1, Total Queries: 2000000, STALLS: 6952

So there are almost 3 times less stalls with +8ddf8d4 compared to GO 1.7. But it looks like at d70b0fe things got worse with related to stalls compared to GO 1.7.

Copy link
Member Author

@aclements aclements commented Nov 4, 2016

Thanks for the data points @zviadm. Just to clarify, do you mean commit d70b0fe specifically made the stalls worse (compared to d70b0fe^), or just that they got worse between go1.7.1 and d70b0fe?

Copy link

@zviadm zviadm commented Nov 7, 2016

I only did tests with Go1.7 and d70b0fe. So things just got worse sometime between Go 1.7.1 and d70b0fe.

One more data point, I ran a more real/complex application experiment with: 8ddf8d4. There is definitely improvement, but having lots of go routines is still quite problematic.
Here are some numbers from experiment for a RPC call (times are end to end measured from client side):

  • p75: ~0.5ms (which is great)
  • p95: ~27ms (which is quite bad for low latency service)
  • p99: ~100ms

The p75 times are great, however because of huge p95/p99 times, even average latency is at ~5ms.

This real application also has similar GC situation where actual STW times are <1-2ms. The concurrent part of GC is taking 100-150ms, and it still causes massive p95/p99 request spikes.

Some more numbers from scale perspective:
-> there were ~60k open mostly idle connections per process.
-> single process was doing ~40-45k QPS.
-> single process runs on a single NUMA socket with 16 hyperthreaded cores.

When same experiment is done with 6k total connections, but same exact QPS (i.e. less clients that are more active), performance is significantly better, with p99 times in <=5ms range, because concurrent GC time is much smaller.

Copy link
Member Author

@aclements aclements commented Nov 20, 2016

@zviadm, thanks for the data. Hopefully the commit I just pushed helps, but it sounds like there are other latency issues affecting your benchmark. This isn't too surprising; we're aware there are other issues. I also have an RPC microbenchmark that shows similar tail latency behavior to yours. I don't completely understand what's going on in my benchmark, but I've analyzed it enough to determine that it's something subtler than the GC just taking away long time slices.

You might want to follow #14812, though it's possible that issue has been fixed (waiting on confirmation). If it has, I'll open up a new tracking bug.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
None yet

No branches or pull requests

6 participants