New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: GC turning on/off disruptive to scheduler and locality #20307

Open
josharian opened this Issue May 10, 2017 · 9 comments

Comments

Projects
None yet
6 participants
@josharian
Contributor

josharian commented May 10, 2017

I generated a trace of the concurrent compiler backend, and see some unfortunate things in it.

$ go tool compile -traceprofile=t.p -c=8 fmt/{doc,format,print,scan}.go
$ go tool trace t.p

Screenshot:

compile_fmt_trace_zoomed

On the left hand side of this screenshot is the beginning of concurrent backend compilation. It continues off-screen to the right. The disconcerting things I see are:

  • Large chunks of idle time. For example, 5 ms of idle time in proc 7 with plenty of work available to be scheduled, right when GC turns off, around 73ms.
  • goroutines being shuffled around between procs when GC flips on/off (the clear visual vertical lines), which is bad for data locality and caches. Should the STW phase note where Gs are and attempt to put them back there when it is done? Or some such?

For reference, the concurrency (fan-out, etc.) is all contained in a single function, compileFunctions, in pgen.go: https://github.com/golang/go/blob/master/src/cmd/compile/internal/gc/pgen.go#L255

Apologies if this is a duplicate. I can split this into two issues if that would be helpful.

cc @aclements @RLH

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 10, 2017

Contributor

Ah. I can fix the idle time by used a buffered channel. (I shouldn't have to, but it works.) The G/proc shuffling remains. New trace:

with_buffered_chan

Contributor

josharian commented May 10, 2017

Ah. I can fix the idle time by used a buffered channel. (I shouldn't have to, but it works.) The G/proc shuffling remains. New trace:

with_buffered_chan

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented May 10, 2017

CL https://golang.org/cl/43110 mentions this issue.

josharian added a commit to josharian/go that referenced this issue May 10, 2017

cmd/compile: use a buffered channel for the function queue
Updates #20307

With -c=2:

name        old time/op       new time/op       delta
Template          140ms ± 3%        139ms ± 4%  -1.06%  (p=0.003 n=50+50)
Unicode          81.1ms ± 4%       81.9ms ± 4%  +0.96%  (p=0.006 n=50+49)
GoTypes           375ms ± 3%        374ms ± 3%    ~     (p=0.094 n=48+48)
Compiler          1.69s ± 2%        1.68s ± 2%  -0.41%  (p=0.004 n=49+48)
SSA               3.05s ± 1%        3.05s ± 2%    ~     (p=0.953 n=47+49)
Flate            86.3ms ± 2%       85.9ms ± 2%  -0.49%  (p=0.011 n=49+48)
GoParser         99.5ms ± 3%       99.3ms ± 3%    ~     (p=0.394 n=48+49)
Reflect           262ms ± 3%        261ms ± 3%    ~     (p=0.354 n=47+49)
Tar              81.4ms ± 3%       79.7ms ± 4%  -1.98%  (p=0.000 n=47+50)
XML               133ms ± 3%        133ms ± 3%    ~     (p=0.992 n=50+49)
[Geo mean]        236ms             235ms       -0.36%

name        old user-time/op  new user-time/op  delta
Template          249ms ± 5%        242ms ± 7%  -2.61%  (p=0.000 n=48+50)
Unicode           111ms ± 4%        111ms ± 6%    ~     (p=0.407 n=46+47)
GoTypes           753ms ± 2%        748ms ± 3%  -0.65%  (p=0.010 n=48+50)
Compiler          3.28s ± 2%        3.27s ± 2%  -0.40%  (p=0.026 n=49+47)
SSA               7.03s ± 2%        7.01s ± 3%    ~     (p=0.154 n=45+50)
Flate             154ms ± 3%        154ms ± 3%    ~     (p=0.306 n=49+49)
GoParser          180ms ± 4%        179ms ± 4%    ~     (p=0.148 n=48+48)
Reflect           427ms ± 2%        428ms ± 3%    ~     (p=0.502 n=46+49)
Tar               142ms ± 5%        135ms ± 9%  -4.83%  (p=0.000 n=46+50)
XML               247ms ± 3%        247ms ± 4%    ~     (p=0.921 n=49+49)
[Geo mean]        426ms             422ms       -0.92%


Change-Id: I4746234439ddb9a7e5840fc783b8857da6a4a680

gopherbot pushed a commit that referenced this issue May 10, 2017

cmd/compile: use a buffered channel for the function queue
Updates #20307

With -c=2:

name        old time/op       new time/op       delta
Template          140ms ± 3%        139ms ± 4%  -1.06%  (p=0.003 n=50+50)
Unicode          81.1ms ± 4%       81.9ms ± 4%  +0.96%  (p=0.006 n=50+49)
GoTypes           375ms ± 3%        374ms ± 3%    ~     (p=0.094 n=48+48)
Compiler          1.69s ± 2%        1.68s ± 2%  -0.41%  (p=0.004 n=49+48)
SSA               3.05s ± 1%        3.05s ± 2%    ~     (p=0.953 n=47+49)
Flate            86.3ms ± 2%       85.9ms ± 2%  -0.49%  (p=0.011 n=49+48)
GoParser         99.5ms ± 3%       99.3ms ± 3%    ~     (p=0.394 n=48+49)
Reflect           262ms ± 3%        261ms ± 3%    ~     (p=0.354 n=47+49)
Tar              81.4ms ± 3%       79.7ms ± 4%  -1.98%  (p=0.000 n=47+50)
XML               133ms ± 3%        133ms ± 3%    ~     (p=0.992 n=50+49)
[Geo mean]        236ms             235ms       -0.36%

name        old user-time/op  new user-time/op  delta
Template          249ms ± 5%        242ms ± 7%  -2.61%  (p=0.000 n=48+50)
Unicode           111ms ± 4%        111ms ± 6%    ~     (p=0.407 n=46+47)
GoTypes           753ms ± 2%        748ms ± 3%  -0.65%  (p=0.010 n=48+50)
Compiler          3.28s ± 2%        3.27s ± 2%  -0.40%  (p=0.026 n=49+47)
SSA               7.03s ± 2%        7.01s ± 3%    ~     (p=0.154 n=45+50)
Flate             154ms ± 3%        154ms ± 3%    ~     (p=0.306 n=49+49)
GoParser          180ms ± 4%        179ms ± 4%    ~     (p=0.148 n=48+48)
Reflect           427ms ± 2%        428ms ± 3%    ~     (p=0.502 n=46+49)
Tar               142ms ± 5%        135ms ± 9%  -4.83%  (p=0.000 n=46+50)
XML               247ms ± 3%        247ms ± 4%    ~     (p=0.921 n=49+49)
[Geo mean]        426ms             422ms       -0.92%


Change-Id: I4746234439ddb9a7e5840fc783b8857da6a4a680
Reviewed-on: https://go-review.googlesource.com/43110
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@RLH

This comment has been minimized.

Show comment
Hide comment
@RLH

RLH May 10, 2017

Contributor
Contributor

RLH commented May 10, 2017

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 10, 2017

Contributor

It's running on my laptop (8 core 2.9 GHz Intel Core i7 macOS). Although presumably this happens on all hardware, not just mine.

If it's easy to hack together a patch (unsightly or not) that experiments with memorizing G locations, I'm happy to benchmark it. The concurrent compiler seems like a good test case for it--numcpu long-running memory-heavy calculations.

That said the GC tracing is likely to do far more to smash the cache and TLB than moving goroutines between cores on the same chip.

Perhaps so. Seems worth an experiment, if feasible.

Contributor

josharian commented May 10, 2017

It's running on my laptop (8 core 2.9 GHz Intel Core i7 macOS). Although presumably this happens on all hardware, not just mine.

If it's easy to hack together a patch (unsightly or not) that experiments with memorizing G locations, I'm happy to benchmark it. The concurrent compiler seems like a good test case for it--numcpu long-running memory-heavy calculations.

That said the GC tracing is likely to do far more to smash the cache and TLB than moving goroutines between cores on the same chip.

Perhaps so. Seems worth an experiment, if feasible.

@griesemer

This comment has been minimized.

Show comment
Hide comment
@griesemer

griesemer May 10, 2017

Contributor

@josharian Do you know why the buffered channel makes such a big difference? Is it because there's a lot of time wasted (for reasons not yet understood) when context-switching (unbuffered case), and there's less context switching needed when there's a buffered channel?

Contributor

griesemer commented May 10, 2017

@josharian Do you know why the buffered channel makes such a big difference? Is it because there's a lot of time wasted (for reasons not yet understood) when context-switching (unbuffered case), and there's less context switching needed when there's a buffered channel?

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 10, 2017

Contributor

@griesemer I think that's exactly it--those gaps in the first trace are (I suspect) caused by waiting for the scheduler to coordinate handing over work to a new worker, which requires scheduling both the producer and the consumer. The unbuffered channel means that (to a first approximation) there's always work available for a worker without needing the scheduler, thus no gaps.

Contributor

josharian commented May 10, 2017

@griesemer I think that's exactly it--those gaps in the first trace are (I suspect) caused by waiting for the scheduler to coordinate handing over work to a new worker, which requires scheduling both the producer and the consumer. The unbuffered channel means that (to a first approximation) there's always work available for a worker without needing the scheduler, thus no gaps.

@RLH

This comment has been minimized.

Show comment
Hide comment
@RLH

RLH May 10, 2017

Contributor
Contributor

RLH commented May 10, 2017

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 10, 2017

Contributor

If I'm reading the goroutine line in the trace correctly there are runnable goroutines that aren't being run during these gaps.

That's correct.

As far as adding CPU affinity to our goroutines I'm a bit nervous.

I'm not suggesting CPU affinity. (At least, I don't think I am.) I'm suggesting that when we go into STW, we record which Ps are running which Gs, and put those Gs back on those Ps when we leave STW. That's it. But maybe that does involve breaking many layers of abstraction; if so, I understand your reluctance.

Contributor

josharian commented May 10, 2017

If I'm reading the goroutine line in the trace correctly there are runnable goroutines that aren't being run during these gaps.

That's correct.

As far as adding CPU affinity to our goroutines I'm a bit nervous.

I'm not suggesting CPU affinity. (At least, I don't think I am.) I'm suggesting that when we go into STW, we record which Ps are running which Gs, and put those Gs back on those Ps when we leave STW. That's it. But maybe that does involve breaking many layers of abstraction; if so, I understand your reluctance.

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 11, 2017

Member

I'm suggesting that when we go into STW, we record which Ps are running which Gs, and put those Gs back on those Ps when we leave STW.

We do that already, actually. STW leaves the per-P run queues in place. However, it does kick the currently running G to the end of the P's run queue (because it's using the regular preemption path), which is probably why it doesn't look like we have affinity in the execution traces.

That might be relatively easy to hack in a fix for, but it's not going to matter much because we definitely don't maintain P-to-M affinity over STW, so the goroutine will almost certainly be resumed on a different OS thread even if it's still on the same P. That's harder to fix.

Member

aclements commented May 11, 2017

I'm suggesting that when we go into STW, we record which Ps are running which Gs, and put those Gs back on those Ps when we leave STW.

We do that already, actually. STW leaves the per-P run queues in place. However, it does kick the currently running G to the end of the P's run queue (because it's using the regular preemption path), which is probably why it doesn't look like we have affinity in the execution traces.

That might be relatively easy to hack in a fix for, but it's not going to matter much because we definitely don't maintain P-to-M affinity over STW, so the goroutine will almost certainly be resumed on a different OS thread even if it's still on the same P. That's harder to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment