Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: check build reproducibility #19961

Open
josharian opened this issue Apr 13, 2017 · 3 comments

Comments

@josharian
Copy link
Contributor

commented Apr 13, 2017

It'd be nice to have the builder (and ideally also a trybot) check reproducibility of builds. This is not an good thing to add to all.bash, because it is slow and computationally expensive.

The implementation is fairly straightforward. I usually do it by running something like:

$ ./make.bash
$ toolstash save
$ for f in `seq 15`; do go build -a -toolexec='toolstash -cmp' std cmd; done

However, a simpler (if slower) approach that should also work is to run make.bash many times (n=5? 10?) and check that all the resulting .a and executable files are identical. In fact, I think all the filesystem contents should be identical.

This is about to become more important because the introduction of a concurrent compiler backend provides lots more opportunity to mess up reproducibility. Detection is currently manual (e.g. #19872).

@bradfitz @adams-sarah

@josharian

This comment has been minimized.

Copy link
Contributor Author

commented Apr 13, 2017

I'm happy to do some legwork here, but my experience from adding the vetall builder is that I'll need some help and direction.

@josharian josharian added the Builders label Apr 13, 2017

@gopherbot

This comment has been minimized.

Copy link

commented Apr 13, 2017

CL https://golang.org/cl/40693 mentions this issue.

josharian added a commit to josharian/go that referenced this issue Apr 19, 2017
cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

CAVEATS

I suspect it's going to end up on Twitter.
If you're coming here from the internet:

* Don't believe the hype.
* If you're going to try it out, please also try with
  the race detector enabled and report an issue
  at golang.org/issue/new if you see a race report.
  Just run 'go install -race cmd/compile' and
  'go build -a yourpackages'. And then run make.bash
  again to return a reasonable compiler.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
let in the development cycle,
we can simply have cmd/go set -c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  It requires additional sorting to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is easily the most heavily contended mutex.
  I hope that the syncmap proposed in golang#18177 may
  provide some free speed-ups here.
  Some lookups may also be removable.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal, aside from some mild
  awkwardness to avoid deadlocks due to recursive calls.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

* types.Sym.Lsymmu. Syms keep a cache of their associated LSym,
  to reduce lookups in ctxt.Hash. This cache itself
  needs concurrency protection. This mutex adds to the size
  of a moderately important data structure, but the alloc benchmarks
  below show that this doesn't hurt much in practice.
  It is moderately contended, mostly because when lookups fail,
  the lock is held while vying for the contended ctxt.Hash mutex.
  The fact that it keeps load off the ctxt.Hash mutex, though,
  makes this mutex worth keeping.
  Also, a fair number of calls to Linksym could be avoided
  by judicious addition of a local variable.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed golang#19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is golang#19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1
costs about 1-2% CPU and has almost no memory impact.

name       old time/op       new time/op       delta
Template         193ms ± 4%        195ms ± 5%  +0.99%  (p=0.007 n=50+50)
Unicode         82.0ms ± 3%       84.7ms ± 4%  +3.26%  (p=0.000 n=47+48)
GoTypes          539ms ± 4%        549ms ± 3%  +1.90%  (p=0.000 n=46+46)
SSA              5.92s ± 2%        5.95s ± 3%  +0.49%  (p=0.018 n=42+48)
Flate            121ms ± 4%        122ms ± 3%    ~     (p=0.783 n=48+46)
GoParser         143ms ± 6%        144ms ± 3%  +0.90%  (p=0.001 n=49+47)
Reflect          342ms ± 4%        347ms ± 3%  +1.47%  (p=0.000 n=48+49)
Tar              104ms ± 5%        105ms ± 4%  +1.06%  (p=0.003 n=47+47)
XML              197ms ± 4%        199ms ± 5%  +0.85%  (p=0.014 n=48+49)

name       old user-time/op  new user-time/op  delta
Template         238ms ±10%        241ms ±10%    ~     (p=0.066 n=50+50)
Unicode          104ms ± 4%        106ms ± 5%  +2.23%  (p=0.000 n=46+49)
GoTypes          706ms ± 3%        714ms ± 5%  +1.15%  (p=0.002 n=45+49)
SSA              8.21s ± 3%        8.27s ± 2%  +0.78%  (p=0.002 n=43+47)
Flate            144ms ± 7%        145ms ± 5%    ~     (p=0.098 n=49+49)
GoParser         175ms ± 4%        177ms ± 4%  +1.35%  (p=0.003 n=47+50)
Reflect          435ms ± 6%        433ms ± 7%    ~     (p=0.852 n=50+50)
Tar              121ms ± 6%        122ms ± 7%    ~     (p=0.143 n=48+50)
XML              240ms ± 4%        241ms ± 4%    ~     (p=0.231 n=50+49)

name       old alloc/op      new alloc/op      delta
Template        38.7MB ± 0%       38.7MB ± 0%    ~     (p=0.690 n=5+5)
Unicode         29.8MB ± 0%       29.8MB ± 0%    ~     (p=0.222 n=5+5)
GoTypes          113MB ± 0%        113MB ± 0%    ~     (p=0.310 n=5+5)
SSA             1.24GB ± 0%       1.24GB ± 0%  +0.04%  (p=0.008 n=5+5)
Flate           25.2MB ± 0%       25.2MB ± 0%    ~     (p=0.841 n=5+5)
GoParser        31.7MB ± 0%       31.7MB ± 0%    ~     (p=0.151 n=5+5)
Reflect         77.5MB ± 0%       77.5MB ± 0%    ~     (p=0.690 n=5+5)
Tar             26.4MB ± 0%       26.4MB ± 0%    ~     (p=0.690 n=5+5)
XML             42.0MB ± 0%       42.0MB ± 0%    ~     (p=0.690 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          378k ± 0%         377k ± 1%    ~     (p=0.841 n=5+5)
Unicode           321k ± 0%         322k ± 0%    ~     (p=0.310 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%    ~     (p=0.310 n=5+5)
SSA              9.67M ± 0%        9.69M ± 0%  +0.20%  (p=0.008 n=5+5)
Flate             233k ± 0%         233k ± 1%    ~     (p=1.000 n=5+5)
GoParser          315k ± 0%         314k ± 1%    ~     (p=0.151 n=5+5)
Reflect           971k ± 0%         970k ± 0%    ~     (p=0.690 n=5+5)
Tar               249k ± 0%         249k ± 0%    ~     (p=0.310 n=5+5)
XML               391k ± 0%         390k ± 0%    ~     (p=0.841 n=5+5)

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates golang#15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
josharian added a commit to josharian/go that referenced this issue Apr 19, 2017
cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.


CAVEATS

I suspect it's going to end up on Twitter.
If you're coming here from the internet:

* Don't believe the hype.
* If you're going to try it out, please also try with
  the race detector enabled and report an issue
  at golang.org/issue/new if you see a race report.
  Just run 'go install -race cmd/compile' and
  'go build -a yourpackages'. And then run make.bash
  again to return a reasonable compiler.


BACKGROUND

The compiler current consists (very roughly) of a handful of phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.


USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
let in the development cycle,
we can simply have cmd/go set -c=1 always,
and revert to the original compiler behavior.


MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  It requires additional sorting to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.Hashmu. ctxt.Hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is easily the most heavily contended mutex.
  I hope that the syncmap proposed in golang#18177 may
  provide some free speed-ups here.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal, aside from some mild
  awkwardness to avoid deadlocks due to recursive calls.

* types.Pkg.Symsmu. Looking up symbols in a package
  happens a fair amount during backend compilation,
  including the construction of gcargs/gclocals symbols,
  and types, particularly via ngotype.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

* types.Sym.Lsymmu. Syms keep a cache of their associated LSym,
  to reduce lookups in ctxt.Hash. This cache itself
  needs concurrency protection. This mutex adds to the size
  of a moderately important data structure, but the alloc benchmarks
  below show that this doesn't hurt much in practice.
  It is moderately contended, mostly because when lookups fail,
  the lock is held while vying for the contended ctxt.Hash mutex.
  The fact that it keeps load off the ctxt.Hash mutex, though,
  makes this mutex worth keeping.


TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed golang#19962 for this.


REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is golang#19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.


NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.


PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1
costs about 3% CPU and has almost no memory impact.

name       old time/op       new time/op       delta
Template         194ms ± 4%        195ms ± 4%  +0.91%  (p=0.002 n=49+47)
Unicode         82.9ms ± 4%       85.2ms ± 3%  +2.68%  (p=0.000 n=47+48)
GoTypes          518ms ± 3%        527ms ± 2%  +1.81%  (p=0.000 n=46+46)
SSA              5.59s ± 2%        5.77s ± 2%  +3.12%  (p=0.000 n=48+50)
Flate            120ms ± 3%        122ms ± 3%  +1.54%  (p=0.000 n=49+48)
GoParser         140ms ± 3%        143ms ± 4%  +2.24%  (p=0.000 n=47+49)
Reflect          333ms ± 3%        342ms ± 3%  +2.67%  (p=0.000 n=47+49)
Tar              102ms ± 6%        104ms ± 4%  +2.37%  (p=0.000 n=50+47)
XML              202ms ±13%        196ms ± 4%  -2.63%  (p=0.036 n=50+48)

name       old user-time/op  new user-time/op  delta
Template         236ms ± 9%        237ms ±10%    ~     (p=0.750 n=50+50)
Unicode          104ms ± 7%        107ms ± 4%  +2.09%  (p=0.000 n=49+47)
GoTypes          691ms ± 3%        701ms ± 3%  +1.40%  (p=0.000 n=50+49)
SSA              7.91s ± 3%        8.07s ± 4%  +1.98%  (p=0.000 n=48+49)
Flate            142ms ± 4%        145ms ± 5%  +2.12%  (p=0.000 n=47+48)
GoParser         172ms ± 6%        175ms ± 5%  +1.76%  (p=0.002 n=50+49)
Reflect          425ms ± 9%        436ms ± 8%  +2.55%  (p=0.001 n=50+50)
Tar              119ms ± 6%        121ms ± 5%  +2.52%  (p=0.000 n=49+49)
XML              242ms ± 8%        239ms ± 6%  -1.54%  (p=0.039 n=49+49)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       38.8MB ± 0%    ~     (p=0.247 n=10+10)
Unicode         29.8MB ± 0%       29.8MB ± 0%    ~     (p=0.631 n=10+10)
GoTypes          113MB ± 0%        113MB ± 0%    ~     (p=0.218 n=10+10)
SSA             1.25GB ± 0%       1.25GB ± 0%  +0.02%  (p=0.000 n=10+10)
Flate           25.3MB ± 0%       25.3MB ± 0%    ~     (p=0.315 n=9+10)
GoParser        31.7MB ± 0%       31.7MB ± 0%    ~     (p=0.089 n=10+10)
Reflect         78.2MB ± 0%       78.2MB ± 0%  +0.07%  (p=0.019 n=10+10)
Tar             26.5MB ± 0%       26.6MB ± 0%    ~     (p=0.165 n=10+10)
XML             42.4MB ± 0%       42.4MB ± 0%    ~     (p=0.497 n=9+10)

name       old allocs/op     new allocs/op     delta
Template          378k ± 1%         379k ± 1%    ~     (p=0.353 n=10+10)
Unicode           321k ± 0%         321k ± 1%    ~     (p=0.684 n=10+10)
GoTypes          1.14M ± 0%        1.14M ± 0%    ~     (p=0.247 n=10+10)
SSA              9.71M ± 0%        9.72M ± 0%  +0.08%  (p=0.000 n=10+10)
Flate             234k ± 1%         234k ± 1%    ~     (p=0.280 n=10+10)
GoParser          316k ± 0%         315k ± 1%  -0.26%  (p=0.040 n=9+9)
Reflect           980k ± 0%         981k ± 0%    ~     (p=0.052 n=10+10)
Tar               249k ± 1%         250k ± 1%    ~     (p=0.190 n=10+10)
XML               391k ± 1%         391k ± 1%    ~     (p=0.829 n=8+10)


Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)


From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)


Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.


Updates golang#15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
josharian added a commit to josharian/go that referenced this issue Apr 20, 2017
cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

CAVEATS

I suspect it's going to end up on Twitter.
If you're coming here from the internet:

* Don't believe the hype.
* If you're going to try it out, please also try with
  the race detector enabled and report an issue
  at golang.org/issue/new if you see a race report.
  Just run 'go install -race cmd/compile' and
  'go build -a yourpackages'. And then run make.bash
  again to return a reasonable compiler.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
let in the development cycle,
we can simply have cmd/go set -c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  It requires additional sorting to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is easily the most heavily contended mutex.
  I hope that the syncmap proposed in golang#18177 may
  provide some free speed-ups here.
  Some lookups may also be removable.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal, aside from some mild
  awkwardness to avoid deadlocks due to recursive calls.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

* types.Sym.Lsymmu. Syms keep a cache of their associated LSym,
  to reduce lookups in ctxt.Hash. This cache itself
  needs concurrency protection. This mutex adds to the size
  of a moderately important data structure, but the alloc benchmarks
  below show that this doesn't hurt much in practice.
  It is moderately contended, mostly because when lookups fail,
  the lock is held while vying for the contended ctxt.Hash mutex.
  The fact that it keeps load off the ctxt.Hash mutex, though,
  makes this mutex worth keeping.
  Also, a fair number of calls to Linksym could be avoided
  by judicious addition of a local variable.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed golang#19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is golang#19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1
costs about 1-2% CPU and has almost no memory impact.

name       old time/op       new time/op       delta
Template         193ms ± 4%        195ms ± 5%  +0.99%  (p=0.007 n=50+50)
Unicode         82.0ms ± 3%       84.7ms ± 4%  +3.26%  (p=0.000 n=47+48)
GoTypes          539ms ± 4%        549ms ± 3%  +1.90%  (p=0.000 n=46+46)
SSA              5.92s ± 2%        5.95s ± 3%  +0.49%  (p=0.018 n=42+48)
Flate            121ms ± 4%        122ms ± 3%    ~     (p=0.783 n=48+46)
GoParser         143ms ± 6%        144ms ± 3%  +0.90%  (p=0.001 n=49+47)
Reflect          342ms ± 4%        347ms ± 3%  +1.47%  (p=0.000 n=48+49)
Tar              104ms ± 5%        105ms ± 4%  +1.06%  (p=0.003 n=47+47)
XML              197ms ± 4%        199ms ± 5%  +0.85%  (p=0.014 n=48+49)

name       old user-time/op  new user-time/op  delta
Template         238ms ±10%        241ms ±10%    ~     (p=0.066 n=50+50)
Unicode          104ms ± 4%        106ms ± 5%  +2.23%  (p=0.000 n=46+49)
GoTypes          706ms ± 3%        714ms ± 5%  +1.15%  (p=0.002 n=45+49)
SSA              8.21s ± 3%        8.27s ± 2%  +0.78%  (p=0.002 n=43+47)
Flate            144ms ± 7%        145ms ± 5%    ~     (p=0.098 n=49+49)
GoParser         175ms ± 4%        177ms ± 4%  +1.35%  (p=0.003 n=47+50)
Reflect          435ms ± 6%        433ms ± 7%    ~     (p=0.852 n=50+50)
Tar              121ms ± 6%        122ms ± 7%    ~     (p=0.143 n=48+50)
XML              240ms ± 4%        241ms ± 4%    ~     (p=0.231 n=50+49)

name       old alloc/op      new alloc/op      delta
Template        38.7MB ± 0%       38.7MB ± 0%    ~     (p=0.690 n=5+5)
Unicode         29.8MB ± 0%       29.8MB ± 0%    ~     (p=0.222 n=5+5)
GoTypes          113MB ± 0%        113MB ± 0%    ~     (p=0.310 n=5+5)
SSA             1.24GB ± 0%       1.24GB ± 0%  +0.04%  (p=0.008 n=5+5)
Flate           25.2MB ± 0%       25.2MB ± 0%    ~     (p=0.841 n=5+5)
GoParser        31.7MB ± 0%       31.7MB ± 0%    ~     (p=0.151 n=5+5)
Reflect         77.5MB ± 0%       77.5MB ± 0%    ~     (p=0.690 n=5+5)
Tar             26.4MB ± 0%       26.4MB ± 0%    ~     (p=0.690 n=5+5)
XML             42.0MB ± 0%       42.0MB ± 0%    ~     (p=0.690 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          378k ± 0%         377k ± 1%    ~     (p=0.841 n=5+5)
Unicode           321k ± 0%         322k ± 0%    ~     (p=0.310 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%    ~     (p=0.310 n=5+5)
SSA              9.67M ± 0%        9.69M ± 0%  +0.20%  (p=0.008 n=5+5)
Flate             233k ± 0%         233k ± 1%    ~     (p=1.000 n=5+5)
GoParser          315k ± 0%         314k ± 1%    ~     (p=0.151 n=5+5)
Reflect           971k ± 0%         970k ± 0%    ~     (p=0.690 n=5+5)
Tar               249k ± 0%         249k ± 0%    ~     (p=0.310 n=5+5)
XML               391k ± 0%         390k ± 0%    ~     (p=0.841 n=5+5)

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates golang#15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
josharian added a commit to josharian/go that referenced this issue Apr 21, 2017
cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

CAVEATS

I suspect it's going to end up on Twitter.
If you're coming here from the internet:

* Don't believe the hype.
* If you're going to try it out, please also try with
  the race detector enabled and report an issue
  at golang.org/issue/new if you see a race report.
  Just run 'go install -race cmd/compile' and
  'go build -a yourpackages'. And then run make.bash
  again to return a reasonable compiler.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
let in the development cycle,
we can simply have cmd/go set -c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  It requires additional sorting to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is easily the most heavily contended mutex.
  I hope that the syncmap proposed in golang#18177 may
  provide some free speed-ups here.
  Some lookups may also be removable.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal, aside from some mild
  awkwardness to avoid deadlocks due to recursive calls.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

* types.Sym.Lsymmu. Syms keep a cache of their associated LSym,
  to reduce lookups in ctxt.Hash. This cache itself
  needs concurrency protection. This mutex adds to the size
  of a moderately important data structure, but the alloc benchmarks
  below show that this doesn't hurt much in practice.
  It is moderately contended, mostly because when lookups fail,
  the lock is held while vying for the contended ctxt.Hash mutex.
  The fact that it keeps load off the ctxt.Hash mutex, though,
  makes this mutex worth keeping.
  Also, a fair number of calls to Linksym could be avoided
  by judicious addition of a local variable.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed golang#19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is golang#19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1
costs about 1-2% CPU and has almost no memory impact.

name       old time/op       new time/op       delta
Template         193ms ± 4%        195ms ± 5%  +0.99%  (p=0.007 n=50+50)
Unicode         82.0ms ± 3%       84.7ms ± 4%  +3.26%  (p=0.000 n=47+48)
GoTypes          539ms ± 4%        549ms ± 3%  +1.90%  (p=0.000 n=46+46)
SSA              5.92s ± 2%        5.95s ± 3%  +0.49%  (p=0.018 n=42+48)
Flate            121ms ± 4%        122ms ± 3%    ~     (p=0.783 n=48+46)
GoParser         143ms ± 6%        144ms ± 3%  +0.90%  (p=0.001 n=49+47)
Reflect          342ms ± 4%        347ms ± 3%  +1.47%  (p=0.000 n=48+49)
Tar              104ms ± 5%        105ms ± 4%  +1.06%  (p=0.003 n=47+47)
XML              197ms ± 4%        199ms ± 5%  +0.85%  (p=0.014 n=48+49)

name       old user-time/op  new user-time/op  delta
Template         238ms ±10%        241ms ±10%    ~     (p=0.066 n=50+50)
Unicode          104ms ± 4%        106ms ± 5%  +2.23%  (p=0.000 n=46+49)
GoTypes          706ms ± 3%        714ms ± 5%  +1.15%  (p=0.002 n=45+49)
SSA              8.21s ± 3%        8.27s ± 2%  +0.78%  (p=0.002 n=43+47)
Flate            144ms ± 7%        145ms ± 5%    ~     (p=0.098 n=49+49)
GoParser         175ms ± 4%        177ms ± 4%  +1.35%  (p=0.003 n=47+50)
Reflect          435ms ± 6%        433ms ± 7%    ~     (p=0.852 n=50+50)
Tar              121ms ± 6%        122ms ± 7%    ~     (p=0.143 n=48+50)
XML              240ms ± 4%        241ms ± 4%    ~     (p=0.231 n=50+49)

name       old alloc/op      new alloc/op      delta
Template        38.7MB ± 0%       38.7MB ± 0%    ~     (p=0.690 n=5+5)
Unicode         29.8MB ± 0%       29.8MB ± 0%    ~     (p=0.222 n=5+5)
GoTypes          113MB ± 0%        113MB ± 0%    ~     (p=0.310 n=5+5)
SSA             1.24GB ± 0%       1.24GB ± 0%  +0.04%  (p=0.008 n=5+5)
Flate           25.2MB ± 0%       25.2MB ± 0%    ~     (p=0.841 n=5+5)
GoParser        31.7MB ± 0%       31.7MB ± 0%    ~     (p=0.151 n=5+5)
Reflect         77.5MB ± 0%       77.5MB ± 0%    ~     (p=0.690 n=5+5)
Tar             26.4MB ± 0%       26.4MB ± 0%    ~     (p=0.690 n=5+5)
XML             42.0MB ± 0%       42.0MB ± 0%    ~     (p=0.690 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          378k ± 0%         377k ± 1%    ~     (p=0.841 n=5+5)
Unicode           321k ± 0%         322k ± 0%    ~     (p=0.310 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%    ~     (p=0.310 n=5+5)
SSA              9.67M ± 0%        9.69M ± 0%  +0.20%  (p=0.008 n=5+5)
Flate             233k ± 0%         233k ± 1%    ~     (p=1.000 n=5+5)
GoParser          315k ± 0%         314k ± 1%    ~     (p=0.151 n=5+5)
Reflect           971k ± 0%         970k ± 0%    ~     (p=0.690 n=5+5)
Tar               249k ± 0%         249k ± 0%    ~     (p=0.310 n=5+5)
XML               391k ± 0%         390k ± 0%    ~     (p=0.841 n=5+5)

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates golang#15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
josharian added a commit to josharian/go that referenced this issue Apr 22, 2017
cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
late in the development cycle,
we can simply have cmd/go set -c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  Since funcsyms may now be added in any order,
  we must sort them to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is easily the most heavily contended mutex.
  I hope that the syncmap proposed in golang#18177 may
  provide some free speed-ups here.
  Some lookups may also be removable.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal, aside from some mild
  awkwardness to avoid deadlocks due to recursive calls.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

* types.Sym.Lsymmu. Syms keep a cache of their associated LSym,
  to reduce lookups in ctxt.Hash. This cache itself
  needs concurrency protection. This mutex adds to the size
  of a moderately important data structure, but the alloc benchmarks
  below show that this doesn't hurt much in practice.
  It is moderately contended, mostly because when lookups fail,
  the lock is held while vying for the contended ctxt.Hash mutex.
  The fact that it keeps load off the ctxt.Hash mutex, though,
  makes this mutex worth keeping.
  Also, a fair number of calls to Linksym could be avoided
  by judicious addition of a local variable.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed golang#19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is golang#19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1 has almost no impact.

name        old time/op       new time/op       delta
Template          195ms ± 3%        194ms ± 5%    ~     (p=0.370 n=30+29)
Unicode          86.6ms ± 3%       87.0ms ± 7%    ~     (p=0.958 n=29+30)
GoTypes           548ms ± 3%        555ms ± 4%  +1.35%  (p=0.001 n=30+28)
Compiler          2.51s ± 2%        2.54s ± 2%  +1.17%  (p=0.000 n=28+30)
SSA               5.16s ± 3%        5.16s ± 2%    ~     (p=0.910 n=30+29)
Flate             124ms ± 5%        124ms ± 4%    ~     (p=0.947 n=30+30)
GoParser          146ms ± 3%        146ms ± 3%    ~     (p=0.150 n=29+28)
Reflect           354ms ± 3%        352ms ± 4%    ~     (p=0.096 n=29+29)
Tar               107ms ± 5%        106ms ± 3%    ~     (p=0.370 n=30+29)
XML               200ms ± 4%        201ms ± 4%    ~     (p=0.313 n=29+28)
[Geo mean]        332ms             333ms       +0.10%

name        old user-time/op  new user-time/op  delta
Template          227ms ± 5%        225ms ± 5%    ~     (p=0.457 n=28+27)
Unicode           109ms ± 4%        109ms ± 5%    ~     (p=0.758 n=29+29)
GoTypes           713ms ± 4%        721ms ± 5%    ~     (p=0.051 n=30+29)
Compiler          3.36s ± 2%        3.38s ± 3%    ~     (p=0.146 n=30+30)
SSA               7.46s ± 3%        7.47s ± 3%    ~     (p=0.804 n=30+29)
Flate             146ms ± 7%        147ms ± 3%    ~     (p=0.833 n=29+27)
GoParser          179ms ± 5%        179ms ± 5%    ~     (p=0.866 n=30+30)
Reflect           431ms ± 4%        429ms ± 4%    ~     (p=0.593 n=29+30)
Tar               124ms ± 5%        123ms ± 5%    ~     (p=0.140 n=29+29)
XML               243ms ± 4%        242ms ± 7%    ~     (p=0.404 n=29+29)
[Geo mean]        415ms             415ms       +0.02%

name        old obj-bytes     new obj-bytes     delta
Template           382k ± 0%         382k ± 0%    ~     (all equal)
Unicode            203k ± 0%         203k ± 0%    ~     (all equal)
GoTypes           1.18M ± 0%        1.18M ± 0%    ~     (all equal)
Compiler          3.98M ± 0%        3.98M ± 0%    ~     (all equal)
SSA               8.28M ± 0%        8.28M ± 0%    ~     (all equal)
Flate              230k ± 0%         230k ± 0%    ~     (all equal)
GoParser           287k ± 0%         287k ± 0%    ~     (all equal)
Reflect           1.00M ± 0%        1.00M ± 0%    ~     (all equal)
Tar                190k ± 0%         190k ± 0%    ~     (all equal)
XML                416k ± 0%         416k ± 0%    ~     (all equal)
[Geo mean]         660k              660k       +0.00%

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates golang#15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
josharian added a commit to josharian/go that referenced this issue Apr 24, 2017
cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
late in the development cycle,
we can simply have cmd/go set c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  Since funcsyms may now be added in any order,
  we must sort them to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is the only heavily contended mutex.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed golang#19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is golang#19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1 has almost no impact.

name        old time/op       new time/op       delta
Template          195ms ± 3%        194ms ± 5%    ~     (p=0.370 n=30+29)
Unicode          86.6ms ± 3%       87.0ms ± 7%    ~     (p=0.958 n=29+30)
GoTypes           548ms ± 3%        555ms ± 4%  +1.35%  (p=0.001 n=30+28)
Compiler          2.51s ± 2%        2.54s ± 2%  +1.17%  (p=0.000 n=28+30)
SSA               5.16s ± 3%        5.16s ± 2%    ~     (p=0.910 n=30+29)
Flate             124ms ± 5%        124ms ± 4%    ~     (p=0.947 n=30+30)
GoParser          146ms ± 3%        146ms ± 3%    ~     (p=0.150 n=29+28)
Reflect           354ms ± 3%        352ms ± 4%    ~     (p=0.096 n=29+29)
Tar               107ms ± 5%        106ms ± 3%    ~     (p=0.370 n=30+29)
XML               200ms ± 4%        201ms ± 4%    ~     (p=0.313 n=29+28)
[Geo mean]        332ms             333ms       +0.10%

name        old user-time/op  new user-time/op  delta
Template          227ms ± 5%        225ms ± 5%    ~     (p=0.457 n=28+27)
Unicode           109ms ± 4%        109ms ± 5%    ~     (p=0.758 n=29+29)
GoTypes           713ms ± 4%        721ms ± 5%    ~     (p=0.051 n=30+29)
Compiler          3.36s ± 2%        3.38s ± 3%    ~     (p=0.146 n=30+30)
SSA               7.46s ± 3%        7.47s ± 3%    ~     (p=0.804 n=30+29)
Flate             146ms ± 7%        147ms ± 3%    ~     (p=0.833 n=29+27)
GoParser          179ms ± 5%        179ms ± 5%    ~     (p=0.866 n=30+30)
Reflect           431ms ± 4%        429ms ± 4%    ~     (p=0.593 n=29+30)
Tar               124ms ± 5%        123ms ± 5%    ~     (p=0.140 n=29+29)
XML               243ms ± 4%        242ms ± 7%    ~     (p=0.404 n=29+29)
[Geo mean]        415ms             415ms       +0.02%

name        old obj-bytes     new obj-bytes     delta
Template           382k ± 0%         382k ± 0%    ~     (all equal)
Unicode            203k ± 0%         203k ± 0%    ~     (all equal)
GoTypes           1.18M ± 0%        1.18M ± 0%    ~     (all equal)
Compiler          3.98M ± 0%        3.98M ± 0%    ~     (all equal)
SSA               8.28M ± 0%        8.28M ± 0%    ~     (all equal)
Flate              230k ± 0%         230k ± 0%    ~     (all equal)
GoParser           287k ± 0%         287k ± 0%    ~     (all equal)
Reflect           1.00M ± 0%        1.00M ± 0%    ~     (all equal)
Tar                190k ± 0%         190k ± 0%    ~     (all equal)
XML                416k ± 0%         416k ± 0%    ~     (all equal)
[Geo mean]         660k              660k       +0.00%

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates golang#15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
josharian added a commit to josharian/go that referenced this issue Apr 24, 2017
cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
late in the development cycle,
we can simply have cmd/go set c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  Since funcsyms may now be added in any order,
  we must sort them to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is the only heavily contended mutex.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed golang#19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is golang#19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1 has almost no impact.

name        old time/op       new time/op       delta
Template          195ms ± 3%        194ms ± 5%    ~     (p=0.370 n=30+29)
Unicode          86.6ms ± 3%       87.0ms ± 7%    ~     (p=0.958 n=29+30)
GoTypes           548ms ± 3%        555ms ± 4%  +1.35%  (p=0.001 n=30+28)
Compiler          2.51s ± 2%        2.54s ± 2%  +1.17%  (p=0.000 n=28+30)
SSA               5.16s ± 3%        5.16s ± 2%    ~     (p=0.910 n=30+29)
Flate             124ms ± 5%        124ms ± 4%    ~     (p=0.947 n=30+30)
GoParser          146ms ± 3%        146ms ± 3%    ~     (p=0.150 n=29+28)
Reflect           354ms ± 3%        352ms ± 4%    ~     (p=0.096 n=29+29)
Tar               107ms ± 5%        106ms ± 3%    ~     (p=0.370 n=30+29)
XML               200ms ± 4%        201ms ± 4%    ~     (p=0.313 n=29+28)
[Geo mean]        332ms             333ms       +0.10%

name        old user-time/op  new user-time/op  delta
Template          227ms ± 5%        225ms ± 5%    ~     (p=0.457 n=28+27)
Unicode           109ms ± 4%        109ms ± 5%    ~     (p=0.758 n=29+29)
GoTypes           713ms ± 4%        721ms ± 5%    ~     (p=0.051 n=30+29)
Compiler          3.36s ± 2%        3.38s ± 3%    ~     (p=0.146 n=30+30)
SSA               7.46s ± 3%        7.47s ± 3%    ~     (p=0.804 n=30+29)
Flate             146ms ± 7%        147ms ± 3%    ~     (p=0.833 n=29+27)
GoParser          179ms ± 5%        179ms ± 5%    ~     (p=0.866 n=30+30)
Reflect           431ms ± 4%        429ms ± 4%    ~     (p=0.593 n=29+30)
Tar               124ms ± 5%        123ms ± 5%    ~     (p=0.140 n=29+29)
XML               243ms ± 4%        242ms ± 7%    ~     (p=0.404 n=29+29)
[Geo mean]        415ms             415ms       +0.02%

name        old obj-bytes     new obj-bytes     delta
Template           382k ± 0%         382k ± 0%    ~     (all equal)
Unicode            203k ± 0%         203k ± 0%    ~     (all equal)
GoTypes           1.18M ± 0%        1.18M ± 0%    ~     (all equal)
Compiler          3.98M ± 0%        3.98M ± 0%    ~     (all equal)
SSA               8.28M ± 0%        8.28M ± 0%    ~     (all equal)
Flate              230k ± 0%         230k ± 0%    ~     (all equal)
GoParser           287k ± 0%         287k ± 0%    ~     (all equal)
Reflect           1.00M ± 0%        1.00M ± 0%    ~     (all equal)
Tar                190k ± 0%         190k ± 0%    ~     (all equal)
XML                416k ± 0%         416k ± 0%    ~     (all equal)
[Geo mean]         660k              660k       +0.00%

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates golang#15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
josharian added a commit to josharian/go that referenced this issue Apr 26, 2017
cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
late in the development cycle,
we can simply have cmd/go set c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  Since funcsyms may now be added in any order,
  we must sort them to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is the only heavily contended mutex.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed golang#19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is golang#19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1 has almost no impact.

name        old time/op       new time/op       delta
Template          195ms ± 3%        194ms ± 5%    ~     (p=0.370 n=30+29)
Unicode          86.6ms ± 3%       87.0ms ± 7%    ~     (p=0.958 n=29+30)
GoTypes           548ms ± 3%        555ms ± 4%  +1.35%  (p=0.001 n=30+28)
Compiler          2.51s ± 2%        2.54s ± 2%  +1.17%  (p=0.000 n=28+30)
SSA               5.16s ± 3%        5.16s ± 2%    ~     (p=0.910 n=30+29)
Flate             124ms ± 5%        124ms ± 4%    ~     (p=0.947 n=30+30)
GoParser          146ms ± 3%        146ms ± 3%    ~     (p=0.150 n=29+28)
Reflect           354ms ± 3%        352ms ± 4%    ~     (p=0.096 n=29+29)
Tar               107ms ± 5%        106ms ± 3%    ~     (p=0.370 n=30+29)
XML               200ms ± 4%        201ms ± 4%    ~     (p=0.313 n=29+28)
[Geo mean]        332ms             333ms       +0.10%

name        old user-time/op  new user-time/op  delta
Template          227ms ± 5%        225ms ± 5%    ~     (p=0.457 n=28+27)
Unicode           109ms ± 4%        109ms ± 5%    ~     (p=0.758 n=29+29)
GoTypes           713ms ± 4%        721ms ± 5%    ~     (p=0.051 n=30+29)
Compiler          3.36s ± 2%        3.38s ± 3%    ~     (p=0.146 n=30+30)
SSA               7.46s ± 3%        7.47s ± 3%    ~     (p=0.804 n=30+29)
Flate             146ms ± 7%        147ms ± 3%    ~     (p=0.833 n=29+27)
GoParser          179ms ± 5%        179ms ± 5%    ~     (p=0.866 n=30+30)
Reflect           431ms ± 4%        429ms ± 4%    ~     (p=0.593 n=29+30)
Tar               124ms ± 5%        123ms ± 5%    ~     (p=0.140 n=29+29)
XML               243ms ± 4%        242ms ± 7%    ~     (p=0.404 n=29+29)
[Geo mean]        415ms             415ms       +0.02%

name        old obj-bytes     new obj-bytes     delta
Template           382k ± 0%         382k ± 0%    ~     (all equal)
Unicode            203k ± 0%         203k ± 0%    ~     (all equal)
GoTypes           1.18M ± 0%        1.18M ± 0%    ~     (all equal)
Compiler          3.98M ± 0%        3.98M ± 0%    ~     (all equal)
SSA               8.28M ± 0%        8.28M ± 0%    ~     (all equal)
Flate              230k ± 0%         230k ± 0%    ~     (all equal)
GoParser           287k ± 0%         287k ± 0%    ~     (all equal)
Reflect           1.00M ± 0%        1.00M ± 0%    ~     (all equal)
Tar                190k ± 0%         190k ± 0%    ~     (all equal)
XML                416k ± 0%         416k ± 0%    ~     (all equal)
[Geo mean]         660k              660k       +0.00%

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates golang#15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
gopherbot pushed a commit that referenced this issue Apr 27, 2017
cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
late in the development cycle,
we can simply have cmd/go set c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  Since funcsyms may now be added in any order,
  we must sort them to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is the only heavily contended mutex.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed #19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is #19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1 has almost no impact.

name        old time/op       new time/op       delta
Template          195ms ± 3%        194ms ± 5%    ~     (p=0.370 n=30+29)
Unicode          86.6ms ± 3%       87.0ms ± 7%    ~     (p=0.958 n=29+30)
GoTypes           548ms ± 3%        555ms ± 4%  +1.35%  (p=0.001 n=30+28)
Compiler          2.51s ± 2%        2.54s ± 2%  +1.17%  (p=0.000 n=28+30)
SSA               5.16s ± 3%        5.16s ± 2%    ~     (p=0.910 n=30+29)
Flate             124ms ± 5%        124ms ± 4%    ~     (p=0.947 n=30+30)
GoParser          146ms ± 3%        146ms ± 3%    ~     (p=0.150 n=29+28)
Reflect           354ms ± 3%        352ms ± 4%    ~     (p=0.096 n=29+29)
Tar               107ms ± 5%        106ms ± 3%    ~     (p=0.370 n=30+29)
XML               200ms ± 4%        201ms ± 4%    ~     (p=0.313 n=29+28)
[Geo mean]        332ms             333ms       +0.10%

name        old user-time/op  new user-time/op  delta
Template          227ms ± 5%        225ms ± 5%    ~     (p=0.457 n=28+27)
Unicode           109ms ± 4%        109ms ± 5%    ~     (p=0.758 n=29+29)
GoTypes           713ms ± 4%        721ms ± 5%    ~     (p=0.051 n=30+29)
Compiler          3.36s ± 2%        3.38s ± 3%    ~     (p=0.146 n=30+30)
SSA               7.46s ± 3%        7.47s ± 3%    ~     (p=0.804 n=30+29)
Flate             146ms ± 7%        147ms ± 3%    ~     (p=0.833 n=29+27)
GoParser          179ms ± 5%        179ms ± 5%    ~     (p=0.866 n=30+30)
Reflect           431ms ± 4%        429ms ± 4%    ~     (p=0.593 n=29+30)
Tar               124ms ± 5%        123ms ± 5%    ~     (p=0.140 n=29+29)
XML               243ms ± 4%        242ms ± 7%    ~     (p=0.404 n=29+29)
[Geo mean]        415ms             415ms       +0.02%

name        old obj-bytes     new obj-bytes     delta
Template           382k ± 0%         382k ± 0%    ~     (all equal)
Unicode            203k ± 0%         203k ± 0%    ~     (all equal)
GoTypes           1.18M ± 0%        1.18M ± 0%    ~     (all equal)
Compiler          3.98M ± 0%        3.98M ± 0%    ~     (all equal)
SSA               8.28M ± 0%        8.28M ± 0%    ~     (all equal)
Flate              230k ± 0%         230k ± 0%    ~     (all equal)
GoParser           287k ± 0%         287k ± 0%    ~     (all equal)
Reflect           1.00M ± 0%        1.00M ± 0%    ~     (all equal)
Tar                190k ± 0%         190k ± 0%    ~     (all equal)
XML                416k ± 0%         416k ± 0%    ~     (all equal)
[Geo mean]         660k              660k       +0.00%

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates #15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
Reviewed-on: https://go-review.googlesource.com/40693
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Robert Griesemer <gri@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot

This comment has been minimized.

Copy link

commented May 9, 2017

CL https://golang.org/cl/42955 mentions this issue.

gopherbot pushed a commit that referenced this issue May 9, 2017
cmd/compile: make builds reproducible in presence of **byte and **int8
CL 39915 introduced sorting of signats by ShortString
for reproducible builds. But ShortString treats types
byte and uint8 identically; same for rune and uint32.
CL 39915 attempted to compensate for this by only
adding the underlying type (uint8) to signats in addsignat.

This only works for byte and uint8. For e.g. *byte and *uint,
both get added, and their sort order is random,
leading to non-reproducible builds.

One fix would be to add yet another type printing mode
that doesn't eliminate byte and rune, and use it
for sorting signats. But the formatting routines
are complicated enough as it is.

Instead, just sort first by ShortString and then by String.
We can't just use String, because ShortString makes distinctions
that String doesn't. ShortString is really preferred here;
String is serving only as a backstop for handling of bytes and runes.

The long series of types in the test helps increase the odds of
failure, allowing a smaller number of iterations in the test.
On my machine, a full test takes 700ms.

Passes toolstash-check.

Updates #19961
Fixes #20272

name        old alloc/op      new alloc/op      delta
Template         37.9MB ± 0%       37.9MB ± 0%  +0.12%  (p=0.032 n=5+5)
Unicode          28.9MB ± 0%       28.9MB ± 0%    ~     (p=0.841 n=5+5)
GoTypes           110MB ± 0%        110MB ± 0%    ~     (p=0.841 n=5+5)
Compiler          463MB ± 0%        463MB ± 0%    ~     (p=0.056 n=5+5)
SSA              1.11GB ± 0%       1.11GB ± 0%  +0.02%  (p=0.016 n=5+5)
Flate            24.7MB ± 0%       24.8MB ± 0%  +0.14%  (p=0.032 n=5+5)
GoParser         31.1MB ± 0%       31.1MB ± 0%    ~     (p=0.421 n=5+5)
Reflect          73.9MB ± 0%       73.9MB ± 0%    ~     (p=1.000 n=5+5)
Tar              25.8MB ± 0%       25.8MB ± 0%  +0.15%  (p=0.016 n=5+5)
XML              41.2MB ± 0%       41.2MB ± 0%    ~     (p=0.310 n=5+5)
[Geo mean]       72.0MB            72.0MB       +0.07%

name        old allocs/op     new allocs/op     delta
Template           384k ± 0%         385k ± 1%    ~     (p=0.056 n=5+5)
Unicode            343k ± 0%         344k ± 0%    ~     (p=0.548 n=5+5)
GoTypes           1.16M ± 0%        1.16M ± 0%    ~     (p=0.421 n=5+5)
Compiler          4.43M ± 0%        4.44M ± 0%  +0.26%  (p=0.032 n=5+5)
SSA               9.86M ± 0%        9.87M ± 0%  +0.10%  (p=0.032 n=5+5)
Flate              237k ± 1%         238k ± 0%  +0.49%  (p=0.032 n=5+5)
GoParser           319k ± 1%         320k ± 1%    ~     (p=0.151 n=5+5)
Reflect            957k ± 0%         957k ± 0%    ~     (p=1.000 n=5+5)
Tar                251k ± 0%         252k ± 1%  +0.49%  (p=0.016 n=5+5)
XML                399k ± 0%         401k ± 1%    ~     (p=0.310 n=5+5)
[Geo mean]         739k              741k       +0.26%

Change-Id: Ic27995a8d374d012b8aca14546b1df9d28d30df7
Reviewed-on: https://go-review.googlesource.com/42955
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>

@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Apr 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.