sync: add a Map to replace RWLock+map usage #18177

Closed
bcmills opened this Issue Dec 3, 2016 · 45 comments

Comments

Projects
None yet
9 participants
Member

bcmills commented Dec 3, 2016

Per #17973, RWMutex has some scaling issues. One option is to fix RWMutex, but for maps with high read:write ratios we can often do better with a carefully-managed atomic.Value anyway. The standard library contains many such maps, especially in the reflect package.

The actual details of such a map are a bit subtle to implement correctly and efficiently. We should provide a Map implementation in the standard library. Since the implementation may need to use both sync and sync/atomic and the former depends on the latter, it should either go in sync or in a new, separate sync subpackage.

(Presumably this should be targeted to 1.9 due to the 1.8 freeze, but I'd like to add a draft API to x/sync in the meantime.)

bradfitz added this to the Proposal milestone Dec 3, 2016

Owner

bradfitz commented Dec 3, 2016

The standard answer (https://golang.org/doc/faq#x_in_std) is always to prototype it elsewhere first.

So x/sync sounds good to start.

bradfitz added the Proposal label Dec 3, 2016

Contributor

rsc commented Dec 5, 2016

Yes, having such a thing would be nice. There are no details here but I think everyone agrees this comes up repeatedly and would be well served by a library in a standard place. When your code is ready please send a CL.

@rsc rsc added Proposal-Accepted and removed Proposal labels Dec 5, 2016

@rsc rsc modified the milestone: Go1.9Early, Proposal Dec 5, 2016

rsc changed the title from proposal: sync: add a Map to replace RWLock+map usage in the standard library to sync: add a Map to replace RWLock+map usage Dec 5, 2016

Contributor

josharian commented Dec 5, 2016

https://go-review.googlesource.com/c/33912/ has a prototype implementation and API discussion.

Here are some concrete use cases I know of, for use in API discussion:

CL https://golang.org/cl/33912 mentions this issue.

bcmills was assigned by Sajmani Feb 8, 2017

CL https://golang.org/cl/36717 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Feb 10, 2017

@bcmills @bcmills bcmills + bcmills expvar: parallelize BenchmarkMapAdd{Same,Different}
The other expvar tests are already parallelized, and this will help to
measure the impact of potential implementations for #18177.

updates #18177

Change-Id: I0f4f1a16a0285556cbcc8339855b6459af412675
Reviewed-on: https://go-review.googlesource.com/36717
Reviewed-by: Russ Cox <rsc@golang.org>
39651bb

CL https://golang.org/cl/36719 mentions this issue.

CL https://golang.org/cl/36722 mentions this issue.

CL https://golang.org/cl/36723 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Feb 10, 2017

@bcmills @bcmills bcmills + bcmills expvar: make BenchmarkAdd{Same,Different} comparable to 1.8
bradfitz noted in change 36717 that the new behavior was no longer
comparable with the old.  This change restores comparable behavior
for -cpu=1.

BenchmarkMapAddSame                 909           909           +0.00%
BenchmarkMapAddSame-6               1309          262           -79.98%
BenchmarkMapAddDifferent            2856          3030          +6.09%
BenchmarkMapAddDifferent-6          3803          581           -84.72%

updates #18177

Change-Id: Ifaff5a1f48be92002d86c296220313b7efdc81d6
Reviewed-on: https://go-review.googlesource.com/36723
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
aa15387

CL https://golang.org/cl/36724 mentions this issue.

CL https://golang.org/cl/36810 mentions this issue.

CL https://golang.org/cl/36831 mentions this issue.

CL https://golang.org/cl/36959 mentions this issue.

CL https://golang.org/cl/36964 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Feb 14, 2017

@bcmills @bcmills bcmills + bcmills expvar: add benchmarks for steady-state Map Add calls
Add a benchmark for setting a String value, which we may
want to treat differently from Int or Float due to the need to support
Add methods for the latter.

Update tests to use only the exported API instead of making (fragile)
assumptions about unexported fields.

The existing Map benchmarks construct a new Map for each iteration, which
focuses the benchmark results on the initial allocation costs for the
Map and its entries. This change adds variants of the benchmarks which
use a long-lived map in order to measure steady-state performance for
Map updates on existing keys.

Updates #18177

Change-Id: I62c920991d17d5898c592446af382cd5c04c528a
Reviewed-on: https://go-review.googlesource.com/36959
Reviewed-by: Ian Lance Taylor <iant@golang.org>
7ffdb75

CL https://golang.org/cl/36617 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Feb 14, 2017

@bcmills @bcmills bcmills + bcmills mime: add benchmarks for TypeByExtension and ExtensionsByType
These are possible use-cases for sync.Map.

Updates golang/go#18177

Change-Id: I5e2a3d1249967c37d3f89a41122bf4a90522db11
Reviewed-on: https://go-review.googlesource.com/36964
Reviewed-by: Ian Lance Taylor <iant@golang.org>
eebd8f5

@gopherbot gopherbot pushed a commit to golang/sync that referenced this issue Feb 15, 2017

@bcmills @bcmills bcmills + bcmills syncmap: add a synchronized map implementation.
This is a draft for the sync.Map API proposed in golang/go#18177.

It supports fast-path loads via an atomic variable, falling back to a
Mutex for stores. In order to keep stores amortized to O(1), loads
following a store follow the Mutex path until enough loads have
occurred to offset the cost of a deep copy.

For mostly-read loads, such as the maps in the reflect package in the
standard library, this significantly reduces cache-line contention vs.
a plain RWMutex with a map.

goos: linux
goarch: amd64
pkg: golang.org/x/sync/syncmap
BenchmarkLoadMostlyHits/*syncmap_test.DeepCopyMap               20000000                73.1 ns/op
BenchmarkLoadMostlyHits/*syncmap_test.DeepCopyMap-48            100000000               13.8 ns/op
BenchmarkLoadMostlyHits/*syncmap_test.RWMutexMap                20000000                87.7 ns/op
BenchmarkLoadMostlyHits/*syncmap_test.RWMutexMap-48             10000000               154 ns/op
BenchmarkLoadMostlyHits/*syncmap.Map                            20000000                72.1 ns/op
BenchmarkLoadMostlyHits/*syncmap.Map-48                         100000000               11.2 ns/op
BenchmarkLoadMostlyMisses/*syncmap_test.DeepCopyMap             20000000                63.2 ns/op
BenchmarkLoadMostlyMisses/*syncmap_test.DeepCopyMap-48          200000000               14.2 ns/op
BenchmarkLoadMostlyMisses/*syncmap_test.RWMutexMap              20000000                72.7 ns/op
BenchmarkLoadMostlyMisses/*syncmap_test.RWMutexMap-48           10000000               150 ns/op
BenchmarkLoadMostlyMisses/*syncmap.Map                          30000000                56.4 ns/op
BenchmarkLoadMostlyMisses/*syncmap.Map-48                       200000000                9.77 ns/op
BenchmarkLoadOrStoreBalanced/*syncmap_test.RWMutexMap            2000000               683 ns/op
BenchmarkLoadOrStoreBalanced/*syncmap_test.RWMutexMap-48         1000000              1394 ns/op
BenchmarkLoadOrStoreBalanced/*syncmap.Map                        2000000               645 ns/op
BenchmarkLoadOrStoreBalanced/*syncmap.Map-48                     1000000              1253 ns/op
BenchmarkLoadOrStoreUnique/*syncmap_test.RWMutexMap              1000000              1015 ns/op
BenchmarkLoadOrStoreUnique/*syncmap_test.RWMutexMap-48           1000000              1911 ns/op
BenchmarkLoadOrStoreUnique/*syncmap.Map                          1000000              1018 ns/op
BenchmarkLoadOrStoreUnique/*syncmap.Map-48                       1000000              1776 ns/op
BenchmarkLoadOrStoreCollision/*syncmap_test.DeepCopyMap         50000000                30.2 ns/op
BenchmarkLoadOrStoreCollision/*syncmap_test.DeepCopyMap-48      2000000000               1.24 ns/op
BenchmarkLoadOrStoreCollision/*syncmap_test.RWMutexMap          30000000                50.1 ns/op
BenchmarkLoadOrStoreCollision/*syncmap_test.RWMutexMap-48        5000000               451 ns/op
BenchmarkLoadOrStoreCollision/*syncmap.Map                      30000000                36.8 ns/op
BenchmarkLoadOrStoreCollision/*syncmap.Map-48                   2000000000               1.24 ns/op
BenchmarkAdversarialAlloc/*syncmap_test.DeepCopyMap             10000000               213 ns/op
BenchmarkAdversarialAlloc/*syncmap_test.DeepCopyMap-48           1000000              5012 ns/op
BenchmarkAdversarialAlloc/*syncmap_test.RWMutexMap              20000000                68.8 ns/op
BenchmarkAdversarialAlloc/*syncmap_test.RWMutexMap-48            5000000               429 ns/op
BenchmarkAdversarialAlloc/*syncmap.Map                           5000000               229 ns/op
BenchmarkAdversarialAlloc/*syncmap.Map-48                        2000000               600 ns/op
BenchmarkAdversarialDelete/*syncmap_test.DeepCopyMap             5000000               314 ns/op
BenchmarkAdversarialDelete/*syncmap_test.DeepCopyMap-48          2000000               726 ns/op
BenchmarkAdversarialDelete/*syncmap_test.RWMutexMap             20000000                63.2 ns/op
BenchmarkAdversarialDelete/*syncmap_test.RWMutexMap-48           5000000               469 ns/op
BenchmarkAdversarialDelete/*syncmap.Map                         10000000               203 ns/op
BenchmarkAdversarialDelete/*syncmap.Map-48                      10000000               253 ns/op

goos: linux
goarch: ppc64le
pkg: golang.org/x/sync/syncmap
BenchmarkLoadMostlyHits/*syncmap_test.DeepCopyMap                5000000               253 ns/op
BenchmarkLoadMostlyHits/*syncmap_test.DeepCopyMap-48            50000000                26.2 ns/op
BenchmarkLoadMostlyHits/*syncmap_test.RWMutexMap                 5000000               505 ns/op
BenchmarkLoadMostlyHits/*syncmap_test.RWMutexMap-48              3000000               443 ns/op
BenchmarkLoadMostlyHits/*syncmap.Map                            10000000               200 ns/op
BenchmarkLoadMostlyHits/*syncmap.Map-48                         100000000               18.1 ns/op
BenchmarkLoadMostlyMisses/*syncmap_test.DeepCopyMap             10000000               162 ns/op
BenchmarkLoadMostlyMisses/*syncmap_test.DeepCopyMap-48          100000000               23.8 ns/op
BenchmarkLoadMostlyMisses/*syncmap_test.RWMutexMap              10000000               195 ns/op
BenchmarkLoadMostlyMisses/*syncmap_test.RWMutexMap-48            3000000               531 ns/op
BenchmarkLoadMostlyMisses/*syncmap.Map                          10000000               182 ns/op
BenchmarkLoadMostlyMisses/*syncmap.Map-48                       100000000               15.8 ns/op
BenchmarkLoadOrStoreBalanced/*syncmap_test.RWMutexMap            1000000              1664 ns/op
BenchmarkLoadOrStoreBalanced/*syncmap_test.RWMutexMap-48         1000000              1768 ns/op
BenchmarkLoadOrStoreBalanced/*syncmap.Map                        1000000              2128 ns/op
BenchmarkLoadOrStoreBalanced/*syncmap.Map-48                     1000000              1903 ns/op
BenchmarkLoadOrStoreUnique/*syncmap_test.RWMutexMap              1000000              2657 ns/op
BenchmarkLoadOrStoreUnique/*syncmap_test.RWMutexMap-48           1000000              2577 ns/op
BenchmarkLoadOrStoreUnique/*syncmap.Map                          1000000              1714 ns/op
BenchmarkLoadOrStoreUnique/*syncmap.Map-48                       1000000              2484 ns/op
BenchmarkLoadOrStoreCollision/*syncmap_test.DeepCopyMap         10000000               130 ns/op
BenchmarkLoadOrStoreCollision/*syncmap_test.DeepCopyMap-48      100000000               11.3 ns/op
BenchmarkLoadOrStoreCollision/*syncmap_test.RWMutexMap           3000000               426 ns/op
BenchmarkLoadOrStoreCollision/*syncmap_test.RWMutexMap-48        2000000               930 ns/op
BenchmarkLoadOrStoreCollision/*syncmap.Map                      10000000               131 ns/op
BenchmarkLoadOrStoreCollision/*syncmap.Map-48                   300000000                4.07 ns/op
BenchmarkAdversarialAlloc/*syncmap_test.DeepCopyMap              3000000               447 ns/op
BenchmarkAdversarialAlloc/*syncmap_test.DeepCopyMap-48            300000              4159 ns/op
BenchmarkAdversarialAlloc/*syncmap_test.RWMutexMap              10000000               191 ns/op
BenchmarkAdversarialAlloc/*syncmap_test.RWMutexMap-48            3000000               535 ns/op
BenchmarkAdversarialAlloc/*syncmap.Map                           2000000               525 ns/op
BenchmarkAdversarialAlloc/*syncmap.Map-48                        1000000              1000 ns/op
BenchmarkAdversarialDelete/*syncmap_test.DeepCopyMap             2000000               711 ns/op
BenchmarkAdversarialDelete/*syncmap_test.DeepCopyMap-48          2000000               900 ns/op
BenchmarkAdversarialDelete/*syncmap_test.RWMutexMap              3000000               354 ns/op
BenchmarkAdversarialDelete/*syncmap_test.RWMutexMap-48           3000000               473 ns/op
BenchmarkAdversarialDelete/*syncmap.Map                          2000000              1357 ns/op
BenchmarkAdversarialDelete/*syncmap.Map-48                       5000000               334 ns/op

Updates golang/go#18177

Change-Id: I8d561b617b1cd2ca03a8e68a5d5a28a519a0ce38
Reviewed-on: https://go-review.googlesource.com/33912
Reviewed-by: Russ Cox <rsc@golang.org>
54b13b0

thwd commented Feb 16, 2017

A humble request from a bystander: could you kindly document the abstract and internal workings of syncmap somewhere, perhaps in the package's documentation? I read the comment blocks inside the Map struct definition and get the bare mechanics of it. But now I'd love to understand the fast-path vs slow-path and/or clean/dirty rationales.

Member

bcmills commented Feb 16, 2017

@thwd The workings are still in flux. The next version will have fewer slow paths to understand. :)

CL https://golang.org/cl/37151 mentions this issue.

@gopherbot gopherbot pushed a commit to golang/sync that referenced this issue Feb 16, 2017

@bcmills @bcmills bcmills + bcmills syncmap: add benchmark for Range
adjust BenchmarkAdversarialDelete to be somewhat more adversarial.

updates golang/go#18177

Change-Id: Id01ed1077a0447dcfc6ea3929c22baaddbc9d6ee
Reviewed-on: https://go-review.googlesource.com/37151
Reviewed-by: Russ Cox <rsc@golang.org>
37569ff

@gopherbot gopherbot pushed a commit to golang/sync that referenced this issue Feb 16, 2017

@bcmills @bcmills bcmills + bcmills syncmap: make quick-check output more readable
Add a test for Range with concurrent Loads and Stores.

The previous quick-check tests generated long, mostly-non-ASCII
strings that were hard to debug on failure; this change makes the keys
and values short and human-readable, which also tends to produce more
key collisions in the test as a side-effect.

updates golang/go#18177

Change-Id: Ie56a64ec9fe295435682b90c3e6466ed5b349bf9
Reviewed-on: https://go-review.googlesource.com/37150
Reviewed-by: Russ Cox <rsc@golang.org>
86ddc85

CL https://golang.org/cl/37153 mentions this issue.

Member

bcmills commented Feb 16, 2017

I have pending changes for several standard-library packages.

There are a few additional uses of RWMutex which could potentially be
replaced but would involve subtle or invasive changes:

syscall/fd_nacl.go uses an RWMutex to guard a map of reference-counted
files. Use of sync.Map would require care to avoid races with
refcount updates.

syscal/env_unix.go uses an RWMutex to guard a map[string]int which
indexes into a []string. sync.Map could potentially be used to reduce
contention in Getenv, but it would require either a parallel map
dedicated to that purpose or a significant change to the existing data
structures.

net/interface.go uses an RWMutex to guard a bidirectional map with
frequent writes. It appears to acquire a read-lock along every path,
so it isn't obvious to me why it's using RWMutex in the first place.

text/template/template.go uses an RWMutex to guard a pair of maps
which are passed to parse.Parse. Changing the underlying data
structure (without introducing needless copying on every Parse) would
require changes to the parse package API.

go/token/position.go uses an RWMutex to guard three variables with
fairly subtle invariants among them. It is plausible that the RWMutex
could be replaced with a sync.Map and one or more atomic variables, but the
changes to do so are not readily apparent.

Contributor

rsc commented Feb 17, 2017

Let's only replace the ones for which sync.Map makes the code clearer or that show up in profiles and are fixed by use of sync.Map. None of these seem worth doing.

Please consider adding a LoadOrStoreLazy(key, factory func() interface{}) (actual interface{}, loaded bool) function:

LoadOrStoreLazy("foo", func() interface{} {
    return heavystuff.New("foo")  
})

This is critical for cases when it's expensive to create a new value.

Member

bcmills commented Feb 17, 2017 edited

@akrylysov What would the semantics of LoadOrStoreLazy be?

If it reserves the slot for the first caller, that's easy enough to do on the caller side with a little extra indirection:

func lazyLoad(m *sync.Map, key interface{}, f func() interface{}) interface{} {
  if g, ok := m.Load(key); ok {
    return g.(func() interface{})()
  }

  var (
    once sync.Once
    x interface{}
  )
  g, _ := m.LoadOrStore(key, func() interface{} {
    once.Do(func() {
      x = f()
      m.Store(key, func() interface{} { return x })
    })
    return x
  })
  return g.(func() interface{})()
})

If it does not reserve the slot, then you just need an extra Load:

x, ok := m.Load(key)
if ok {
  return x
}
x, _ = m.LoadOrStore(key, heavystuff.New(key))
return x

akrylysov commented Feb 17, 2017 edited

@bcmills it should reserve the slot. My idea is that the factory function (the second argument of LoadOrStoreLazy) should be called only when map is missing the key. So basically m.dirty[key] = f() instead of m.dirty[key] = value in https://github.com/golang/sync/blob/master/syncmap/map.go#L119.

Member

bcmills commented Feb 17, 2017 edited

That seems worse than the caller-side lazyLoad function I sketched above. It cuts out a little bit of indirection for clean Load calls, but at the cost of locking the map for all of the other keys that miss while the callback executes.

Member

bcmills commented Feb 17, 2017

I have a couple more use-cases which, unlike the examples I've found in the standard library so far, are not "append-only" with respect to the set of keys in the map.

They fall into the general category of "handles":

  1. Maps from integer handle to Go pointer, so that the handles can be passed to C functions or C callers and stored without violating the cgo pointer-passing rules.
  2. Maps from session-ID or user-ID to per-session or per-user data in long-lived servers without long-lived connections.

CL https://golang.org/cl/37342 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Mar 15, 2017

@bcmills @bcmills bcmills + bcmills archive/zip: parallelize benchmarks
Add subbenchmarks for BenchmarkZip64Test with different sizes to tease
apart construction costs vs. steady-state throughput.

Results remain comparable with the non-parallel version with -cpu=1:

benchmark                           old ns/op     new ns/op     delta
BenchmarkCompressedZipGarbage       26832835      27506953      +2.51%
BenchmarkCompressedZipGarbage-6     27172377      4321534       -84.10%
BenchmarkZip64Test                  196758732     197765510     +0.51%
BenchmarkZip64Test-6                193850605     192625458     -0.63%

benchmark                           old allocs     new allocs     delta
BenchmarkCompressedZipGarbage       44             44             +0.00%
BenchmarkCompressedZipGarbage-6     44             44             +0.00%

benchmark                           old bytes     new bytes     delta
BenchmarkCompressedZipGarbage       5592          5664          +1.29%
BenchmarkCompressedZipGarbage-6     5592          21946         +292.45%

updates #18177

Change-Id: Icfa359d9b1a8df5e085dacc07d2b9221b284764c
Reviewed-on: https://go-review.googlesource.com/36719
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
d0a045d

@gopherbot gopherbot pushed a commit to golang/sync that referenced this issue Mar 15, 2017

@bcmills @bcmills bcmills + bcmills syncmap: remove blocking for all operations on existing keys
In the previous version, Loads of keys with unchanged values would
block on the Mutex if the map was dirty, leading to potentially long
stalls while the read-only map is being copied to the dirty map.

This change adds a double pointer indirection to each entry and an
extra atomic load on the fast path. In exchange, it removes the need
to lock the map for nearly all operations on existing keys.

Each entry is represented by an atomically-updated pointer to an
interface{} value. The same entries are stored in both the read-only
and dirty maps. Keys deleted before the dirty map was last copied are
marked as "expunged" and omitted from the dirty map until the next
Store to that key.

Newly-stored values exist only in the dirty map. The existence of new
keys in the dirty map is indicated by an "amended" bit attached to the
read-only map, allowing Load or Delete calls for nonexistent keys to
sometimes return immediately (without acquiring the Mutex).

This trades off a bit of steady-state throughput in the "mostly hits"
case in exchange for less contention (and lower Load tail latencies)
for mixed Load/Store/Delete usage.

Unfortunately, the latency impact is currently difficult to measure
(#19128).

updates golang/go#19128
updates golang/go#18177

https://perf.golang.org/search?q=upload:20170315.5

name                                               old time/op    new time/op    delta
LoadMostlyHits/*syncmap_test.DeepCopyMap             70.3ns ± 6%    70.2ns ± 4%      ~     (p=0.886 n=4+4)
LoadMostlyHits/*syncmap_test.DeepCopyMap-48          10.9ns ± 4%    13.6ns ±24%      ~     (p=0.314 n=4+4)
LoadMostlyHits/*syncmap_test.RWMutexMap              86.0ns ± 9%    86.0ns ± 4%      ~     (p=1.000 n=4+4)
LoadMostlyHits/*syncmap_test.RWMutexMap-48            140ns ± 3%     139ns ± 2%      ~     (p=0.686 n=4+4)
LoadMostlyHits/*syncmap.Map                          70.6ns ± 8%    70.3ns ± 1%      ~     (p=0.571 n=4+4)
LoadMostlyHits/*syncmap.Map-48                       10.3ns ± 7%    11.2ns ± 5%      ~     (p=0.114 n=4+4)
LoadMostlyMisses/*syncmap_test.DeepCopyMap           59.4ns ± 4%    59.4ns ± 4%      ~     (p=1.000 n=4+4)
LoadMostlyMisses/*syncmap_test.DeepCopyMap-48        10.4ns ±30%    12.6ns ±29%      ~     (p=0.486 n=4+4)
LoadMostlyMisses/*syncmap_test.RWMutexMap            64.8ns ± 6%    63.8ns ± 4%      ~     (p=0.686 n=4+4)
LoadMostlyMisses/*syncmap_test.RWMutexMap-48          138ns ± 3%     139ns ± 2%      ~     (p=0.800 n=4+4)
LoadMostlyMisses/*syncmap.Map                        51.5ns ± 7%    50.4ns ± 2%      ~     (p=0.371 n=4+4)
LoadMostlyMisses/*syncmap.Map-48                     9.37ns ± 3%    9.40ns ± 3%      ~     (p=0.886 n=4+4)
LoadOrStoreBalanced/*syncmap_test.RWMutexMap          820ns ±17%     834ns ±14%      ~     (p=1.000 n=4+4)
LoadOrStoreBalanced/*syncmap_test.RWMutexMap-48      1.38µs ± 4%    1.38µs ± 3%      ~     (p=0.886 n=4+4)
LoadOrStoreBalanced/*syncmap.Map                      709ns ±13%    1085ns ± 9%   +53.09%  (p=0.029 n=4+4)
LoadOrStoreBalanced/*syncmap.Map-48                  1.30µs ±13%    1.08µs ± 8%   -17.40%  (p=0.029 n=4+4)
LoadOrStoreUnique/*syncmap_test.RWMutexMap           1.26µs ±14%    1.16µs ±29%      ~     (p=0.343 n=4+4)
LoadOrStoreUnique/*syncmap_test.RWMutexMap-48        1.82µs ±15%    1.98µs ± 6%      ~     (p=0.143 n=4+4)
LoadOrStoreUnique/*syncmap.Map                       1.06µs ± 7%    1.86µs ±11%   +76.09%  (p=0.029 n=4+4)
LoadOrStoreUnique/*syncmap.Map-48                    2.00µs ± 4%    1.62µs ± 4%   -19.20%  (p=0.029 n=4+4)
LoadOrStoreCollision/*syncmap_test.DeepCopyMap       32.9ns ± 8%    32.6ns ± 9%      ~     (p=0.686 n=4+4)
LoadOrStoreCollision/*syncmap_test.DeepCopyMap-48   2.26ns ±136%    3.41ns ±62%      ~     (p=0.114 n=4+4)
LoadOrStoreCollision/*syncmap_test.RWMutexMap        58.0ns ±20%    54.6ns ±17%      ~     (p=0.343 n=4+4)
LoadOrStoreCollision/*syncmap_test.RWMutexMap-48      458ns ± 2%     445ns ± 6%      ~     (p=0.229 n=4+4)
LoadOrStoreCollision/*syncmap.Map                    35.8ns ± 5%    39.6ns ± 9%      ~     (p=0.114 n=4+4)
LoadOrStoreCollision/*syncmap.Map-48                 1.24ns ± 2%    1.58ns ± 4%   +27.16%  (p=0.029 n=4+4)
Range/*syncmap_test.DeepCopyMap                      19.7µs ± 4%    19.8µs ± 3%      ~     (p=0.686 n=4+4)
Range/*syncmap_test.DeepCopyMap-48                    763ns ± 1%     864ns ± 3%   +13.24%  (p=0.029 n=4+4)
Range/*syncmap_test.RWMutexMap                       20.9µs ± 3%    20.4µs ± 4%      ~     (p=0.343 n=4+4)
Range/*syncmap_test.RWMutexMap-48                     764ns ± 1%     870ns ± 1%   +13.77%  (p=0.029 n=4+4)
Range/*syncmap.Map                                   20.4µs ± 5%    22.7µs ± 5%   +10.98%  (p=0.029 n=4+4)
Range/*syncmap.Map-48                                 776ns ± 3%     954ns ± 4%   +22.95%  (p=0.029 n=4+4)
AdversarialAlloc/*syncmap_test.DeepCopyMap            206ns ± 5%     199ns ± 2%      ~     (p=0.200 n=4+4)
AdversarialAlloc/*syncmap_test.DeepCopyMap-48        8.94µs ± 5%    9.21µs ± 4%      ~     (p=0.200 n=4+4)
AdversarialAlloc/*syncmap_test.RWMutexMap            63.4ns ± 4%    63.8ns ± 3%      ~     (p=0.686 n=4+4)
AdversarialAlloc/*syncmap_test.RWMutexMap-48          184ns ±10%     198ns ±11%      ~     (p=0.343 n=4+4)
AdversarialAlloc/*syncmap.Map                         213ns ± 3%     264ns ± 3%   +23.97%  (p=0.029 n=4+4)
AdversarialAlloc/*syncmap.Map-48                      556ns ± 5%    1389ns ± 8%  +149.93%  (p=0.029 n=4+4)
AdversarialDelete/*syncmap_test.DeepCopyMap           300ns ± 6%     304ns ± 7%      ~     (p=0.886 n=4+4)
AdversarialDelete/*syncmap_test.DeepCopyMap-48        647ns ± 3%     646ns ± 3%      ~     (p=0.943 n=4+4)
AdversarialDelete/*syncmap_test.RWMutexMap           69.1ns ± 1%    69.2ns ± 6%      ~     (p=0.686 n=4+4)
AdversarialDelete/*syncmap_test.RWMutexMap-48         289ns ±15%     300ns ±17%      ~     (p=0.829 n=4+4)
AdversarialDelete/*syncmap.Map                        198ns ± 5%     264ns ± 2%   +33.17%  (p=0.029 n=4+4)
AdversarialDelete/*syncmap.Map-48                     291ns ± 9%     173ns ± 8%   -40.50%  (p=0.029 n=4+4)

name                                               old alloc/op   new alloc/op   delta
LoadMostlyHits/*syncmap_test.DeepCopyMap              7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyHits/*syncmap_test.DeepCopyMap-48           7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyHits/*syncmap_test.RWMutexMap               7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyHits/*syncmap_test.RWMutexMap-48            7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyHits/*syncmap.Map                           7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyHits/*syncmap.Map-48                        7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyMisses/*syncmap_test.DeepCopyMap            7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyMisses/*syncmap_test.DeepCopyMap-48         7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyMisses/*syncmap_test.RWMutexMap             7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyMisses/*syncmap_test.RWMutexMap-48          7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyMisses/*syncmap.Map                         7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadMostlyMisses/*syncmap.Map-48                      7.00B ± 0%     7.00B ± 0%      ~     (all equal)
LoadOrStoreBalanced/*syncmap_test.RWMutexMap          95.0B ± 0%     95.0B ± 0%      ~     (all equal)
LoadOrStoreBalanced/*syncmap_test.RWMutexMap-48       95.0B ± 0%     95.0B ± 0%      ~     (all equal)
LoadOrStoreBalanced/*syncmap.Map                      95.0B ± 0%     88.0B ± 0%    -7.37%  (p=0.029 n=4+4)
LoadOrStoreBalanced/*syncmap.Map-48                   95.0B ± 0%     88.0B ± 0%    -7.37%  (p=0.029 n=4+4)
LoadOrStoreUnique/*syncmap_test.RWMutexMap             175B ± 0%      175B ± 0%      ~     (all equal)
LoadOrStoreUnique/*syncmap_test.RWMutexMap-48          175B ± 0%      175B ± 0%      ~     (all equal)
LoadOrStoreUnique/*syncmap.Map                         175B ± 0%      161B ± 0%    -8.00%  (p=0.029 n=4+4)
LoadOrStoreUnique/*syncmap.Map-48                      175B ± 0%      161B ± 0%    -8.00%  (p=0.029 n=4+4)
LoadOrStoreCollision/*syncmap_test.DeepCopyMap        0.00B          0.00B           ~     (all equal)
LoadOrStoreCollision/*syncmap_test.DeepCopyMap-48     0.00B          0.00B           ~     (all equal)
LoadOrStoreCollision/*syncmap_test.RWMutexMap         0.00B          0.00B           ~     (all equal)
LoadOrStoreCollision/*syncmap_test.RWMutexMap-48      0.00B          0.00B           ~     (all equal)
LoadOrStoreCollision/*syncmap.Map                     0.00B          0.00B           ~     (all equal)
LoadOrStoreCollision/*syncmap.Map-48                  0.00B          0.00B           ~     (all equal)
Range/*syncmap_test.DeepCopyMap                       0.00B          0.00B           ~     (all equal)
Range/*syncmap_test.DeepCopyMap-48                    0.00B          0.00B           ~     (all equal)
Range/*syncmap_test.RWMutexMap                        0.00B          0.00B           ~     (all equal)
Range/*syncmap_test.RWMutexMap-48                     0.00B          0.00B           ~     (all equal)
Range/*syncmap.Map                                    0.00B          0.00B           ~     (all equal)
Range/*syncmap.Map-48                                 0.00B          0.00B           ~     (all equal)
AdversarialAlloc/*syncmap_test.DeepCopyMap            74.0B ± 0%     74.0B ± 0%      ~     (all equal)
AdversarialAlloc/*syncmap_test.DeepCopyMap-48        3.15kB ± 0%    3.15kB ± 0%      ~     (p=1.000 n=4+4)
AdversarialAlloc/*syncmap_test.RWMutexMap             8.00B ± 0%     8.00B ± 0%      ~     (all equal)
AdversarialAlloc/*syncmap_test.RWMutexMap-48          8.00B ± 0%     8.00B ± 0%      ~     (all equal)
AdversarialAlloc/*syncmap.Map                         74.0B ± 0%     55.0B ± 0%   -25.68%  (p=0.029 n=4+4)
AdversarialAlloc/*syncmap.Map-48                      8.00B ± 0%    56.25B ± 1%  +603.12%  (p=0.029 n=4+4)
AdversarialDelete/*syncmap_test.DeepCopyMap            155B ± 0%      155B ± 0%      ~     (all equal)
AdversarialDelete/*syncmap_test.DeepCopyMap-48         156B ± 0%      156B ± 0%      ~     (p=1.000 n=4+4)
AdversarialDelete/*syncmap_test.RWMutexMap            8.00B ± 0%     8.00B ± 0%      ~     (all equal)
AdversarialDelete/*syncmap_test.RWMutexMap-48         8.00B ± 0%     8.00B ± 0%      ~     (all equal)
AdversarialDelete/*syncmap.Map                        81.0B ± 0%     65.0B ± 0%   -19.75%  (p=0.029 n=4+4)
AdversarialDelete/*syncmap.Map-48                     23.0B ± 9%     15.2B ± 5%   -33.70%  (p=0.029 n=4+4)

name                                               old allocs/op  new allocs/op  delta
LoadMostlyHits/*syncmap_test.DeepCopyMap               0.00           0.00           ~     (all equal)
LoadMostlyHits/*syncmap_test.DeepCopyMap-48            0.00           0.00           ~     (all equal)
LoadMostlyHits/*syncmap_test.RWMutexMap                0.00           0.00           ~     (all equal)
LoadMostlyHits/*syncmap_test.RWMutexMap-48             0.00           0.00           ~     (all equal)
LoadMostlyHits/*syncmap.Map                            0.00           0.00           ~     (all equal)
LoadMostlyHits/*syncmap.Map-48                         0.00           0.00           ~     (all equal)
LoadMostlyMisses/*syncmap_test.DeepCopyMap             0.00           0.00           ~     (all equal)
LoadMostlyMisses/*syncmap_test.DeepCopyMap-48          0.00           0.00           ~     (all equal)
LoadMostlyMisses/*syncmap_test.RWMutexMap              0.00           0.00           ~     (all equal)
LoadMostlyMisses/*syncmap_test.RWMutexMap-48           0.00           0.00           ~     (all equal)
LoadMostlyMisses/*syncmap.Map                          0.00           0.00           ~     (all equal)
LoadMostlyMisses/*syncmap.Map-48                       0.00           0.00           ~     (all equal)
LoadOrStoreBalanced/*syncmap_test.RWMutexMap           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
LoadOrStoreBalanced/*syncmap_test.RWMutexMap-48        2.00 ± 0%      2.00 ± 0%      ~     (all equal)
LoadOrStoreBalanced/*syncmap.Map                       2.00 ± 0%      3.00 ± 0%   +50.00%  (p=0.029 n=4+4)
LoadOrStoreBalanced/*syncmap.Map-48                    2.00 ± 0%      3.00 ± 0%   +50.00%  (p=0.029 n=4+4)
LoadOrStoreUnique/*syncmap_test.RWMutexMap             2.00 ± 0%      2.00 ± 0%      ~     (all equal)
LoadOrStoreUnique/*syncmap_test.RWMutexMap-48          2.00 ± 0%      2.00 ± 0%      ~     (all equal)
LoadOrStoreUnique/*syncmap.Map                         2.00 ± 0%      4.00 ± 0%  +100.00%  (p=0.029 n=4+4)
LoadOrStoreUnique/*syncmap.Map-48                      2.00 ± 0%      4.00 ± 0%  +100.00%  (p=0.029 n=4+4)
LoadOrStoreCollision/*syncmap_test.DeepCopyMap         0.00           0.00           ~     (all equal)
LoadOrStoreCollision/*syncmap_test.DeepCopyMap-48      0.00           0.00           ~     (all equal)
LoadOrStoreCollision/*syncmap_test.RWMutexMap          0.00           0.00           ~     (all equal)
LoadOrStoreCollision/*syncmap_test.RWMutexMap-48       0.00           0.00           ~     (all equal)
LoadOrStoreCollision/*syncmap.Map                      0.00           0.00           ~     (all equal)
LoadOrStoreCollision/*syncmap.Map-48                   0.00           0.00           ~     (all equal)
Range/*syncmap_test.DeepCopyMap                        0.00           0.00           ~     (all equal)
Range/*syncmap_test.DeepCopyMap-48                     0.00           0.00           ~     (all equal)
Range/*syncmap_test.RWMutexMap                         0.00           0.00           ~     (all equal)
Range/*syncmap_test.RWMutexMap-48                      0.00           0.00           ~     (all equal)
Range/*syncmap.Map                                     0.00           0.00           ~     (all equal)
Range/*syncmap.Map-48                                  0.00           0.00           ~     (all equal)
AdversarialAlloc/*syncmap_test.DeepCopyMap             1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialAlloc/*syncmap_test.DeepCopyMap-48          1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialAlloc/*syncmap_test.RWMutexMap              1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialAlloc/*syncmap_test.RWMutexMap-48           1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialAlloc/*syncmap.Map                          1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialAlloc/*syncmap.Map-48                       1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialDelete/*syncmap_test.DeepCopyMap            1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialDelete/*syncmap_test.DeepCopyMap-48         1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialDelete/*syncmap_test.RWMutexMap             1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialDelete/*syncmap_test.RWMutexMap-48          1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialDelete/*syncmap.Map                         1.00 ± 0%      1.00 ± 0%      ~     (all equal)
AdversarialDelete/*syncmap.Map-48                      1.00 ± 0%      1.00 ± 0%      ~     (all equal)

Change-Id: I93c9458505d84238cc8e9016a7dfd285868af236
Reviewed-on: https://go-review.googlesource.com/37342
Reviewed-by: Russ Cox <rsc@golang.org>
a60ad46
Contributor

josharian commented Apr 8, 2017

Is the plan to try to move this to the standard library for 1.9, or wait for 1.10 (or beyond)?

Member

bcmills commented Apr 8, 2017

I'm still planning to try to get it in for 1.9. (I need to address some code review comments — got bogged down with internal work.)

CL https://golang.org/cl/40693 mentions this issue.

@josharian josharian added a commit to josharian/go that referenced this issue Apr 19, 2017

@josharian josharian cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

CAVEATS

I suspect it's going to end up on Twitter.
If you're coming here from the internet:

* Don't believe the hype.
* If you're going to try it out, please also try with
  the race detector enabled and report an issue
  at golang.org/issue/new if you see a race report.
  Just run 'go install -race cmd/compile' and
  'go build -a yourpackages'. And then run make.bash
  again to return a reasonable compiler.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
let in the development cycle,
we can simply have cmd/go set -c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  It requires additional sorting to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is easily the most heavily contended mutex.
  I hope that the syncmap proposed in #18177 may
  provide some free speed-ups here.
  Some lookups may also be removable.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal, aside from some mild
  awkwardness to avoid deadlocks due to recursive calls.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

* types.Sym.Lsymmu. Syms keep a cache of their associated LSym,
  to reduce lookups in ctxt.Hash. This cache itself
  needs concurrency protection. This mutex adds to the size
  of a moderately important data structure, but the alloc benchmarks
  below show that this doesn't hurt much in practice.
  It is moderately contended, mostly because when lookups fail,
  the lock is held while vying for the contended ctxt.Hash mutex.
  The fact that it keeps load off the ctxt.Hash mutex, though,
  makes this mutex worth keeping.
  Also, a fair number of calls to Linksym could be avoided
  by judicious addition of a local variable.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed #19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is #19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1
costs about 1-2% CPU and has almost no memory impact.

name       old time/op       new time/op       delta
Template         193ms ± 4%        195ms ± 5%  +0.99%  (p=0.007 n=50+50)
Unicode         82.0ms ± 3%       84.7ms ± 4%  +3.26%  (p=0.000 n=47+48)
GoTypes          539ms ± 4%        549ms ± 3%  +1.90%  (p=0.000 n=46+46)
SSA              5.92s ± 2%        5.95s ± 3%  +0.49%  (p=0.018 n=42+48)
Flate            121ms ± 4%        122ms ± 3%    ~     (p=0.783 n=48+46)
GoParser         143ms ± 6%        144ms ± 3%  +0.90%  (p=0.001 n=49+47)
Reflect          342ms ± 4%        347ms ± 3%  +1.47%  (p=0.000 n=48+49)
Tar              104ms ± 5%        105ms ± 4%  +1.06%  (p=0.003 n=47+47)
XML              197ms ± 4%        199ms ± 5%  +0.85%  (p=0.014 n=48+49)

name       old user-time/op  new user-time/op  delta
Template         238ms ±10%        241ms ±10%    ~     (p=0.066 n=50+50)
Unicode          104ms ± 4%        106ms ± 5%  +2.23%  (p=0.000 n=46+49)
GoTypes          706ms ± 3%        714ms ± 5%  +1.15%  (p=0.002 n=45+49)
SSA              8.21s ± 3%        8.27s ± 2%  +0.78%  (p=0.002 n=43+47)
Flate            144ms ± 7%        145ms ± 5%    ~     (p=0.098 n=49+49)
GoParser         175ms ± 4%        177ms ± 4%  +1.35%  (p=0.003 n=47+50)
Reflect          435ms ± 6%        433ms ± 7%    ~     (p=0.852 n=50+50)
Tar              121ms ± 6%        122ms ± 7%    ~     (p=0.143 n=48+50)
XML              240ms ± 4%        241ms ± 4%    ~     (p=0.231 n=50+49)

name       old alloc/op      new alloc/op      delta
Template        38.7MB ± 0%       38.7MB ± 0%    ~     (p=0.690 n=5+5)
Unicode         29.8MB ± 0%       29.8MB ± 0%    ~     (p=0.222 n=5+5)
GoTypes          113MB ± 0%        113MB ± 0%    ~     (p=0.310 n=5+5)
SSA             1.24GB ± 0%       1.24GB ± 0%  +0.04%  (p=0.008 n=5+5)
Flate           25.2MB ± 0%       25.2MB ± 0%    ~     (p=0.841 n=5+5)
GoParser        31.7MB ± 0%       31.7MB ± 0%    ~     (p=0.151 n=5+5)
Reflect         77.5MB ± 0%       77.5MB ± 0%    ~     (p=0.690 n=5+5)
Tar             26.4MB ± 0%       26.4MB ± 0%    ~     (p=0.690 n=5+5)
XML             42.0MB ± 0%       42.0MB ± 0%    ~     (p=0.690 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          378k ± 0%         377k ± 1%    ~     (p=0.841 n=5+5)
Unicode           321k ± 0%         322k ± 0%    ~     (p=0.310 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%    ~     (p=0.310 n=5+5)
SSA              9.67M ± 0%        9.69M ± 0%  +0.20%  (p=0.008 n=5+5)
Flate             233k ± 0%         233k ± 1%    ~     (p=1.000 n=5+5)
GoParser          315k ± 0%         314k ± 1%    ~     (p=0.151 n=5+5)
Reflect           971k ± 0%         970k ± 0%    ~     (p=0.690 n=5+5)
Tar               249k ± 0%         249k ± 0%    ~     (p=0.310 n=5+5)
XML               391k ± 0%         390k ± 0%    ~     (p=0.841 n=5+5)

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates #15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
0b456c1

@josharian josharian added a commit to josharian/go that referenced this issue Apr 19, 2017

@josharian josharian cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.


CAVEATS

I suspect it's going to end up on Twitter.
If you're coming here from the internet:

* Don't believe the hype.
* If you're going to try it out, please also try with
  the race detector enabled and report an issue
  at golang.org/issue/new if you see a race report.
  Just run 'go install -race cmd/compile' and
  'go build -a yourpackages'. And then run make.bash
  again to return a reasonable compiler.


BACKGROUND

The compiler current consists (very roughly) of a handful of phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.


USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
let in the development cycle,
we can simply have cmd/go set -c=1 always,
and revert to the original compiler behavior.


MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  It requires additional sorting to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.Hashmu. ctxt.Hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is easily the most heavily contended mutex.
  I hope that the syncmap proposed in #18177 may
  provide some free speed-ups here.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal, aside from some mild
  awkwardness to avoid deadlocks due to recursive calls.

* types.Pkg.Symsmu. Looking up symbols in a package
  happens a fair amount during backend compilation,
  including the construction of gcargs/gclocals symbols,
  and types, particularly via ngotype.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

* types.Sym.Lsymmu. Syms keep a cache of their associated LSym,
  to reduce lookups in ctxt.Hash. This cache itself
  needs concurrency protection. This mutex adds to the size
  of a moderately important data structure, but the alloc benchmarks
  below show that this doesn't hurt much in practice.
  It is moderately contended, mostly because when lookups fail,
  the lock is held while vying for the contended ctxt.Hash mutex.
  The fact that it keeps load off the ctxt.Hash mutex, though,
  makes this mutex worth keeping.


TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed #19962 for this.


REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is #19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.


NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.


PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1
costs about 3% CPU and has almost no memory impact.

name       old time/op       new time/op       delta
Template         194ms ± 4%        195ms ± 4%  +0.91%  (p=0.002 n=49+47)
Unicode         82.9ms ± 4%       85.2ms ± 3%  +2.68%  (p=0.000 n=47+48)
GoTypes          518ms ± 3%        527ms ± 2%  +1.81%  (p=0.000 n=46+46)
SSA              5.59s ± 2%        5.77s ± 2%  +3.12%  (p=0.000 n=48+50)
Flate            120ms ± 3%        122ms ± 3%  +1.54%  (p=0.000 n=49+48)
GoParser         140ms ± 3%        143ms ± 4%  +2.24%  (p=0.000 n=47+49)
Reflect          333ms ± 3%        342ms ± 3%  +2.67%  (p=0.000 n=47+49)
Tar              102ms ± 6%        104ms ± 4%  +2.37%  (p=0.000 n=50+47)
XML              202ms ±13%        196ms ± 4%  -2.63%  (p=0.036 n=50+48)

name       old user-time/op  new user-time/op  delta
Template         236ms ± 9%        237ms ±10%    ~     (p=0.750 n=50+50)
Unicode          104ms ± 7%        107ms ± 4%  +2.09%  (p=0.000 n=49+47)
GoTypes          691ms ± 3%        701ms ± 3%  +1.40%  (p=0.000 n=50+49)
SSA              7.91s ± 3%        8.07s ± 4%  +1.98%  (p=0.000 n=48+49)
Flate            142ms ± 4%        145ms ± 5%  +2.12%  (p=0.000 n=47+48)
GoParser         172ms ± 6%        175ms ± 5%  +1.76%  (p=0.002 n=50+49)
Reflect          425ms ± 9%        436ms ± 8%  +2.55%  (p=0.001 n=50+50)
Tar              119ms ± 6%        121ms ± 5%  +2.52%  (p=0.000 n=49+49)
XML              242ms ± 8%        239ms ± 6%  -1.54%  (p=0.039 n=49+49)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       38.8MB ± 0%    ~     (p=0.247 n=10+10)
Unicode         29.8MB ± 0%       29.8MB ± 0%    ~     (p=0.631 n=10+10)
GoTypes          113MB ± 0%        113MB ± 0%    ~     (p=0.218 n=10+10)
SSA             1.25GB ± 0%       1.25GB ± 0%  +0.02%  (p=0.000 n=10+10)
Flate           25.3MB ± 0%       25.3MB ± 0%    ~     (p=0.315 n=9+10)
GoParser        31.7MB ± 0%       31.7MB ± 0%    ~     (p=0.089 n=10+10)
Reflect         78.2MB ± 0%       78.2MB ± 0%  +0.07%  (p=0.019 n=10+10)
Tar             26.5MB ± 0%       26.6MB ± 0%    ~     (p=0.165 n=10+10)
XML             42.4MB ± 0%       42.4MB ± 0%    ~     (p=0.497 n=9+10)

name       old allocs/op     new allocs/op     delta
Template          378k ± 1%         379k ± 1%    ~     (p=0.353 n=10+10)
Unicode           321k ± 0%         321k ± 1%    ~     (p=0.684 n=10+10)
GoTypes          1.14M ± 0%        1.14M ± 0%    ~     (p=0.247 n=10+10)
SSA              9.71M ± 0%        9.72M ± 0%  +0.08%  (p=0.000 n=10+10)
Flate             234k ± 1%         234k ± 1%    ~     (p=0.280 n=10+10)
GoParser          316k ± 0%         315k ± 1%  -0.26%  (p=0.040 n=9+9)
Reflect           980k ± 0%         981k ± 0%    ~     (p=0.052 n=10+10)
Tar               249k ± 1%         250k ± 1%    ~     (p=0.190 n=10+10)
XML               391k ± 1%         391k ± 1%    ~     (p=0.829 n=8+10)


Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)


From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)


Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.


Updates #15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
e0d27ea

@josharian josharian added a commit to josharian/go that referenced this issue Apr 20, 2017

@josharian josharian cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

CAVEATS

I suspect it's going to end up on Twitter.
If you're coming here from the internet:

* Don't believe the hype.
* If you're going to try it out, please also try with
  the race detector enabled and report an issue
  at golang.org/issue/new if you see a race report.
  Just run 'go install -race cmd/compile' and
  'go build -a yourpackages'. And then run make.bash
  again to return a reasonable compiler.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
let in the development cycle,
we can simply have cmd/go set -c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  It requires additional sorting to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is easily the most heavily contended mutex.
  I hope that the syncmap proposed in #18177 may
  provide some free speed-ups here.
  Some lookups may also be removable.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal, aside from some mild
  awkwardness to avoid deadlocks due to recursive calls.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

* types.Sym.Lsymmu. Syms keep a cache of their associated LSym,
  to reduce lookups in ctxt.Hash. This cache itself
  needs concurrency protection. This mutex adds to the size
  of a moderately important data structure, but the alloc benchmarks
  below show that this doesn't hurt much in practice.
  It is moderately contended, mostly because when lookups fail,
  the lock is held while vying for the contended ctxt.Hash mutex.
  The fact that it keeps load off the ctxt.Hash mutex, though,
  makes this mutex worth keeping.
  Also, a fair number of calls to Linksym could be avoided
  by judicious addition of a local variable.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed #19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is #19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1
costs about 1-2% CPU and has almost no memory impact.

name       old time/op       new time/op       delta
Template         193ms ± 4%        195ms ± 5%  +0.99%  (p=0.007 n=50+50)
Unicode         82.0ms ± 3%       84.7ms ± 4%  +3.26%  (p=0.000 n=47+48)
GoTypes          539ms ± 4%        549ms ± 3%  +1.90%  (p=0.000 n=46+46)
SSA              5.92s ± 2%        5.95s ± 3%  +0.49%  (p=0.018 n=42+48)
Flate            121ms ± 4%        122ms ± 3%    ~     (p=0.783 n=48+46)
GoParser         143ms ± 6%        144ms ± 3%  +0.90%  (p=0.001 n=49+47)
Reflect          342ms ± 4%        347ms ± 3%  +1.47%  (p=0.000 n=48+49)
Tar              104ms ± 5%        105ms ± 4%  +1.06%  (p=0.003 n=47+47)
XML              197ms ± 4%        199ms ± 5%  +0.85%  (p=0.014 n=48+49)

name       old user-time/op  new user-time/op  delta
Template         238ms ±10%        241ms ±10%    ~     (p=0.066 n=50+50)
Unicode          104ms ± 4%        106ms ± 5%  +2.23%  (p=0.000 n=46+49)
GoTypes          706ms ± 3%        714ms ± 5%  +1.15%  (p=0.002 n=45+49)
SSA              8.21s ± 3%        8.27s ± 2%  +0.78%  (p=0.002 n=43+47)
Flate            144ms ± 7%        145ms ± 5%    ~     (p=0.098 n=49+49)
GoParser         175ms ± 4%        177ms ± 4%  +1.35%  (p=0.003 n=47+50)
Reflect          435ms ± 6%        433ms ± 7%    ~     (p=0.852 n=50+50)
Tar              121ms ± 6%        122ms ± 7%    ~     (p=0.143 n=48+50)
XML              240ms ± 4%        241ms ± 4%    ~     (p=0.231 n=50+49)

name       old alloc/op      new alloc/op      delta
Template        38.7MB ± 0%       38.7MB ± 0%    ~     (p=0.690 n=5+5)
Unicode         29.8MB ± 0%       29.8MB ± 0%    ~     (p=0.222 n=5+5)
GoTypes          113MB ± 0%        113MB ± 0%    ~     (p=0.310 n=5+5)
SSA             1.24GB ± 0%       1.24GB ± 0%  +0.04%  (p=0.008 n=5+5)
Flate           25.2MB ± 0%       25.2MB ± 0%    ~     (p=0.841 n=5+5)
GoParser        31.7MB ± 0%       31.7MB ± 0%    ~     (p=0.151 n=5+5)
Reflect         77.5MB ± 0%       77.5MB ± 0%    ~     (p=0.690 n=5+5)
Tar             26.4MB ± 0%       26.4MB ± 0%    ~     (p=0.690 n=5+5)
XML             42.0MB ± 0%       42.0MB ± 0%    ~     (p=0.690 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          378k ± 0%         377k ± 1%    ~     (p=0.841 n=5+5)
Unicode           321k ± 0%         322k ± 0%    ~     (p=0.310 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%    ~     (p=0.310 n=5+5)
SSA              9.67M ± 0%        9.69M ± 0%  +0.20%  (p=0.008 n=5+5)
Flate             233k ± 0%         233k ± 1%    ~     (p=1.000 n=5+5)
GoParser          315k ± 0%         314k ± 1%    ~     (p=0.151 n=5+5)
Reflect           971k ± 0%         970k ± 0%    ~     (p=0.690 n=5+5)
Tar               249k ± 0%         249k ± 0%    ~     (p=0.310 n=5+5)
XML               391k ± 0%         390k ± 0%    ~     (p=0.841 n=5+5)

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates #15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
225ba0f

@josharian josharian added a commit to josharian/go that referenced this issue Apr 21, 2017

@josharian josharian cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

CAVEATS

I suspect it's going to end up on Twitter.
If you're coming here from the internet:

* Don't believe the hype.
* If you're going to try it out, please also try with
  the race detector enabled and report an issue
  at golang.org/issue/new if you see a race report.
  Just run 'go install -race cmd/compile' and
  'go build -a yourpackages'. And then run make.bash
  again to return a reasonable compiler.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
let in the development cycle,
we can simply have cmd/go set -c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  It requires additional sorting to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is easily the most heavily contended mutex.
  I hope that the syncmap proposed in #18177 may
  provide some free speed-ups here.
  Some lookups may also be removable.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal, aside from some mild
  awkwardness to avoid deadlocks due to recursive calls.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

* types.Sym.Lsymmu. Syms keep a cache of their associated LSym,
  to reduce lookups in ctxt.Hash. This cache itself
  needs concurrency protection. This mutex adds to the size
  of a moderately important data structure, but the alloc benchmarks
  below show that this doesn't hurt much in practice.
  It is moderately contended, mostly because when lookups fail,
  the lock is held while vying for the contended ctxt.Hash mutex.
  The fact that it keeps load off the ctxt.Hash mutex, though,
  makes this mutex worth keeping.
  Also, a fair number of calls to Linksym could be avoided
  by judicious addition of a local variable.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed #19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is #19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1
costs about 1-2% CPU and has almost no memory impact.

name       old time/op       new time/op       delta
Template         193ms ± 4%        195ms ± 5%  +0.99%  (p=0.007 n=50+50)
Unicode         82.0ms ± 3%       84.7ms ± 4%  +3.26%  (p=0.000 n=47+48)
GoTypes          539ms ± 4%        549ms ± 3%  +1.90%  (p=0.000 n=46+46)
SSA              5.92s ± 2%        5.95s ± 3%  +0.49%  (p=0.018 n=42+48)
Flate            121ms ± 4%        122ms ± 3%    ~     (p=0.783 n=48+46)
GoParser         143ms ± 6%        144ms ± 3%  +0.90%  (p=0.001 n=49+47)
Reflect          342ms ± 4%        347ms ± 3%  +1.47%  (p=0.000 n=48+49)
Tar              104ms ± 5%        105ms ± 4%  +1.06%  (p=0.003 n=47+47)
XML              197ms ± 4%        199ms ± 5%  +0.85%  (p=0.014 n=48+49)

name       old user-time/op  new user-time/op  delta
Template         238ms ±10%        241ms ±10%    ~     (p=0.066 n=50+50)
Unicode          104ms ± 4%        106ms ± 5%  +2.23%  (p=0.000 n=46+49)
GoTypes          706ms ± 3%        714ms ± 5%  +1.15%  (p=0.002 n=45+49)
SSA              8.21s ± 3%        8.27s ± 2%  +0.78%  (p=0.002 n=43+47)
Flate            144ms ± 7%        145ms ± 5%    ~     (p=0.098 n=49+49)
GoParser         175ms ± 4%        177ms ± 4%  +1.35%  (p=0.003 n=47+50)
Reflect          435ms ± 6%        433ms ± 7%    ~     (p=0.852 n=50+50)
Tar              121ms ± 6%        122ms ± 7%    ~     (p=0.143 n=48+50)
XML              240ms ± 4%        241ms ± 4%    ~     (p=0.231 n=50+49)

name       old alloc/op      new alloc/op      delta
Template        38.7MB ± 0%       38.7MB ± 0%    ~     (p=0.690 n=5+5)
Unicode         29.8MB ± 0%       29.8MB ± 0%    ~     (p=0.222 n=5+5)
GoTypes          113MB ± 0%        113MB ± 0%    ~     (p=0.310 n=5+5)
SSA             1.24GB ± 0%       1.24GB ± 0%  +0.04%  (p=0.008 n=5+5)
Flate           25.2MB ± 0%       25.2MB ± 0%    ~     (p=0.841 n=5+5)
GoParser        31.7MB ± 0%       31.7MB ± 0%    ~     (p=0.151 n=5+5)
Reflect         77.5MB ± 0%       77.5MB ± 0%    ~     (p=0.690 n=5+5)
Tar             26.4MB ± 0%       26.4MB ± 0%    ~     (p=0.690 n=5+5)
XML             42.0MB ± 0%       42.0MB ± 0%    ~     (p=0.690 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          378k ± 0%         377k ± 1%    ~     (p=0.841 n=5+5)
Unicode           321k ± 0%         322k ± 0%    ~     (p=0.310 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%    ~     (p=0.310 n=5+5)
SSA              9.67M ± 0%        9.69M ± 0%  +0.20%  (p=0.008 n=5+5)
Flate             233k ± 0%         233k ± 1%    ~     (p=1.000 n=5+5)
GoParser          315k ± 0%         314k ± 1%    ~     (p=0.151 n=5+5)
Reflect           971k ± 0%         970k ± 0%    ~     (p=0.690 n=5+5)
Tar               249k ± 0%         249k ± 0%    ~     (p=0.310 n=5+5)
XML               391k ± 0%         390k ± 0%    ~     (p=0.841 n=5+5)

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates #15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
b7a429d

@josharian josharian added a commit to josharian/go that referenced this issue Apr 22, 2017

@josharian josharian cmd/compile: add initial backend concurrency support
This CL adds initial support for concurrent backend compilation.

BACKGROUND

The compiler currently consists (very roughly) of the following phases:

1. Initialization.
2. Lexing and parsing into the cmd/compile/internal/syntax AST.
3. Translation into the cmd/compile/internal/gc AST.
4. Some gc AST passes: typechecking, escape analysis, inlining,
   closure handling, expression evaluation ordering (order.go),
   and some lowering and optimization (walk.go).
5. Translation into the cmd/compile/internal/ssa SSA form.
6. Optimization and lowering of SSA form.
7. Translation from SSA form to assembler instructions.
8. Translation from assembler instructions to machine code.
9. Writing lots of output: machine code, DWARF symbols,
   type and reflection info, export data.

Phase 2 was already concurrent as of Go 1.8.

Phase 3 is planned for eventual removal;
we hope to go straight from syntax AST to SSA.

Phases 5–8 are per-function; this CL adds support for
processing multiple functions concurrently.
The slowest phases in the compiler are 5 and 6,
so this offers the opportunity for some good speed-ups.

Unfortunately, it's not quite that straightforward.
In the current compiler, the latter parts of phase 4
(order, walk) are done function-at-a-time as needed.
Making order and walk concurrency-safe proved hard,
and they're not particularly slow, so there wasn't much reward.
To enable phases 5–8 to be done concurrently,
when concurrent backend compilation is requested,
we complete phase 4 for all functions
before starting later phases for any functions.

Also, in reality, we automatically generate new
functions in phase 9, such as method wrappers
and equality and has routines.
Those new functions then go through phases 4–8.
This CL disables concurrent backend compilation
after the first, big, user-provided batch of
functions has been compiled.
This is done to keep things simple,
and because the autogenerated functions
tend to be small, few, simple, and fast to compile.

USAGE

Concurrent backend compilation still defaults to off.
To set the number of functions that may be backend-compiled
concurrently, use the compiler flag -c.
In future work, cmd/go will automatically set -c.

Furthermore, this CL has been intentionally written
so that the c=1 path has no backend concurrency whatsoever,
not even spawning any goroutines.
This helps ensure that, should problems arise
late in the development cycle,
we can simply have cmd/go set -c=1 always,
and revert to the original compiler behavior.

MUTEXES

Most of the work required to make concurrent backend
compilation safe has occurred over the past month.
This CL adds a handful of mutexes to get the rest of the way there;
they are the mutexes that I didn't see a clean way to avoid.
Some of them may still be eliminable in future work.

In no particular order:

* gc.funcsymsmu. The global funcsyms slice is populated
  lazily when we need function symbols for closures.
  This occurs during gc AST to SSA translation.
  The function funcsym also does a package lookup,
  which is a source of races on types.Pkg.Syms;
  funcsymsmu also covers that package lookup.
  This mutex is low priority: it adds a single global,
  it is in an infrequently used code path, and it is low contention.
  Since funcsyms may now be added in any order,
  we must sort them to preserve reproducible builds.

* gc.largeStackFramesMu. We don't discover until after SSA compilation
  that a function's stack frame is gigantic.
  Recording that error happens basically never,
  but it does happen concurrently.
  Fix with a low priority mutex and sorting.

* obj.Link.hashmu. ctxt.hash stores the mapping from
  types.Syms (compiler symbols) to obj.LSyms (linker symbols).
  It is accessed fairly heavily through all the phases.
  This is easily the most heavily contended mutex.
  I hope that the syncmap proposed in #18177 may
  provide some free speed-ups here.
  Some lookups may also be removable.

* gc.signatlistmu. The global signatlist map is
  populated with types through several of the concurrent phases,
  including notably via ngotype during DWARF generation.
  It is low priority for removal, aside from some mild
  awkwardness to avoid deadlocks due to recursive calls.

* gc.typepkgmu. Looking up symbols in the types package
  happens a fair amount during backend compilation
  and DWARF generation, particularly via ngotype.
  This mutex helps us to avoid a broader mutex on types.Pkg.Syms.
  It has low-to-moderate contention.

* types.internedStringsmu. gc AST to SSA conversion and
  some SSA work introduce new autotmps.
  Those autotmps have their names interned to reduce allocations.
  That interning requires protecting types.internedStrings.
  The autotmp names are heavily re-used, and the mutex
  overhead and contention here are low, so it is probably
  a worthwhile performance optimization to keep this mutex.

* types.Sym.Lsymmu. Syms keep a cache of their associated LSym,
  to reduce lookups in ctxt.Hash. This cache itself
  needs concurrency protection. This mutex adds to the size
  of a moderately important data structure, but the alloc benchmarks
  below show that this doesn't hurt much in practice.
  It is moderately contended, mostly because when lookups fail,
  the lock is held while vying for the contended ctxt.Hash mutex.
  The fact that it keeps load off the ctxt.Hash mutex, though,
  makes this mutex worth keeping.
  Also, a fair number of calls to Linksym could be avoided
  by judicious addition of a local variable.

TESTING

I have been testing this code locally by running
'go install -race cmd/compile'
and then doing
'go build -a -gcflags=-c=128 std cmd'
for all architectures and a variety of compiler flags.
This obviously needs to be made part of the builders,
but it is too expensive to make part of all.bash.
I have filed #19962 for this.

REPRODUCIBLE BUILDS

This version of the compiler generates reproducible builds.
Testing reproducible builds also needs automation, however,
and is also too expensive for all.bash.
This is #19961.

Also of note is that some of the compiler flags used by 'toolstash -cmp'
are currently incompatible with concurrent backend compilation.
They still work fine with c=1.
Time will tell whether this is a problem.

NEXT STEPS

* Continue to find and fix races and bugs,
  using a combination of code inspection, fuzzing,
  and hopefully some community experimentation.
  I do not know of any outstanding races,
  but there probably are some.
* Improve testing.
* Improve performance, for many values of c.
* Integrate with cmd/go and fine tune.
* Support concurrent compilation with the -race flag.
  It is a sad irony that it does not yet work.
* Minor code cleanup that has been deferred during
  the last month due to uncertainty about the
  ultimate shape of this CL.

PERFORMANCE

Here's the buried lede, at last. :)

All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.

First, going from tip to this CL with c=1 has almost no impact.

name        old time/op       new time/op       delta
Template          195ms ± 3%        194ms ± 5%    ~     (p=0.370 n=30+29)
Unicode          86.6ms ± 3%       87.0ms ± 7%    ~     (p=0.958 n=29+30)
GoTypes           548ms ± 3%        555ms ± 4%  +1.35%  (p=0.001 n=30+28)
Compiler          2.51s ± 2%        2.54s ± 2%  +1.17%  (p=0.000 n=28+30)
SSA               5.16s ± 3%        5.16s ± 2%    ~     (p=0.910 n=30+29)
Flate             124ms ± 5%        124ms ± 4%    ~     (p=0.947 n=30+30)
GoParser          146ms ± 3%        146ms ± 3%    ~     (p=0.150 n=29+28)
Reflect           354ms ± 3%        352ms ± 4%    ~     (p=0.096 n=29+29)
Tar               107ms ± 5%        106ms ± 3%    ~     (p=0.370 n=30+29)
XML               200ms ± 4%        201ms ± 4%    ~     (p=0.313 n=29+28)
[Geo mean]        332ms             333ms       +0.10%

name        old user-time/op  new user-time/op  delta
Template          227ms ± 5%        225ms ± 5%    ~     (p=0.457 n=28+27)
Unicode           109ms ± 4%        109ms ± 5%    ~     (p=0.758 n=29+29)
GoTypes           713ms ± 4%        721ms ± 5%    ~     (p=0.051 n=30+29)
Compiler          3.36s ± 2%        3.38s ± 3%    ~     (p=0.146 n=30+30)
SSA               7.46s ± 3%        7.47s ± 3%    ~     (p=0.804 n=30+29)
Flate             146ms ± 7%        147ms ± 3%    ~     (p=0.833 n=29+27)
GoParser          179ms ± 5%        179ms ± 5%    ~     (p=0.866 n=30+30)
Reflect           431ms ± 4%        429ms ± 4%    ~     (p=0.593 n=29+30)
Tar               124ms ± 5%        123ms ± 5%    ~     (p=0.140 n=29+29)
XML               243ms ± 4%        242ms ± 7%    ~     (p=0.404 n=29+29)
[Geo mean]        415ms             415ms       +0.02%

name        old obj-bytes     new obj-bytes     delta
Template           382k ± 0%         382k ± 0%    ~     (all equal)
Unicode            203k ± 0%         203k ± 0%    ~     (all equal)
GoTypes           1.18M ± 0%        1.18M ± 0%    ~     (all equal)
Compiler          3.98M ± 0%        3.98M ± 0%    ~     (all equal)
SSA               8.28M ± 0%        8.28M ± 0%    ~     (all equal)
Flate              230k ± 0%         230k ± 0%    ~     (all equal)
GoParser           287k ± 0%         287k ± 0%    ~     (all equal)
Reflect           1.00M ± 0%        1.00M ± 0%    ~     (all equal)
Tar                190k ± 0%         190k ± 0%    ~     (all equal)
XML                416k ± 0%         416k ± 0%    ~     (all equal)
[Geo mean]         660k              660k       +0.00%

Comparing this CL to itself, from c=1 to c=2
improves real times 20-30%, costs 5-10% more CPU time,
and adds about 2% alloc.
The allocation increase comes from allocating more ssa.Caches.

name       old time/op       new time/op       delta
Template         202ms ± 3%        149ms ± 3%  -26.15%  (p=0.000 n=49+49)
Unicode         87.4ms ± 4%       84.2ms ± 3%   -3.68%  (p=0.000 n=48+48)
GoTypes          560ms ± 2%        398ms ± 2%  -28.96%  (p=0.000 n=49+49)
Compiler         2.46s ± 3%        1.76s ± 2%  -28.61%  (p=0.000 n=48+46)
SSA              6.17s ± 2%        4.04s ± 1%  -34.52%  (p=0.000 n=49+49)
Flate            126ms ± 3%         92ms ± 2%  -26.81%  (p=0.000 n=49+48)
GoParser         148ms ± 4%        107ms ± 2%  -27.78%  (p=0.000 n=49+48)
Reflect          361ms ± 3%        281ms ± 3%  -22.10%  (p=0.000 n=49+49)
Tar              109ms ± 4%         86ms ± 3%  -20.81%  (p=0.000 n=49+47)
XML              204ms ± 3%        144ms ± 2%  -29.53%  (p=0.000 n=48+45)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        246ms ± 4%     ~     (p=0.401 n=50+48)
Unicode          109ms ± 4%        111ms ± 4%   +1.47%  (p=0.000 n=44+50)
GoTypes          728ms ± 3%        765ms ± 3%   +5.04%  (p=0.000 n=46+50)
Compiler         3.33s ± 3%        3.41s ± 2%   +2.31%  (p=0.000 n=49+48)
SSA              8.52s ± 2%        9.11s ± 2%   +6.93%  (p=0.000 n=49+47)
Flate            149ms ± 4%        161ms ± 3%   +8.13%  (p=0.000 n=50+47)
GoParser         181ms ± 5%        192ms ± 2%   +6.40%  (p=0.000 n=49+46)
Reflect          452ms ± 9%        474ms ± 2%   +4.99%  (p=0.000 n=50+48)
Tar              126ms ± 6%        136ms ± 4%   +7.95%  (p=0.000 n=50+49)
XML              247ms ± 5%        264ms ± 3%   +6.94%  (p=0.000 n=48+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       39.3MB ± 0%   +1.48%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.2MB ± 0%   +1.19%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        114MB ± 0%   +0.69%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        447MB ± 0%   +0.95%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.26GB ± 0%   +0.89%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       25.9MB ± 1%   +2.35%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       32.2MB ± 0%   +1.59%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       78.9MB ± 0%   +0.91%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.0MB ± 0%   +1.80%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       43.4MB ± 0%   +2.35%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          379k ± 0%         378k ± 0%     ~     (p=0.421 n=5+5)
Unicode           322k ± 0%         321k ± 0%     ~     (p=0.222 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.548 n=5+5)
Compiler         4.12M ± 0%        4.11M ± 0%   -0.14%  (p=0.032 n=5+5)
SSA              9.72M ± 0%        9.72M ± 0%     ~     (p=0.421 n=5+5)
Flate             234k ± 1%         234k ± 0%     ~     (p=0.421 n=5+5)
GoParser          316k ± 1%         315k ± 0%     ~     (p=0.222 n=5+5)
Reflect           980k ± 0%         979k ± 0%     ~     (p=0.095 n=5+5)
Tar               249k ± 1%         249k ± 1%     ~     (p=0.841 n=5+5)
XML               392k ± 0%         391k ± 0%     ~     (p=0.095 n=5+5)

From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%:

name       old time/op       new time/op       delta
Template         203ms ± 3%        131ms ± 5%  -35.45%  (p=0.000 n=50+50)
Unicode         87.2ms ± 4%       84.1ms ± 2%   -3.61%  (p=0.000 n=48+47)
GoTypes          560ms ± 4%        310ms ± 2%  -44.65%  (p=0.000 n=50+49)
Compiler         2.47s ± 3%        1.41s ± 2%  -43.10%  (p=0.000 n=50+46)
SSA              6.17s ± 2%        3.20s ± 2%  -48.06%  (p=0.000 n=49+49)
Flate            126ms ± 4%         74ms ± 2%  -41.06%  (p=0.000 n=49+48)
GoParser         148ms ± 4%         89ms ± 3%  -39.97%  (p=0.000 n=49+50)
Reflect          360ms ± 3%        242ms ± 3%  -32.81%  (p=0.000 n=49+49)
Tar              108ms ± 4%         73ms ± 4%  -32.48%  (p=0.000 n=50+49)
XML              203ms ± 3%        119ms ± 3%  -41.56%  (p=0.000 n=49+48)

name       old user-time/op  new user-time/op  delta
Template         246ms ± 9%        287ms ± 9%  +16.98%  (p=0.000 n=50+50)
Unicode          109ms ± 4%        118ms ± 5%   +7.56%  (p=0.000 n=46+50)
GoTypes          735ms ± 4%        806ms ± 2%   +9.62%  (p=0.000 n=50+50)
Compiler         3.34s ± 4%        3.56s ± 2%   +6.78%  (p=0.000 n=49+49)
SSA              8.54s ± 3%       10.04s ± 3%  +17.55%  (p=0.000 n=50+50)
Flate            149ms ± 6%        176ms ± 3%  +17.82%  (p=0.000 n=50+48)
GoParser         181ms ± 5%        213ms ± 3%  +17.47%  (p=0.000 n=50+50)
Reflect          453ms ± 6%        499ms ± 2%  +10.11%  (p=0.000 n=50+48)
Tar              126ms ± 5%        149ms ±11%  +18.76%  (p=0.000 n=50+50)
XML              246ms ± 5%        287ms ± 4%  +16.53%  (p=0.000 n=49+50)

name       old alloc/op      new alloc/op      delta
Template        38.8MB ± 0%       40.4MB ± 0%   +4.21%  (p=0.008 n=5+5)
Unicode         29.8MB ± 0%       30.9MB ± 0%   +3.68%  (p=0.008 n=5+5)
GoTypes          113MB ± 0%        116MB ± 0%   +2.71%  (p=0.008 n=5+5)
Compiler         443MB ± 0%        455MB ± 0%   +2.75%  (p=0.008 n=5+5)
SSA             1.25GB ± 0%       1.27GB ± 0%   +1.84%  (p=0.008 n=5+5)
Flate           25.3MB ± 0%       26.9MB ± 1%   +6.31%  (p=0.008 n=5+5)
GoParser        31.7MB ± 0%       33.2MB ± 0%   +4.61%  (p=0.008 n=5+5)
Reflect         78.2MB ± 0%       80.2MB ± 0%   +2.53%  (p=0.008 n=5+5)
Tar             26.6MB ± 0%       27.9MB ± 0%   +5.19%  (p=0.008 n=5+5)
XML             42.4MB ± 0%       44.6MB ± 0%   +5.20%  (p=0.008 n=5+5)

name       old allocs/op     new allocs/op     delta
Template          380k ± 0%         379k ± 0%   -0.39%  (p=0.032 n=5+5)
Unicode           321k ± 0%         321k ± 0%     ~     (p=0.841 n=5+5)
GoTypes          1.14M ± 0%        1.14M ± 0%     ~     (p=0.421 n=5+5)
Compiler         4.12M ± 0%        4.14M ± 0%   +0.52%  (p=0.008 n=5+5)
SSA              9.72M ± 0%        9.76M ± 0%   +0.37%  (p=0.008 n=5+5)
Flate             234k ± 1%         234k ± 1%     ~     (p=0.690 n=5+5)
GoParser          316k ± 0%         317k ± 1%     ~     (p=0.841 n=5+5)
Reflect           981k ± 0%         981k ± 0%     ~     (p=1.000 n=5+5)
Tar               250k ± 0%         249k ± 1%     ~     (p=0.151 n=5+5)
XML               393k ± 0%         392k ± 0%     ~     (p=0.056 n=5+5)

Going beyond c=4 on my machine tends to increase CPU time and allocs
without impacting real time.

The CPU time numbers matter, because when there are many concurrent
compilation processes, that will impact the overall throughput.

The numbers above are in many ways the best case scenario;
we can take full advantage of all cores.
Fortunately, the most common compilation scenario is incremental
re-compilation of a single package during a build/test cycle.

Updates #15756

Change-Id: I6725558ca2069edec0ac5b0d1683105a9fff6bea
060897b

@gopherbot gopherbot pushed a commit that referenced this issue Apr 26, 2017

@bcmills @bcmills bcmills + bcmills sync: import Map from x/sync/syncmap
This is a direct port of the version from
commit a60ad46e0ed33d02e09bda439efaf9c9727dbc6c
(https://go-review.googlesource.com/c/37342/).

updates #17973
updates #18177

Change-Id: I63fa5ef6951b1edd39f84927d1181a4df9b15385
Reviewed-on: https://go-review.googlesource.com/36617
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
959025c

@gopherbot gopherbot pushed a commit that referenced this issue Apr 26, 2017

@bcmills @bcmills bcmills + bcmills encoding/xml: parallelize benchmarks
Results remain comparable with the non-parallel version with -cpu=1:
benchmark                old ns/op     new ns/op     delta
BenchmarkMarshal         31220         28618         -8.33%
BenchmarkMarshal-6       37181         7658          -79.40%
BenchmarkUnmarshal       81837         83522         +2.06%
BenchmarkUnmarshal-6     96339         18244         -81.06%

benchmark                old allocs     new allocs     delta
BenchmarkMarshal         23             23             +0.00%
BenchmarkMarshal-6       23             23             +0.00%
BenchmarkUnmarshal       189            189            +0.00%
BenchmarkUnmarshal-6     189            189            +0.00%

benchmark                old bytes     new bytes     delta
BenchmarkMarshal         5776          5776          +0.00%
BenchmarkMarshal-6       5776          5776          +0.00%
BenchmarkUnmarshal       8576          8576          +0.00%
BenchmarkUnmarshal-6     8576          8576          +0.00%

updates #18177

Change-Id: I7e7055a11d18896bd54d7d773f2ec64767cdb4c8
Reviewed-on: https://go-review.googlesource.com/36810
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
9d37d4c

@gopherbot gopherbot pushed a commit that referenced this issue Apr 26, 2017

@bcmills @bcmills bcmills + bcmills encoding/gob: parallelize Encode/Decode benchmarks
Results remain comparable with the non-parallel version with -cpu=1:

benchmark                              old ns/op     new ns/op     delta
BenchmarkEndToEndPipe                  6200          6171          -0.47%
BenchmarkEndToEndPipe-6                1073          1024          -4.57%
BenchmarkEndToEndByteBuffer            2925          2664          -8.92%
BenchmarkEndToEndByteBuffer-6          516           560           +8.53%
BenchmarkEndToEndSliceByteBuffer       231683        237450        +2.49%
BenchmarkEndToEndSliceByteBuffer-6     59080         59452         +0.63%
BenchmarkEncodeComplex128Slice         67541         66003         -2.28%
BenchmarkEncodeComplex128Slice-6       72740         11316         -84.44%
BenchmarkEncodeFloat64Slice            25769         27899         +8.27%
BenchmarkEncodeFloat64Slice-6          26655         4557          -82.90%
BenchmarkEncodeInt32Slice              18685         18845         +0.86%
BenchmarkEncodeInt32Slice-6            18389         3462          -81.17%
BenchmarkEncodeStringSlice             19089         19354         +1.39%
BenchmarkEncodeStringSlice-6           20155         3237          -83.94%
BenchmarkEncodeInterfaceSlice          659601        677129        +2.66%
BenchmarkEncodeInterfaceSlice-6        640974        251621        -60.74%
BenchmarkDecodeComplex128Slice         117130        129955        +10.95%
BenchmarkDecodeComplex128Slice-6       155447        24924         -83.97%
BenchmarkDecodeFloat64Slice            67695         68776         +1.60%
BenchmarkDecodeFloat64Slice-6          82966         15225         -81.65%
BenchmarkDecodeInt32Slice              63102         62733         -0.58%
BenchmarkDecodeInt32Slice-6            77857         13003         -83.30%
BenchmarkDecodeStringSlice             130240        129562        -0.52%
BenchmarkDecodeStringSlice-6           165500        31507         -80.96%
BenchmarkDecodeInterfaceSlice          937637        1060835       +13.14%
BenchmarkDecodeInterfaceSlice-6        973495        270613        -72.20%

updates #18177

Change-Id: Ib3579010faa70827d5cbd02a826dbbb66ca13eb7
Reviewed-on: https://go-review.googlesource.com/36722
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
3058b1f

@gopherbot gopherbot pushed a commit that referenced this issue Apr 26, 2017

@bcmills @bcmills bcmills + bcmills reflect: parallelize benchmarks
Add a benchmark for PtrTo: it's the motivation for #17973, which is
the motivation for #18177.

Results remain comparable with the non-parallel version with -cpu=1:

benchmark                             old ns/op     new ns/op     delta
BenchmarkCall                         357           360           +0.84%
BenchmarkCall-6                       90.3          90.7          +0.44%
BenchmarkCallArgCopy/size=128         319           323           +1.25%
BenchmarkCallArgCopy/size=128-6       329           82.2          -75.02%
BenchmarkCallArgCopy/size=256         354           335           -5.37%
BenchmarkCallArgCopy/size=256-6       340           85.2          -74.94%
BenchmarkCallArgCopy/size=1024        374           703           +87.97%
BenchmarkCallArgCopy/size=1024-6      378           95.8          -74.66%
BenchmarkCallArgCopy/size=4096        627           631           +0.64%
BenchmarkCallArgCopy/size=4096-6      643           120           -81.34%
BenchmarkCallArgCopy/size=65536       10502         10169         -3.17%
BenchmarkCallArgCopy/size=65536-6     10298         2240          -78.25%
BenchmarkFieldByName1                 139           132           -5.04%
BenchmarkFieldByName1-6               144           24.9          -82.71%
BenchmarkFieldByName2                 2721          2778          +2.09%
BenchmarkFieldByName2-6               3953          578           -85.38%
BenchmarkFieldByName3                 19136         18357         -4.07%
BenchmarkFieldByName3-6               23072         3850          -83.31%
BenchmarkInterfaceBig                 12.7          15.5          +22.05%
BenchmarkInterfaceBig-6               14.2          2.48          -82.54%
BenchmarkInterfaceSmall               13.1          15.1          +15.27%
BenchmarkInterfaceSmall-6             13.0          2.54          -80.46%
BenchmarkNew                          43.8          43.0          -1.83%
BenchmarkNew-6                        40.5          6.67          -83.53%

benchmark                             old MB/s     new MB/s     speedup
BenchmarkCallArgCopy/size=128         400.24       395.15       0.99x
BenchmarkCallArgCopy/size=128-6       388.74       1557.76      4.01x
BenchmarkCallArgCopy/size=256         722.44       762.44       1.06x
BenchmarkCallArgCopy/size=256-6       751.98       3003.83      3.99x
BenchmarkCallArgCopy/size=1024        2733.22      1455.50      0.53x
BenchmarkCallArgCopy/size=1024-6      2706.40      10687.53     3.95x
BenchmarkCallArgCopy/size=4096        6523.32      6488.25      0.99x
BenchmarkCallArgCopy/size=4096-6      6363.85      34003.09     5.34x
BenchmarkCallArgCopy/size=65536       6239.88      6444.46      1.03x
BenchmarkCallArgCopy/size=65536-6     6363.83      29255.26     4.60x

benchmark           old allocs     new allocs     delta
BenchmarkCall       0              0              +0.00%
BenchmarkCall-6     0              0              +0.00%

benchmark           old bytes     new bytes     delta
BenchmarkCall       0             0             +0.00%
BenchmarkCall-6     0             0             +0.00%

updates #17973
updates #18177

Change-Id: If70c5c742e8d1b138347f4963ad7cff38fffc018
Reviewed-on: https://go-review.googlesource.com/36831
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
f5f5a00

@gopherbot gopherbot pushed a commit that referenced this issue Apr 26, 2017

@bcmills @bcmills bcmills + bcmills encoding/json: parallelize most benchmarks
Don't bother with BenchmarkDecoderStream — it's doing something subtle
with the input buffer that isn't easy to replicate in a parallel test.

Results remain comparable with the non-parallel version with -cpu=1:

benchmark                          old ns/op     new ns/op     delta
BenchmarkCodeEncoder               22815832      21058729      -7.70%
BenchmarkCodeEncoder-6             22190561      3579757       -83.87%
BenchmarkCodeMarshal               25356621      25396429      +0.16%
BenchmarkCodeMarshal-6             25359813      4944908       -80.50%
BenchmarkCodeDecoder               94794556      88016360      -7.15%
BenchmarkCodeDecoder-6             93795028      16726283      -82.17%
BenchmarkDecoderStream             532           583           +9.59%
BenchmarkDecoderStream-6           598           550           -8.03%
BenchmarkCodeUnmarshal             97644168      89162504      -8.69%
BenchmarkCodeUnmarshal-6           96615302      17036419      -82.37%
BenchmarkCodeUnmarshalReuse        91747073      90298479      -1.58%
BenchmarkCodeUnmarshalReuse-6      89397165      15518005      -82.64%
BenchmarkUnmarshalString           808           843           +4.33%
BenchmarkUnmarshalString-6         912           220           -75.88%
BenchmarkUnmarshalFloat64          695           732           +5.32%
BenchmarkUnmarshalFloat64-6        710           191           -73.10%
BenchmarkUnmarshalInt64            635           640           +0.79%
BenchmarkUnmarshalInt64-6          618           185           -70.06%
BenchmarkIssue10335                916           947           +3.38%
BenchmarkIssue10335-6              879           216           -75.43%
BenchmarkNumberIsValid             34.7          34.3          -1.15%
BenchmarkNumberIsValid-6           34.9          36.7          +5.16%
BenchmarkNumberIsValidRegexp       1174          1121          -4.51%
BenchmarkNumberIsValidRegexp-6     1134          1119          -1.32%
BenchmarkSkipValue                 20506938      20708060      +0.98%
BenchmarkSkipValue-6               21627665      22375630      +3.46%
BenchmarkEncoderEncode             690           726           +5.22%
BenchmarkEncoderEncode-6           649           157           -75.81%

benchmark                    old MB/s     new MB/s     speedup
BenchmarkCodeEncoder         85.05        92.15        1.08x
BenchmarkCodeEncoder-6       87.45        542.07       6.20x
BenchmarkCodeMarshal         76.53        76.41        1.00x
BenchmarkCodeMarshal-6       76.52        392.42       5.13x
BenchmarkCodeDecoder         20.47        22.05        1.08x
BenchmarkCodeDecoder-6       20.69        116.01       5.61x
BenchmarkCodeUnmarshal       19.87        21.76        1.10x
BenchmarkCodeUnmarshal-6     20.08        113.90       5.67x
BenchmarkSkipValue           90.55        89.67        0.99x
BenchmarkSkipValue-6         90.83        87.80        0.97x

benchmark                    old allocs     new allocs     delta
BenchmarkIssue10335          4              4              +0.00%
BenchmarkIssue10335-6        4              4              +0.00%
BenchmarkEncoderEncode       1              1              +0.00%
BenchmarkEncoderEncode-6     1              1              +0.00%

benchmark                    old bytes     new bytes     delta
BenchmarkIssue10335          320           320           +0.00%
BenchmarkIssue10335-6        320           320           +0.00%
BenchmarkEncoderEncode       8             8             +0.00%
BenchmarkEncoderEncode-6     8             8             +0.00%

updates #18177

Change-Id: Ia4f5bf5ac0afbadb1705ed9f9e1b39dabba67b40
Reviewed-on: https://go-review.googlesource.com/36724
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
c5b6c2a

CL https://golang.org/cl/41871 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Apr 27, 2017

@bcmills @bcmills bcmills + bcmills reflect: use sync.Map instead of RWMutex for type caches
This provides a significant speedup when using reflection-heavy code
on many CPU cores, such as when marshaling or unmarshaling protocol
buffers.

updates #17973
updates #18177

name                       old time/op    new time/op     delta
Call                          239ns ±10%      245ns ± 7%       ~     (p=0.562 n=10+9)
Call-6                        201ns ±38%       48ns ±29%    -76.39%  (p=0.000 n=10+9)
Call-48                       133ns ± 8%       12ns ± 2%    -90.92%  (p=0.000 n=10+8)
CallArgCopy/size=128          169ns ±12%      197ns ± 2%    +16.35%  (p=0.000 n=10+7)
CallArgCopy/size=128-6        142ns ± 9%       34ns ± 7%    -76.10%  (p=0.000 n=10+9)
CallArgCopy/size=128-48       125ns ± 3%        9ns ± 7%    -93.01%  (p=0.000 n=8+8)
CallArgCopy/size=256          177ns ± 8%      197ns ± 5%    +11.24%  (p=0.000 n=10+9)
CallArgCopy/size=256-6        148ns ±11%       35ns ± 6%    -76.23%  (p=0.000 n=10+9)
CallArgCopy/size=256-48       127ns ± 4%        9ns ± 9%    -92.66%  (p=0.000 n=10+9)
CallArgCopy/size=1024         196ns ± 6%      228ns ± 7%    +16.09%  (p=0.000 n=10+9)
CallArgCopy/size=1024-6       143ns ± 6%       42ns ± 5%    -70.39%  (p=0.000 n=8+8)
CallArgCopy/size=1024-48      130ns ± 7%       10ns ± 1%    -91.99%  (p=0.000 n=10+8)
CallArgCopy/size=4096         330ns ± 9%      351ns ± 5%     +6.20%  (p=0.004 n=10+9)
CallArgCopy/size=4096-6       173ns ±14%       62ns ± 6%    -63.83%  (p=0.000 n=10+8)
CallArgCopy/size=4096-48      141ns ± 6%       15ns ± 6%    -89.59%  (p=0.000 n=10+8)
CallArgCopy/size=65536       7.71µs ±10%     7.74µs ±10%       ~     (p=0.859 n=10+9)
CallArgCopy/size=65536-6     1.33µs ± 4%     1.34µs ± 6%       ~     (p=0.720 n=10+9)
CallArgCopy/size=65536-48     347ns ± 2%      344ns ± 2%       ~     (p=0.202 n=10+9)
PtrTo                        30.2ns ±10%     41.3ns ±11%    +36.97%  (p=0.000 n=10+9)
PtrTo-6                       126ns ± 6%        7ns ±10%    -94.47%  (p=0.000 n=9+9)
PtrTo-48                     86.9ns ± 9%      1.7ns ± 9%    -98.08%  (p=0.000 n=10+9)
FieldByName1                 86.6ns ± 5%     87.3ns ± 7%       ~     (p=0.737 n=10+9)
FieldByName1-6               19.8ns ±10%     18.7ns ±10%       ~     (p=0.073 n=9+9)
FieldByName1-48              7.54ns ± 4%     7.74ns ± 5%     +2.55%  (p=0.023 n=9+9)
FieldByName2                 1.63µs ± 8%     1.70µs ± 4%     +4.13%  (p=0.020 n=9+9)
FieldByName2-6                481ns ± 6%      490ns ±10%       ~     (p=0.474 n=9+9)
FieldByName2-48               723ns ± 3%      736ns ± 2%     +1.76%  (p=0.045 n=8+8)
FieldByName3                 10.5µs ± 7%     10.8µs ± 7%       ~     (p=0.234 n=8+8)
FieldByName3-6               2.78µs ± 3%     2.94µs ±10%     +5.87%  (p=0.031 n=9+9)
FieldByName3-48              3.72µs ± 2%     3.91µs ± 5%     +4.91%  (p=0.003 n=9+9)
InterfaceBig                 10.8ns ± 5%     10.7ns ± 5%       ~     (p=0.849 n=9+9)
InterfaceBig-6               9.62ns ±81%     1.79ns ± 4%    -81.38%  (p=0.003 n=9+9)
InterfaceBig-48              0.48ns ±34%     0.50ns ± 7%       ~     (p=0.071 n=8+9)
InterfaceSmall               10.7ns ± 5%     10.9ns ± 4%       ~     (p=0.243 n=9+9)
InterfaceSmall-6             1.85ns ± 5%     1.79ns ± 1%     -2.97%  (p=0.006 n=7+8)
InterfaceSmall-48            0.49ns ±20%     0.48ns ± 5%       ~     (p=0.740 n=7+9)
New                          28.2ns ±20%     26.6ns ± 3%       ~     (p=0.617 n=9+9)
New-6                        4.69ns ± 4%     4.44ns ± 3%     -5.33%  (p=0.001 n=9+9)
New-48                       1.10ns ± 9%     1.08ns ± 6%       ~     (p=0.285 n=9+8)

name                       old alloc/op   new alloc/op    delta
Call                          0.00B           0.00B            ~     (all equal)
Call-6                        0.00B           0.00B            ~     (all equal)
Call-48                       0.00B           0.00B            ~     (all equal)

name                       old allocs/op  new allocs/op   delta
Call                           0.00            0.00            ~     (all equal)
Call-6                         0.00            0.00            ~     (all equal)
Call-48                        0.00            0.00            ~     (all equal)

name                       old speed      new speed       delta
CallArgCopy/size=128        757MB/s ±11%    649MB/s ± 1%    -14.33%  (p=0.000 n=10+7)
CallArgCopy/size=128-6      901MB/s ± 9%   3781MB/s ± 7%   +319.69%  (p=0.000 n=10+9)
CallArgCopy/size=128-48    1.02GB/s ± 2%  14.63GB/s ± 6%  +1337.98%  (p=0.000 n=8+8)
CallArgCopy/size=256       1.45GB/s ± 9%   1.30GB/s ± 5%    -10.17%  (p=0.000 n=10+9)
CallArgCopy/size=256-6     1.73GB/s ±11%   7.28GB/s ± 7%   +320.76%  (p=0.000 n=10+9)
CallArgCopy/size=256-48    2.00GB/s ± 4%  27.46GB/s ± 9%  +1270.85%  (p=0.000 n=10+9)
CallArgCopy/size=1024      5.21GB/s ± 6%   4.49GB/s ± 8%    -13.74%  (p=0.000 n=10+9)
CallArgCopy/size=1024-6    7.18GB/s ± 7%  24.17GB/s ± 5%   +236.64%  (p=0.000 n=9+8)
CallArgCopy/size=1024-48   7.87GB/s ± 7%  98.43GB/s ± 1%  +1150.99%  (p=0.000 n=10+8)
CallArgCopy/size=4096      12.3GB/s ± 6%   11.7GB/s ± 5%     -5.00%  (p=0.008 n=9+9)
CallArgCopy/size=4096-6    23.8GB/s ±16%   65.6GB/s ± 5%   +175.02%  (p=0.000 n=10+8)
CallArgCopy/size=4096-48   29.0GB/s ± 7%  279.6GB/s ± 6%   +862.87%  (p=0.000 n=10+8)
CallArgCopy/size=65536     8.52GB/s ±11%   8.49GB/s ± 9%       ~     (p=0.842 n=10+9)
CallArgCopy/size=65536-6   49.3GB/s ± 4%   49.0GB/s ± 6%       ~     (p=0.720 n=10+9)
CallArgCopy/size=65536-48   189GB/s ± 2%    190GB/s ± 2%       ~     (p=0.211 n=10+9)

https://perf.golang.org/search?q=upload:20170426.3

Change-Id: Iff68f18ef69defb7f30962e21736ac7685a48a27
Reviewed-on: https://go-review.googlesource.com/41871
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
33b92cd

CL https://golang.org/cl/41930 mentions this issue.

CL https://golang.org/cl/41931 mentions this issue.

CL https://golang.org/cl/41990 mentions this issue.

CL https://golang.org/cl/41991 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Apr 28, 2017

@bcmills @bcmills bcmills + bcmills encoding/xml: replace tinfoMap RWMutex with sync.Map
This simplifies the code a bit and provides a modest speedup for
Marshal with many CPUs.

updates #17973
updates #18177

name          old time/op    new time/op    delta
Marshal         15.8µs ± 1%    15.9µs ± 1%   +0.67%  (p=0.021 n=8+7)
Marshal-6       5.76µs ±11%    5.17µs ± 2%  -10.36%  (p=0.002 n=8+8)
Marshal-48      9.88µs ± 5%    7.31µs ± 6%  -26.04%  (p=0.000 n=8+8)
Unmarshal       44.7µs ± 3%    45.1µs ± 5%     ~     (p=0.645 n=8+8)
Unmarshal-6     12.1µs ± 7%    11.8µs ± 8%     ~     (p=0.442 n=8+8)
Unmarshal-48    18.7µs ± 3%    18.2µs ± 4%     ~     (p=0.054 n=7+8)

name          old alloc/op   new alloc/op   delta
Marshal         5.78kB ± 0%    5.78kB ± 0%     ~     (all equal)
Marshal-6       5.78kB ± 0%    5.78kB ± 0%     ~     (all equal)
Marshal-48      5.78kB ± 0%    5.78kB ± 0%     ~     (all equal)
Unmarshal       8.58kB ± 0%    8.58kB ± 0%     ~     (all equal)
Unmarshal-6     8.58kB ± 0%    8.58kB ± 0%     ~     (all equal)
Unmarshal-48    8.58kB ± 0%    8.58kB ± 0%     ~     (p=1.000 n=8+8)

name          old allocs/op  new allocs/op  delta
Marshal           23.0 ± 0%      23.0 ± 0%     ~     (all equal)
Marshal-6         23.0 ± 0%      23.0 ± 0%     ~     (all equal)
Marshal-48        23.0 ± 0%      23.0 ± 0%     ~     (all equal)
Unmarshal          189 ± 0%       189 ± 0%     ~     (all equal)
Unmarshal-6        189 ± 0%       189 ± 0%     ~     (all equal)
Unmarshal-48       189 ± 0%       189 ± 0%     ~     (all equal)

https://perf.golang.org/search?q=upload:20170427.5

Change-Id: I4ee95a99540d3e4e47e056fff18357efd2cd340a
Reviewed-on: https://go-review.googlesource.com/41991
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
eb6adc2

CL https://golang.org/cl/42110 mentions this issue.

CL https://golang.org/cl/42112 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Apr 28, 2017

@bcmills @bcmills bcmills + bcmills encoding/json: replace encoderCache RWMutex with a sync.Map
This provides a moderate speedup for encoding when using many CPU cores.

name                    old time/op    new time/op    delta
CodeEncoder               14.1ms ±10%    13.5ms ± 4%      ~     (p=0.867 n=8+7)
CodeEncoder-6             2.58ms ± 8%    2.72ms ± 6%      ~     (p=0.065 n=8+8)
CodeEncoder-48             629µs ± 1%     629µs ± 1%      ~     (p=0.867 n=8+7)
CodeMarshal               14.9ms ± 5%    14.9ms ± 5%      ~     (p=0.721 n=8+8)
CodeMarshal-6             3.28ms ±11%    3.24ms ±12%      ~     (p=0.798 n=8+8)
CodeMarshal-48             739µs ± 1%     745µs ± 2%      ~     (p=0.328 n=8+8)
CodeDecoder               49.7ms ± 4%    49.2ms ± 4%      ~     (p=0.463 n=7+8)
CodeDecoder-6             10.1ms ± 8%    10.4ms ± 3%      ~     (p=0.232 n=7+8)
CodeDecoder-48            2.60ms ± 3%    2.61ms ± 2%      ~     (p=1.000 n=8+8)
DecoderStream              352ns ± 5%     344ns ± 4%      ~     (p=0.077 n=8+8)
DecoderStream-6            485ns ± 8%     503ns ± 6%      ~     (p=0.123 n=8+8)
DecoderStream-48           522ns ± 7%     520ns ± 5%      ~     (p=0.959 n=8+8)
CodeUnmarshal             52.2ms ± 5%    54.4ms ±18%      ~     (p=0.955 n=7+8)
CodeUnmarshal-6           12.4ms ± 6%    12.3ms ± 6%      ~     (p=0.878 n=8+8)
CodeUnmarshal-48          3.46ms ± 7%    3.40ms ± 9%      ~     (p=0.442 n=8+8)
CodeUnmarshalReuse        48.9ms ± 6%    50.3ms ± 7%      ~     (p=0.279 n=8+8)
CodeUnmarshalReuse-6      10.3ms ±11%    10.3ms ±10%      ~     (p=0.959 n=8+8)
CodeUnmarshalReuse-48     2.68ms ± 3%    2.67ms ± 4%      ~     (p=0.878 n=8+8)
UnmarshalString            476ns ± 7%     474ns ± 7%      ~     (p=0.644 n=8+8)
UnmarshalString-6          164ns ± 9%     160ns ±10%      ~     (p=0.556 n=8+8)
UnmarshalString-48         181ns ± 0%     177ns ± 2%    -2.36%  (p=0.001 n=7+7)
UnmarshalFloat64           414ns ± 4%     418ns ± 4%      ~     (p=0.382 n=8+8)
UnmarshalFloat64-6         147ns ± 9%     143ns ±16%      ~     (p=0.457 n=8+8)
UnmarshalFloat64-48        176ns ± 2%     174ns ± 2%      ~     (p=0.118 n=8+8)
UnmarshalInt64             369ns ± 4%     354ns ± 1%    -3.85%  (p=0.005 n=8+7)
UnmarshalInt64-6           132ns ±11%     132ns ±10%      ~     (p=0.982 n=8+8)
UnmarshalInt64-48          177ns ± 3%     174ns ± 2%    -1.84%  (p=0.028 n=8+7)
Issue10335                 540ns ± 5%     535ns ± 0%      ~     (p=0.330 n=7+7)
Issue10335-6               159ns ± 8%     164ns ± 8%      ~     (p=0.246 n=8+8)
Issue10335-48              186ns ± 1%     182ns ± 2%    -1.89%  (p=0.010 n=8+8)
Unmapped                  1.74µs ± 2%    1.76µs ± 6%      ~     (p=0.181 n=6+8)
Unmapped-6                 414ns ± 5%     402ns ±10%      ~     (p=0.244 n=7+8)
Unmapped-48                226ns ± 2%     224ns ± 2%      ~     (p=0.144 n=7+8)
NumberIsValid             20.1ns ± 4%    19.7ns ± 3%      ~     (p=0.204 n=8+8)
NumberIsValid-6           20.4ns ± 8%    22.2ns ±16%      ~     (p=0.129 n=7+8)
NumberIsValid-48          23.1ns ±12%    23.8ns ± 8%      ~     (p=0.104 n=8+8)
NumberIsValidRegexp        629ns ± 5%     622ns ± 0%      ~     (p=0.148 n=7+7)
NumberIsValidRegexp-6      757ns ± 2%     725ns ±14%      ~     (p=0.351 n=8+7)
NumberIsValidRegexp-48     757ns ± 2%     723ns ±13%      ~     (p=0.521 n=8+8)
SkipValue                 13.2ms ± 9%    13.3ms ± 1%      ~     (p=0.130 n=8+8)
SkipValue-6               15.1ms ±10%    14.8ms ± 2%      ~     (p=0.397 n=7+8)
SkipValue-48              13.9ms ±12%    14.3ms ± 1%      ~     (p=0.694 n=8+7)
EncoderEncode              433ns ± 4%     410ns ± 3%    -5.48%  (p=0.001 n=8+8)
EncoderEncode-6            221ns ±15%      75ns ± 5%   -66.15%  (p=0.000 n=7+8)
EncoderEncode-48           161ns ± 4%      19ns ± 7%   -88.29%  (p=0.000 n=7+8)

name                    old speed      new speed      delta
CodeEncoder              139MB/s ±10%   144MB/s ± 4%      ~     (p=0.844 n=8+7)
CodeEncoder-6            756MB/s ± 8%   714MB/s ± 6%      ~     (p=0.065 n=8+8)
CodeEncoder-48          3.08GB/s ± 1%  3.09GB/s ± 1%      ~     (p=0.867 n=8+7)
CodeMarshal              130MB/s ± 5%   130MB/s ± 5%      ~     (p=0.721 n=8+8)
CodeMarshal-6            594MB/s ±10%   601MB/s ±11%      ~     (p=0.798 n=8+8)
CodeMarshal-48          2.62GB/s ± 1%  2.60GB/s ± 2%      ~     (p=0.328 n=8+8)
CodeDecoder             39.0MB/s ± 4%  39.5MB/s ± 4%      ~     (p=0.463 n=7+8)
CodeDecoder-6            189MB/s ±13%   187MB/s ± 3%      ~     (p=0.505 n=8+8)
CodeDecoder-48           746MB/s ± 2%   745MB/s ± 2%      ~     (p=1.000 n=8+8)
CodeUnmarshal           37.2MB/s ± 5%  35.9MB/s ±16%      ~     (p=0.955 n=7+8)
CodeUnmarshal-6          157MB/s ± 6%   158MB/s ± 6%      ~     (p=0.878 n=8+8)
CodeUnmarshal-48         561MB/s ± 7%   572MB/s ±10%      ~     (p=0.442 n=8+8)
SkipValue                141MB/s ±10%   139MB/s ± 1%      ~     (p=0.130 n=8+8)
SkipValue-6              131MB/s ± 3%   133MB/s ± 2%      ~     (p=0.662 n=6+8)
SkipValue-48             138MB/s ±11%   132MB/s ± 1%      ~     (p=0.281 n=8+7)

name                    old alloc/op   new alloc/op   delta
CodeEncoder               45.9kB ± 0%    45.9kB ± 0%    -0.02%  (p=0.002 n=7+8)
CodeEncoder-6             55.1kB ± 0%    55.1kB ± 0%    -0.01%  (p=0.002 n=7+8)
CodeEncoder-48             110kB ± 0%     110kB ± 0%    -0.00%  (p=0.030 n=7+8)
CodeMarshal               4.59MB ± 0%    4.59MB ± 0%    -0.00%  (p=0.000 n=8+8)
CodeMarshal-6             4.59MB ± 0%    4.59MB ± 0%    -0.00%  (p=0.000 n=8+8)
CodeMarshal-48            4.59MB ± 0%    4.59MB ± 0%    -0.00%  (p=0.001 n=7+8)
CodeDecoder               2.28MB ± 5%    2.21MB ± 0%      ~     (p=0.257 n=8+7)
CodeDecoder-6             2.43MB ±11%    2.51MB ± 0%      ~     (p=0.473 n=8+8)
CodeDecoder-48            2.93MB ± 0%    2.93MB ± 0%      ~     (p=0.554 n=7+8)
DecoderStream              16.0B ± 0%     16.0B ± 0%      ~     (all equal)
DecoderStream-6            16.0B ± 0%     16.0B ± 0%      ~     (all equal)
DecoderStream-48           16.0B ± 0%     16.0B ± 0%      ~     (all equal)
CodeUnmarshal             3.28MB ± 0%    3.28MB ± 0%      ~     (p=1.000 n=7+7)
CodeUnmarshal-6           3.28MB ± 0%    3.28MB ± 0%      ~     (p=0.593 n=8+8)
CodeUnmarshal-48          3.28MB ± 0%    3.28MB ± 0%      ~     (p=0.670 n=8+8)
CodeUnmarshalReuse        1.87MB ± 0%    1.88MB ± 1%    +0.48%  (p=0.011 n=7+8)
CodeUnmarshalReuse-6      1.90MB ± 1%    1.90MB ± 1%      ~     (p=0.589 n=8+8)
CodeUnmarshalReuse-48     1.96MB ± 0%    1.96MB ± 0%    +0.00%  (p=0.002 n=7+8)
UnmarshalString             304B ± 0%      304B ± 0%      ~     (all equal)
UnmarshalString-6           304B ± 0%      304B ± 0%      ~     (all equal)
UnmarshalString-48          304B ± 0%      304B ± 0%      ~     (all equal)
UnmarshalFloat64            292B ± 0%      292B ± 0%      ~     (all equal)
UnmarshalFloat64-6          292B ± 0%      292B ± 0%      ~     (all equal)
UnmarshalFloat64-48         292B ± 0%      292B ± 0%      ~     (all equal)
UnmarshalInt64              289B ± 0%      289B ± 0%      ~     (all equal)
UnmarshalInt64-6            289B ± 0%      289B ± 0%      ~     (all equal)
UnmarshalInt64-48           289B ± 0%      289B ± 0%      ~     (all equal)
Issue10335                  312B ± 0%      312B ± 0%      ~     (all equal)
Issue10335-6                312B ± 0%      312B ± 0%      ~     (all equal)
Issue10335-48               312B ± 0%      312B ± 0%      ~     (all equal)
Unmapped                    344B ± 0%      344B ± 0%      ~     (all equal)
Unmapped-6                  344B ± 0%      344B ± 0%      ~     (all equal)
Unmapped-48                 344B ± 0%      344B ± 0%      ~     (all equal)
NumberIsValid              0.00B          0.00B           ~     (all equal)
NumberIsValid-6            0.00B          0.00B           ~     (all equal)
NumberIsValid-48           0.00B          0.00B           ~     (all equal)
NumberIsValidRegexp        0.00B          0.00B           ~     (all equal)
NumberIsValidRegexp-6      0.00B          0.00B           ~     (all equal)
NumberIsValidRegexp-48     0.00B          0.00B           ~     (all equal)
SkipValue                  0.00B          0.00B           ~     (all equal)
SkipValue-6                0.00B          0.00B           ~     (all equal)
SkipValue-48              15.0B ±167%      0.0B           ~     (p=0.200 n=8+8)
EncoderEncode              8.00B ± 0%     0.00B       -100.00%  (p=0.000 n=8+8)
EncoderEncode-6            8.00B ± 0%     0.00B       -100.00%  (p=0.000 n=8+8)
EncoderEncode-48           8.00B ± 0%     0.00B       -100.00%  (p=0.000 n=8+8)

name                    old allocs/op  new allocs/op  delta
CodeEncoder                 1.00 ± 0%      0.00       -100.00%  (p=0.000 n=8+8)
CodeEncoder-6               1.00 ± 0%      0.00       -100.00%  (p=0.000 n=8+8)
CodeEncoder-48              1.00 ± 0%      0.00       -100.00%  (p=0.000 n=8+8)
CodeMarshal                 17.0 ± 0%      16.0 ± 0%    -5.88%  (p=0.000 n=8+8)
CodeMarshal-6               17.0 ± 0%      16.0 ± 0%    -5.88%  (p=0.000 n=8+8)
CodeMarshal-48              17.0 ± 0%      16.0 ± 0%    -5.88%  (p=0.000 n=8+8)
CodeDecoder                89.6k ± 0%     89.5k ± 0%      ~     (p=0.154 n=8+7)
CodeDecoder-6              89.8k ± 0%     89.9k ± 0%      ~     (p=0.467 n=8+8)
CodeDecoder-48             90.5k ± 0%     90.5k ± 0%      ~     (p=0.533 n=8+7)
DecoderStream               2.00 ± 0%      2.00 ± 0%      ~     (all equal)
DecoderStream-6             2.00 ± 0%      2.00 ± 0%      ~     (all equal)
DecoderStream-48            2.00 ± 0%      2.00 ± 0%      ~     (all equal)
CodeUnmarshal               105k ± 0%      105k ± 0%      ~     (all equal)
CodeUnmarshal-6             105k ± 0%      105k ± 0%      ~     (all equal)
CodeUnmarshal-48            105k ± 0%      105k ± 0%      ~     (all equal)
CodeUnmarshalReuse         89.5k ± 0%     89.6k ± 0%      ~     (p=0.246 n=7+8)
CodeUnmarshalReuse-6       89.8k ± 0%     89.8k ± 0%      ~     (p=1.000 n=8+8)
CodeUnmarshalReuse-48      90.5k ± 0%     90.5k ± 0%      ~     (all equal)
UnmarshalString             2.00 ± 0%      2.00 ± 0%      ~     (all equal)
UnmarshalString-6           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
UnmarshalString-48          2.00 ± 0%      2.00 ± 0%      ~     (all equal)
UnmarshalFloat64            2.00 ± 0%      2.00 ± 0%      ~     (all equal)
UnmarshalFloat64-6          2.00 ± 0%      2.00 ± 0%      ~     (all equal)
UnmarshalFloat64-48         2.00 ± 0%      2.00 ± 0%      ~     (all equal)
UnmarshalInt64              2.00 ± 0%      2.00 ± 0%      ~     (all equal)
UnmarshalInt64-6            2.00 ± 0%      2.00 ± 0%      ~     (all equal)
UnmarshalInt64-48           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
Issue10335                  3.00 ± 0%      3.00 ± 0%      ~     (all equal)
Issue10335-6                3.00 ± 0%      3.00 ± 0%      ~     (all equal)
Issue10335-48               3.00 ± 0%      3.00 ± 0%      ~     (all equal)
Unmapped                    4.00 ± 0%      4.00 ± 0%      ~     (all equal)
Unmapped-6                  4.00 ± 0%      4.00 ± 0%      ~     (all equal)
Unmapped-48                 4.00 ± 0%      4.00 ± 0%      ~     (all equal)
NumberIsValid               0.00           0.00           ~     (all equal)
NumberIsValid-6             0.00           0.00           ~     (all equal)
NumberIsValid-48            0.00           0.00           ~     (all equal)
NumberIsValidRegexp         0.00           0.00           ~     (all equal)
NumberIsValidRegexp-6       0.00           0.00           ~     (all equal)
NumberIsValidRegexp-48      0.00           0.00           ~     (all equal)
SkipValue                   0.00           0.00           ~     (all equal)
SkipValue-6                 0.00           0.00           ~     (all equal)
SkipValue-48                0.00           0.00           ~     (all equal)
EncoderEncode               1.00 ± 0%      0.00       -100.00%  (p=0.000 n=8+8)
EncoderEncode-6             1.00 ± 0%      0.00       -100.00%  (p=0.000 n=8+8)
EncoderEncode-48            1.00 ± 0%      0.00       -100.00%  (p=0.000 n=8+8)

https://perf.golang.org/search?q=upload:20170427.2

updates #17973
updates #18177

Change-Id: I5881c7a2bfad1766e6aa3444bb630883e0be467b
Reviewed-on: https://go-review.googlesource.com/41931
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
d6ce7e4

@gopherbot gopherbot pushed a commit that referenced this issue Apr 28, 2017

@bcmills @bcmills bcmills + bcmills net/rpc: use a sync.Map for serviceMap instead of RWMutex
This has no measurable impact on performance, but somewhat simplifies
the code.

updates #18177

name                  old time/op    new time/op    delta
EndToEnd                54.3µs ±10%    55.7µs ±12%    ~     (p=0.505 n=8+8)
EndToEnd-6              31.4µs ± 9%    32.7µs ± 6%    ~     (p=0.130 n=8+8)
EndToEnd-48             25.5µs ±12%    26.4µs ± 6%    ~     (p=0.195 n=8+8)
EndToEndHTTP            53.7µs ± 8%    51.2µs ±15%    ~     (p=0.463 n=7+8)
EndToEndHTTP-6          30.9µs ±18%    31.2µs ±14%    ~     (p=0.959 n=8+8)
EndToEndHTTP-48         24.9µs ±11%    25.7µs ± 6%    ~     (p=0.382 n=8+8)
EndToEndAsync           23.6µs ± 7%    24.2µs ± 6%    ~     (p=0.383 n=7+7)
EndToEndAsync-6         21.0µs ±23%    22.0µs ±20%    ~     (p=0.574 n=8+8)
EndToEndAsync-48        22.8µs ±16%    23.3µs ±13%    ~     (p=0.721 n=8+8)
EndToEndAsyncHTTP       25.8µs ± 7%    24.7µs ±14%    ~     (p=0.161 n=8+8)
EndToEndAsyncHTTP-6     22.1µs ±19%    22.6µs ±12%    ~     (p=0.645 n=8+8)
EndToEndAsyncHTTP-48    22.9µs ±13%    22.1µs ±20%    ~     (p=0.574 n=8+8)

name                  old alloc/op   new alloc/op   delta
EndToEnd                  320B ± 0%      321B ± 0%    ~     (p=1.000 n=8+8)
EndToEnd-6                320B ± 0%      321B ± 0%  +0.20%  (p=0.037 n=8+7)
EndToEnd-48               326B ± 0%      326B ± 0%    ~     (p=0.124 n=8+8)
EndToEndHTTP              320B ± 0%      320B ± 0%    ~     (all equal)
EndToEndHTTP-6            320B ± 0%      321B ± 0%    ~     (p=0.077 n=8+8)
EndToEndHTTP-48           324B ± 0%      324B ± 0%    ~     (p=1.000 n=8+8)
EndToEndAsync             227B ± 0%      227B ± 0%    ~     (p=0.154 n=8+7)
EndToEndAsync-6           226B ± 0%      226B ± 0%    ~     (all equal)
EndToEndAsync-48          230B ± 1%      229B ± 1%    ~     (p=0.072 n=8+8)
EndToEndAsyncHTTP         227B ± 0%      227B ± 0%    ~     (all equal)
EndToEndAsyncHTTP-6       226B ± 0%      226B ± 0%    ~     (p=0.400 n=8+7)
EndToEndAsyncHTTP-48      228B ± 0%      228B ± 0%    ~     (p=0.949 n=8+6)

name                  old allocs/op  new allocs/op  delta
EndToEnd                  9.00 ± 0%      9.00 ± 0%    ~     (all equal)
EndToEnd-6                9.00 ± 0%      9.00 ± 0%    ~     (all equal)
EndToEnd-48               9.00 ± 0%      9.00 ± 0%    ~     (all equal)
EndToEndHTTP              9.00 ± 0%      9.00 ± 0%    ~     (all equal)
EndToEndHTTP-6            9.00 ± 0%      9.00 ± 0%    ~     (all equal)
EndToEndHTTP-48           9.00 ± 0%      9.00 ± 0%    ~     (all equal)
EndToEndAsync             8.00 ± 0%      8.00 ± 0%    ~     (all equal)
EndToEndAsync-6           8.00 ± 0%      8.00 ± 0%    ~     (all equal)
EndToEndAsync-48          8.00 ± 0%      8.00 ± 0%    ~     (all equal)
EndToEndAsyncHTTP         8.00 ± 0%      8.00 ± 0%    ~     (all equal)
EndToEndAsyncHTTP-6       8.00 ± 0%      8.00 ± 0%    ~     (all equal)
EndToEndAsyncHTTP-48      8.00 ± 0%      8.00 ± 0%    ~     (all equal)

https://perf.golang.org/search?q=upload:20170428.2

Change-Id: I8ef7f71a7602302aa78c144327270dfce9211539
Reviewed-on: https://go-review.googlesource.com/42112
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
ce5263f

@gopherbot gopherbot pushed a commit that referenced this issue Apr 28, 2017

@bcmills @bcmills bcmills + bcmills mime: use sync.Map instead of RWMutex for type lookups
This provides a significant speedup for TypeByExtension and
ExtensionsByType when using many CPU cores.

updates #17973
updates #18177

name                                          old time/op    new time/op    delta
QEncodeWord                                      526ns ± 3%     525ns ± 3%     ~     (p=0.990 n=15+28)
QEncodeWord-6                                    945ns ± 7%     913ns ±20%     ~     (p=0.220 n=14+28)
QEncodeWord-48                                  1.02µs ± 2%    1.00µs ± 6%   -2.22%  (p=0.036 n=13+27)
QDecodeWord                                      311ns ±18%     323ns ±20%     ~     (p=0.107 n=16+28)
QDecodeWord-6                                    595ns ±12%     612ns ±11%     ~     (p=0.093 n=15+27)
QDecodeWord-48                                   592ns ± 6%     606ns ± 8%   +2.39%  (p=0.045 n=16+26)
QDecodeHeader                                    389ns ± 4%     394ns ± 8%     ~     (p=0.161 n=12+26)
QDecodeHeader-6                                  685ns ±12%     674ns ±20%     ~     (p=0.773 n=14+27)
QDecodeHeader-48                                 658ns ±13%     669ns ±14%     ~     (p=0.457 n=16+28)
TypeByExtension/.html                           77.4ns ±15%    55.5ns ±13%  -28.35%  (p=0.000 n=8+8)
TypeByExtension/.html-6                          263ns ± 9%      10ns ±21%  -96.29%  (p=0.000 n=8+8)
TypeByExtension/.html-48                         175ns ± 5%       2ns ±16%  -98.88%  (p=0.000 n=8+8)
TypeByExtension/.HTML                            113ns ± 6%      97ns ± 6%  -14.37%  (p=0.000 n=8+8)
TypeByExtension/.HTML-6                          273ns ± 7%      17ns ± 4%  -93.93%  (p=0.000 n=7+8)
TypeByExtension/.HTML-48                         175ns ± 4%       4ns ± 4%  -97.73%  (p=0.000 n=8+8)
TypeByExtension/.unused                          116ns ± 4%      90ns ± 4%  -22.89%  (p=0.001 n=7+7)
TypeByExtension/.unused-6                        262ns ± 5%      15ns ± 4%  -94.17%  (p=0.000 n=8+8)
TypeByExtension/.unused-48                       176ns ± 4%       3ns ±10%  -98.10%  (p=0.000 n=8+8)
ExtensionsByType/text/html                       630ns ± 5%     522ns ± 5%  -17.19%  (p=0.000 n=8+7)
ExtensionsByType/text/html-6                     314ns ±20%     136ns ± 6%  -56.80%  (p=0.000 n=8+8)
ExtensionsByType/text/html-48                    298ns ± 4%     104ns ± 6%  -65.06%  (p=0.000 n=8+8)
ExtensionsByType/text/html;_charset=utf-8       1.12µs ± 3%    1.05µs ± 7%   -6.19%  (p=0.004 n=8+7)
ExtensionsByType/text/html;_charset=utf-8-6      402ns ±11%     307ns ± 4%  -23.77%  (p=0.000 n=8+8)
ExtensionsByType/text/html;_charset=utf-8-48     422ns ± 3%     309ns ± 4%  -26.86%  (p=0.000 n=8+8)
ExtensionsByType/application/octet-stream        810ns ± 2%     747ns ± 5%   -7.74%  (p=0.000 n=8+8)
ExtensionsByType/application/octet-stream-6      289ns ± 9%     185ns ± 8%  -36.15%  (p=0.000 n=7+8)
ExtensionsByType/application/octet-stream-48     267ns ± 6%      94ns ± 2%  -64.91%  (p=0.000 n=8+7)

name                                          old alloc/op   new alloc/op   delta
QEncodeWord                                      48.0B ± 0%     48.0B ± 0%     ~     (all equal)
QEncodeWord-6                                    48.0B ± 0%     48.0B ± 0%     ~     (all equal)
QEncodeWord-48                                   48.0B ± 0%     48.0B ± 0%     ~     (all equal)
QDecodeWord                                      48.0B ± 0%     48.0B ± 0%     ~     (all equal)
QDecodeWord-6                                    48.0B ± 0%     48.0B ± 0%     ~     (all equal)
QDecodeWord-48                                   48.0B ± 0%     48.0B ± 0%     ~     (all equal)
QDecodeHeader                                    48.0B ± 0%     48.0B ± 0%     ~     (all equal)
QDecodeHeader-6                                  48.0B ± 0%     48.0B ± 0%     ~     (all equal)
QDecodeHeader-48                                 48.0B ± 0%     48.0B ± 0%     ~     (all equal)
TypeByExtension/.html                            0.00B          0.00B          ~     (all equal)
TypeByExtension/.html-6                          0.00B          0.00B          ~     (all equal)
TypeByExtension/.html-48                         0.00B          0.00B          ~     (all equal)
TypeByExtension/.HTML                            0.00B          0.00B          ~     (all equal)
TypeByExtension/.HTML-6                          0.00B          0.00B          ~     (all equal)
TypeByExtension/.HTML-48                         0.00B          0.00B          ~     (all equal)
TypeByExtension/.unused                          0.00B          0.00B          ~     (all equal)
TypeByExtension/.unused-6                        0.00B          0.00B          ~     (all equal)
TypeByExtension/.unused-48                       0.00B          0.00B          ~     (all equal)
ExtensionsByType/text/html                        192B ± 0%      176B ± 0%   -8.33%  (p=0.000 n=8+8)
ExtensionsByType/text/html-6                      192B ± 0%      176B ± 0%   -8.33%  (p=0.000 n=8+8)
ExtensionsByType/text/html-48                     192B ± 0%      176B ± 0%   -8.33%  (p=0.000 n=8+8)
ExtensionsByType/text/html;_charset=utf-8         480B ± 0%      464B ± 0%   -3.33%  (p=0.000 n=8+8)
ExtensionsByType/text/html;_charset=utf-8-6       480B ± 0%      464B ± 0%   -3.33%  (p=0.000 n=8+8)
ExtensionsByType/text/html;_charset=utf-8-48      480B ± 0%      464B ± 0%   -3.33%  (p=0.000 n=8+8)
ExtensionsByType/application/octet-stream         160B ± 0%      160B ± 0%     ~     (all equal)
ExtensionsByType/application/octet-stream-6       160B ± 0%      160B ± 0%     ~     (all equal)
ExtensionsByType/application/octet-stream-48      160B ± 0%      160B ± 0%     ~     (all equal)

name                                          old allocs/op  new allocs/op  delta
QEncodeWord                                       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
QEncodeWord-6                                     1.00 ± 0%      1.00 ± 0%     ~     (all equal)
QEncodeWord-48                                    1.00 ± 0%      1.00 ± 0%     ~     (all equal)
QDecodeWord                                       2.00 ± 0%      2.00 ± 0%     ~     (all equal)
QDecodeWord-6                                     2.00 ± 0%      2.00 ± 0%     ~     (all equal)
QDecodeWord-48                                    2.00 ± 0%      2.00 ± 0%     ~     (all equal)
QDecodeHeader                                     2.00 ± 0%      2.00 ± 0%     ~     (all equal)
QDecodeHeader-6                                   2.00 ± 0%      2.00 ± 0%     ~     (all equal)
QDecodeHeader-48                                  2.00 ± 0%      2.00 ± 0%     ~     (all equal)
TypeByExtension/.html                             0.00           0.00          ~     (all equal)
TypeByExtension/.html-6                           0.00           0.00          ~     (all equal)
TypeByExtension/.html-48                          0.00           0.00          ~     (all equal)
TypeByExtension/.HTML                             0.00           0.00          ~     (all equal)
TypeByExtension/.HTML-6                           0.00           0.00          ~     (all equal)
TypeByExtension/.HTML-48                          0.00           0.00          ~     (all equal)
TypeByExtension/.unused                           0.00           0.00          ~     (all equal)
TypeByExtension/.unused-6                         0.00           0.00          ~     (all equal)
TypeByExtension/.unused-48                        0.00           0.00          ~     (all equal)
ExtensionsByType/text/html                        3.00 ± 0%      3.00 ± 0%     ~     (all equal)
ExtensionsByType/text/html-6                      3.00 ± 0%      3.00 ± 0%     ~     (all equal)
ExtensionsByType/text/html-48                     3.00 ± 0%      3.00 ± 0%     ~     (all equal)
ExtensionsByType/text/html;_charset=utf-8         4.00 ± 0%      4.00 ± 0%     ~     (all equal)
ExtensionsByType/text/html;_charset=utf-8-6       4.00 ± 0%      4.00 ± 0%     ~     (all equal)
ExtensionsByType/text/html;_charset=utf-8-48      4.00 ± 0%      4.00 ± 0%     ~     (all equal)
ExtensionsByType/application/octet-stream         2.00 ± 0%      2.00 ± 0%     ~     (all equal)
ExtensionsByType/application/octet-stream-6       2.00 ± 0%      2.00 ± 0%     ~     (all equal)
ExtensionsByType/application/octet-stream-48      2.00 ± 0%      2.00 ± 0%     ~     (all equal)

https://perf.golang.org/search?q=upload:20170427.4

Change-Id: I35438be087ad6eb3d5da9119b395723ea5babaf6
Reviewed-on: https://go-review.googlesource.com/41990
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
e8d7e5d

@gopherbot gopherbot pushed a commit that referenced this issue Apr 28, 2017

@bcmills @bcmills bcmills + bcmills expvar: replace RWMutex usage with sync.Map and atomics
Int and Float already used atomics.

When many goroutines on many CPUs concurrently update a StringSet or a
Map with different keys per goroutine, this change results in dramatic
steady-state speedups.

This change does add some overhead for single-CPU and ephemeral maps.
I believe that is mostly due to an increase in allocations per call
(to pack the map keys and values into interface{} values that may
escape into the heap). With better inlining and/or escape analysis,
the single-CPU penalty may decline somewhat.

There are still two RWMutexes in the package: one for the keys in the
global "vars" map, and one for the keys in individual Map variables.

Those RWMutexes could also be eliminated, but avoiding excessive
allocations when adding new keys would require care. The remaining
RWMutexes are only acquired in Do functions, which I believe are not
typically on the fast path.

updates #17973
updates #18177

name             old time/op    new time/op    delta
StringSet          65.9ns ± 8%    55.7ns ± 1%   -15.46%  (p=0.000 n=8+7)
StringSet-6         416ns ±22%     127ns ±19%   -69.37%  (p=0.000 n=8+8)
StringSet-48        309ns ± 8%      94ns ± 3%   -69.43%  (p=0.001 n=7+7)

name             old alloc/op   new alloc/op   delta
StringSet           0.00B         16.00B ± 0%     +Inf%  (p=0.000 n=8+8)
StringSet-6         0.00B         16.00B ± 0%     +Inf%  (p=0.000 n=8+8)
StringSet-48        0.00B         16.00B ± 0%     +Inf%  (p=0.000 n=8+8)

name             old allocs/op  new allocs/op  delta
StringSet            0.00           1.00 ± 0%     +Inf%  (p=0.000 n=8+8)
StringSet-6          0.00           1.00 ± 0%     +Inf%  (p=0.000 n=8+8)
StringSet-48         0.00           1.00 ± 0%     +Inf%  (p=0.000 n=8+8)

https://perf.golang.org/search?q=upload:20170427.3

name                           old time/op    new time/op    delta
IntAdd                           5.64ns ± 3%    5.58ns ± 1%      ~     (p=0.185 n=8+8)
IntAdd-6                         18.6ns ±32%    21.4ns ±21%      ~     (p=0.078 n=8+8)
IntAdd-48                        19.6ns ±13%    20.6ns ±19%      ~     (p=0.702 n=8+8)
IntSet                           5.50ns ± 1%    5.48ns ± 0%      ~     (p=0.222 n=7+8)
IntSet-6                         18.5ns ±16%    20.4ns ±30%      ~     (p=0.314 n=8+8)
IntSet-48                        19.7ns ±12%    20.4ns ±16%      ~     (p=0.522 n=8+8)
FloatAdd                         14.5ns ± 1%    14.6ns ± 2%      ~     (p=0.237 n=7+8)
FloatAdd-6                       69.9ns ±13%    68.4ns ± 7%      ~     (p=0.557 n=7+7)
FloatAdd-48                       110ns ± 9%     109ns ± 6%      ~     (p=0.667 n=8+8)
FloatSet                         7.62ns ± 3%    7.64ns ± 5%      ~     (p=0.939 n=8+8)
FloatSet-6                       20.7ns ±22%    21.0ns ±23%      ~     (p=0.959 n=8+8)
FloatSet-48                      20.4ns ±24%    20.8ns ±19%      ~     (p=0.899 n=8+8)
MapSet                           88.1ns ±15%   200.9ns ± 7%  +128.11%  (p=0.000 n=8+8)
MapSet-6                          453ns ±12%     202ns ± 8%   -55.43%  (p=0.000 n=8+8)
MapSet-48                         432ns ±12%     240ns ±15%   -44.49%  (p=0.000 n=8+8)
MapSetDifferent                   349ns ± 1%     876ns ± 2%  +151.08%  (p=0.001 n=6+7)
MapSetDifferent-6                1.74µs ±32%    0.25µs ±17%   -85.71%  (p=0.000 n=8+8)
MapSetDifferent-48               1.77µs ±10%    0.14µs ± 2%   -91.84%  (p=0.000 n=8+8)
MapSetString                     88.1ns ± 7%   205.3ns ± 5%  +132.98%  (p=0.001 n=7+7)
MapSetString-6                    438ns ±30%     205ns ± 9%   -53.15%  (p=0.000 n=8+8)
MapSetString-48                   419ns ±14%     241ns ±15%   -42.39%  (p=0.000 n=8+8)
MapAddSame                        686ns ± 9%    1010ns ± 5%   +47.41%  (p=0.000 n=8+8)
MapAddSame-6                      238ns ±10%     300ns ±11%   +26.22%  (p=0.000 n=8+8)
MapAddSame-48                     366ns ± 4%     483ns ± 3%   +32.06%  (p=0.000 n=8+8)
MapAddDifferent                  1.96µs ± 4%    3.24µs ± 6%   +65.58%  (p=0.000 n=8+8)
MapAddDifferent-6                 553ns ± 3%     948ns ± 8%   +71.43%  (p=0.000 n=7+8)
MapAddDifferent-48                548ns ± 4%    1242ns ±10%  +126.81%  (p=0.000 n=8+8)
MapAddSameSteadyState            31.5ns ± 7%    41.7ns ± 6%   +32.61%  (p=0.000 n=8+8)
MapAddSameSteadyState-6           239ns ± 7%     101ns ±30%   -57.53%  (p=0.000 n=7+8)
MapAddSameSteadyState-48          152ns ± 4%      85ns ±13%   -43.84%  (p=0.000 n=8+7)
MapAddDifferentSteadyState        151ns ± 5%     177ns ± 1%   +17.32%  (p=0.001 n=8+6)
MapAddDifferentSteadyState-6      861ns ±15%      62ns ±23%   -92.85%  (p=0.000 n=8+8)
MapAddDifferentSteadyState-48     617ns ± 2%      20ns ±14%   -96.75%  (p=0.000 n=8+8)
RealworldExpvarUsage             4.33µs ± 4%    4.48µs ± 6%      ~     (p=0.336 n=8+7)
RealworldExpvarUsage-6           2.12µs ±20%    2.28µs ±10%      ~     (p=0.228 n=8+6)
RealworldExpvarUsage-48          1.23µs ±19%    1.36µs ±16%      ~     (p=0.152 n=7+8)

name                           old alloc/op   new alloc/op   delta
IntAdd                            0.00B          0.00B           ~     (all equal)
IntAdd-6                          0.00B          0.00B           ~     (all equal)
IntAdd-48                         0.00B          0.00B           ~     (all equal)
IntSet                            0.00B          0.00B           ~     (all equal)
IntSet-6                          0.00B          0.00B           ~     (all equal)
IntSet-48                         0.00B          0.00B           ~     (all equal)
FloatAdd                          0.00B          0.00B           ~     (all equal)
FloatAdd-6                        0.00B          0.00B           ~     (all equal)
FloatAdd-48                       0.00B          0.00B           ~     (all equal)
FloatSet                          0.00B          0.00B           ~     (all equal)
FloatSet-6                        0.00B          0.00B           ~     (all equal)
FloatSet-48                       0.00B          0.00B           ~     (all equal)
MapSet                            0.00B         48.00B ± 0%     +Inf%  (p=0.000 n=8+8)
MapSet-6                          0.00B         48.00B ± 0%     +Inf%  (p=0.000 n=8+8)
MapSet-48                         0.00B         48.00B ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetDifferent                   0.00B        192.00B ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetDifferent-6                 0.00B        192.00B ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetDifferent-48                0.00B        192.00B ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetString                      0.00B         48.00B ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetString-6                    0.00B         48.00B ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetString-48                   0.00B         48.00B ± 0%     +Inf%  (p=0.000 n=8+8)
MapAddSame                         456B ± 0%      480B ± 0%    +5.26%  (p=0.000 n=8+8)
MapAddSame-6                       456B ± 0%      480B ± 0%    +5.26%  (p=0.000 n=8+8)
MapAddSame-48                      456B ± 0%      480B ± 0%    +5.26%  (p=0.000 n=8+8)
MapAddDifferent                    672B ± 0%     1088B ± 0%   +61.90%  (p=0.000 n=8+8)
MapAddDifferent-6                  672B ± 0%     1088B ± 0%   +61.90%  (p=0.000 n=8+8)
MapAddDifferent-48                 672B ± 0%     1088B ± 0%   +61.90%  (p=0.000 n=8+8)
MapAddSameSteadyState             0.00B          0.00B           ~     (all equal)
MapAddSameSteadyState-6           0.00B          0.00B           ~     (all equal)
MapAddSameSteadyState-48          0.00B          0.00B           ~     (all equal)
MapAddDifferentSteadyState        0.00B          0.00B           ~     (all equal)
MapAddDifferentSteadyState-6      0.00B          0.00B           ~     (all equal)
MapAddDifferentSteadyState-48     0.00B          0.00B           ~     (all equal)
RealworldExpvarUsage              0.00B          0.00B           ~     (all equal)
RealworldExpvarUsage-6            0.00B          0.00B           ~     (all equal)
RealworldExpvarUsage-48           0.00B          0.00B           ~     (all equal)

name                           old allocs/op  new allocs/op  delta
IntAdd                             0.00           0.00           ~     (all equal)
IntAdd-6                           0.00           0.00           ~     (all equal)
IntAdd-48                          0.00           0.00           ~     (all equal)
IntSet                             0.00           0.00           ~     (all equal)
IntSet-6                           0.00           0.00           ~     (all equal)
IntSet-48                          0.00           0.00           ~     (all equal)
FloatAdd                           0.00           0.00           ~     (all equal)
FloatAdd-6                         0.00           0.00           ~     (all equal)
FloatAdd-48                        0.00           0.00           ~     (all equal)
FloatSet                           0.00           0.00           ~     (all equal)
FloatSet-6                         0.00           0.00           ~     (all equal)
FloatSet-48                        0.00           0.00           ~     (all equal)
MapSet                             0.00           3.00 ± 0%     +Inf%  (p=0.000 n=8+8)
MapSet-6                           0.00           3.00 ± 0%     +Inf%  (p=0.000 n=8+8)
MapSet-48                          0.00           3.00 ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetDifferent                    0.00          12.00 ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetDifferent-6                  0.00          12.00 ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetDifferent-48                 0.00          12.00 ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetString                       0.00           3.00 ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetString-6                     0.00           3.00 ± 0%     +Inf%  (p=0.000 n=8+8)
MapSetString-48                    0.00           3.00 ± 0%     +Inf%  (p=0.000 n=8+8)
MapAddSame                         6.00 ± 0%     11.00 ± 0%   +83.33%  (p=0.000 n=8+8)
MapAddSame-6                       6.00 ± 0%     11.00 ± 0%   +83.33%  (p=0.000 n=8+8)
MapAddSame-48                      6.00 ± 0%     11.00 ± 0%   +83.33%  (p=0.000 n=8+8)
MapAddDifferent                    14.0 ± 0%      31.0 ± 0%  +121.43%  (p=0.000 n=8+8)
MapAddDifferent-6                  14.0 ± 0%      31.0 ± 0%  +121.43%  (p=0.000 n=8+8)
MapAddDifferent-48                 14.0 ± 0%      31.0 ± 0%  +121.43%  (p=0.000 n=8+8)
MapAddSameSteadyState              0.00           0.00           ~     (all equal)
MapAddSameSteadyState-6            0.00           0.00           ~     (all equal)
MapAddSameSteadyState-48           0.00           0.00           ~     (all equal)
MapAddDifferentSteadyState         0.00           0.00           ~     (all equal)
MapAddDifferentSteadyState-6       0.00           0.00           ~     (all equal)
MapAddDifferentSteadyState-48      0.00           0.00           ~     (all equal)
RealworldExpvarUsage               0.00           0.00           ~     (all equal)
RealworldExpvarUsage-6             0.00           0.00           ~     (all equal)
RealworldExpvarUsage-48            0.00           0.00           ~     (all equal)

https://perf.golang.org/search?q=upload:20170427.1

Change-Id: I388b2e8a3cadb84fc1418af8acfc27338f799273
Reviewed-on: https://go-review.googlesource.com/41930
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
fb0fe42

CL https://golang.org/cl/42113 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Apr 29, 2017

@bcmills @bradfitz bcmills + bradfitz archive/zip: replace RWMutex with sync.Map
This change replaces the compressors and decompressors maps with
instances of sync.Map, eliminating the need for Mutex locking in
NewReader and NewWriter.

The impact for encoding large payloads is miniscule, but as the
payload size decreases, the reduction in setup costs becomes
measurable.

updates #17973
updates #18177

name                        old time/op    new time/op    delta
CompressedZipGarbage          13.6ms ± 3%    13.8ms ± 4%    ~     (p=0.275 n=14+16)
CompressedZipGarbage-6        2.81ms ±10%    2.80ms ± 9%    ~     (p=0.616 n=16+16)
CompressedZipGarbage-48        606µs ± 4%     600µs ± 3%    ~     (p=0.110 n=16+15)
Zip64Test                     88.7ms ± 5%    87.5ms ± 5%    ~     (p=0.150 n=14+14)
Zip64Test-6                   88.6ms ± 8%    94.5ms ±13%    ~     (p=0.070 n=14+16)
Zip64Test-48                   102ms ±19%     101ms ±19%    ~     (p=0.599 n=16+15)
Zip64TestSizes/4096           21.7µs ±10%    23.0µs ± 2%    ~     (p=0.076 n=14+12)
Zip64TestSizes/4096-6         7.58µs ±13%    7.49µs ±18%    ~     (p=0.752 n=16+16)
Zip64TestSizes/4096-48        19.5µs ± 8%    18.0µs ± 4%  -7.74%  (p=0.000 n=16+15)
Zip64TestSizes/1048576        1.36ms ± 9%    1.40ms ± 8%  +2.79%  (p=0.029 n=24+25)
Zip64TestSizes/1048576-6       262µs ±11%     260µs ±10%    ~     (p=0.506 n=24+24)
Zip64TestSizes/1048576-48      120µs ± 7%     116µs ± 7%  -3.05%  (p=0.006 n=24+25)
Zip64TestSizes/67108864       86.8ms ± 6%    85.1ms ± 5%    ~     (p=0.149 n=14+17)
Zip64TestSizes/67108864-6     15.9ms ± 2%    16.1ms ± 6%    ~     (p=0.279 n=14+17)
Zip64TestSizes/67108864-48    4.51ms ± 5%    4.53ms ± 4%    ~     (p=0.766 n=15+17)

name                        old alloc/op   new alloc/op   delta
CompressedZipGarbage          5.63kB ± 0%    5.63kB ± 0%    ~     (all equal)
CompressedZipGarbage-6        15.4kB ± 0%    15.4kB ± 0%    ~     (all equal)
CompressedZipGarbage-48       25.5kB ± 3%    25.6kB ± 2%    ~     (p=0.450 n=16+16)
Zip64Test                     20.0kB ± 0%    20.0kB ± 0%    ~     (p=0.060 n=16+13)
Zip64Test-6                   20.0kB ± 0%    20.0kB ± 0%    ~     (p=0.136 n=16+14)
Zip64Test-48                  20.0kB ± 0%    20.0kB ± 0%    ~     (p=1.000 n=16+16)
Zip64TestSizes/4096           20.0kB ± 0%    20.0kB ± 0%    ~     (all equal)
Zip64TestSizes/4096-6         20.0kB ± 0%    20.0kB ± 0%    ~     (all equal)
Zip64TestSizes/4096-48        20.0kB ± 0%    20.0kB ± 0%  -0.00%  (p=0.002 n=16+13)
Zip64TestSizes/1048576        20.0kB ± 0%    20.0kB ± 0%    ~     (all equal)
Zip64TestSizes/1048576-6      20.0kB ± 0%    20.0kB ± 0%    ~     (all equal)
Zip64TestSizes/1048576-48     20.1kB ± 0%    20.1kB ± 0%    ~     (p=0.775 n=24+25)
Zip64TestSizes/67108864       20.0kB ± 0%    20.0kB ± 0%    ~     (all equal)
Zip64TestSizes/67108864-6     20.0kB ± 0%    20.0kB ± 0%    ~     (p=0.272 n=16+17)
Zip64TestSizes/67108864-48    20.1kB ± 0%    20.1kB ± 0%    ~     (p=0.098 n=14+15)

name                        old allocs/op  new allocs/op  delta
CompressedZipGarbage            44.0 ± 0%      44.0 ± 0%    ~     (all equal)
CompressedZipGarbage-6          44.0 ± 0%      44.0 ± 0%    ~     (all equal)
CompressedZipGarbage-48         44.0 ± 0%      44.0 ± 0%    ~     (all equal)
Zip64Test                       53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64Test-6                     53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64Test-48                    53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64TestSizes/4096             53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64TestSizes/4096-6           53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64TestSizes/4096-48          53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64TestSizes/1048576          53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64TestSizes/1048576-6        53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64TestSizes/1048576-48       53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64TestSizes/67108864         53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64TestSizes/67108864-6       53.0 ± 0%      53.0 ± 0%    ~     (all equal)
Zip64TestSizes/67108864-48      53.0 ± 0%      53.0 ± 0%    ~     (all equal)

https://perf.golang.org/search?q=upload:20170428.4

Change-Id: Idb7bec091a210aba833066f8d083d66e27788286
Reviewed-on: https://go-review.googlesource.com/42113
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
34fd5db
Owner

bradfitz commented May 3, 2017

This happened. Closing.

CLs can continue to reference this, but no reason to keep this open.

bradfitz closed this May 3, 2017

tmm1 commented May 3, 2017 edited

Very happy to see this land in stdlib.

I wonder though if Delete() should return the deleted value from the Map? Am I missing some simpler way to atomically pull an entry out of the map to operate on it?

Member

bcmills commented May 4, 2017

There is currently no way to atomically remove and return a value from the map. There wasn't a use-case for it in the standard library, and it would make the synchronization a bit trickier. If you have a concrete use for it, though, we should by all means consider it.

(Also note that the delete operation for built-in maps doesn't return the deleted element: when possible, I was aiming for consistency with the built-in map API.)

tmm1 commented May 4, 2017

If you have a concrete use for it, though, we should by all means consider it.

In my case I need to cleanup some resources associated with the object in the map. Currently I have the following, but to make it atomic I would need to add a lock or sync.Once inside my Cleanup() method:

v, ok := streams.Load(key)
if ok {
  v.(*Stream).Cleanup()
  streams.Delete(key)
}

Something like this would be preferable:

v, ok := streams.Delete(key)
if ok {
  v.(*Stream).Cleanup()
}
Member

bcmills commented May 4, 2017

I've seen that use-case before.

On the other hand, having Delete return the value would potentially constrain future optimization of sync.Map. The current implementation is reasonably scalable, but it's also very simple. I do not pretend to claim that it is optimal.

I suspect we would want to address this use-case with a separate method. In particular, I think a CompareAndDelete would allow you to do your cleanup without a separate lock:

v, ok := streams.Load(key)
if ok && streams.CompareAndDelete(key, v) {
  v.(*Stream).Cleanup()
}

Since we're after the Go 1.9 feature freeze, let's consider that as a possible API extension for 1.10.

tmm1 commented May 4, 2017

I like the CompareAndDelete proposal. The additional method would prevent any backwards compat issues.

I'm still on go 1.8, but I imported x/sync/syncmap and it's made my application code significantly cleaner and simpler. Thanks again for this great new type.

I'm also using your lazyLoad technique above to create entries in my map, and I have to say it's quite an elegant implementation. I would echo @akrylysov's sentiment for an API like this in stdlib, as there is quite a lot of boilerplate required at the moment. Maybe a new wrapper type like sync.LazyMap?

Contributor

seh commented May 5, 2017

@bcmills, it looks like Load followed by the hypothetical CompareAndDelete still requires two lookup operations into the map. I understood @tmm1's original request to imply seeking the key's position in the map once, and taking action once there in right place.

Is the problem that you can't atomically remove an entry from that position and capture the value corresponding to the key?

Member

bcmills commented May 9, 2017 edited

it looks like Load followed by the hypothetical CompareAndDelete still requires two lookup operations into the map.

Indeed, although a Sufficiently Clever Compiler† could convert them to a single lookup as long as there are no other reads that could introduce a happens-before relationship with some other atomic store.

I understood @tmm1's original request to imply seeking the key's position in the map once, and taking action once there in right place.

The main point of sync.Map is to reduce cross-CPU contention, and it only really needs to be in the standard library at all so that we can use it to reduce contention elsewhere in the standard library. Reducing the number of atomic operations is only a secondary goal. Even with the two atomic operations, CompareAndDelete would avoid contention for @tmm1's use-case.

That said, if you're deleting from the map at all, your use-case is a bit different from the ones in the standard library and you'll likely get better performance from a different data structure, which does not necessarily need to live in the Go standard library.

That's not to say that we can't consider other possible optimizations to the API, of course. A Remove that combines the sequence of Load and CompareAndDelete into a single operation would be worth adding to sync.Map if the use-case turns out to be common. (It's probably worth filing a separate issue to track that use-case, since the immediate goal of this issue, "replace sync.RWMutex usage in the standard library", has been addressed.)


† Constructing a Sufficiently Clever Compiler for this usage is left as an exercise to the reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment