Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: enhance map cacheline efficiency #48687

Open
simonhf opened this issue Sep 29, 2021 · 23 comments
Open

runtime: enhance map cacheline efficiency #48687

simonhf opened this issue Sep 29, 2021 · 23 comments

Comments

@simonhf
Copy link

@simonhf simonhf commented Sep 29, 2021

Looking at the Golang source code for how map is implemented [1], there seems to be a possible opportunity to enhance cacheline efficiency:

The current algorithm appears to work as follows for looking up a key in a map:

  • Hash the key [2] and find the associated bucket [3].
  • Search along bucket .tophash array [4] for a key candidate <-- cacheline fetch 1
  • Binary compare key in key array [5] <-- cacheline fetch 2
  • Grab associated element address in element array [6] <-- cacheline fetch 3 (when caller uses element address)

So there are 3 arrays in a bucket; .tophash array, key array, and element (value) array. Each array has the same hard-coded bucketCnt length of 8 elements [7]. A key array item size is typically 16 bytes. This means a typical total key array size is 8 items * 16 byte size = 128 bytes long, or 2x (typically sized) 64 byte cachelines. The total key array size means the associated element array item is usually located in a different cacheline.

This can also be seen (added padding and comments to make it easier to read) from the way the key and element addresses are calculated in the source code:

k  :=  add(unsafe.Pointer(b), dataOffset+i        *uintptr(t.keysize)                      ) // <-- [8]
e  :=  add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize)) // <-- [9]
//                                                                    ^^^^^^^^^^^^^^^^^^^^^ element array
//                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ key array
//                            ^^^^^^^^^^ skip .tophash array

Possible optimization:

If the key and element arrays would be interleaved (instead of separate arrays) then this would not guarantee that the associated key[n] and element[n] are always in the same cacheline, but often they would be :-)

As an example, the interleaved version of the above k and e assignments would look something like this:

k  :=  add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize+t.elemsize)+uintptr(0)        )
e  :=  add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize+t.elemsize)+uintptr(t.keysize))

In theory, performance would more often be better due to less CPU cacheline fetches, and otherwise the same performance as before?

[1] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L532
[2] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L515
[3] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L517
[4] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L532
[5] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L542
[6] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L543
[7] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L66
[8] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L538
[9] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L543

@gopherbot gopherbot added this to the Proposal milestone Sep 29, 2021
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 29, 2021

This is not an API change so taking it out of the proposal process.

CC @randall77

@ianlancetaylor ianlancetaylor changed the title Proposal: Enhance Golang map cacheline efficiency runtime: enhance map cacheline efficiency Sep 29, 2021
@ianlancetaylor ianlancetaylor removed this from the Proposal milestone Sep 29, 2021
@ianlancetaylor ianlancetaylor added this to the Backlog milestone Sep 29, 2021
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 29, 2021

I'm not sure you are considering the cache line advantages when the first key comparison fails but the second or third one succeeds. The current layout makes it more likely that the second and third key comparisons will be in the same cacheline.

@randall77
Copy link
Contributor

@randall77 randall77 commented Sep 29, 2021

The trouble with interleaving key/value in the bucket is it may require more alignment padding.
I experimented a while back with doing the optimal packing (e.g. map[int16]int64 would use k0 k1 k2 k3 v0 v1 v2 v3 k4 k5 k6 k7 v4 v5 v6 v7). It required a more complicated indexing scheme and ended up not helping performance-wise, especially for small maps. I'm open to changing the current scheme if someone can demonstrate a performance improvement.

@randall77
Copy link
Contributor

@randall77 randall77 commented Sep 29, 2021

I'm not sure you are considering the cache line advantages when the first key comparison fails but the second or third one succeeds. The current layout makes it more likely that the second and third key comparisons will be in the same cacheline.

The hope is that only one tophash comparison matches, so you only have to look at a single key.

@simonhf
Copy link
Author

@simonhf simonhf commented Sep 29, 2021

cache line advantages when the first key comparison fails but the second or third one succeeds

@ianlancetaylor, you mean because the .tophash array and the key array can share the same cache line, and .tophash is not necessarily unique?

If so, I assume non-unique .tophash values occur in a tiny minority of cases. As an example experiment below, 80 million keys are hashed into the equivalent of the .tophash array with 8 elements. The number of times a 'collision' occurs (where the tophash byte is non-unique) is counted. And yes, I know is not exactly the same as the Golang map algorithm; which does not use SHA256 to calculate the hash value (a non-cryptographic hashing algorithm might result in more collisions), and the .tophash arrays are not necessarily completely full either like they are in the experiment.

The experimental results show that non-unique values in the .tophash array only occur 1,070,210 + 8,433 + 23 = 1,078,666 times or for 1.3% of all keys. However, the chance of the .tophash array and the collided keys all being in the same cacheline is even less than 1.35%, because only the first 3 of 8 possible keys can possibly be in the same cacheline as the .tophash array itself, e.g. 8 bytes for .tophash plus 3 x 16 = 48 bytes for the first 3 keys is 56 bytes... nearly the 64 bytes of the cacheline.

Whereas, with the interleaving of keys and elements, presumably most if not all (assuming good alignment to cacheline boundaries) associated keys and elements would end up in the same cacheline, but not necessarily the same cacheline as .tophash?

$ perl -e 'use Digest::SHA; $keys=10_000_000; $tophash=8; foreach $k(1..$keys){ undef $h; foreach $i(1..$tophash){ $hex = Digest::SHA::sha256_hex($k.$i); $top = substr($hex, 0, 2); $h->{$top}++; if(0){ printf qq[- k=%d i=%d hex=%s top=%s\n], $k, $i, $hex, $top; } } foreach $top(keys %{$h}){ $f=$h->{$top}; $hf->{$f}++; } } sub END{ foreach $f(sort keys %{$hf}){ printf qq[- %d keys hashed into n x tophash[%d]; 8 bits exists in any array %d time(s) occurred %8d times\n], $keys * $tophash, $tophash, $f, $hf->{$f}; } }'
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 1 time(s) occurred 77834169 times
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 2 time(s) occurred  1070210 times
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 3 time(s) occurred     8433 times
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 4 time(s) occurred       28 times

The hope is that only one tophash comparison matches, so you only have to look at a single key.

@randall77, the experiment above seems to validate that ~ 98.65% of the time only one .tophash value does match :-)

Interestingly, the same experiment shows that if .tophash would be extended to 16 items (and I'm not suggesting it should be) then the the 1.35% turns into 2.83%.

$ perl -e 'use Digest::SHA; $keys=5_000_000; $tophash=16; foreach $k(1..$keys){ undef $h; foreach $i(1..$tophash){ $hex = Digest::SHA::sha256_hex($k.$i); $top = substr($hex, 0, 2); $h->{$top}++; if(0){ printf qq[- k=%d i=%d hex=%s top=%s\n], $k, $i, $hex, $top; } } foreach $top(keys %{$h}){ $f=$h->{$top}; $hf->{$f}++; } } sub END{ foreach $f(sort keys %{$hf}){ printf qq[- %d keys hashed into n x tophash[%d]; 8 bits exists in any array %d time(s) occurred %8d times\n], $keys * $tophash, $tophash, $f, $hf->{$f}; } }'
- 80000000 keys hashed into n x tophash[16]; 8 bits exists in any array 1 time(s) occurred 75437834 times
- 80000000 keys hashed into n x tophash[16]; 8 bits exists in any array 2 time(s) occurred  2218711 times
- 80000000 keys hashed into n x tophash[16]; 8 bits exists in any array 3 time(s) occurred    40900 times
- 80000000 keys hashed into n x tophash[16]; 8 bits exists in any array 4 time(s) occurred      506 times
- 80000000 keys hashed into n x tophash[16]; 8 bits exists in any array 5 time(s) occurred        4 times

Perhaps more interesting would be to extend .tophash items from 8 bits to 16 bits? Which also adds 8 bytes to the size of every map. The experiment below shows that the 1.35% turns into 0.00535%, i.e. there is now very much hope that only one .tophash comparison matches :-)

$ perl -e 'use Digest::SHA; $keys=10_000_000; $tophash=8; foreach $k(1..$keys){ undef $h; foreach $i(1..$tophash){ $hex = Digest::SHA::sha256_hex($k.$i); $top = substr($hex, 0, 4); $h->{$top}++; if(0){ printf qq[- k=%d i=%d hex=%s top=%s\n], $k, $i, $hex, $top; } } foreach $top(keys %{$h}){ $f=$h->{$top}; $hf->{$f}++; } } sub END{ foreach $f(sort keys %{$hf}){ printf qq[- %d keys hashed into n x tophash[%d]; 8 bits exists in any array %d time(s) occurred %8d times\n], $keys * $tophash, $tophash, $f, $hf->{$f}; } }'
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 1 time(s) occurred 79991440 times
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 2 time(s) occurred     4280 times

And just for completeness, extending .tophash items from 8 bits to 24 bits? Which also adds 16 bytes to the size of every map. The experiment below shows that the 1.35% turns into 0.00002625%. Doesn't seem to be worth it for the advantage over 16 bits?

$ perl -e 'use Digest::SHA; $keys=10_000_000; $tophash=8; foreach $k(1..$keys){ undef $h; foreach $i(1..$tophash){ $hex = Digest::SHA::sha256_hex($k.$i); $top = substr($hex, 0, 6); $h->{$top}++; if(0){ printf qq[- k=%d i=%d hex=%s top=%s\n], $k, $i, $hex, $top; } } foreach $top(keys %{$h}){ $f=$h->{$top}; $hf->{$f}++; } } sub END{ foreach $f(sort keys %{$hf}){ printf qq[- %d keys hashed into n x tophash[%d]; 8 bits exists in any array %d time(s) occurred %8d times\n], $keys * $tophash, $tophash, $f, $hf->{$f}; } }'
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 1 time(s) occurred 79999958 times
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 2 time(s) occurred       21 times

I experimented a while back with doing the optimal packing (e.g. map[int16]int64 would use k0 k1 k2 k3 v0 v1 v2 v3 k4 k5 k6 k7 v4 v5 v6 v7). It required a more complicated indexing scheme and ended up not helping performance-wise, especially for small maps.

Did you check to see if NOT packing / aligning / padding results in performance issues, e.g. "On recent Intel processors (Sandy Bridge and Nehalem), there is no performance penalty for reading or writing misaligned memory operands" [1] (admittedly a little old for these days) ?

Or, maybe opportunistically only use the simple interleaving (i.e. where k0 k1 k2 k3 v0 v1 v2 v3 is complicated interleaving) for map schemes which don't require special packing, which would still be a larger number of all maps, or?

I'm open to changing the current scheme if someone can demonstrate a performance improvement.

Hmmm... I might have a crack myself at demonstrating an improvement. Have never compiled Golang from scratch (in order to change map.go) but there's a first time for everything :-)

[1] https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

@randall77
Copy link
Contributor

@randall77 randall77 commented Sep 29, 2021

Did you check to see if NOT packing / aligning / padding results in performance issues, e.g. "On recent Intel processors (Sandy Bridge and Nehalem), there is no performance penalty for reading or writing misaligned memory operands" [1] (admittedly a little old for these days) ?

No, I didn't. It is kind of a moot point because there are architectures we run on that do not allow misaligned accesses at all. And we do not want to have different implementations on different architectures, if we can avoid it.

Or, maybe opportunistically only use the simple interleaving (i.e. where k0 k1 k2 k3 v0 v1 v2 v3 is complicated interleaving) for map schemes which don't require special packing, which would still be a larger number of all maps, or?

map[T]bool is a relatively common map type.

We could simply interleave only for types which require no padding, but then we need to somehow fork the code based on what packing would happen. Again, not ideal code size and maintenance-wise.

We could interleave always, with padding if needed, which would waste space but allow a single implementation. Maybe the space wasted wouldn't be too bad (map[int64]bool is probably the worst common case, which is a 63% expansion).

@randall77
Copy link
Contributor

@randall77 randall77 commented Sep 29, 2021

Did you check to see if NOT packing / aligning / padding results in performance issues, e.g. "On recent Intel processors (Sandy Bridge and Nehalem), there is no performance penalty for reading or writing misaligned memory operands" [1] (admittedly a little old for these days) ?

Also the garbage collector requires aligned pointers.

@simonhf
Copy link
Author

@simonhf simonhf commented Oct 2, 2021

I'm open to changing the current scheme if someone can demonstrate a performance improvement.

I'm going to have a go at such a demonstration this weekend :-)

As a pre-step, today, I built Golang from scratch for the first time... yay! :-)

Quick question though: If I modify runtime/map.go and rebuild Golang [1] then the fastest rebuild takes ~ 276 seconds on my recent MacBook Pro. Is there a way to make it do a minimal / incremental build faster, or is this just the status quo?

$ time ./make.bash
Building Go cmd/dist using /usr/local/Cellar/go/1.16.6/libexec. (go1.16.6 darwin/amd64)
Building Go toolchain1 using /usr/local/Cellar/go/1.16.6/libexec.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for darwin/amd64.
---
Installed Go for darwin/amd64 in /Users/simonhf/20211001-golang/goroot
Installed commands in /Users/simonhf/20211001-golang/goroot/bin
./make.bash  276.81s user 51.73s system 369% cpu 1:28.96 total

[1] https://github.com/cloudflare/go/wiki/Starting-out

@josharian
Copy link
Contributor

@josharian josharian commented Oct 2, 2021

If you modify only the runtime (not the rest of the toolchain), you don't need to run make.bash again. You can do go test runtime or go test strings or whatever work you want. The toolchain can see that package runtime's sources have changed, and that package runtime is a dependency of your tests, and thus recompile whatever is necessary.

@josharian
Copy link
Contributor

@josharian josharian commented Oct 2, 2021

And if you do modify the compiler, you can run go install cmd/compile to install the modified compiler. However, if you've introduced a bug, it'll be present the next time you do go install cmd/compile, so when modifying the toolchain it's a good idea to either rebuild from scratch occasionally or hide all your work behind a feature flag.

Also, that build sounds really slow. Are you running all.bash or make.bash? (EDIT: I see you are running make.bash.) The former runs all the tests, but the latter is all you need to be able to use your installation. If you're hacking on maps I'd guess that you'll find most bugs pretty quickly, without the full test suite.

@andig
Copy link
Contributor

@andig andig commented Oct 2, 2021

Gotip download which does a full build imho takes 150s on Mac M1. It might help to start with 1.17. as base go since it has the register-based ABI enabled.

@josharian
Copy link
Contributor

@josharian josharian commented Oct 2, 2021

@andig I see darwin/amd64 in the snippet above. The M1 is really fast. :) Using 1.17 as a bootstrap toolchain is a fine idea, but probably unlikely to speed things up more than 5-10%. Fortunately, a full rebuild isn't necessary to play with the runtime in isolation.

@simonhf
Copy link
Author

@simonhf simonhf commented Oct 2, 2021

Thanks for the tips about iterating faster. go test runtime does work out a little faster; 214 versus 276 seconds:

$ time go test runtime
ok      runtime 132.649s
go test runtime  214.00s user 34.22s system 177% cpu 2:19.89 total

But if I only run my new test and map.go is updated and needs to be rebuild, then everything happens in under 10 seconds :-)

$ time go test -v runtime/map_bench_test.go
=== RUN   TestMapPerformance
...
go test -v runtime/map_bench_test.go  8.62s user 1.55s system 275% cpu 3.698 total

And @andig: The M1 is fast :-)

@ALTree
Copy link
Member

@ALTree ALTree commented Oct 2, 2021

The M1 is really fast.

Is it? I get this on my 2014 laptop with an Haswell Mobile processor and a spinning hard disk:

$ time ./make.bash
Building Go cmd/dist using /usr/local/go. (go1.17 linux/amd64)
Building Go toolchain1 using /usr/local/go.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/amd64.
---
Installed Go for linux/amd64 in /home/alberto/go
Installed commands in /home/alberto/go/bin

real	2m29.378s
user	7m38.069s
sys	0m28.770s

$ lscpu | grep "Model name"
Model name:                      Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz

Are you sure you don't have a nosy antivirus meddling with make.bash or something like that?

@simonhf
Copy link
Author

@simonhf simonhf commented Oct 2, 2021

@ALTree admittedly the laptop is full of corporate security software, and also standard Mac software doing stuff in the background... all chewing CPU :-) I guess the M1 would still have the standard Mac software in the background doing its thing too. But the Mac WindowServer daemon is the worst and constantly chews ~ 16% CPU :-( I have done some research and this daemon uses more CPU depending upon settings and how many external peripherals like monitors (I'm using a 4k external monitor), keyboard & mice (had to turn down the reporting hz on mine to stop WindowServer using tons of CPU every time I moved the mouse!) you have attached. So it might be gobbling more CPU and an the M1 :-)

Maybe a 'spinning hard disk' is not such a big issue if there's plenty of RAM?

@ALTree
Copy link
Member

@ALTree ALTree commented Oct 2, 2021

Yeah, then it's probably OS related. In my experience make.bash is quite sensitive to the presence of other cpu- or HardDisk- hungry processes. My 2m30s measurement was on an idle machine, and my Linux machine is very very quiet when idle. You should be able to run a ~2 minutes make.bash on your hardware, but not if you have a bunch of other processes interfering with the build. I guess your best option is to just run tests and/or go install the compiler or runtime as Josh suggested, and only run make.bash once in a while.

@simonhf
Copy link
Author

@simonhf simonhf commented Oct 4, 2021

As a side note, I switched to a GCP Linux box to get more accurate performance test runs due to little or no background processes running on Linux versus MacOS. Linux also ran make.bash much faster :-)

$ time ./make.bash
Building Go cmd/dist using /usr/lib/go-1.15. (go1.15.9 linux/amd64)
Building Go toolchain1 using /usr/lib/go-1.15.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/amd64.
---
Installed Go for linux/amd64 in /home/simon_hardy_francis_dapperlabs_c/20211002-golang/goroot
Installed commands in /home/simon_hardy_francis_dapperlabs_c/20211002-golang/goroot/bin

real    0m58.456s

@josharian
Copy link
Contributor

@josharian josharian commented Oct 4, 2021

This might be caused by #48496. I'm poking at a simple fix for that now.

@simonhf
Copy link
Author

@simonhf simonhf commented Oct 4, 2021

I'm open to changing the current scheme if someone can demonstrate a performance improvement.

@randall77 below are the results of my weekend performance experiments and how I conducted them :-) Looks like a performance improvement is demonstrated but only for larger map keys.

Unfortunately in this experiment then the interleaved keys slightly decreases performance for smaller map keys.

So future work might be to repeat the experiment but after introducing dynamic interleaving of map keys and values depending upon the map key size? If this can be achieved then this experiment hints at faster map access times for maps with bigger keys; slightly faster map read times, and ~ +15% faster map write times.

Build Golang from scratch on MacOS and hack in experimental changes

  • With a recent Golang binary already installed, follow instructions here [1] & here [2] to build a new Golang from scratch.
  • Surf to [3] and fork repo privately to get [4].
$ git clone https://github.com/simonhf/go ; cd go/src
$ git checkout -b map-cacheline-perf-experiments
$ git log -1 | head -5
commit cc5e3de593afca73cf1b4d732ddceffb2837b390 (HEAD -> map-cacheline-perf-experiments, origin/master, origin/HEAD, master)
Author: Filippo Valsorda <filippo@golang.org>
Date:   Sat May 8 01:07:30 2021 -0400

    crypto/tls: use cryptobyte.NewFixedBuilder
$ time ./all.bash
...
ALL TESTS PASSED
---
Installed Go for darwin/amd64 in /Users/simonhardy-francis/work/20211004-golang/go
Installed commands in /Users/simonhardy-francis/work/20211004-golang/go/bin
./all.bash  1770.38s user 970.98s system 354% cpu 12:52.57 total

$ go version
go version go1.16.6 darwin/amd64

$ export PATH=/Users/simonhardy-francis/work/20211004-golang/go/bin:$PATH 

$ go version
go version devel go1.18-cc5e3de593 Mon Oct 4 17:17:11 2021 +0000 darwin/amd64

[1] https://golang.org/doc/install/source
[2] https://github-wiki-see.page/m/cloudflare/go/wiki/Starting-out
[3] https://github.com/golang/go
[4] https://github.com/simonhf/go

Hack runtime/map*.go

  • Strategy: For quicker dev iteration, create runtime/map_bench_*_test.go files; which re-compiles changes to runtime/map*.go faster than running make.bash; < 10 seconds versus ~ 58 seconds.
  • Strategy: Make the fewest changes to prove interleaving bucket keys and values more efficiently uses CPU cachelines.
  • Strategy: Create perf test for map with a shorter key (e.g. int64; 8 bytes) and longer key (e.g. string; 16 byte string ref).
  • Strategy: Use a lot of keys and values to ensure the map is much bigger than the CPU cacheline cache. Using a smaller map will mess up the results because all cachelines will be cached.
  • Strategy: A map bucket currently uses 3 arrays -- 8 byte tophash[8] followed by 8 keys, followed by 8 values -- and the idea is modify the code to change this to 2 arrays; 8 byte tophash[8] followed by 8 key, value pairs. If a key is an int64 then the 2nd array in the existing production code will be 8 elements x 8 bytes = 64 bytes, and if a key is a string then the 2nd array in the existing production code will be 8 elements x 16 ref bytes = 128 bytes.
  • Strategy: Initially create two perf test files -- map_bench_m_64_64_test.go and map_bench_m_str_64_test.go -- and consider merging them later, or throwing them away.
  • Strategy: Add some conditional debugging code which can be enabled and disabled via bools runtime.MapMakeDebug and runtime.MapIterDebug.
  • Strategy: For fairer results and easier testing, use the same map function and test function to test unmodified and modified algorithm, e.g. the test program manipulates bool runtime.MapCacheFriendly dynamically (this mechanism would be removed if the code ever made it to production) to select the algorithm at runtime:
func mapaccess1_fast64(t *maptype, h *hmap, key uint64) unsafe.Pointer {
...
	for ; b != nil; b = b.overflow(t) {
		if MapCacheFriendly {
			...
		} else {
			...
		}
...
}
  • Push the changes:
$ git add runtime/map_bench_m_64_64_test.go runtime/map_bench_m_str_64_test.go
$ git add runtime/map.go runtime/map_fast64.go runtime/map_faststr.go
$ git commit -m "Performance experiment interleaving map bucket keys and values"
$ git push origin map-cacheline-perf-experiments

Build Golang from scratch on Linux

  • Why build and run the performance tests on Linux? Due to little to no background processes running, therefore getting more accurate results.
$ # build master branch first because map_bench_*_test.go files are not yet real test files
$ git clone https://github.com/simonhf/go ; cd go/src
$ time ./all.bash
...
ALL TESTS PASSED
---
Installed Go for linux/amd64 in /home/simonhf/20211004-golang/go
Installed commands in /home/simonhf/20211004-golang/go/bin
*** You need to add /home/simonhf/20211004-golang/go/bin to your PATH.
real    4m32.349s

$ go version
go version go1.15.9 linux/amd64

$ export PATH=/home/simonhf/20211004-golang/go/bin:$PATH 

$ go version
go version devel go1.18-cc5e3de593 Mon Oct 4 17:17:11 2021 +0000 linux/amd64

$ git checkout map-cacheline-perf-experiments

[1] https://golang.org/doc/install/source

Runnings tests on Linux

Test map[int64]int64

$ go clean -testcache ; time go test -v runtime/map_bench_m_64_64_test.go | tee /tmp/map_bench_m_64_64_test.go.txt
=== RUN   TestMapPerformance
- put 50000000 map keys in  4.564 seconds or   10954753 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.147 seconds or   15888315 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.826 seconds or   10360881 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.281 seconds or   15238174 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.603 seconds or   10861454 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.191 seconds or   15667847 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.436 seconds or   11270405 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.310 seconds or   15106715 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.214 seconds or   11865911 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.193 seconds or   15659193 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.437 seconds or   11269932 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.307 seconds or   15121559 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.198 seconds or   11909539 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.190 seconds or   15674869 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.416 seconds or   11321484 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.276 seconds or   15260414 keys per second; ir=1250000025000000 mapCacheFriendly=true
--- PASS: TestMapPerformance (64.22s)
PASS
ok      command-line-arguments  64.234s

real    1m4.393s
$ cat /tmp/map_bench_m_64_64_test.go.txt | perl -lane 'if(m~(got|put).*mapCacheFriendly=(true|false)~){ push @{$h->{$1}{$2}}, $_; } sub END{ foreach $pg(qw(put got)){ foreach $ft(qw(false true)){ @a=@{$h->{$pg}{$ft}}; foreach(@a){ if(m~(\d+) keys per sec~){ $r->{$pg}{$ft}+=$1; } printf qq[%s\n], $_; } printf qq[- %d total keys per second\n], $r->{$pg}{$ft}; } printf qq[- %.1f%% diff; true to false\n], ($r->{$pg}{true} - $r->{$pg}{false}) / $r->{$pg}{false} * 100; } }'
- put 50000000 map keys in  4.564 seconds or   10954753 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.603 seconds or   10861454 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.214 seconds or   11865911 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.198 seconds or   11909539 keys per second; ir=1250000025000000 mapCacheFriendly=false
- 45591657 total keys per second
- put 50000000 map keys in  4.826 seconds or   10360881 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.436 seconds or   11270405 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.437 seconds or   11269932 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.416 seconds or   11321484 keys per second; ir=1250000025000000 mapCacheFriendly=true
- 44222702 total keys per second
- -3.0% diff; true to false
- got 50000000 map vals in  3.147 seconds or   15888315 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.191 seconds or   15667847 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.193 seconds or   15659193 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.190 seconds or   15674869 keys per second; ir=1250000025000000 mapCacheFriendly=false
- 62890224 total keys per second
- got 50000000 map vals in  3.281 seconds or   15238174 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.310 seconds or   15106715 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.307 seconds or   15121559 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.276 seconds or   15260414 keys per second; ir=1250000025000000 mapCacheFriendly=true
- 60726862 total keys per second
- -3.4% diff; true to false

Test map[string]int64

$ go clean -testcache ; time go test -v runtime/map_bench_m_str_64_test.go | tee /tmp/map_bench_m_str_64_test.go.txt
=== RUN   TestMapPerformance
- put 10000000 map keys in  3.611 seconds or    2769477 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  3.113 seconds or    3212054 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.118 seconds or    3207083 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.965 seconds or    3373103 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  3.418 seconds or    2925681 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.951 seconds or    3388747 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.017 seconds or    3314097 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.951 seconds or    3388157 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  3.240 seconds or    3086267 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.943 seconds or    3397923 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.007 seconds or    3325893 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.959 seconds or    3379344 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  3.737 seconds or    2675626 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.950 seconds or    3389673 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  2.991 seconds or    3342883 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.971 seconds or    3365750 keys per second; ir=50000005000000 mapCacheFriendly=true
--- PASS: TestMapPerformance (51.32s)
PASS
ok      command-line-arguments  51.334s

real    0m51.508s
$ cat /tmp/map_bench_m_str_64_test.go.txt | perl -lane 'if(m~(got|put).*mapCacheFriendly=(true|false)~){ push @{$h->{$1}{$2}}, $_; } sub END{ foreach $pg(qw(put got)){ foreach $ft(qw(false true)){ @a=@{$h->{$pg}{$ft}}; foreach(@a){ if(m~(\d+) keys per sec~){ $r->{$pg}{$ft}+=$1; } printf qq[%s\n], $_; } printf qq[- %d total keys per second\n], $r->{$pg}{$ft}; } printf qq[- %.1f%% diff; true to false\n], ($r->{$pg}{true} - $r->{$pg}{false}) / $r->{$pg}{false} * 100; } }'
- put 10000000 map keys in  3.611 seconds or    2769477 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.418 seconds or    2925681 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.240 seconds or    3086267 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.737 seconds or    2675626 keys per second; ir=50000005000000 mapCacheFriendly=false
- 11457051 total keys per second
- put 10000000 map keys in  3.118 seconds or    3207083 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  3.017 seconds or    3314097 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  3.007 seconds or    3325893 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  2.991 seconds or    3342883 keys per second; ir=50000005000000 mapCacheFriendly=true
- 13189956 total keys per second
- 15.1% diff; true to false
- got 10000000 map vals in  3.113 seconds or    3212054 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.951 seconds or    3388747 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.943 seconds or    3397923 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.950 seconds or    3389673 keys per second; ir=50000005000000 mapCacheFriendly=false
- 13388397 total keys per second
- got 10000000 map vals in  2.965 seconds or    3373103 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.951 seconds or    3388157 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.959 seconds or    3379344 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.971 seconds or    3365750 keys per second; ir=50000005000000 mapCacheFriendly=true
- 13506354 total keys per second
- 0.9% diff; true to false

Performance observations / comments

| new algorithm    | writes faster | reads faster |
| map[int64]int64  |         -3.0% |        -3.4% |
| map[string]int64 |        +15.1% |        +0.9% |
  • Looks like when the key is small enough (e.g. 8 bytes with map[int64]int64) the map write and read times get worse if the key and values are interleaved.
  • However, when the key is big enough (e.g. 16 bytes with map[string]int64) the map write and read times get better if the key and values are interleaved; writes times by +15.1% and read times by +0.9%.
  • Future work: Performance test more combinations of key and value sizes, and on more CPU types.
  • Future work: If the map code could be changed to dynamically (and efficiently) interleave the keys and values -- i.e. do not interleave for e.g. map[int64]int64, but interleave for e.g. map[string]int64 -- then map performance could enjoy the best of both worlds? :-)

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 5, 2021

Future work: If the map code could be changed to dynamically (and efficiently) interleave the keys and values -- i.e. do not interleave for e.g. map[int64]int64, but interleave for e.g. map[string]int64 -- then map performance could enjoy the best of both worlds? :-)

That's the tricky part, because then there's a cost of dynamically choosing which interleaving you're going to use.

My experiment a while back added some fields to runtime.maptype.

    keyOffset [8]uint16
    elemOffset [8]uint16

So you could find a key given its index i by just doing b + t.keyOffset[i]. Then the compiler could choose whatever interleaving it wanted and put the offsets in the above arrays.
DIdn't help though. I think the extra dependent load on every access was the problem.

@simonhf
Copy link
Author

@simonhf simonhf commented Oct 5, 2021

@randall77 thanks for the comment and info. That's interesting! Yes, I guess even if *Offset[] is in an already fetched cacheline, there is still an expense for reading and writing which is not insignificant.

But I'm wondering if I can implement an opportunistic conditional interleave (without an extra data structure) purely based on the size of the key... which in theory wouldn't change the memory footprint of the data structures? I was thinking generic code like this:

Create a few simple variables dependent upon having a suitably sized t.keysize and t.elemsize:

var maybeSkipKeys uintptr
var maybeSkipKey uintptr
var maybeKeySize uintptr
var maybeElemSize uintptr
if (t.SuitsInterleave()) {
	maybeSkipKeys = 0
	maybeSkipKey = uintptr(t.keysize)
	maybeKeySize = uintptr(t.keysize)+uintptr(t.elemsize)
	maybeElemSize = uintptr(t.keysize)+uintptr(t.elemsize)
} else {
	maybeSkipKeys = bucketCnt*uintptr(t.keysize)
	maybeSkipKey = 0
	maybeKeySize = uintptr(t.keysize)
	maybeElemSize = uintptr(t.elemsize)
}

Then later, there's a single expression (very similar to the current code) without an if() to calculate k and e addresses irrespective of whether the keys and values are interleaved or not?

k = add(unsafe.Pointer(b), dataOffset+uintptr(offi)*maybeKeySize)
e = add(unsafe.Pointer(b), dataOffset+maybeSkipKeys+uintptr(offi)*maybeElemSize+maybeSkipKey)

In theory this mechanism would work for the maps I performance tested before. The map[int64]int64 would end up not interleaved and have not decrease in performance. And map[string]int64 would end up interleaved and get the performance benefits. The generic iterator functions would use the same t.SuitsInterleave() to determine k and e addresses at runtime and be compatible with both not interleaved and interleaved. What do you think? Do you foresee any issues before I try this out? :-)

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 5, 2021

@simonhf seems like a reasonable thing to try. Thanks.

@josharian
Copy link
Contributor

@josharian josharian commented Oct 5, 2021

Using 1.17 as a bootstrap toolchain is a fine idea, but probably unlikely to speed things up more than 5-10%.

Follow-up: I was wrong. Using 1.17.1 as a bootstrap toolchain on an M1 is much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants