Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: enhance map cacheline efficiency #48687

Open
simonhf opened this issue Sep 29, 2021 · 26 comments
Open

runtime: enhance map cacheline efficiency #48687

simonhf opened this issue Sep 29, 2021 · 26 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@simonhf
Copy link

simonhf commented Sep 29, 2021

Looking at the Golang source code for how map is implemented [1], there seems to be a possible opportunity to enhance cacheline efficiency:

The current algorithm appears to work as follows for looking up a key in a map:

  • Hash the key [2] and find the associated bucket [3].
  • Search along bucket .tophash array [4] for a key candidate <-- cacheline fetch 1
  • Binary compare key in key array [5] <-- cacheline fetch 2
  • Grab associated element address in element array [6] <-- cacheline fetch 3 (when caller uses element address)

So there are 3 arrays in a bucket; .tophash array, key array, and element (value) array. Each array has the same hard-coded bucketCnt length of 8 elements [7]. A key array item size is typically 16 bytes. This means a typical total key array size is 8 items * 16 byte size = 128 bytes long, or 2x (typically sized) 64 byte cachelines. The total key array size means the associated element array item is usually located in a different cacheline.

This can also be seen (added padding and comments to make it easier to read) from the way the key and element addresses are calculated in the source code:

k  :=  add(unsafe.Pointer(b), dataOffset+i        *uintptr(t.keysize)                      ) // <-- [8]
e  :=  add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize)) // <-- [9]
//                                                                    ^^^^^^^^^^^^^^^^^^^^^ element array
//                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ key array
//                            ^^^^^^^^^^ skip .tophash array

Possible optimization:

If the key and element arrays would be interleaved (instead of separate arrays) then this would not guarantee that the associated key[n] and element[n] are always in the same cacheline, but often they would be :-)

As an example, the interleaved version of the above k and e assignments would look something like this:

k  :=  add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize+t.elemsize)+uintptr(0)        )
e  :=  add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize+t.elemsize)+uintptr(t.keysize))

In theory, performance would more often be better due to less CPU cacheline fetches, and otherwise the same performance as before?

[1] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L532
[2] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L515
[3] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L517
[4] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L532
[5] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L542
[6] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L543
[7] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L66
[8] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L538
[9] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L543

@gopherbot gopherbot added this to the Proposal milestone Sep 29, 2021
@ianlancetaylor
Copy link
Contributor

This is not an API change so taking it out of the proposal process.

CC @randall77

@ianlancetaylor ianlancetaylor changed the title Proposal: Enhance Golang map cacheline efficiency runtime: enhance map cacheline efficiency Sep 29, 2021
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels Sep 29, 2021
@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Backlog Sep 29, 2021
@ianlancetaylor
Copy link
Contributor

I'm not sure you are considering the cache line advantages when the first key comparison fails but the second or third one succeeds. The current layout makes it more likely that the second and third key comparisons will be in the same cacheline.

@randall77
Copy link
Contributor

The trouble with interleaving key/value in the bucket is it may require more alignment padding.
I experimented a while back with doing the optimal packing (e.g. map[int16]int64 would use k0 k1 k2 k3 v0 v1 v2 v3 k4 k5 k6 k7 v4 v5 v6 v7). It required a more complicated indexing scheme and ended up not helping performance-wise, especially for small maps. I'm open to changing the current scheme if someone can demonstrate a performance improvement.

@randall77
Copy link
Contributor

I'm not sure you are considering the cache line advantages when the first key comparison fails but the second or third one succeeds. The current layout makes it more likely that the second and third key comparisons will be in the same cacheline.

The hope is that only one tophash comparison matches, so you only have to look at a single key.

@simonhf
Copy link
Author

simonhf commented Sep 29, 2021

cache line advantages when the first key comparison fails but the second or third one succeeds

@ianlancetaylor, you mean because the .tophash array and the key array can share the same cache line, and .tophash is not necessarily unique?

If so, I assume non-unique .tophash values occur in a tiny minority of cases. As an example experiment below, 80 million keys are hashed into the equivalent of the .tophash array with 8 elements. The number of times a 'collision' occurs (where the tophash byte is non-unique) is counted. And yes, I know is not exactly the same as the Golang map algorithm; which does not use SHA256 to calculate the hash value (a non-cryptographic hashing algorithm might result in more collisions), and the .tophash arrays are not necessarily completely full either like they are in the experiment.

The experimental results show that non-unique values in the .tophash array only occur 1,070,210 + 8,433 + 23 = 1,078,666 times or for 1.3% of all keys. However, the chance of the .tophash array and the collided keys all being in the same cacheline is even less than 1.35%, because only the first 3 of 8 possible keys can possibly be in the same cacheline as the .tophash array itself, e.g. 8 bytes for .tophash plus 3 x 16 = 48 bytes for the first 3 keys is 56 bytes... nearly the 64 bytes of the cacheline.

Whereas, with the interleaving of keys and elements, presumably most if not all (assuming good alignment to cacheline boundaries) associated keys and elements would end up in the same cacheline, but not necessarily the same cacheline as .tophash?

$ perl -e 'use Digest::SHA; $keys=10_000_000; $tophash=8; foreach $k(1..$keys){ undef $h; foreach $i(1..$tophash){ $hex = Digest::SHA::sha256_hex($k.$i); $top = substr($hex, 0, 2); $h->{$top}++; if(0){ printf qq[- k=%d i=%d hex=%s top=%s\n], $k, $i, $hex, $top; } } foreach $top(keys %{$h}){ $f=$h->{$top}; $hf->{$f}++; } } sub END{ foreach $f(sort keys %{$hf}){ printf qq[- %d keys hashed into n x tophash[%d]; 8 bits exists in any array %d time(s) occurred %8d times\n], $keys * $tophash, $tophash, $f, $hf->{$f}; } }'
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 1 time(s) occurred 77834169 times
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 2 time(s) occurred  1070210 times
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 3 time(s) occurred     8433 times
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 4 time(s) occurred       28 times

The hope is that only one tophash comparison matches, so you only have to look at a single key.

@randall77, the experiment above seems to validate that ~ 98.65% of the time only one .tophash value does match :-)

Interestingly, the same experiment shows that if .tophash would be extended to 16 items (and I'm not suggesting it should be) then the the 1.35% turns into 2.83%.

$ perl -e 'use Digest::SHA; $keys=5_000_000; $tophash=16; foreach $k(1..$keys){ undef $h; foreach $i(1..$tophash){ $hex = Digest::SHA::sha256_hex($k.$i); $top = substr($hex, 0, 2); $h->{$top}++; if(0){ printf qq[- k=%d i=%d hex=%s top=%s\n], $k, $i, $hex, $top; } } foreach $top(keys %{$h}){ $f=$h->{$top}; $hf->{$f}++; } } sub END{ foreach $f(sort keys %{$hf}){ printf qq[- %d keys hashed into n x tophash[%d]; 8 bits exists in any array %d time(s) occurred %8d times\n], $keys * $tophash, $tophash, $f, $hf->{$f}; } }'
- 80000000 keys hashed into n x tophash[16]; 8 bits exists in any array 1 time(s) occurred 75437834 times
- 80000000 keys hashed into n x tophash[16]; 8 bits exists in any array 2 time(s) occurred  2218711 times
- 80000000 keys hashed into n x tophash[16]; 8 bits exists in any array 3 time(s) occurred    40900 times
- 80000000 keys hashed into n x tophash[16]; 8 bits exists in any array 4 time(s) occurred      506 times
- 80000000 keys hashed into n x tophash[16]; 8 bits exists in any array 5 time(s) occurred        4 times

Perhaps more interesting would be to extend .tophash items from 8 bits to 16 bits? Which also adds 8 bytes to the size of every map. The experiment below shows that the 1.35% turns into 0.00535%, i.e. there is now very much hope that only one .tophash comparison matches :-)

$ perl -e 'use Digest::SHA; $keys=10_000_000; $tophash=8; foreach $k(1..$keys){ undef $h; foreach $i(1..$tophash){ $hex = Digest::SHA::sha256_hex($k.$i); $top = substr($hex, 0, 4); $h->{$top}++; if(0){ printf qq[- k=%d i=%d hex=%s top=%s\n], $k, $i, $hex, $top; } } foreach $top(keys %{$h}){ $f=$h->{$top}; $hf->{$f}++; } } sub END{ foreach $f(sort keys %{$hf}){ printf qq[- %d keys hashed into n x tophash[%d]; 8 bits exists in any array %d time(s) occurred %8d times\n], $keys * $tophash, $tophash, $f, $hf->{$f}; } }'
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 1 time(s) occurred 79991440 times
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 2 time(s) occurred     4280 times

And just for completeness, extending .tophash items from 8 bits to 24 bits? Which also adds 16 bytes to the size of every map. The experiment below shows that the 1.35% turns into 0.00002625%. Doesn't seem to be worth it for the advantage over 16 bits?

$ perl -e 'use Digest::SHA; $keys=10_000_000; $tophash=8; foreach $k(1..$keys){ undef $h; foreach $i(1..$tophash){ $hex = Digest::SHA::sha256_hex($k.$i); $top = substr($hex, 0, 6); $h->{$top}++; if(0){ printf qq[- k=%d i=%d hex=%s top=%s\n], $k, $i, $hex, $top; } } foreach $top(keys %{$h}){ $f=$h->{$top}; $hf->{$f}++; } } sub END{ foreach $f(sort keys %{$hf}){ printf qq[- %d keys hashed into n x tophash[%d]; 8 bits exists in any array %d time(s) occurred %8d times\n], $keys * $tophash, $tophash, $f, $hf->{$f}; } }'
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 1 time(s) occurred 79999958 times
- 80000000 keys hashed into n x tophash[8]; 8 bits exists in any array 2 time(s) occurred       21 times

I experimented a while back with doing the optimal packing (e.g. map[int16]int64 would use k0 k1 k2 k3 v0 v1 v2 v3 k4 k5 k6 k7 v4 v5 v6 v7). It required a more complicated indexing scheme and ended up not helping performance-wise, especially for small maps.

Did you check to see if NOT packing / aligning / padding results in performance issues, e.g. "On recent Intel processors (Sandy Bridge and Nehalem), there is no performance penalty for reading or writing misaligned memory operands" [1] (admittedly a little old for these days) ?

Or, maybe opportunistically only use the simple interleaving (i.e. where k0 k1 k2 k3 v0 v1 v2 v3 is complicated interleaving) for map schemes which don't require special packing, which would still be a larger number of all maps, or?

I'm open to changing the current scheme if someone can demonstrate a performance improvement.

Hmmm... I might have a crack myself at demonstrating an improvement. Have never compiled Golang from scratch (in order to change map.go) but there's a first time for everything :-)

[1] https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

@randall77
Copy link
Contributor

Did you check to see if NOT packing / aligning / padding results in performance issues, e.g. "On recent Intel processors (Sandy Bridge and Nehalem), there is no performance penalty for reading or writing misaligned memory operands" [1] (admittedly a little old for these days) ?

No, I didn't. It is kind of a moot point because there are architectures we run on that do not allow misaligned accesses at all. And we do not want to have different implementations on different architectures, if we can avoid it.

Or, maybe opportunistically only use the simple interleaving (i.e. where k0 k1 k2 k3 v0 v1 v2 v3 is complicated interleaving) for map schemes which don't require special packing, which would still be a larger number of all maps, or?

map[T]bool is a relatively common map type.

We could simply interleave only for types which require no padding, but then we need to somehow fork the code based on what packing would happen. Again, not ideal code size and maintenance-wise.

We could interleave always, with padding if needed, which would waste space but allow a single implementation. Maybe the space wasted wouldn't be too bad (map[int64]bool is probably the worst common case, which is a 63% expansion).

@randall77
Copy link
Contributor

Did you check to see if NOT packing / aligning / padding results in performance issues, e.g. "On recent Intel processors (Sandy Bridge and Nehalem), there is no performance penalty for reading or writing misaligned memory operands" [1] (admittedly a little old for these days) ?

Also the garbage collector requires aligned pointers.

@simonhf
Copy link
Author

simonhf commented Oct 2, 2021

I'm open to changing the current scheme if someone can demonstrate a performance improvement.

I'm going to have a go at such a demonstration this weekend :-)

As a pre-step, today, I built Golang from scratch for the first time... yay! :-)

Quick question though: If I modify runtime/map.go and rebuild Golang [1] then the fastest rebuild takes ~ 276 seconds on my recent MacBook Pro. Is there a way to make it do a minimal / incremental build faster, or is this just the status quo?

$ time ./make.bash
Building Go cmd/dist using /usr/local/Cellar/go/1.16.6/libexec. (go1.16.6 darwin/amd64)
Building Go toolchain1 using /usr/local/Cellar/go/1.16.6/libexec.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for darwin/amd64.
---
Installed Go for darwin/amd64 in /Users/simonhf/20211001-golang/goroot
Installed commands in /Users/simonhf/20211001-golang/goroot/bin
./make.bash  276.81s user 51.73s system 369% cpu 1:28.96 total

[1] https://github.com/cloudflare/go/wiki/Starting-out

@josharian
Copy link
Contributor

If you modify only the runtime (not the rest of the toolchain), you don't need to run make.bash again. You can do go test runtime or go test strings or whatever work you want. The toolchain can see that package runtime's sources have changed, and that package runtime is a dependency of your tests, and thus recompile whatever is necessary.

@josharian
Copy link
Contributor

josharian commented Oct 2, 2021

And if you do modify the compiler, you can run go install cmd/compile to install the modified compiler. However, if you've introduced a bug, it'll be present the next time you do go install cmd/compile, so when modifying the toolchain it's a good idea to either rebuild from scratch occasionally or hide all your work behind a feature flag.

Also, that build sounds really slow. Are you running all.bash or make.bash? (EDIT: I see you are running make.bash.) The former runs all the tests, but the latter is all you need to be able to use your installation. If you're hacking on maps I'd guess that you'll find most bugs pretty quickly, without the full test suite.

@andig
Copy link
Contributor

andig commented Oct 2, 2021

Gotip download which does a full build imho takes 150s on Mac M1. It might help to start with 1.17. as base go since it has the register-based ABI enabled.

@josharian
Copy link
Contributor

@andig I see darwin/amd64 in the snippet above. The M1 is really fast. :) Using 1.17 as a bootstrap toolchain is a fine idea, but probably unlikely to speed things up more than 5-10%. Fortunately, a full rebuild isn't necessary to play with the runtime in isolation.

@simonhf
Copy link
Author

simonhf commented Oct 2, 2021

Thanks for the tips about iterating faster. go test runtime does work out a little faster; 214 versus 276 seconds:

$ time go test runtime
ok      runtime 132.649s
go test runtime  214.00s user 34.22s system 177% cpu 2:19.89 total

But if I only run my new test and map.go is updated and needs to be rebuild, then everything happens in under 10 seconds :-)

$ time go test -v runtime/map_bench_test.go
=== RUN   TestMapPerformance
...
go test -v runtime/map_bench_test.go  8.62s user 1.55s system 275% cpu 3.698 total

And @andig: The M1 is fast :-)

@ALTree
Copy link
Member

ALTree commented Oct 2, 2021

The M1 is really fast.

Is it? I get this on my 2014 laptop with an Haswell Mobile processor and a spinning hard disk:

$ time ./make.bash
Building Go cmd/dist using /usr/local/go. (go1.17 linux/amd64)
Building Go toolchain1 using /usr/local/go.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/amd64.
---
Installed Go for linux/amd64 in /home/alberto/go
Installed commands in /home/alberto/go/bin

real	2m29.378s
user	7m38.069s
sys	0m28.770s

$ lscpu | grep "Model name"
Model name:                      Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz

Are you sure you don't have a nosy antivirus meddling with make.bash or something like that?

@simonhf
Copy link
Author

simonhf commented Oct 2, 2021

@ALTree admittedly the laptop is full of corporate security software, and also standard Mac software doing stuff in the background... all chewing CPU :-) I guess the M1 would still have the standard Mac software in the background doing its thing too. But the Mac WindowServer daemon is the worst and constantly chews ~ 16% CPU :-( I have done some research and this daemon uses more CPU depending upon settings and how many external peripherals like monitors (I'm using a 4k external monitor), keyboard & mice (had to turn down the reporting hz on mine to stop WindowServer using tons of CPU every time I moved the mouse!) you have attached. So it might be gobbling more CPU and an the M1 :-)

Maybe a 'spinning hard disk' is not such a big issue if there's plenty of RAM?

@ALTree
Copy link
Member

ALTree commented Oct 2, 2021

Yeah, then it's probably OS related. In my experience make.bash is quite sensitive to the presence of other cpu- or HardDisk- hungry processes. My 2m30s measurement was on an idle machine, and my Linux machine is very very quiet when idle. You should be able to run a ~2 minutes make.bash on your hardware, but not if you have a bunch of other processes interfering with the build. I guess your best option is to just run tests and/or go install the compiler or runtime as Josh suggested, and only run make.bash once in a while.

@simonhf
Copy link
Author

simonhf commented Oct 4, 2021

As a side note, I switched to a GCP Linux box to get more accurate performance test runs due to little or no background processes running on Linux versus MacOS. Linux also ran make.bash much faster :-)

$ time ./make.bash
Building Go cmd/dist using /usr/lib/go-1.15. (go1.15.9 linux/amd64)
Building Go toolchain1 using /usr/lib/go-1.15.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/amd64.
---
Installed Go for linux/amd64 in /home/simon_hardy_francis_dapperlabs_c/20211002-golang/goroot
Installed commands in /home/simon_hardy_francis_dapperlabs_c/20211002-golang/goroot/bin

real    0m58.456s

@josharian
Copy link
Contributor

This might be caused by #48496. I'm poking at a simple fix for that now.

@simonhf
Copy link
Author

simonhf commented Oct 4, 2021

I'm open to changing the current scheme if someone can demonstrate a performance improvement.

@randall77 below are the results of my weekend performance experiments and how I conducted them :-) Looks like a performance improvement is demonstrated but only for larger map keys.

Unfortunately in this experiment then the interleaved keys slightly decreases performance for smaller map keys.

So future work might be to repeat the experiment but after introducing dynamic interleaving of map keys and values depending upon the map key size? If this can be achieved then this experiment hints at faster map access times for maps with bigger keys; slightly faster map read times, and ~ +15% faster map write times.

Build Golang from scratch on MacOS and hack in experimental changes

  • With a recent Golang binary already installed, follow instructions here [1] & here [2] to build a new Golang from scratch.
  • Surf to [3] and fork repo privately to get [4].
$ git clone https://github.com/simonhf/go ; cd go/src
$ git checkout -b map-cacheline-perf-experiments
$ git log -1 | head -5
commit cc5e3de593afca73cf1b4d732ddceffb2837b390 (HEAD -> map-cacheline-perf-experiments, origin/master, origin/HEAD, master)
Author: Filippo Valsorda <filippo@golang.org>
Date:   Sat May 8 01:07:30 2021 -0400

    crypto/tls: use cryptobyte.NewFixedBuilder
$ time ./all.bash
...
ALL TESTS PASSED
---
Installed Go for darwin/amd64 in /Users/simonhardy-francis/work/20211004-golang/go
Installed commands in /Users/simonhardy-francis/work/20211004-golang/go/bin
./all.bash  1770.38s user 970.98s system 354% cpu 12:52.57 total

$ go version
go version go1.16.6 darwin/amd64

$ export PATH=/Users/simonhardy-francis/work/20211004-golang/go/bin:$PATH 

$ go version
go version devel go1.18-cc5e3de593 Mon Oct 4 17:17:11 2021 +0000 darwin/amd64

[1] https://golang.org/doc/install/source
[2] https://github-wiki-see.page/m/cloudflare/go/wiki/Starting-out
[3] https://github.com/golang/go
[4] https://github.com/simonhf/go

Hack runtime/map*.go

  • Strategy: For quicker dev iteration, create runtime/map_bench_*_test.go files; which re-compiles changes to runtime/map*.go faster than running make.bash; < 10 seconds versus ~ 58 seconds.
  • Strategy: Make the fewest changes to prove interleaving bucket keys and values more efficiently uses CPU cachelines.
  • Strategy: Create perf test for map with a shorter key (e.g. int64; 8 bytes) and longer key (e.g. string; 16 byte string ref).
  • Strategy: Use a lot of keys and values to ensure the map is much bigger than the CPU cacheline cache. Using a smaller map will mess up the results because all cachelines will be cached.
  • Strategy: A map bucket currently uses 3 arrays -- 8 byte tophash[8] followed by 8 keys, followed by 8 values -- and the idea is modify the code to change this to 2 arrays; 8 byte tophash[8] followed by 8 key, value pairs. If a key is an int64 then the 2nd array in the existing production code will be 8 elements x 8 bytes = 64 bytes, and if a key is a string then the 2nd array in the existing production code will be 8 elements x 16 ref bytes = 128 bytes.
  • Strategy: Initially create two perf test files -- map_bench_m_64_64_test.go and map_bench_m_str_64_test.go -- and consider merging them later, or throwing them away.
  • Strategy: Add some conditional debugging code which can be enabled and disabled via bools runtime.MapMakeDebug and runtime.MapIterDebug.
  • Strategy: For fairer results and easier testing, use the same map function and test function to test unmodified and modified algorithm, e.g. the test program manipulates bool runtime.MapCacheFriendly dynamically (this mechanism would be removed if the code ever made it to production) to select the algorithm at runtime:
func mapaccess1_fast64(t *maptype, h *hmap, key uint64) unsafe.Pointer {
...
	for ; b != nil; b = b.overflow(t) {
		if MapCacheFriendly {
			...
		} else {
			...
		}
...
}
  • Push the changes:
$ git add runtime/map_bench_m_64_64_test.go runtime/map_bench_m_str_64_test.go
$ git add runtime/map.go runtime/map_fast64.go runtime/map_faststr.go
$ git commit -m "Performance experiment interleaving map bucket keys and values"
$ git push origin map-cacheline-perf-experiments

Build Golang from scratch on Linux

  • Why build and run the performance tests on Linux? Due to little to no background processes running, therefore getting more accurate results.
$ # build master branch first because map_bench_*_test.go files are not yet real test files
$ git clone https://github.com/simonhf/go ; cd go/src
$ time ./all.bash
...
ALL TESTS PASSED
---
Installed Go for linux/amd64 in /home/simonhf/20211004-golang/go
Installed commands in /home/simonhf/20211004-golang/go/bin
*** You need to add /home/simonhf/20211004-golang/go/bin to your PATH.
real    4m32.349s

$ go version
go version go1.15.9 linux/amd64

$ export PATH=/home/simonhf/20211004-golang/go/bin:$PATH 

$ go version
go version devel go1.18-cc5e3de593 Mon Oct 4 17:17:11 2021 +0000 linux/amd64

$ git checkout map-cacheline-perf-experiments

[1] https://golang.org/doc/install/source

Runnings tests on Linux

Test map[int64]int64

$ go clean -testcache ; time go test -v runtime/map_bench_m_64_64_test.go | tee /tmp/map_bench_m_64_64_test.go.txt
=== RUN   TestMapPerformance
- put 50000000 map keys in  4.564 seconds or   10954753 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.147 seconds or   15888315 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.826 seconds or   10360881 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.281 seconds or   15238174 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.603 seconds or   10861454 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.191 seconds or   15667847 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.436 seconds or   11270405 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.310 seconds or   15106715 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.214 seconds or   11865911 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.193 seconds or   15659193 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.437 seconds or   11269932 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.307 seconds or   15121559 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.198 seconds or   11909539 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.190 seconds or   15674869 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.416 seconds or   11321484 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.276 seconds or   15260414 keys per second; ir=1250000025000000 mapCacheFriendly=true
--- PASS: TestMapPerformance (64.22s)
PASS
ok      command-line-arguments  64.234s

real    1m4.393s
$ cat /tmp/map_bench_m_64_64_test.go.txt | perl -lane 'if(m~(got|put).*mapCacheFriendly=(true|false)~){ push @{$h->{$1}{$2}}, $_; } sub END{ foreach $pg(qw(put got)){ foreach $ft(qw(false true)){ @a=@{$h->{$pg}{$ft}}; foreach(@a){ if(m~(\d+) keys per sec~){ $r->{$pg}{$ft}+=$1; } printf qq[%s\n], $_; } printf qq[- %d total keys per second\n], $r->{$pg}{$ft}; } printf qq[- %.1f%% diff; true to false\n], ($r->{$pg}{true} - $r->{$pg}{false}) / $r->{$pg}{false} * 100; } }'
- put 50000000 map keys in  4.564 seconds or   10954753 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.603 seconds or   10861454 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.214 seconds or   11865911 keys per second; ir=1250000025000000 mapCacheFriendly=false
- put 50000000 map keys in  4.198 seconds or   11909539 keys per second; ir=1250000025000000 mapCacheFriendly=false
- 45591657 total keys per second
- put 50000000 map keys in  4.826 seconds or   10360881 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.436 seconds or   11270405 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.437 seconds or   11269932 keys per second; ir=1250000025000000 mapCacheFriendly=true
- put 50000000 map keys in  4.416 seconds or   11321484 keys per second; ir=1250000025000000 mapCacheFriendly=true
- 44222702 total keys per second
- -3.0% diff; true to false
- got 50000000 map vals in  3.147 seconds or   15888315 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.191 seconds or   15667847 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.193 seconds or   15659193 keys per second; ir=1250000025000000 mapCacheFriendly=false
- got 50000000 map vals in  3.190 seconds or   15674869 keys per second; ir=1250000025000000 mapCacheFriendly=false
- 62890224 total keys per second
- got 50000000 map vals in  3.281 seconds or   15238174 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.310 seconds or   15106715 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.307 seconds or   15121559 keys per second; ir=1250000025000000 mapCacheFriendly=true
- got 50000000 map vals in  3.276 seconds or   15260414 keys per second; ir=1250000025000000 mapCacheFriendly=true
- 60726862 total keys per second
- -3.4% diff; true to false

Test map[string]int64

$ go clean -testcache ; time go test -v runtime/map_bench_m_str_64_test.go | tee /tmp/map_bench_m_str_64_test.go.txt
=== RUN   TestMapPerformance
- put 10000000 map keys in  3.611 seconds or    2769477 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  3.113 seconds or    3212054 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.118 seconds or    3207083 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.965 seconds or    3373103 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  3.418 seconds or    2925681 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.951 seconds or    3388747 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.017 seconds or    3314097 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.951 seconds or    3388157 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  3.240 seconds or    3086267 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.943 seconds or    3397923 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.007 seconds or    3325893 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.959 seconds or    3379344 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  3.737 seconds or    2675626 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.950 seconds or    3389673 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  2.991 seconds or    3342883 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.971 seconds or    3365750 keys per second; ir=50000005000000 mapCacheFriendly=true
--- PASS: TestMapPerformance (51.32s)
PASS
ok      command-line-arguments  51.334s

real    0m51.508s
$ cat /tmp/map_bench_m_str_64_test.go.txt | perl -lane 'if(m~(got|put).*mapCacheFriendly=(true|false)~){ push @{$h->{$1}{$2}}, $_; } sub END{ foreach $pg(qw(put got)){ foreach $ft(qw(false true)){ @a=@{$h->{$pg}{$ft}}; foreach(@a){ if(m~(\d+) keys per sec~){ $r->{$pg}{$ft}+=$1; } printf qq[%s\n], $_; } printf qq[- %d total keys per second\n], $r->{$pg}{$ft}; } printf qq[- %.1f%% diff; true to false\n], ($r->{$pg}{true} - $r->{$pg}{false}) / $r->{$pg}{false} * 100; } }'
- put 10000000 map keys in  3.611 seconds or    2769477 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.418 seconds or    2925681 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.240 seconds or    3086267 keys per second; ir=50000005000000 mapCacheFriendly=false
- put 10000000 map keys in  3.737 seconds or    2675626 keys per second; ir=50000005000000 mapCacheFriendly=false
- 11457051 total keys per second
- put 10000000 map keys in  3.118 seconds or    3207083 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  3.017 seconds or    3314097 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  3.007 seconds or    3325893 keys per second; ir=50000005000000 mapCacheFriendly=true
- put 10000000 map keys in  2.991 seconds or    3342883 keys per second; ir=50000005000000 mapCacheFriendly=true
- 13189956 total keys per second
- 15.1% diff; true to false
- got 10000000 map vals in  3.113 seconds or    3212054 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.951 seconds or    3388747 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.943 seconds or    3397923 keys per second; ir=50000005000000 mapCacheFriendly=false
- got 10000000 map vals in  2.950 seconds or    3389673 keys per second; ir=50000005000000 mapCacheFriendly=false
- 13388397 total keys per second
- got 10000000 map vals in  2.965 seconds or    3373103 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.951 seconds or    3388157 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.959 seconds or    3379344 keys per second; ir=50000005000000 mapCacheFriendly=true
- got 10000000 map vals in  2.971 seconds or    3365750 keys per second; ir=50000005000000 mapCacheFriendly=true
- 13506354 total keys per second
- 0.9% diff; true to false

Performance observations / comments

| new algorithm    | writes faster | reads faster |
| map[int64]int64  |         -3.0% |        -3.4% |
| map[string]int64 |        +15.1% |        +0.9% |
  • Looks like when the key is small enough (e.g. 8 bytes with map[int64]int64) the map write and read times get worse if the key and values are interleaved.
  • However, when the key is big enough (e.g. 16 bytes with map[string]int64) the map write and read times get better if the key and values are interleaved; writes times by +15.1% and read times by +0.9%.
  • Future work: Performance test more combinations of key and value sizes, and on more CPU types.
  • Future work: If the map code could be changed to dynamically (and efficiently) interleave the keys and values -- i.e. do not interleave for e.g. map[int64]int64, but interleave for e.g. map[string]int64 -- then map performance could enjoy the best of both worlds? :-)

@randall77
Copy link
Contributor

Future work: If the map code could be changed to dynamically (and efficiently) interleave the keys and values -- i.e. do not interleave for e.g. map[int64]int64, but interleave for e.g. map[string]int64 -- then map performance could enjoy the best of both worlds? :-)

That's the tricky part, because then there's a cost of dynamically choosing which interleaving you're going to use.

My experiment a while back added some fields to runtime.maptype.

    keyOffset [8]uint16
    elemOffset [8]uint16

So you could find a key given its index i by just doing b + t.keyOffset[i]. Then the compiler could choose whatever interleaving it wanted and put the offsets in the above arrays.
DIdn't help though. I think the extra dependent load on every access was the problem.

@simonhf
Copy link
Author

simonhf commented Oct 5, 2021

@randall77 thanks for the comment and info. That's interesting! Yes, I guess even if *Offset[] is in an already fetched cacheline, there is still an expense for reading and writing which is not insignificant.

But I'm wondering if I can implement an opportunistic conditional interleave (without an extra data structure) purely based on the size of the key... which in theory wouldn't change the memory footprint of the data structures? I was thinking generic code like this:

Create a few simple variables dependent upon having a suitably sized t.keysize and t.elemsize:

var maybeSkipKeys uintptr
var maybeSkipKey uintptr
var maybeKeySize uintptr
var maybeElemSize uintptr
if (t.SuitsInterleave()) {
	maybeSkipKeys = 0
	maybeSkipKey = uintptr(t.keysize)
	maybeKeySize = uintptr(t.keysize)+uintptr(t.elemsize)
	maybeElemSize = uintptr(t.keysize)+uintptr(t.elemsize)
} else {
	maybeSkipKeys = bucketCnt*uintptr(t.keysize)
	maybeSkipKey = 0
	maybeKeySize = uintptr(t.keysize)
	maybeElemSize = uintptr(t.elemsize)
}

Then later, there's a single expression (very similar to the current code) without an if() to calculate k and e addresses irrespective of whether the keys and values are interleaved or not?

k = add(unsafe.Pointer(b), dataOffset+uintptr(offi)*maybeKeySize)
e = add(unsafe.Pointer(b), dataOffset+maybeSkipKeys+uintptr(offi)*maybeElemSize+maybeSkipKey)

In theory this mechanism would work for the maps I performance tested before. The map[int64]int64 would end up not interleaved and have not decrease in performance. And map[string]int64 would end up interleaved and get the performance benefits. The generic iterator functions would use the same t.SuitsInterleave() to determine k and e addresses at runtime and be compatible with both not interleaved and interleaved. What do you think? Do you foresee any issues before I try this out? :-)

@randall77
Copy link
Contributor

@simonhf seems like a reasonable thing to try. Thanks.

@josharian
Copy link
Contributor

Using 1.17 as a bootstrap toolchain is a fine idea, but probably unlikely to speed things up more than 5-10%.

Follow-up: I was wrong. Using 1.17.1 as a bootstrap toolchain on an M1 is much faster.

@simonhf
Copy link
Author

simonhf commented Nov 13, 2021

@randall77 so I tried out opportunistic conditional interleave idea. Made the following changes:

  • Added a 3rd performance test script to test the performance of a byte array as a key.
  • Added a 3rd metric to each performance test for testing iterative sequential key access.
  • Turned off GC which each performance test is running to minimize any GC influence on the tests.
  • Ignore the first hash table creation of a test process because the heap gets grown (costly) the first time.
  • Push the changes:
$ git add runtime/map_bench_m_64___64_test.go runtime/map_bench_m_byte_64_test.go runtime/map_bench_m_str__64_test.go
$ git add runtime/map.go runtime/map_fast64.go runtime/map_faststr.go
$ git commit -m "Performance experiment interleaving map bucket keys and values part 2"
$ git push origin map-cacheline-perf-experiments

Runnings tests on Linux

Test map[int64]int64

$ export MAP_BENCH_SCRIPT="map_bench_m_64___64_test.go" ; export MAP_BENCH_DEBUG=false ; cd ../src ; go clean -testcache ; time go test -v runtime/$MAP_BENCH_SCRIPT | tee ../../$MAP_BENCH_SCRIPT.txt
=== RUN   TestMapPerformance
-   put 10000000 map keys in  0.786 seconds or   12721484 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  0.622 seconds or   16071264 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.187 seconds or   53343902 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  0.788 seconds or   12691436 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  0.618 seconds or   16188136 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.184 seconds or   54444683 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  0.783 seconds or   12765836 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  0.613 seconds or   16316242 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.180 seconds or   55647042 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  0.778 seconds or   12846807 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  0.616 seconds or   16243227 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.189 seconds or   52892966 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  0.780 seconds or   12823769 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  0.612 seconds or   16344321 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.188 seconds or   53268103 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  0.788 seconds or   12686142 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  0.624 seconds or   16026752 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.184 seconds or   54385577 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  0.786 seconds or   12727807 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  0.618 seconds or   16185869 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.188 seconds or   53155414 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  0.788 seconds or   12697704 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  0.625 seconds or   16010335 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.190 seconds or   52748784 per second; ir=50000005000000 mapCacheFriendly=true
--- PASS: TestMapPerformance (14.91s)
PASS
ok      command-line-arguments  14.917s

real    0m15.110s
user    0m15.161s
sys     0m0.398s
$ export MAP_BENCH_SCRIPT="map_bench_m_64___64_test.go" ; cat ../../$MAP_BENCH_SCRIPT.txt | perl -lane 'if(m~(got|put|range).*mapCacheFriendly=(true|false)~){ push @{$h->{$1}{$2}}, $_; } sub END{ foreach $pg(qw(put got range)){ foreach $ft(qw(false true)){ @a=@{$h->{$pg}{$ft}}; foreach(@a){ if(m~(\d+) per sec~){ $r->{$pg}{$ft}+=$1; } printf qq[%s\n], $_; } printf qq[- %d total keys per second\n], $r->{$pg}{$ft}; } printf qq[- %.1f%% diff; true better than false if %% > 0\n], ($r->{$pg}{true} - $r->{$pg}{false}) / $r->{$pg}{false} * 100; } }'
-   put 10000000 map keys in  0.786 seconds or   12721484 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  0.783 seconds or   12765836 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  0.780 seconds or   12823769 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  0.786 seconds or   12727807 per second; ir=50000005000000 mapCacheFriendly=false
- 51038896 total keys per second
-   put 10000000 map keys in  0.788 seconds or   12691436 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  0.778 seconds or   12846807 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  0.788 seconds or   12686142 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  0.788 seconds or   12697704 per second; ir=50000005000000 mapCacheFriendly=true
- 50922089 total keys per second
- -0.2% diff; true better than false if % > 0
-   got 10000000 map vals in  0.622 seconds or   16071264 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  0.613 seconds or   16316242 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  0.612 seconds or   16344321 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  0.618 seconds or   16185869 per second; ir=50000005000000 mapCacheFriendly=false
- 64917696 total keys per second
-   got 10000000 map vals in  0.618 seconds or   16188136 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  0.616 seconds or   16243227 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  0.624 seconds or   16026752 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  0.625 seconds or   16010335 per second; ir=50000005000000 mapCacheFriendly=true
- 64468450 total keys per second
- -0.7% diff; true better than false if % > 0
- range 10000000 map vals in  0.187 seconds or   53343902 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.180 seconds or   55647042 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.188 seconds or   53268103 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.188 seconds or   53155414 per second; ir=50000005000000 mapCacheFriendly=false
- 215414461 total keys per second
- range 10000000 map vals in  0.184 seconds or   54444683 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.189 seconds or   52892966 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.184 seconds or   54385577 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.190 seconds or   52748784 per second; ir=50000005000000 mapCacheFriendly=true
- 214472010 total keys per second
- -0.4% diff; true better than false if % > 0

Test map[string]int64

$ export MAP_BENCH_SCRIPT="map_bench_m_str__64_test.go" ; export MAP_BENCH_DEBUG=false ; cd ../src ; go clean -testcache ; time go test -v runtime/$MAP_BENCH_SCRIPT | tee ../../$MAP_BENCH_SCRIPT.txt
=== RUN   TestMapPerformance
-   put 10000000 map keys in  3.333 seconds or    3000263 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  2.951 seconds or    3388583 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.204 seconds or   48948296 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  2.929 seconds or    3414093 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  2.902 seconds or    3445911 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.208 seconds or   48111947 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  3.351 seconds or    2984277 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  2.957 seconds or    3382247 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.206 seconds or   48641376 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  3.058 seconds or    3269834 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  2.895 seconds or    3454491 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.209 seconds or   47903624 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  3.726 seconds or    2683484 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  2.961 seconds or    3377756 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.201 seconds or   49693907 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  2.933 seconds or    3409214 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  2.892 seconds or    3457710 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.207 seconds or   48353113 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  3.354 seconds or    2981505 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  2.950 seconds or    3389792 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.201 seconds or   49647510 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  2.941 seconds or    3400772 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  2.898 seconds or    3450356 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.202 seconds or   49616359 per second; ir=50000005000000 mapCacheFriendly=true
--- PASS: TestMapPerformance (60.04s)
PASS
ok      command-line-arguments  60.046s

real    1m0.232s
user    1m0.718s
sys     0m0.619s
$ export MAP_BENCH_SCRIPT="map_bench_m_str__64_test.go" ; cat ../../$MAP_BENCH_SCRIPT.txt | perl -lane 'if(m~(got|put|range).*mapCacheFriendly=(true|false)~){ push @{$h->{$1}{$2}}, $_; } sub END{ foreach $pg(qw(put got range)){ foreach $ft(qw(false true)){ @a=@{$h->{$pg}{$ft}}; foreach(@a){ if(m~(\d+) per sec~){ $r->{$pg}{$ft}+=$1; } printf qq[%s\n], $_; } printf qq[- %d total keys per second\n], $r->{$pg}{$ft}; } printf qq[- %.1f%% diff; true better than false if %% > 0\n], ($r->{$pg}{true} - $r->{$pg}{false}) / $r->{$pg}{false} * 100; } }'
-   put 10000000 map keys in  3.333 seconds or    3000263 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  3.351 seconds or    2984277 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  3.726 seconds or    2683484 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  3.354 seconds or    2981505 per second; ir=50000005000000 mapCacheFriendly=false
- 11649529 total keys per second
-   put 10000000 map keys in  2.929 seconds or    3414093 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  3.058 seconds or    3269834 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  2.933 seconds or    3409214 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  2.941 seconds or    3400772 per second; ir=50000005000000 mapCacheFriendly=true
- 13493913 total keys per second
- 15.8% diff; true better than false if % > 0
-   got 10000000 map vals in  2.951 seconds or    3388583 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  2.957 seconds or    3382247 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  2.961 seconds or    3377756 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  2.950 seconds or    3389792 per second; ir=50000005000000 mapCacheFriendly=false
- 13538378 total keys per second
-   got 10000000 map vals in  2.902 seconds or    3445911 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  2.895 seconds or    3454491 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  2.892 seconds or    3457710 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  2.898 seconds or    3450356 per second; ir=50000005000000 mapCacheFriendly=true
- 13808468 total keys per second
- 2.0% diff; true better than false if % > 0
- range 10000000 map vals in  0.204 seconds or   48948296 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.206 seconds or   48641376 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.201 seconds or   49693907 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.201 seconds or   49647510 per second; ir=50000005000000 mapCacheFriendly=false
- 196931089 total keys per second
- range 10000000 map vals in  0.208 seconds or   48111947 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.209 seconds or   47903624 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.207 seconds or   48353113 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.202 seconds or   49616359 per second; ir=50000005000000 mapCacheFriendly=true
- 193985043 total keys per second
- -1.5% diff; true better than false if % > 0

Test map[[32]byte]int64

$ export MAP_BENCH_SCRIPT="map_bench_m_byte_64_test.go" ; export MAP_BENCH_DEBUG=false ; cd ../src ; go clean -testcache ; time go test -v runtime/$MAP_BENCH_SCRIPT | tee ../../$MAP_BENCH_SCRIPT.txt
=== RUN   TestMapPerformance
-   put 10000000 map keys in  1.762 seconds or    5675426 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  1.582 seconds or    6321965 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.250 seconds or   40014726 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  1.670 seconds or    5987124 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  1.498 seconds or    6673523 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.239 seconds or   41886167 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  1.768 seconds or    5655742 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  1.590 seconds or    6287516 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.240 seconds or   41731550 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  1.690 seconds or    5917336 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  1.493 seconds or    6697574 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.252 seconds or   39659551 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  1.763 seconds or    5672468 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  1.586 seconds or    6304264 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.239 seconds or   41786329 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  1.669 seconds or    5989906 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  1.506 seconds or    6640452 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.253 seconds or   39580246 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  1.758 seconds or    5688813 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  1.587 seconds or    6300570 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.240 seconds or   41672412 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  1.670 seconds or    5989299 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  1.499 seconds or    6671009 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.253 seconds or   39464253 per second; ir=50000005000000 mapCacheFriendly=true
--- PASS: TestMapPerformance (33.01s)
PASS
ok      command-line-arguments  33.019s

real    0m33.203s
user    0m33.134s
sys     0m0.488s
$ export MAP_BENCH_SCRIPT="map_bench_m_byte_64_test.go" ; cat ../../$MAP_BENCH_SCRIPT.txt | perl -lane 'if(m~(got|put|range).*mapCacheFriendly=(true|false)~){ push @{$h->{$1}{$2}}, $_; } sub END{ foreach $pg(qw(put got range)){ foreach $ft(qw(false true)){ @a=@{$h->{$pg}{$ft}}; foreach(@a){ if(m~(\d+) per sec~){ $r->{$pg}{$ft}+=$1; } printf qq[%s\n], $_; } printf qq[- %d total keys per second\n], $r->{$pg}{$ft}; } printf qq[- %.1f%% diff; true better than false if %% > 0\n], ($r->{$pg}{true} - $r->{$pg}{false}) / $r->{$pg}{false} * 100; } }'
-   put 10000000 map keys in  1.762 seconds or    5675426 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  1.768 seconds or    5655742 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  1.763 seconds or    5672468 per second; ir=50000005000000 mapCacheFriendly=false
-   put 10000000 map keys in  1.758 seconds or    5688813 per second; ir=50000005000000 mapCacheFriendly=false
- 22692449 total keys per second
-   put 10000000 map keys in  1.670 seconds or    5987124 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  1.690 seconds or    5917336 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  1.669 seconds or    5989906 per second; ir=50000005000000 mapCacheFriendly=true
-   put 10000000 map keys in  1.670 seconds or    5989299 per second; ir=50000005000000 mapCacheFriendly=true
- 23883665 total keys per second
- 5.2% diff; true better than false if % > 0
-   got 10000000 map vals in  1.582 seconds or    6321965 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  1.590 seconds or    6287516 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  1.586 seconds or    6304264 per second; ir=50000005000000 mapCacheFriendly=false
-   got 10000000 map vals in  1.587 seconds or    6300570 per second; ir=50000005000000 mapCacheFriendly=false
- 25214315 total keys per second
-   got 10000000 map vals in  1.498 seconds or    6673523 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  1.493 seconds or    6697574 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  1.506 seconds or    6640452 per second; ir=50000005000000 mapCacheFriendly=true
-   got 10000000 map vals in  1.499 seconds or    6671009 per second; ir=50000005000000 mapCacheFriendly=true
- 26682558 total keys per second
- 5.8% diff; true better than false if % > 0
- range 10000000 map vals in  0.250 seconds or   40014726 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.240 seconds or   41731550 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.239 seconds or   41786329 per second; ir=50000005000000 mapCacheFriendly=false
- range 10000000 map vals in  0.240 seconds or   41672412 per second; ir=50000005000000 mapCacheFriendly=false
- 165205017 total keys per second
- range 10000000 map vals in  0.239 seconds or   41886167 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.252 seconds or   39659551 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.253 seconds or   39580246 per second; ir=50000005000000 mapCacheFriendly=true
- range 10000000 map vals in  0.253 seconds or   39464253 per second; ir=50000005000000 mapCacheFriendly=true
- 160590217 total keys per second
- -2.8% diff; true better than false if % > 0

Performance observations / comments

| new algorithm      | writes faster | reads faster | ranges faster |
| map[int64]int64    |         -0.2% |        -0.7% |         -0.4% |
| map[string]int64   |        +15.8% |        +2.0% |         -1.5% |
| map[[32]byte]int64 |         +5.2% |        +5.8% |         -2.8% |
  • The map[int64]int64 is actually running the same code with the same parameters twice now -- because the whole point of this latest experiment was to not change the code or performance for shorter keys :-) -- so the worst case change of -0.7% must within the margin of error when comparing the performance of the same code run with the same parameters?
  • The previous experiment only modified 'fast' assigning and accessing keys for map_faststr.go and map_fast64.go, but now the assign and access for regular keys in map.go is also modified, and for the example 32 byte key, we see an improvement in both reads and writes.

@randall77
Copy link
Contributor

Interesting. What do your benchmarks look like? Sounds good enough to pursue some more, but we always have to be wary of overfitting the benchmark.

So what does the SuitsInterleave function look like? And maybe more importantly, what do the key and element locating functions now look like? Currently they are something like:

add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))

Are there now branches when computing each of these, or are you doing something smarter? I'm imagining some x + i*y formula for each, with the coefficients x,y stored in the maptype.

@simonhf
Copy link
Author

simonhf commented Nov 15, 2021

What do your benchmarks look like?

@randall77 the code for the original experiments was put in a github branch with the steps needs to make and run it detailed above. In this latest set of experiments I just updated that branch here [1]. In theory you can just grab the code and replicate my experiments, but on your different system.

The benchmark for the 3 different types of hash implementation can be found here [2], [3], and [4]. Sorry, I was thinking about trying to combine all 3 benchmarks files into a single file. They are mostly duplicated code and very simple.

Sounds good enough to pursue some more, but we always have to be wary of overfitting the benchmark.

Yes, I agree. And if I am guilty of doing that then it is only subconsciously so far :-) But it would be good to have more benchmarks, for example perhaps benchmarking with different sizes of values too?

So what does the SuitsInterleave function look like?

In the end I just kept it simple stupid for this experiment, as can be seen here [5]:

Instead of SuitsInterleave() I just hard-coded MapCacheFriendly && (t.keysize > 8). MapCacheFriendly is the bool that let's me switch the experimental version of map on and off. And t.keysize > 8 means interleave keys unless the keys are 8 bytes or shorter.

I'm guessing that shorter uint64 keys do not show an improvement because 8 keys @ 8 bytes is 64 bytes or one cache line. Which means the values are in the very next cache line, and the CPU does cache line read-ahead anyway. Whereas, with a bigger key such as 16 bytes, then the values will be multiple cache lines ahead and the CPU read-ahead mechanism doesn't help?

It may be interesting to experiment with shorter 8 byte keys with longer values? Perhaps SuitsInterleave() should be something more like MapCacheFriendly && (((t.keysize + t.elemsize) * bucketCnt) > 128)... e.g. if values are longer than interleaving becomes beneficial again?

And maybe more importantly, what do the key and element locating functions now look like?

There are blocks of code like this [5] in the various map functions:

var maybeHead4E uintptr
var maybeTail4E uintptr
var maybeSize4K uintptr
var maybeSize4E uintptr
if (MapCacheFriendly && (t.keysize > 8)) { // come here to interleave KVs in one array
    maybeHead4E = 0
    maybeTail4E = uintptr(t.keysize)
    maybeSize4K = uintptr(t.keysize)+uintptr(t.elemsize)
    maybeSize4E = uintptr(t.keysize)+uintptr(t.elemsize)
} else { // come here to separate KVs in two arrays
    maybeHead4E = bucketCnt*uintptr(t.keysize)
    maybeTail4E = 0
    maybeSize4K = uintptr(t.keysize)
    maybeSize4E = uintptr(t.elemsize)
}
// e.g. k = add(unsafe.Pointer(b), dataOffset+            (i*maybeSize4K)            )
// e.g. e = add(unsafe.Pointer(b), dataOffset+maybeHead4E+(i*maybeSize4E)+maybeTail4E)

This allows us to have a very similar element locating calculations to before, but without any associated if() as in the first experiment. The if() is now located at the start of each function and creates the building blocks for the if()-less element locating calculations.

Are there now branches when computing each of these, or are you doing something smarter? I'm imagining some x + i*y formula for each, with the coefficients x,y stored in the maptype.

Yep, so the if() branches are gone from the looping part of the map functions doing the element locating calculations. There is just one if() at the start of each function which calculates the building blocks for those if()-less element locating calculations. But you are right... those building blocks could be calculated only once when the map is created and stored in the maptype itself... :-)

Looking forward to your further comments and suggestions.

[1] simonhf@9f0ed4b
[2] https://github.com/simonhf/go/blob/map-cacheline-perf-experiments/src/runtime/map_bench_m_64___64_test.go
[3] https://github.com/simonhf/go/blob/map-cacheline-perf-experiments/src/runtime/map_bench_m_byte_64_test.go
[4] https://github.com/simonhf/go/blob/map-cacheline-perf-experiments/src/runtime/map_bench_m_str__64_test.go
[5] https://github.com/simonhf/go/blob/map-cacheline-perf-experiments/src/runtime/map.go#L666

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

7 participants