New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: working with small maps is 4x-10x slower than in nodejs #19495

Open
navytux opened this Issue Mar 10, 2017 · 40 comments

Comments

Projects
None yet
@navytux
Contributor

navytux commented Mar 10, 2017

Please answer these questions before submitting your issue. Thanks!

What did you do?

Hello up there. Map performance was already discussed some time ago in #3885 and improved a bit. It was also said there that the map algorithm is choosen to work very well with very very large maps. However maps are not always very very large and imho in many practical cases they are small and medium.

So please consider the following 3 programs:

package main
//import "fmt"

func main() {
    a := make(map[int]int)

    for i := 0; i < 100000000; i++ {
        a[i & 0xffff] = i
        //a[i & 0x7f] = i
        //a[i] = i
    }

    //fmt.Println(a)
}

(https://play.golang.org/p/rPH1pSM1Xk)

#!/usr/bin/env nodejs

function main() {
    var a = {};

    for (var i = 0; i < 100000000; i++) {
        a[i & 0xffff] = i;
        //a[i & 0x7f] = i;
        //a[i] = i;
    }

    //console.log(a)
}

main()
#!/usr/bin/env pypy

def main():
    a = {}

    for i in range(100000000):
        a[i & 0xffff] = i
        #a[i & 0x7f] = i
        #a[i] = i

    #print(a)

if __name__ == '__main__':
    main()

The time it takes to run them on i7-6600U is as follows:

Program Time (seconds, best of 5)
map.go 3.668
map.js 0.385
map.py 1.988

The go version is 9.5x slower than javascript one, and ~ 1.8x slower than pypy one.

If we reduce the actual map size from 64K elements to 128 elements, activating the a[i & 0x7f] = i case via e.g. the following patch:

--- a/map.go.kirr
+++ b/map.go
@@ -5,8 +5,8 @@ func main() {
     a := make(map[int]int)
 
     for i := 0; i < 100000000; i++ {
-       a[i & 0xffff] = i
-       //a[i & 0x7f] = i
+       //a[i & 0xffff] = i
+       a[i & 0x7f] = i
        //a[i] = i
     }

timings become:

Program Time (seconds, best of 5)
map.go 1.571
map.js 0.377
map.py 0.896

javascript becomes only a bit faster here while go & pypy improved ~ 2.3x / 2.2x respectively. Still go is 4x slower than javascript and 1.7x slower than pypy.

We can also test how it works if we do not limit the map size and let it grow on every operation. Yes, javascript and pypy are more memory hungry and for original niter=1E8 I'm getting out-of-memory in their cases on my small laptop, but let's test with e.g. niter=1E7 (diff to original program):

--- a/map.go.kirr
+++ b/map.go
@@ -4,10 +4,10 @@ package main
 func main() {
     a := make(map[int]int)
 
-    for i := 0; i < 100000000; i++ {
-       a[i & 0xffff] = i
+    for i := 0; i < 100000000 / 10; i++ {
+       //a[i & 0xffff] = i
        //a[i & 0x7f] = i
-       //a[i] = i
+       a[i] = i
     }
 
     //fmt.Println(a)

timings become:

Program Time (seconds, best of 5)
map.go 2.877
map.js 0.438
map.py 1.277

So it is go/js ~6.5x slower and go/pypy is ~2.2x slower.

The profile for original program (a[i & 0xffff] = i) is:

File: map
Type: cpu
Time: Mar 10, 2017 at 7:18pm (MSK)
Duration: 3.70s, Total samples = 36ms ( 0.97%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 36000us, 100% of 36000us total
      flat  flat%   sum%        cum   cum%
   27800us 77.22% 77.22%    33900us 94.17%  runtime.mapassign /home/kirr/src/tools/go/go/src/runtime/hashmap.go
    3100us  8.61% 85.83%     3100us  8.61%  runtime.aeshash64 /home/kirr/src/tools/go/go/src/runtime/asm_amd64.s
    3000us  8.33% 94.17%     3000us  8.33%  runtime.memequal64 /home/kirr/src/tools/go/go/src/runtime/alg.go
    1700us  4.72% 98.89%    36000us   100%  main.main /home/kirr/tmp/trashme/map/map.go
     400us  1.11%   100%      400us  1.11%  runtime.mapassign /home/kirr/src/tools/go/go/src/runtime/stubs.go
         0     0%   100%    36000us   100%  runtime.main /home/kirr/src/tools/go/go/src/runtime/proc.go

What did you expect to see?

Map operations for small / medium maps are as fast or better than in nodejs.

What did you see instead?

Map operations are 4x-10x slower than in javascript for maps sizes that are commonly present in many programs.

Does this issue reproduce with the latest release (go1.8)?

Yes.

System details

go version devel +d11a2184fb Fri Mar 10 01:39:09 2017 +0000 linux/amd64
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/kirr/go"
GORACE=""
GOROOT="/home/kirr/src/tools/go/go"
GOTOOLDIR="/home/kirr/src/tools/go/go/pkg/tool/linux_amd64"
GCCGO="/usr/bin/gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build714926978=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOROOT/bin/go version: go version devel +d11a2184fb Fri Mar 10 01:39:09 2017 +0000 linux/amd64
GOROOT/bin/go tool compile -V: compile version devel +d11a2184fb Fri Mar 10 01:39:09 2017 +0000 X:framepointer
uname -sr: Linux 4.9.0-2-amd64
Distributor ID:	Debian
Description:	Debian GNU/Linux 9.0 (stretch)
Release:	9.0
Codename:	stretch
/lib/x86_64-linux-gnu/libc.so.6: GNU C Library (Debian GLIBC 2.24-9) stable release version 2.24, by Roland McGrath et al.
gdb --version: GNU gdb (Debian 7.12-6) 7.12.0.20161007-git

Thanks beforehand,
Kirill

/cc @rsc, @randall77

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Mar 10, 2017

In doing this kind of benchmarking, it's very important to remember that on amd64 the Go type int is 64 bits. Do your timings change if you rewrite your Go programs to use int32 rather than int?

@josharian

This comment has been minimized.

Contributor

josharian commented Mar 10, 2017

I'm surprised the pprof output shows mapaccess, not mapaccessfast*.

@navytux

This comment has been minimized.

Contributor

navytux commented Mar 10, 2017

Good point, thanks for noticing. So with the following program:

package main
//import "fmt"

func main() {
    //a := make(map[int]int)
    a := make(map[int32]int32)

    for i := int32(0); i < 100000000; i++ {
        a[i & 0xffff] = i
        //a[i & 0x7f] = i
        //a[i] = i
    }

    //fmt.Println(a)
}
--- a/map.go.kirr
+++ b/map.go
@@ -2,9 +2,10 @@ package main
 //import "fmt"
 
 func main() {
-    a := make(map[int]int)
+    //a := make(map[int]int)
+    a := make(map[int32]int32)
 
-    for i := 0; i < 100000000; i++ {
+    for i := int32(0); i < 100000000; i++ {
        a[i & 0xffff] = i
        //a[i & 0x7f] = i
        //a[i] = i

map.go timings are:

map.go case Time (seconds, best of 5)
a[i & 0xffff] = i 3.579
a[i & 0x7f] = i 1.596
a[i] = i; niter=1E7 2.616

in other words it stays the same as with map[int]int ± 2%.

@navytux

This comment has been minimized.

Contributor

navytux commented Mar 10, 2017

@josharian it is mapassign. git grep shows there is no mapassign_* under go/src.

@josharian

This comment has been minimized.

Contributor

josharian commented Mar 10, 2017

Huh. We have mapaccess1_fast* and mapaccess2_fast*, but not mapassign_fast*. Maybe we should add that. (Perhaps in the process we should automate generating the fast versions, like we automate the generation of Slice* routines in package sort.) mapdelete_fast* too? Any others?

I'd suggest that as a first step for this issue. I have a few other things on my plate I'd like to do first, but if no one speaks up for it, I'll plan to eventually get to this.

@josharian josharian added this to the Go1.9 milestone Mar 10, 2017

@rasky

This comment has been minimized.

Member

rasky commented Mar 11, 2017

How do you suggest to automate generating the fast versions? I can see a Go file, run by go generate, doing the right sequence of printf or template substitutions, is that what you're thinking?

@josharian

This comment has been minimized.

Contributor

josharian commented Mar 11, 2017

@rasky yeah, something like that. I took a quick look at this earlier and there was more divergence between the regular and the _fast methods than I anticipated. It's probably worth doing one manually and benchmarking the improvements before investing too much time in automating it. Do you plan to work on this?

@rasky

This comment has been minimized.

Member

rasky commented Mar 11, 2017

@josharian I wish, but I don't have enough time in the next two weeks.

@josharian

This comment has been minimized.

Contributor

josharian commented Mar 11, 2017

@rasky cool. I think @martisch has also been thinking about this. If/when someone starts on this in earnest, just drop a note here to avoid duplicated work. Since it seems there's plenty of other folks interested, I will turn my attention elsewhere.

@gopherbot

This comment has been minimized.

gopherbot commented Mar 12, 2017

CL https://golang.org/cl/38091 mentions this issue.

@huguesb

This comment has been minimized.

Contributor

huguesb commented Mar 12, 2017

@josharian I have a patchset up that passes tests and shows some improvements on benchmarks. Are you the right person to review this?

name                old time/op  new time/op  delta
MapAssignInt32_255  24.6ns ± 0%  19.5ns ± 0%   ~     (p=1.000 n=1+1)
MapAssignInt32_64k  43.9ns ± 0%  37.3ns ± 0%   ~     (p=1.000 n=1+1)
MapAssignInt64_255  24.8ns ± 0%  19.7ns ± 0%   ~     (p=1.000 n=1+1)
MapAssignInt64_64k  45.5ns ± 0%  37.8ns ± 0%   ~     (p=1.000 n=1+1)
MapAssignStr_255    46.3ns ± 0%  25.1ns ± 0%   ~     (p=1.000 n=1+1)
MapAssignStr_64k    79.6ns ± 0%  56.8ns ± 0%   ~     (p=1.000 n=1+1)

Not quite sure why benchstat doesn't show a significant delta...

@mvdan

This comment has been minimized.

Member

mvdan commented Mar 12, 2017

@huguesb you need to run each benchmark multiple times for benchstat to show lower p values. Try with 5 or 10 runs before and after.

@huguesb

This comment has been minimized.

Contributor

huguesb commented Mar 12, 2017

@mvdan Yeah, I screwed up the result concatenation the first time. Here goes

name                  old time/op  new time/op  delta
MapAssignInt32_255-8  25.0ns ± 3%  19.0ns ± 4%  -24.20%  (p=0.000 n=10+10)
MapAssignInt32_64k-8  44.6ns ± 3%  37.6ns ± 3%  -15.72%  (p=0.000 n=10+9)
MapAssignInt64_255-8  24.8ns ± 4%  18.9ns ± 4%  -23.82%  (p=0.000 n=10+10)
MapAssignInt64_64k-8  45.2ns ± 1%  38.5ns ± 6%  -14.86%  (p=0.000 n=9+9)
MapAssignStr_255-8    46.9ns ± 1%  24.6ns ± 4%  -47.49%  (p=0.000 n=8+10)
MapAssignStr_64k-8    80.9ns ± 4%  53.3ns ± 3%  -34.10%  (p=0.000 n=10+9)
@mvdan

This comment has been minimized.

Member

mvdan commented Mar 12, 2017

Please add the numbers to the CL too. Thanks for working on this!

@huguesb

This comment has been minimized.

Contributor

huguesb commented Mar 12, 2017

@mvdan Done

gopherbot pushed a commit that referenced this issue Mar 13, 2017

runtime: add mapassign_fast*
Add benchmarks for map assignment with int32/int64/string key

Benchmark results on darwin/amd64

name                  old time/op  new time/op  delta
MapAssignInt32_255-8  24.7ns ± 3%  17.4ns ± 2%  -29.75%  (p=0.000 n=10+10)
MapAssignInt32_64k-8  45.5ns ± 4%  37.6ns ± 4%  -17.18%  (p=0.000 n=10+10)
MapAssignInt64_255-8  26.0ns ± 3%  17.9ns ± 4%  -31.03%  (p=0.000 n=10+10)
MapAssignInt64_64k-8  46.9ns ± 5%  38.7ns ± 2%  -17.53%  (p=0.000 n=9+10)
MapAssignStr_255-8    47.8ns ± 3%  24.8ns ± 4%  -48.01%  (p=0.000 n=10+10)
MapAssignStr_64k-8    83.0ns ± 3%  51.9ns ± 3%  -37.45%  (p=0.000 n=10+9)

name                     old time/op    new time/op    delta
BinaryTree17-8              3.11s ±19%     2.78s ± 3%    ~     (p=0.095 n=5+5)
Fannkuch11-8                3.26s ± 1%     3.21s ± 2%    ~     (p=0.056 n=5+5)
FmtFprintfEmpty-8          50.3ns ± 1%    50.8ns ± 2%    ~     (p=0.246 n=5+5)
FmtFprintfString-8         82.7ns ± 4%    80.1ns ± 5%    ~     (p=0.238 n=5+5)
FmtFprintfInt-8            82.6ns ± 2%    81.9ns ± 3%    ~     (p=0.508 n=5+5)
FmtFprintfIntInt-8          124ns ± 4%     121ns ± 3%    ~     (p=0.111 n=5+5)
FmtFprintfPrefixedInt-8     158ns ± 6%     160ns ± 2%    ~     (p=0.341 n=5+5)
FmtFprintfFloat-8           249ns ± 2%     245ns ± 2%    ~     (p=0.095 n=5+5)
FmtManyArgs-8               513ns ± 2%     519ns ± 3%    ~     (p=0.151 n=5+5)
GobDecode-8                7.48ms ±12%    7.11ms ± 2%    ~     (p=0.222 n=5+5)
GobEncode-8                6.25ms ± 1%    6.03ms ± 2%  -3.56%  (p=0.008 n=5+5)
Gzip-8                      252ms ± 4%     252ms ± 4%    ~     (p=1.000 n=5+5)
Gunzip-8                   38.4ms ± 3%    38.6ms ± 2%    ~     (p=0.690 n=5+5)
HTTPClientServer-8         76.9µs ±41%    66.4µs ± 6%    ~     (p=0.310 n=5+5)
JSONEncode-8               16.5ms ± 3%    16.7ms ± 3%    ~     (p=0.421 n=5+5)
JSONDecode-8               54.6ms ± 1%    54.3ms ± 2%    ~     (p=0.548 n=5+5)
Mandelbrot200-8            4.45ms ± 3%    4.47ms ± 1%    ~     (p=0.841 n=5+5)
GoParse-8                  3.43ms ± 1%    3.32ms ± 2%  -3.28%  (p=0.008 n=5+5)
RegexpMatchEasy0_32-8      88.2ns ± 3%    89.4ns ± 2%    ~     (p=0.333 n=5+5)
RegexpMatchEasy0_1K-8       205ns ± 1%     206ns ± 1%    ~     (p=0.905 n=5+5)
RegexpMatchEasy1_32-8      85.1ns ± 1%    85.5ns ± 5%    ~     (p=0.690 n=5+5)
RegexpMatchEasy1_1K-8       365ns ± 1%     371ns ± 9%    ~     (p=1.000 n=5+5)
RegexpMatchMedium_32-8      129ns ± 2%     128ns ± 3%    ~     (p=0.730 n=5+5)
RegexpMatchMedium_1K-8     39.8µs ± 0%    39.7µs ± 4%    ~     (p=0.730 n=4+5)
RegexpMatchHard_32-8       1.99µs ± 3%    2.05µs ±16%    ~     (p=0.794 n=5+5)
RegexpMatchHard_1K-8       59.3µs ± 1%    60.3µs ± 7%    ~     (p=1.000 n=5+5)
Revcomp-8                   1.36s ±63%     0.52s ± 5%    ~     (p=0.095 n=5+5)
Template-8                 62.6ms ±14%    60.5ms ± 5%    ~     (p=0.690 n=5+5)
TimeParse-8                 330ns ± 2%     324ns ± 2%    ~     (p=0.087 n=5+5)
TimeFormat-8                350ns ± 3%     340ns ± 1%  -2.86%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
GobDecode-8               103MB/s ±11%   108MB/s ± 2%    ~     (p=0.222 n=5+5)
GobEncode-8               123MB/s ± 1%   127MB/s ± 2%  +3.71%  (p=0.008 n=5+5)
Gzip-8                   77.1MB/s ± 4%  76.9MB/s ± 3%    ~     (p=1.000 n=5+5)
Gunzip-8                  505MB/s ± 3%   503MB/s ± 2%    ~     (p=0.690 n=5+5)
JSONEncode-8              118MB/s ± 3%   116MB/s ± 3%    ~     (p=0.421 n=5+5)
JSONDecode-8             35.5MB/s ± 1%  35.8MB/s ± 2%    ~     (p=0.397 n=5+5)
GoParse-8                16.9MB/s ± 1%  17.4MB/s ± 2%  +3.45%  (p=0.008 n=5+5)
RegexpMatchEasy0_32-8     363MB/s ± 3%   358MB/s ± 2%    ~     (p=0.421 n=5+5)
RegexpMatchEasy0_1K-8    4.98GB/s ± 1%  4.97GB/s ± 1%    ~     (p=0.548 n=5+5)
RegexpMatchEasy1_32-8     376MB/s ± 1%   375MB/s ± 5%    ~     (p=0.690 n=5+5)
RegexpMatchEasy1_1K-8    2.80GB/s ± 1%  2.76GB/s ± 9%    ~     (p=0.841 n=5+5)
RegexpMatchMedium_32-8   7.73MB/s ± 1%  7.76MB/s ± 3%    ~     (p=0.730 n=5+5)
RegexpMatchMedium_1K-8   25.8MB/s ± 0%  25.8MB/s ± 4%    ~     (p=0.651 n=4+5)
RegexpMatchHard_32-8     16.1MB/s ± 3%  15.7MB/s ±14%    ~     (p=0.794 n=5+5)
RegexpMatchHard_1K-8     17.3MB/s ± 1%  17.0MB/s ± 7%    ~     (p=0.984 n=5+5)
Revcomp-8                 273MB/s ±83%   488MB/s ± 5%    ~     (p=0.095 n=5+5)
Template-8               31.1MB/s ±13%  32.1MB/s ± 5%    ~     (p=0.690 n=5+5)

Updates #19495

Change-Id: I116e9a2a4594769318b22d736464de8a98499909
Reviewed-on: https://go-review.googlesource.com/38091
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@navytux

This comment has been minimized.

Contributor

navytux commented Mar 14, 2017

@huguesb, @josharian thanks for working on this. After recent patch (ec091b6) timings, compared to #19495 (comment), improve a bit:

map.go case Time (map[int32]int32) Time (map[int]int)
a[i & 0xffff] = i 2.921 2.991
a[i & 0x7f] = i 1.121 1.124
a[i] = i; niter=1E7 2.438 2.563

but go is still 3x - 7.5x slower than nodejs.

for map[int32]int32 case profile looks like this:

      flat  flat%   sum%        cum   cum%
   20.60ms 69.36% 69.36%       23ms 77.44%  runtime.mapassign_fast32 /home/kirr/src/tools/go/go/src/runtime/hashmap_fast.go
    3.60ms 12.12% 81.48%    29.70ms   100%  main.main /home/kirr/tmp/trashme/map/map.go
    3.10ms 10.44% 91.92%     3.10ms 10.44%  runtime.mapassign_fast32 /home/kirr/src/tools/go/go/src/runtime/stubs.go
    2.40ms  8.08%   100%     2.40ms  8.08%  runtime.aeshash32 /home/kirr/src/tools/go/go/src/runtime/asm_amd64.s
         0     0%   100%    29.70ms   100%  runtime.main /home/kirr/src/tools/go/go/src/runtime/proc.go
@davecheney

This comment has been minimized.

Contributor

davecheney commented Mar 14, 2017

@navytux are you sure the js benchmark is correct? The code does not reference any values added to the map during the loop, so a smart compiler will optimise away the whole loop.

@navytux

This comment has been minimized.

Contributor

navytux commented Mar 14, 2017

@davecheney thanks for asking. Yes, I checked it this way:

  • the time of executing map.js grows if I increase niter. This proves the loop is not wholly eliminated but does not prove compiler optimzes assignment away.
  • there is console.log(a) in the end. If it is enabled - the compiler cannot optimize map assignments away, at least unless it is very smart to compute the whole result just from analyzing loop structure. So e.g. for a[i & 0x7f] = i case with the following patch:
--- a/map.js.kirr
+++ b/map.js
@@ -4,12 +4,12 @@ function main() {
     var a = {};
 
     for (var i = 0; i < 100000000; i++) {
-        a[i & 0xffff] = i;
-        //a[i & 0x7f] = i;
+        //a[i & 0xffff] = i;
+        a[i & 0x7f] = i;
         //a[i] = i;
     }
 
-    //console.log(a)
+    console.log(a)
 }
 
 main()

it is:

kirr@deco:~/tmp/trashme/map$ time ./map.js
{ '0': 99999872,
  '1': 99999873,
  '2': 99999874,
  '3': 99999875,
...
  '125': 99999997,
  '126': 99999998,
  '127': 99999999 }

real    0m0,396s
user    0m0,384s
sys     0m0,012s

it is only a bit more than 0.377 we had for this case without console.log(). To me it tells the map assignments and loop are not optimized away.

Said that I also have to say I'm complete newbie to javascript and can be missing something.

Thanks again,
Kirill

@rasky

This comment has been minimized.

Member

rasky commented Mar 14, 2017

Can you try how things change if you manually unroll the loop 8 times on both languages?

@navytux

This comment has been minimized.

Contributor

navytux commented Mar 14, 2017

I looked around a bit to see how nodejs compiles this (I used nodejs --print-opt-code --code-comments ... from http://stackoverflow.com/questions/10953173/dump-v8-jit-output-from-node). Showing only the loop itself:

--- Raw source ---
() {
    var a = {};

    for (var i = 0; i < 100000000; i++) {
        //a[i & 0xffff] = i;
        a[i & 0x7f] = i;
        //a[i] = i;
    }

    console.log(a)
}


--- Optimized code ---
optimization_id = 0
source_position = 77
kind = OPTIMIZED_FUNCTION
name = main
stack_slots = 7
compiler = crankshaft
Instructions (size = 857)
0x2c5a2f84bee0     0  55             push rbp
0x2c5a2f84bee1     1  4889e5         REX.W movq rbp,rsp
0x2c5a2f84bee4     4  56             push rsi
0x2c5a2f84bee5     5  57             push rdi
0x2c5a2f84bee6     6  4883ec38       REX.W subq rsp,0x38
                  ;;; <@0,#0> -------------------- B0 --------------------

...

                  ;;; <@66,#58> -------------------- B5 (loop header) --------------------
0x2c5a2f84c011   305  4c894db8       REX.W movq [rbp-0x48],r9    ;; debug: position 121
                  ;;; <@69,#62> compare-numeric-and-branch
0x2c5a2f84c015   309  4181f900e1f505 cmpl r9,0x5f5e100
0x2c5a2f84c01c   316  0f8d44000000   jge 390  (0x2c5a2f84c066)
                  ;;; <@70,#63> -------------------- B6 (unreachable/replaced) --------------------                                      
                  ;;; <@74,#69> -------------------- B7 --------------------                                                             
                  ;;; <@76,#71> stack-check
0x2c5a2f84c022   322  493ba548080000 REX.W cmpq rsp,[r13+0x848]                                                                          
0x2c5a2f84c029   329  0f8269010000   jc 696  (0x2c5a2f84c198)                                                                            
                  ;;; <@77,#71> gap  
0x2c5a2f84c02f   335  498bc1         REX.W movq rax,r9                                                                                   
                  ;;; <@78,#75> bit-i
0x2c5a2f84c032   338  83e07f         andl rax,0x7f           ;; debug: position 182                                                      
                  ;;; <@80,#100> smi-tag
0x2c5a2f84c035   341  418bd1         movl rdx,r9
0x2c5a2f84c038   344  48c1e220       REX.W shlq rdx, 32                                                                                  
                  ;;; <@82,#101> smi-tag
0x2c5a2f84c03c   348  8bc8           movl rcx,rax
0x2c5a2f84c03e   350  48c1e120       REX.W shlq rcx, 32                                                                                  
                  ;;; <@83,#101> gap 
0x2c5a2f84c042   354  488bf7         REX.W movq rsi,rdi                                                                                  
0x2c5a2f84c045   357  488bc2         REX.W movq rax,rdx                                                                                  
0x2c5a2f84c048   360  488bd3         REX.W movq rdx,rbx                                                                                  
                  ;;; <@84,#78> store-keyed-generic
0x2c5a2f84c04b   363  e81068eeff     call KeyedStoreIC_PreMonomorphic  (0x2c5a2f732860)    ;; code: KEYED_STORE_IC, PREMONOMORPHIC       
                  ;;; <@86,#79> lazy-bailout                                                                                             
                  ;;; <@87,#79> gap  
0x2c5a2f84c050   368  488b45b8       REX.W movq rax,[rbp-0x48]                                                                           
                  ;;; <@88,#82> add-i
0x2c5a2f84c054   372  448d4801       leal r9,[rax+0x1]       ;; debug: position 135                                                      
                  ;;; <@90,#85> gap  
0x2c5a2f84c058   376  4c8b45c0       REX.W movq r8,[rbp-0x40]
0x2c5a2f84c05c   380  488b7dc8       REX.W movq rdi,[rbp-0x38]                                                                           
0x2c5a2f84c060   384  488b5dd0       REX.W movq rbx,[rbp-0x30]                                                                           
                  ;;; <@91,#85> goto 
0x2c5a2f84c064   388  ebab           jmp 305  (0x2c5a2f84c011)
                  ;;; <@92,#66> -------------------- B8 (unreachable/replaced) --------------------                                      
                  ;;; <@96,#86> -------------------- B9 --------------------                                                             
                  ;;; <@97,#86> gap  
0x2c5a2f84c066   390  488b45c8       REX.W movq rax,[rbp-0x38]    ;; debug: position 234    

...

--- End code ---                                    

it could be seen the loop is there (100000000 = 0x5f5e100), a & 0x7f is there and map assign seems to be call KeyedStoreIC_PreMonomorphic. So to me it once again shows things are not optimized away.

@navytux

This comment has been minimized.

Contributor

navytux commented Mar 14, 2017

@rasky, sure. For the following programs:

package main
//import "fmt"

func main() {
    //a := make(map[int]int)
    a := make(map[int32]int32)

    for i := int32(0); i < 100000000; {
        //a[i & 0xffff] = i
        //a[i & 0x7f] = i
        //a[i] = i

        a[i & 0xffff] = i       // 0
        i++
        a[i & 0xffff] = i       // 1
        i++
        a[i & 0xffff] = i       // 2
        i++
        a[i & 0xffff] = i       // 3
        i++
        a[i & 0xffff] = i       // 4
        i++
        a[i & 0xffff] = i       // 5
        i++
        a[i & 0xffff] = i       // 6
        i++
        a[i & 0xffff] = i       // 7
        i++
    }

    //fmt.Println(a)
}
#!/usr/bin/env nodejs

function main() {
    var a = {};

    for (var i = 0; i < 100000000;) {
        //a[i & 0xffff] = i;
        //a[i & 0x7f] = i;
        //a[i] = i;

        a[i & 0xffff] = i;      // 0
        i++;
        a[i & 0xffff] = i;      // 1
        i++;
        a[i & 0xffff] = i;      // 2
        i++;
        a[i & 0xffff] = i;      // 3
        i++;
        a[i & 0xffff] = i;      // 4
        i++;
        a[i & 0xffff] = i;      // 5
        i++;
        a[i & 0xffff] = i;      // 6
        i++;
        a[i & 0xffff] = i;      // 7
        i++;
    }

    //console.log(a)
}

main()

timings are:

Program Time (seconds, best of 5)
map-unroll.go 2.938
map-unroll.js 0.307

and just in case if I uncomment console.log(a) in map-unroll.js it becomes:

--- a/map-unroll.js.kirr
+++ b/map-unroll.js
@@ -26,7 +26,7 @@ function main() {
        i++;
     }
 
-    //console.log(a)
+    console.log(a)
 }
 
 main()
kirr@deco:~/tmp/trashme/map$ ./map-unroll.js |head
{ '0': 99942400,
  '1': 99942401,
  '2': 99942402,
  '3': 99942403,
  '4': 99942404,
  '5': 99942405,
  '6': 99942406,
  '7': 99942407,
  '8': 99942408,
  '9': 99942409,
events.js:141
      throw er; // Unhandled 'error' event
      ^

Error: write EPIPE
    at exports._errnoException (util.js:907:11)
    at WriteWrap.afterWrite (net.js:787:14)
kirr@deco:~/tmp/trashme/map$ time ./map-unroll.js >/dev/null 

real    0m0,501s
user    0m0,496s
sys     0m0,008s

so it is 0.501s vs 0.307s but there are 64K entries to print and imho it is reasonable to expect it to add time. Still even 0.5s vs 2.9s (and go case does not print anything) is a big difference.

Thanks,
Kirill

@gopherbot

This comment has been minimized.

gopherbot commented Mar 15, 2017

CL https://golang.org/cl/38172 mentions this issue.

@huguesb

This comment has been minimized.

Contributor

huguesb commented Mar 15, 2017

I've been doing some profiling and it looks like a good chunk of time is spent iterating inside buckets. My feeling is that the memory layout is distinctly suboptimal because every insert needs to check all cells of the matching bucket in case the matching entry is at the end of the bucket. This implies a constant factor increase up to 8x. Assuming keys are evenly distributed inside buckets, which seems reasonable in this test case, I would expect an average of 4 probes per insert, which seems roughly in line with the observed performance difference with other hash tables (which I suspect use some variant of open addressing and do much less probing in practice).

I'm going to experiment with a few micro-optimizations but I think eventually we'll have to use a different memory layout to get significant improvements.

@randall77

This comment has been minimized.

Contributor

randall77 commented Mar 15, 2017

In the common case you only need to probe the 8 tophash bytes; you don't need to look at the key and value slots themselves.
Is is really the case that

for i := 0; i < 8; i++ {
  if b.tophash[i] == x {
    .. unlikely if value is not already in map ..
  }
}

takes a long time?

@huguesb

This comment has been minimized.

Contributor

huguesb commented Mar 15, 2017

ROUTINE ======================== runtime.mapassign_fast64 in /Users/hugues/go/src/runtime/hashmap_fast.go
   19.60ms    23.20ms (flat, cum) 74.84% of Total
         .          .    508:	}
         .          .    509:	h.flags &^= hashWriting
         .          .    510:	return val
         .          .    511:}
         .          .    512:
     700us      700us    513:func mapassign_fast64(t *maptype, h *hmap, key uint64) unsafe.Pointer {

...

    2.40ms        6ms    524:	hash := t.key.alg.hash(noescape(unsafe.Pointer(&key)), uintptr(h.hash0))

...

    1.30ms     1.30ms    548:		for i := uintptr(0); i < bucketCnt; i++ {
    9.40ms     9.40ms    549:			if b.tophash[i] != top {
    1.20ms     1.20ms    550:				if b.tophash[i] == empty && inserti == nil {
@josharian

This comment has been minimized.

Contributor

josharian commented Mar 15, 2017

Maybe that branch is systematically mispredicted, leading to lots of stalls.

gopherbot pushed a commit that referenced this issue Mar 21, 2017

runtime: add mapdelete_fast*
Add benchmarks for map delete with int32/int64/string key

Benchmark results on darwin/amd64

name                 old time/op  new time/op  delta
MapDelete/Int32/1-8   151ns ± 8%    99ns ± 3%  -34.39%  (p=0.008 n=5+5)
MapDelete/Int32/2-8   128ns ± 2%   111ns ±15%  -13.40%  (p=0.040 n=5+5)
MapDelete/Int32/4-8   128ns ± 5%   114ns ± 2%  -10.82%  (p=0.008 n=5+5)
MapDelete/Int64/1-8   144ns ± 0%   104ns ± 3%  -27.53%  (p=0.016 n=4+5)
MapDelete/Int64/2-8   153ns ± 1%   126ns ± 3%  -17.17%  (p=0.008 n=5+5)
MapDelete/Int64/4-8   178ns ± 3%   136ns ± 2%  -23.60%  (p=0.008 n=5+5)
MapDelete/Str/1-8     187ns ± 3%   171ns ± 3%   -8.54%  (p=0.008 n=5+5)
MapDelete/Str/2-8     221ns ± 3%   206ns ± 4%   -7.18%  (p=0.016 n=5+4)
MapDelete/Str/4-8     256ns ± 5%   232ns ± 2%   -9.36%  (p=0.016 n=4+5)

name                     old time/op    new time/op    delta
BinaryTree17-8              2.78s ± 7%     2.70s ± 1%    ~     (p=0.151 n=5+5)
Fannkuch11-8                3.21s ± 2%     3.19s ± 1%    ~     (p=0.310 n=5+5)
FmtFprintfEmpty-8          49.1ns ± 3%    50.2ns ± 2%    ~     (p=0.095 n=5+5)
FmtFprintfString-8         78.6ns ± 4%    80.2ns ± 5%    ~     (p=0.460 n=5+5)
FmtFprintfInt-8            79.7ns ± 1%    81.0ns ± 3%    ~     (p=0.103 n=5+5)
FmtFprintfIntInt-8          117ns ± 2%     119ns ± 0%    ~     (p=0.079 n=5+4)
FmtFprintfPrefixedInt-8     153ns ± 1%     146ns ± 3%  -4.19%  (p=0.024 n=5+5)
FmtFprintfFloat-8           239ns ± 1%     237ns ± 1%    ~     (p=0.246 n=5+5)
FmtManyArgs-8               506ns ± 2%     509ns ± 2%    ~     (p=0.238 n=5+5)
GobDecode-8                7.06ms ± 4%    6.86ms ± 1%    ~     (p=0.222 n=5+5)
GobEncode-8                6.01ms ± 5%    5.87ms ± 2%    ~     (p=0.222 n=5+5)
Gzip-8                      246ms ± 4%     236ms ± 1%  -4.12%  (p=0.008 n=5+5)
Gunzip-8                   37.7ms ± 4%    37.3ms ± 1%    ~     (p=0.841 n=5+5)
HTTPClientServer-8         64.9µs ± 1%    64.4µs ± 0%  -0.80%  (p=0.032 n=5+4)
JSONEncode-8               16.0ms ± 2%    16.2ms ±11%    ~     (p=0.548 n=5+5)
JSONDecode-8               53.2ms ± 2%    53.1ms ± 4%    ~     (p=1.000 n=5+5)
Mandelbrot200-8            4.33ms ± 2%    4.32ms ± 2%    ~     (p=0.841 n=5+5)
GoParse-8                  3.24ms ± 2%    3.27ms ± 4%    ~     (p=0.690 n=5+5)
RegexpMatchEasy0_32-8      86.2ns ± 1%    85.2ns ± 3%    ~     (p=0.286 n=5+5)
RegexpMatchEasy0_1K-8       198ns ± 2%     199ns ± 1%    ~     (p=0.310 n=5+5)
RegexpMatchEasy1_32-8      82.6ns ± 2%    81.8ns ± 1%    ~     (p=0.294 n=5+5)
RegexpMatchEasy1_1K-8       359ns ± 2%     354ns ± 1%  -1.39%  (p=0.048 n=5+5)
RegexpMatchMedium_32-8      123ns ± 2%     123ns ± 1%    ~     (p=0.905 n=5+5)
RegexpMatchMedium_1K-8     38.2µs ± 2%    38.6µs ± 8%    ~     (p=0.690 n=5+5)
RegexpMatchHard_32-8       1.92µs ± 2%    1.91µs ± 5%    ~     (p=0.460 n=5+5)
RegexpMatchHard_1K-8       57.6µs ± 1%    57.0µs ± 2%    ~     (p=0.310 n=5+5)
Revcomp-8                   483ms ± 7%     441ms ± 1%  -8.79%  (p=0.016 n=5+4)
Template-8                 58.0ms ± 1%    58.2ms ± 7%    ~     (p=0.310 n=5+5)
TimeParse-8                 324ns ± 6%     312ns ± 2%    ~     (p=0.087 n=5+5)
TimeFormat-8                330ns ± 1%     329ns ± 1%    ~     (p=0.968 n=5+5)

name                     old speed      new speed      delta
GobDecode-8               109MB/s ± 4%   112MB/s ± 1%    ~     (p=0.222 n=5+5)
GobEncode-8               128MB/s ± 5%   131MB/s ± 2%    ~     (p=0.222 n=5+5)
Gzip-8                   78.9MB/s ± 4%  82.3MB/s ± 1%  +4.25%  (p=0.008 n=5+5)
Gunzip-8                  514MB/s ± 4%   521MB/s ± 1%    ~     (p=0.841 n=5+5)
JSONEncode-8              121MB/s ± 2%   120MB/s ±10%    ~     (p=0.548 n=5+5)
JSONDecode-8             36.5MB/s ± 2%  36.6MB/s ± 4%    ~     (p=1.000 n=5+5)
GoParse-8                17.9MB/s ± 2%  17.7MB/s ± 4%    ~     (p=0.730 n=5+5)
RegexpMatchEasy0_32-8     371MB/s ± 1%   375MB/s ± 3%    ~     (p=0.310 n=5+5)
RegexpMatchEasy0_1K-8    5.15GB/s ± 1%  5.13GB/s ± 1%    ~     (p=0.548 n=5+5)
RegexpMatchEasy1_32-8     387MB/s ± 2%   391MB/s ± 1%    ~     (p=0.310 n=5+5)
RegexpMatchEasy1_1K-8    2.85GB/s ± 2%  2.89GB/s ± 1%    ~     (p=0.056 n=5+5)
RegexpMatchMedium_32-8   8.07MB/s ± 2%  8.06MB/s ± 1%    ~     (p=0.730 n=5+5)
RegexpMatchMedium_1K-8   26.8MB/s ± 2%  26.6MB/s ± 7%    ~     (p=0.690 n=5+5)
RegexpMatchHard_32-8     16.7MB/s ± 2%  16.7MB/s ± 5%    ~     (p=0.421 n=5+5)
RegexpMatchHard_1K-8     17.8MB/s ± 1%  18.0MB/s ± 2%    ~     (p=0.310 n=5+5)
Revcomp-8                 527MB/s ± 6%   577MB/s ± 1%  +9.44%  (p=0.016 n=5+4)
Template-8               33.5MB/s ± 1%  33.4MB/s ± 7%    ~     (p=0.310 n=5+5)

Updates #19495

Change-Id: Ib9ece1690813d9b4788455db43d30891e2138df5
Reviewed-on: https://go-review.googlesource.com/38172
Reviewed-by: Hugues Bruant <hugues.bruant@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@egonelbre

This comment has been minimized.

Contributor

egonelbre commented Mar 26, 2017

I suspect that node/python are switching the implementation based on the data being stored. For example here alpha takes 0.4ms and beta 10.7ms:

function alpha() {
	var a = {};
	a[0xffff] = 10;
	for (var i = 0; i < 100000000; i++) {
		a[i & 0xffff] = i;
	}
}
function beta() {
	var a = {};
	a[0xfffffff] = 10;
	for (var i = 0; i < 100000000; i++) {
		a[i & 0xffff] = i;
	}
}
@egonelbre

This comment has been minimized.

Contributor

egonelbre commented Mar 26, 2017

And as a proof-of-concept, Go implementation that achieves similar characteristics https://play.golang.org/p/7y3SVXAtA2. I'm not sure whether this is the same that V8 does, but I am guessing it is something similar.

@navytux

This comment has been minimized.

Contributor

navytux commented Mar 26, 2017

@egonelbre thanks for noticing this. Some relevant links about nodejs:

https://github.com/nodejs/node/blob/v4.x/deps/v8/src/objects.h#L3519
https://github.com/nodejs/node/blob/v4.x/deps/v8/src/objects.cc#L14934
https://github.com/nodejs/node/blob/v4.x/deps/v8/src/objects.cc#L12440
https://github.com/nodejs/node/blob/v4.x/deps/v8/src/objects.cc#L12499
https://github.com/nodejs/node/blob/v4.x/deps/v8/src/elements-kind.h#L13
(I've checked nodejs v4.x branch since it matches nodejs which is currently packaged by Debian)


The pypy is not switching dict implementation this way - timing for the following program stays the same irregardless of whether high index preset is enabled or not:

#!/usr/bin/env pypy

def main():
    a = {}

    if 1:
        # set high index to make sure dict implementation is hash-table, not array
        a[0x100000000] = 10
        a[1<<63 - 1] = 10


    for i in range(100000000):
        a[i & 0xffff] = i
        #a[i & 0x7f] = i
        #a[i] = i

    print(a)

if __name__ == '__main__':
    main()

it is:

$ ./map.py |head -c80
{4294967296: 10, 4611686018427387904: 10, 0: 99942400, 1: 99942401, 2: 99942402,Traceback (most recent call last):
  File "./map.py", line 20, in <module>
    main()
  File "./map.py", line 17, in main
    print(a)
IOError: [Errno 32] Broken pipe: '<fdopen>'
$ time ./map.py >/dev/null 

real    0m2,005s
user    0m1,884s
sys     0m0,116s

The go version:

package main
//import "fmt"

func main() {
    a := make(map[int32]int32)

    for i := int32(0); i < 100000000; i++ {
        a[i & 0xffff] = i
        //a[i & 0x7f] = i
        //a[i] = i
    }

    //fmt.Println(a)
}

Still takes ~1.4x - 1.5x longer compared to pypy:

$ go build map2.go
$ time ./map2             # best of 5

real    0m2,887s
user    0m2,884s
sys     0m0,004s

(and pypy/nodejs programs do printing, while go does not - with printing go time increases to ~ 3.0s)

Pypy dict related links:

https://bitbucket.org/pypy/pypy/src/9d2b5856f4c3b3ff1f6b2098afd4de299aaaa9cf/rpython/rtyper/lltypesystem/rdict.py?at=default&fileviewer=file-view-default#rdict.py-18
https://bitbucket.org/pypy/pypy/src/9d2b5856f4c3b3ff1f6b2098afd4de299aaaa9cf/pypy/objspace/std/dictmultiobject.py?at=default&fileviewer=file-view-default#dictmultiobject.py-1270
https://bitbucket.org/pypy/pypy/src/9d2b5856f4c3b3ff1f6b2098afd4de299aaaa9cf/pypy/objspace/std/dictmultiobject.py?at=default&fileviewer=file-view-default#dictmultiobject.py-676

@philhofer

This comment has been minimized.

Contributor

philhofer commented Mar 28, 2017

@huguesb @josharian Why guess when you can measure? :)

If you're on linux and have a relatively modern Intel chip (or something with a decent PMU), you can run perf record -e cache-misses:pp -e cycles:pp and perf record -e branch-misses:pp -e cycles:pp on your executable (followed by perf report) and see which event a) occurs more frequently, and b) which one consumes more cycles. May also be worth looking at dtlb_load_misses as well.

@josharian

This comment has been minimized.

Contributor

josharian commented Mar 28, 2017

@philhofer if, say, you're typing on your phone or your only dev machine runs OS X. Or both, as it happens. :) But I'd love to hear the results!

@philhofer

This comment has been minimized.

Contributor

philhofer commented Mar 28, 2017

Yes, this appears to be a dcache effect, not a branch misprediction. (Surprisingly, the branch in question doesn't show up at all when measuring mispredictions.)

Event counts for cache-misses:pp (you have to squint a little bit in order to compensate for the interrupt skid). I see very similar results for dTLB-load-misses.

       │             for {                                                                                                                            ▒
       │                     for i := uintptr(0); i < bucketCnt; i++ {                                                                                ▒
  0.90 │ d4:   cmp    ip, #8                                                                                                                          ▒
       │     ↓ bcs    1a0                                                                                                                             ▒
       │                             if b.tophash[i] != top {                                                                                         ▒
       │       ldrsb  fp, [r2]                                                                                                                        ▒
       │       add    lr, r2, ip                                                                                                                      ◆
  2.23 │       ldrb   r0, [lr]                                                                                                                        ▒
       │       and    r3, r5, #255                                                                                                                    ▒
  2.51 │       cmp    r3, r0                                                                                                                          ▒
       │     ↓ beq    13c                                                                                                                             ▒
       │                                     if b.tophash[i] == empty && inserti == nil {                                                             ▒
       │       cmp    r0, #0                                                                                                                          ▒
 26.86 │     ↓ bne    104                                                                                                                             ▒
       │       cmp    r7, #0                                                                                                                          ▒
       │     ↓ beq    114                                                                                                                             ▒
       │                     for i := uintptr(0); i < bucketCnt; i++ {                                                                                ▒
       │104:   add    ip, ip, #1                                                                                                                      ▒
       │             bucket := hash & (uintptr(1)<<h.B - 1)                                                                                           ▒
  3.13 │       ldr    r0, [sp, #28]                                                                                                                   ▒
       │       mov    r3, #1                                                                                                                          ▒
       │                     for i := uintptr(0); i < bucketCnt; i++ {                                                                                ▒
  0.07 │     ↑ b      d4

The hardware I'm testing on has a particularly anemic TLB (12 entries!), so it tends to exaggerate these sorts of effects.

On an unrelated note, the instruction sequence

ldr r0, [sp, #28]
mov r3, #1

shouldn't be in that loop body at all.

@huguesb

This comment has been minimized.

Contributor

huguesb commented Mar 29, 2017

Relevant perf output for cache-misses on amd64 (Intel Core i7-3610QM from 2012) . Not quite sure what to make of it.

       │             var inserti *uint8                                                                                                                                                                                                           ▒
       │             var insertk unsafe.Pointer                                                                                                                                                                                                   ▒
       │             var val unsafe.Pointer                                                                                                                                                                                                       ▒
       │             for {                                                                                                                                                                                                                        ◆
       │                     for i := uintptr(0); i < bucketCnt; i++ {                                                                                                                                                                            ▒
  2.02 │ e5:   cmp    $0x8,%r11                                                                                                                                                                                                                   ▒
       │     ↓ jae    192                                                                                                                                                                                                                         ▒
       │                             if b.tophash[i] != top {                                                                                                                                                                                     ▒
  0.26 │       test   %al,(%rdx)                                                                                                                                                                                                                  ▒
 24.06 │       movzbl (%rdx,%r11,1),%r12d                                                                                                                                                                                                         ▒
  1.16 │       lea    (%rdx,%r11,1),%r13                                                                                                                                                                                                          ▒
  2.43 │       cmp    %r12b,%al                                                                                                                                                                                                                   ▒
       │     ↓ je     126                                                                                                                                                                                                                         ▒
       │                                     if b.tophash[i] == empty && inserti == nil {                                                                                                                                                         ▒
  0.90 │       test   %r12b,%r12b                                                                                                                                                                                                                 ▒
       │     ↓ jne    109                                                                                                                                                                                                                         ▒
       │       test   %rcx,%rcx                                                                                                                                                                                                                   ▒
       │     ↓ je     10e       
@philhofer

This comment has been minimized.

Contributor

philhofer commented Mar 29, 2017

That looks consistent with the theory that the load generated for b.tophash[i] != top is often expensive. Since tophash is only 8 bytes, I think it's fair to say that the cache miss occurs when walking from one bucket to its overflow bucket.

FWIW, I think this is more or less what you would expect: one mandatory cache miss per bucket scanned. Even worse is the fact that the buckets are far apart (for large maps), so you may take a TLB miss on top of your cache miss.

@josharian

This comment has been minimized.

Contributor

josharian commented Mar 30, 2017

the cache miss occurs when walking from one bucket to its overflow bucket.

If that's the case, it might be interesting to experiment with altering loadFactor in runtime/hashmap.go. And also to instrument a little to understand how much of the time the map spends in a getting-bigger growth state and a same-size growth state. I doubt there are any silver bullets, but it might be worth looking.

@ALTree ALTree modified the milestones: Go1.10, Go1.9 Jun 6, 2017

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@IronCore864

This comment has been minimized.

IronCore864 commented Feb 5, 2018

Any solution or update to this issue?

Funny I met this problem while I was doing a simple leetcode program and the thought came to me that it might be interesting to compare the results of python and go, and I was scared by the result.

@randall77

This comment has been minimized.

Contributor

randall77 commented Feb 5, 2018

I don't think there have been any significant changes since the mapassign_fast* functions went in (March 2017?).

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

@gopherbot

This comment has been minimized.

gopherbot commented May 31, 2018

Change https://golang.org/cl/115616 mentions this issue: runtime: use indexByte in mapassign_fast32

@gopherbot

This comment has been minimized.

gopherbot commented Jun 12, 2018

Change https://golang.org/cl/118358 mentions this issue: runtime: vectorized map bucket lookup

@gopherbot

This comment has been minimized.

gopherbot commented Oct 15, 2018

Change https://golang.org/cl/142437 mentions this issue: runtime: exit early when scanning map buckets

gopherbot pushed a commit that referenced this issue Oct 31, 2018

runtime: exit early when scanning map buckets
Divide the "empty" slot state into two, "emptyOne" and "emptyRest".
emptyOne means just that slot is empty. emptyRest means all subsequent
slots in that bucket are empty and the overflow pointer is nil.

When scanning a bucket, we can often stop at emptyRest, reducing
the total work we have to do. (This is similar to how tombstones
work in open addressing.)

Ideally on delete we have to figure out whether to zero the slot
with an emptyOne or emptyRest marker. For now, we choose the safe
but non-optimal choice. (Fix in subsequent CL?)

This is a simpler CL than some others we've tried, including my
CL sequence 11835[5-8] and Ilya's CL 115616.

Update #19495

name                           old time/op    new time/op    delta
MegMap                           8.96ns ± 2%    8.74ns ± 6%   -2.44%  (p=0.020 n=10+10)
MegOneMap                        8.91ns ± 2%    5.53ns ± 2%  -37.99%  (p=0.000 n=10+10)
MegEqMap                         46.0µs ± 1%    45.8µs ± 3%     ~     (p=0.315 n=9+10)
MegEmptyMap                      2.50ns ± 0%    2.50ns ± 2%     ~     (p=0.957 n=8+10)
SmallStrMap                      8.54ns ± 1%    8.71ns ± 2%   +2.01%  (p=0.000 n=10+10)
MapStringKeysEight_16            8.61ns ± 3%    8.71ns ± 3%   +1.20%  (p=0.026 n=9+9)
MapStringKeysEight_32            8.54ns ± 2%    8.97ns ± 1%   +5.05%  (p=0.000 n=10+9)
MapStringKeysEight_64            8.66ns ± 2%    8.99ns ± 2%   +3.87%  (p=0.000 n=10+10)
MapStringKeysEight_1M            8.57ns ± 2%    8.95ns ± 2%   +4.51%  (p=0.000 n=10+9)
IntMap                           6.69ns ± 1%    7.46ns ± 1%  +11.60%  (p=0.000 n=9+9)
MapFirst/1                       3.69ns ± 1%    3.63ns ± 3%   -1.52%  (p=0.040 n=10+10)
MapFirst/2                       3.70ns ± 2%    3.63ns ± 2%   -1.95%  (p=0.001 n=9+9)
MapFirst/3                       3.74ns ± 2%    3.66ns ± 2%   -2.12%  (p=0.000 n=8+10)
MapFirst/4                       3.71ns ± 2%    3.66ns ± 4%     ~     (p=0.073 n=9+10)
MapFirst/5                       3.69ns ± 1%    3.62ns ± 2%   -1.88%  (p=0.000 n=9+10)
MapFirst/6                       3.68ns ± 2%    3.62ns ± 1%   -1.83%  (p=0.001 n=10+9)
MapFirst/7                       3.67ns ± 1%    3.60ns ± 1%   -1.98%  (p=0.000 n=10+8)
MapFirst/8                       3.68ns ± 2%    3.61ns ± 2%   -1.87%  (p=0.000 n=10+10)
MapFirst/9                       8.03ns ± 4%    7.89ns ± 2%   -1.76%  (p=0.007 n=10+10)
MapFirst/10                      7.99ns ± 2%    7.86ns ± 3%   -1.64%  (p=0.009 n=9+10)
MapFirst/11                      7.96ns ± 1%    7.80ns ± 2%   -2.01%  (p=0.000 n=10+10)
MapFirst/12                      7.96ns ± 1%    7.82ns ± 1%   -1.67%  (p=0.000 n=10+10)
MapFirst/13                      8.06ns ± 3%    7.92ns ± 3%     ~     (p=0.055 n=10+10)
MapFirst/14                      7.95ns ± 1%    7.80ns ± 1%   -1.88%  (p=0.000 n=10+9)
MapFirst/15                      8.01ns ± 2%    7.80ns ± 2%   -2.57%  (p=0.000 n=10+10)
MapFirst/16                      8.05ns ± 2%    7.90ns ± 2%   -1.84%  (p=0.005 n=9+10)
MapMid/1                         4.00ns ± 1%    3.94ns ± 2%   -1.30%  (p=0.021 n=8+9)
MapMid/2                         4.39ns ± 2%    4.32ns ± 4%     ~     (p=0.128 n=10+10)
MapMid/3                         4.40ns ± 2%    4.27ns ± 2%   -2.93%  (p=0.000 n=10+9)
MapMid/4                         4.76ns ± 2%    4.65ns ± 1%   -2.26%  (p=0.000 n=10+9)
MapMid/5                         4.76ns ± 1%    4.65ns ± 1%   -2.27%  (p=0.000 n=10+10)
MapMid/6                         5.11ns ± 2%    4.98ns ± 2%   -2.55%  (p=0.000 n=10+10)
MapMid/7                         5.12ns ± 1%    5.01ns ± 3%   -2.02%  (p=0.003 n=9+9)
MapMid/8                         5.71ns ± 3%    5.97ns ± 1%   +4.51%  (p=0.000 n=10+9)
MapMid/9                         8.72ns ±10%    8.89ns ±10%     ~     (p=0.458 n=9+10)
MapMid/10                        10.1ns ±15%     9.6ns ± 7%     ~     (p=0.080 n=9+10)
MapMid/11                        9.88ns ±10%    9.44ns ±11%     ~     (p=0.065 n=10+10)
MapMid/12                        9.90ns ±13%   10.04ns ± 9%     ~     (p=1.000 n=10+8)
MapMid/13                        9.67ns ±14%   10.23ns ±10%     ~     (p=0.209 n=10+9)
MapMid/14                        9.12ns ±14%    9.14ns ±13%     ~     (p=0.927 n=10+10)
MapMid/15                        9.16ns ±12%    9.15ns ±16%     ~     (p=0.955 n=10+10)
MapMid/16                        9.37ns ±11%    9.60ns ±23%     ~     (p=0.825 n=9+10)
MapLast/1                        4.08ns ± 1%    3.92ns ± 0%   -3.91%  (p=0.000 n=10+9)
MapLast/2                        4.37ns ± 1%    4.28ns ± 1%   -1.95%  (p=0.000 n=10+10)
MapLast/3                        4.94ns ± 2%    4.65ns ± 1%   -5.79%  (p=0.000 n=9+8)
MapLast/4                        5.40ns ± 3%    5.02ns ± 2%   -7.13%  (p=0.000 n=9+9)
MapLast/5                        5.88ns ± 2%    5.67ns ± 2%   -3.57%  (p=0.000 n=10+10)
MapLast/6                        6.48ns ± 3%    5.90ns ± 2%   -8.89%  (p=0.000 n=10+10)
MapLast/7                        7.01ns ± 2%    6.27ns ± 5%  -10.56%  (p=0.000 n=10+10)
MapLast/8                        7.60ns ± 2%    6.62ns ± 2%  -12.93%  (p=0.000 n=9+10)
MapLast/9                        10.6ns ± 9%    10.9ns ±15%     ~     (p=0.344 n=9+10)
MapLast/10                       11.0ns ±12%    10.9ns ±14%     ~     (p=0.985 n=10+10)
MapLast/11                       11.4ns ±12%    11.8ns ±22%     ~     (p=0.671 n=10+10)
MapLast/12                       11.6ns ±10%    12.1ns ±19%     ~     (p=0.617 n=10+10)
MapLast/13                       12.5ns ±23%    11.8ns ±13%     ~     (p=0.827 n=10+9)
MapLast/14                       10.5ns ±22%    10.4ns ± 5%     ~     (p=0.797 n=10+9)
MapLast/15                       10.0ns ±15%    10.3ns ±16%     ~     (p=0.565 n=10+10)
MapLast/16                       10.4ns ±12%    10.5ns ±13%     ~     (p=0.889 n=10+9)
MapCycle                         22.3ns ± 1%    22.0ns ± 2%   -1.43%  (p=0.002 n=9+10)
RepeatedLookupStrMapKey32        16.4ns ± 1%    16.6ns ± 1%   +1.24%  (p=0.000 n=10+9)
RepeatedLookupStrMapKey1M        35.6µs ± 0%    35.4µs ± 1%   -0.62%  (p=0.002 n=10+10)
NewEmptyMap                      5.36ns ± 1%    9.05ns ± 1%  +69.02%  (p=0.000 n=10+8)
NewSmallMap                      51.2ns ± 2%    33.7ns ± 1%  -34.22%  (p=0.000 n=10+9)
MapIter                          83.8ns ± 1%    88.4ns ± 1%   +5.55%  (p=0.000 n=10+10)
MapIterEmpty                     4.32ns ± 3%    5.54ns ± 3%  +28.12%  (p=0.000 n=10+10)
SameLengthMap                    4.31ns ± 1%    4.59ns ± 2%   +6.41%  (p=0.000 n=9+10)
BigKeyMap                        24.2ns ± 2%    24.3ns ± 1%     ~     (p=0.432 n=10+10)
BigValMap                        24.3ns ± 1%    24.4ns ± 2%     ~     (p=0.200 n=10+9)
SmallKeyMap                      17.5ns ± 1%    18.5ns ± 2%   +5.81%  (p=0.000 n=9+10)
MapPopulate/1                    29.0ns ± 4%    18.8ns ± 1%  -35.27%  (p=0.000 n=10+9)
MapPopulate/10                    736ns ± 5%     693ns ± 4%   -5.92%  (p=0.000 n=10+10)
MapPopulate/100                  11.3µs ± 2%    10.8µs ± 3%   -4.38%  (p=0.000 n=10+10)
MapPopulate/1000                  139µs ± 8%     132µs ± 4%   -5.10%  (p=0.002 n=10+10)
MapPopulate/10000                1.21ms ± 5%    1.16ms ± 5%   -4.56%  (p=0.002 n=10+10)
MapPopulate/100000               12.2ms ± 3%    11.8ms ± 5%     ~     (p=0.052 n=10+10)
ComplexAlgMap                    73.9ns ± 1%    74.4ns ± 2%     ~     (p=0.161 n=9+10)
GoMapClear/Reflexive/1           36.0ns ± 1%    26.9ns ± 2%  -25.31%  (p=0.000 n=10+10)
GoMapClear/Reflexive/10          35.2ns ± 1%    24.4ns ± 1%  -30.62%  (p=0.000 n=10+10)
GoMapClear/Reflexive/100         69.6ns ± 2%    59.2ns ± 1%  -14.92%  (p=0.000 n=10+10)
GoMapClear/Reflexive/1000        1.06µs ± 2%    1.05µs ± 1%   -1.16%  (p=0.013 n=10+9)
GoMapClear/Reflexive/10000       11.7µs ± 1%    11.7µs ± 1%     ~     (p=0.542 n=10+10)
GoMapClear/NonReflexive/1        96.3ns ± 1%    90.0ns ± 1%   -6.52%  (p=0.000 n=10+10)
GoMapClear/NonReflexive/10        110ns ± 2%     101ns ± 0%   -8.10%  (p=0.000 n=10+7)
GoMapClear/NonReflexive/100       270ns ± 2%     235ns ± 2%  -12.94%  (p=0.000 n=10+10)
GoMapClear/NonReflexive/1000     3.02µs ± 2%    2.48µs ± 1%  -17.92%  (p=0.000 n=10+10)
GoMapClear/NonReflexive/10000    23.7µs ± 1%    19.6µs ± 1%  -17.30%  (p=0.000 n=10+9)
MapPop100                        9.65µs ± 6%    9.18µs ± 8%   -4.82%  (p=0.008 n=9+10)
MapPop1000                        162µs ± 6%     148µs ± 4%   -8.67%  (p=0.000 n=9+9)
MapPop10000                      3.05ms ± 8%    2.82ms ±15%   -7.66%  (p=0.023 n=10+10)
MapAssign/Int32/256              15.7ns ± 4%    14.6ns ± 2%   -7.08%  (p=0.000 n=10+10)
MapAssign/Int32/65536            29.8ns ± 1%    30.4ns ± 0%   +2.04%  (p=0.000 n=10+8)
MapAssign/Int64/256              14.9ns ± 5%    14.8ns ± 4%     ~     (p=0.611 n=10+10)
MapAssign/Int64/65536            30.3ns ± 2%    30.4ns ± 1%   +0.54%  (p=0.046 n=10+9)
MapAssign/Str/256                17.8ns ± 3%    19.8ns ± 4%  +11.08%  (p=0.000 n=10+10)
MapAssign/Str/65536              35.7ns ± 1%    36.4ns ± 1%   +1.82%  (p=0.000 n=10+10)
MapOperatorAssign/Int32/256      18.8ns ± 5%    14.6ns ± 3%  -22.57%  (p=0.000 n=10+10)
MapOperatorAssign/Int32/65536    29.8ns ± 1%    30.5ns ± 1%   +2.39%  (p=0.000 n=10+10)
MapOperatorAssign/Int64/256      16.6ns ± 4%    15.0ns ± 6%   -9.34%  (p=0.000 n=10+10)
MapOperatorAssign/Int64/65536    30.1ns ± 1%    31.7ns ± 2%   +5.21%  (p=0.000 n=10+10)
MapOperatorAssign/Str/256        1.70µs ± 1%    1.61µs ± 2%   -5.55%  (p=0.000 n=10+8)
MapOperatorAssign/Str/65536       289ns ± 7%     294ns ± 4%     ~     (p=0.425 n=10+10)
MapAppendAssign/Int32/256        34.3ns ± 2%    31.0ns ± 3%   -9.59%  (p=0.000 n=9+9)
MapAppendAssign/Int32/65536      51.8ns ± 3%    47.1ns ±13%   -9.17%  (p=0.002 n=9+10)
MapAppendAssign/Int64/256        32.5ns ± 8%    31.2ns ± 6%     ~     (p=0.065 n=10+10)
MapAppendAssign/Int64/65536      51.4ns ± 4%    47.2ns ±10%   -8.07%  (p=0.005 n=9+10)
MapAppendAssign/Str/256           105ns ±12%     109ns ± 4%     ~     (p=0.138 n=10+8)
MapAppendAssign/Str/65536         101ns ±14%      81ns ± 8%  -19.82%  (p=0.000 n=10+9)
MapDelete/Int32/100              32.0ns ± 1%    35.0ns ± 2%   +9.59%  (p=0.000 n=9+10)
MapDelete/Int32/1000             27.0ns ± 3%    30.3ns ± 1%  +12.10%  (p=0.000 n=10+9)
MapDelete/Int32/10000            29.2ns ± 1%    32.9ns ± 2%  +12.80%  (p=0.000 n=10+10)
MapDelete/Int64/100              31.5ns ± 1%    35.7ns ± 2%  +13.16%  (p=0.000 n=10+10)
MapDelete/Int64/1000             27.0ns ± 2%    30.6ns ± 1%  +13.21%  (p=0.000 n=10+10)
MapDelete/Int64/10000            30.3ns ± 1%    34.4ns ± 3%  +13.47%  (p=0.000 n=10+10)
MapDelete/Str/100                23.4ns ± 8%    26.7ns ± 6%  +14.10%  (p=0.000 n=10+9)
MapDelete/Str/1000               31.0ns ± 2%    35.1ns ± 3%  +13.19%  (p=0.000 n=10+9)
MapDelete/Str/10000              38.8ns ± 1%    43.4ns ± 2%  +12.02%  (p=0.000 n=9+10)

Change-Id: I564ce0f40936589f0f9b837f7f2bbcca4c4a1070
Reviewed-on: https://go-review.googlesource.com/c/142437
Reviewed-by: Giovanni Bajo <rasky@develer.com>
Reviewed-by: Martin Möhrmann <martisch@uos.de>
Run-TryBot: Martin Möhrmann <martisch@uos.de>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment