New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering #18977

Open
cherrymui opened this Issue Feb 7, 2017 · 41 comments

Comments

Projects
None yet
8 participants
@cherrymui
Contributor

cherrymui commented Feb 7, 2017

CL https://go-review.googlesource.com/c/36205/ deletes assembler backend instruction reordering. This makes Fannkuch-11 on AMD64 slow down 6%. Figure out why and improve the compiler.

@cherrymui cherrymui self-assigned this Feb 7, 2017

@cherrymui cherrymui added this to the Go1.9 milestone Feb 7, 2017

@cherrymui cherrymui changed the title from cmd/compile: Fannkuch-11 on AMD64 slow down 6% after removing assembler backend instruction reordering to cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering Feb 7, 2017

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented Mar 21, 2017

CL https://golang.org/cl/38431 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 24, 2017

cmd/compile: rotate loops so conditional branch is at the end
Old loops look like this:
   loop:
     CMPQ ...
     JGE exit
     ...
     JMP loop
   exit:

New loops look like this:
    JMP entry
  loop:
    ...
  entry:
    CMPQ ...
    JLT loop

This removes one instruction (the unconditional jump) from
the inner loop.
Kinda surprisingly, it matters.

This is a bit different than the peeling that the old obj
library did in that we don't duplicate the loop exit test.
We just jump to the test.  I'm not sure if it is better or
worse to do that (peeling gets rid of the JMP but means more
code duplication), but this CL is certainly a much simpler
compiler change, so I'll try this way first.

The obj library used to do peeling before
CL https://go-review.googlesource.com/c/36205 turned it off.

Fixes #15837 (remove obj instruction reordering)
The reordering is already removed, this CL implements the only
part of that reordering that we'd like to keep.

Fixes #14758 (append loop)
name    old time/op    new time/op    delta
Foo-12     817ns ± 4%     538ns ± 0%  -34.08%   (p=0.000 n=10+9)
Bar-12     850ns ±11%     570ns ±13%  -32.88%  (p=0.000 n=10+10)

Update #19595 (BLAS slowdown)
name                       old time/op  new time/op  delta
DgemvMedMedNoTransIncN-12  13.2µs ± 9%  10.2µs ± 1%  -22.26%  (p=0.000 n=9+9)

Fixes #19633 (append loop)
name    old time/op    new time/op    delta
Foo-12     810ns ± 1%     540ns ± 0%  -33.30%   (p=0.000 n=8+9)

Update #18977 (Fannkuch11 regression)
name         old time/op    new time/op    delta
Fannkuch11-8                2.80s ± 0%     3.01s ± 0%  +7.47%   (p=0.000 n=9+10)
This one makes no sense.  There's strictly 1 less instruction in the
inner loop (17 instead of 18).  They are exactly the same instructions
except for the JMP that has been elided.

go1 benchmarks generally don't look very impressive.  But the gains for the
specific issues above make this CL still probably worth it.
name                      old time/op    new time/op    delta
BinaryTree17-8              2.32s ± 0%     2.34s ± 0%  +1.14%    (p=0.000 n=9+7)
Fannkuch11-8                2.80s ± 0%     3.01s ± 0%  +7.47%   (p=0.000 n=9+10)
FmtFprintfEmpty-8          44.1ns ± 1%    46.1ns ± 1%  +4.53%  (p=0.000 n=10+10)
FmtFprintfString-8         67.8ns ± 0%    74.4ns ± 1%  +9.80%   (p=0.000 n=10+9)
FmtFprintfInt-8            74.9ns ± 0%    78.4ns ± 0%  +4.67%   (p=0.000 n=8+10)
FmtFprintfIntInt-8          117ns ± 1%     123ns ± 1%  +4.69%   (p=0.000 n=9+10)
FmtFprintfPrefixedInt-8     160ns ± 1%     146ns ± 0%  -8.22%   (p=0.000 n=8+10)
FmtFprintfFloat-8           214ns ± 0%     206ns ± 0%  -3.91%    (p=0.000 n=8+8)
FmtManyArgs-8               468ns ± 0%     497ns ± 1%  +6.09%   (p=0.000 n=8+10)
GobDecode-8                6.16ms ± 0%    6.21ms ± 1%  +0.76%   (p=0.000 n=9+10)
GobEncode-8                4.90ms ± 0%    4.92ms ± 1%  +0.37%   (p=0.028 n=9+10)
Gzip-8                      209ms ± 0%     212ms ± 0%  +1.33%  (p=0.000 n=10+10)
Gunzip-8                   36.6ms ± 0%    38.0ms ± 1%  +4.03%    (p=0.000 n=9+9)
HTTPClientServer-8         84.2µs ± 0%    86.0µs ± 1%  +2.14%    (p=0.000 n=9+9)
JSONEncode-8               13.6ms ± 3%    13.8ms ± 1%  +1.55%   (p=0.003 n=9+10)
JSONDecode-8               53.2ms ± 5%    52.9ms ± 0%    ~     (p=0.280 n=10+10)
Mandelbrot200-8            3.78ms ± 0%    3.78ms ± 1%    ~      (p=0.661 n=10+9)
GoParse-8                  2.89ms ± 0%    2.94ms ± 2%  +1.50%  (p=0.000 n=10+10)
RegexpMatchEasy0_32-8      68.5ns ± 2%    68.9ns ± 1%    ~     (p=0.136 n=10+10)
RegexpMatchEasy0_1K-8       220ns ± 1%     225ns ± 1%  +2.41%  (p=0.000 n=10+10)
RegexpMatchEasy1_32-8      64.7ns ± 0%    64.5ns ± 0%  -0.28%  (p=0.042 n=10+10)
RegexpMatchEasy1_1K-8       348ns ± 1%     355ns ± 0%  +1.90%  (p=0.000 n=10+10)
RegexpMatchMedium_32-8      102ns ± 1%     105ns ± 1%  +2.95%  (p=0.000 n=10+10)
RegexpMatchMedium_1K-8     33.1µs ± 3%    32.5µs ± 0%  -1.75%  (p=0.000 n=10+10)
RegexpMatchHard_32-8       1.71µs ± 1%    1.70µs ± 1%  -0.84%   (p=0.002 n=10+9)
RegexpMatchHard_1K-8       51.1µs ± 0%    50.8µs ± 1%  -0.48%  (p=0.004 n=10+10)
Revcomp-8                   411ms ± 1%     402ms ± 0%  -2.22%   (p=0.000 n=10+9)
Template-8                 61.8ms ± 1%    59.7ms ± 0%  -3.44%    (p=0.000 n=9+9)
TimeParse-8                 306ns ± 0%     318ns ± 0%  +3.83%  (p=0.000 n=10+10)
TimeFormat-8                320ns ± 0%     318ns ± 1%  -0.53%   (p=0.012 n=7+10)

Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b
Reviewed-on: https://go-review.googlesource.com/38431
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 11, 2017

Contributor

Tip (6897030) looks like a net win vs 1.8.1. Code size is smaller:

  • 1.8.1: "".fannkuch t=1 size=844 args=0x10 locals=0x80
  • tip: "".fannkuch STEXT size=829 args=0x10 locals=0x80

Execution time is faster, at least on my machine:

name          old time/op  new time/op  delta
Fannkuch11-8   2.74s ± 2%   2.68s ± 2%  -2.32%  (p=0.000 n=20+19)

So whatever regression occurred has at the least been compensated for, if not fixed by CL 38431.

I do see one notable regression, though, which is in the number of panicindex calls. I've filed #20332 for that.

I'm going to close this as fixed. @cherrymui or @randall77, please reopen if you disagree.

Contributor

josharian commented May 11, 2017

Tip (6897030) looks like a net win vs 1.8.1. Code size is smaller:

  • 1.8.1: "".fannkuch t=1 size=844 args=0x10 locals=0x80
  • tip: "".fannkuch STEXT size=829 args=0x10 locals=0x80

Execution time is faster, at least on my machine:

name          old time/op  new time/op  delta
Fannkuch11-8   2.74s ± 2%   2.68s ± 2%  -2.32%  (p=0.000 n=20+19)

So whatever regression occurred has at the least been compensated for, if not fixed by CL 38431.

I do see one notable regression, though, which is in the number of panicindex calls. I've filed #20332 for that.

I'm going to close this as fixed. @cherrymui or @randall77, please reopen if you disagree.

@josharian josharian closed this May 11, 2017

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 11, 2017

Contributor

@josharian, sad to bring bad news but here is how it looks on my machine:

$ benchstat go18.txt gotip.txt
name          old time/op  new time/op  delta
Fannkuch11-4   2.68s ± 0%   2.94s ± 0%  +9.88%  (p=0.000 n=20+19)

in other words it is ~ 10% regression.

$ go18 version
go version go1.8.1 linux/amd64
$ go version
go version devel +ee69c21747 Thu May 11 18:28:40 2017 +0000 linux/amd64

Linux deco 4.9.0-2-amd64 #1 SMP Debian 4.9.18-1 (2017-03-30) x86_64 GNU/Linux
Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz

Contributor

navytux commented May 11, 2017

@josharian, sad to bring bad news but here is how it looks on my machine:

$ benchstat go18.txt gotip.txt
name          old time/op  new time/op  delta
Fannkuch11-4   2.68s ± 0%   2.94s ± 0%  +9.88%  (p=0.000 n=20+19)

in other words it is ~ 10% regression.

$ go18 version
go version go1.8.1 linux/amd64
$ go version
go version devel +ee69c21747 Thu May 11 18:28:40 2017 +0000 linux/amd64

Linux deco 4.9.0-2-amd64 #1 SMP Debian 4.9.18-1 (2017-03-30) x86_64 GNU/Linux
Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 11, 2017

Contributor

@navytux will you re-measure including CL 43291?

Contributor

josharian commented May 11, 2017

@navytux will you re-measure including CL 43291?

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 11, 2017

Contributor

let me try

Contributor

navytux commented May 11, 2017

let me try

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 11, 2017

Contributor

Strangely CL 43291 make things only worse:

$ benchstat gotip.txt gotip+cl43291.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.94s ± 0%   3.01s ± 0%  +2.29%  (p=0.000 n=19+19)

and I've doble-checked it is reproducible via recompiling (make.bash) and benchmarking tip then tip+CL then tip again then tip+CL again.

Contributor

navytux commented May 11, 2017

Strangely CL 43291 make things only worse:

$ benchstat gotip.txt gotip+cl43291.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.94s ± 0%   3.01s ± 0%  +2.29%  (p=0.000 n=19+19)

and I've doble-checked it is reproducible via recompiling (make.bash) and benchmarking tip then tip+CL then tip again then tip+CL again.

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 11, 2017

Contributor

OK, re-opening.

One other things I see in tip vs 1.8.1 is that in 1.8.1, all the panicindex calls are at the end of the function, so jumps to them are always unlikely (forward jumps). In tip, they're scattered throughout. Not sure whether that's responsible, but I'll see if I can fix it and check whether it helps.

Contributor

josharian commented May 11, 2017

OK, re-opening.

One other things I see in tip vs 1.8.1 is that in 1.8.1, all the panicindex calls are at the end of the function, so jumps to them are always unlikely (forward jumps). In tip, they're scattered throughout. Not sure whether that's responsible, but I'll see if I can fix it and check whether it helps.

@josharian josharian reopened this May 11, 2017

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 11, 2017

Contributor

@josharian thanks for feedback and reopening.

Contributor

navytux commented May 11, 2017

@josharian thanks for feedback and reopening.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented May 11, 2017

CL https://golang.org/cl/43293 mentions this issue.

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 11, 2017

Contributor

@navytux will you see what effect CL 43293 has on your machine? Thanks.

Contributor

josharian commented May 11, 2017

@navytux will you see what effect CL 43293 has on your machine? Thanks.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 12, 2017

Contributor

@josharian sure. Today's tip is go version devel +482da51803 Fri May 12 01:10:11 2017 +0000 linux/amd64 and it already includes CL 43291. We are using it as the base:

$ benchstat gotip.txt gotip+cl43293.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.01s ± 0%   2.84s ± 1%  -5.66%  (p=0.000 n=17+19)

in other words it is partial improvement and we are now back to 6% regression over go18:

$ benchstat go18.txt gotip+cl43293.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.68s ± 0%   2.84s ± 1%  +6.17%  (p=0.000 n=20+19)
Contributor

navytux commented May 12, 2017

@josharian sure. Today's tip is go version devel +482da51803 Fri May 12 01:10:11 2017 +0000 linux/amd64 and it already includes CL 43291. We are using it as the base:

$ benchstat gotip.txt gotip+cl43293.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.01s ± 0%   2.84s ± 1%  -5.66%  (p=0.000 n=17+19)

in other words it is partial improvement and we are now back to 6% regression over go18:

$ benchstat go18.txt gotip+cl43293.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.68s ± 0%   2.84s ± 1%  +6.17%  (p=0.000 n=20+19)
@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 12, 2017

Contributor

For the reference here is how Fannkuch11 looks under perf stat -d:

 Performance counter stats for './go1.test-18 -test.bench Fannkuch11' (10 runs):

       4008,809815      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,06% )
               834      context-switches          #    0,208 K/sec                    ( +-  1,04% )
                14      cpu-migrations            #    0,003 K/sec                    ( +- 10,89% )
            79 865      page-faults               #    0,020 M/sec                    ( +-  0,01% )
    13 398 886 442      cycles                    #    3,342 GHz                      ( +-  0,06% )  (49,99%)
    29 180 837 581      instructions              #    2,18  insn per cycle           ( +-  0,06% )  (62,57%)
     5 974 722 394      branches                  # 1490,398 M/sec                    ( +-  0,06% )  (62,62%)
       162 930 510      branch-misses             #    2,73% of all branches          ( +-  0,09% )  (62,73%)
     6 602 849 508      L1-dcache-loads           # 1647,085 M/sec                    ( +-  0,11% )  (62,84%)
        73 480 588      L1-dcache-load-misses     #    1,11% of all L1-dcache hits    ( +-  0,24% )  (62,61%)
         3 604 844      LLC-loads                 #    0,899 M/sec                    ( +-  3,96% )  (49,89%)
           316 385      LLC-load-misses           #    8,78% of all LL-cache hits     ( +-  5,10% )  (49,85%)

       4,003714952 seconds time elapsed                                          ( +-  0,07% )
 Performance counter stats for './go1.test-tip -test.bench Fannkuch11' (10 runs):

       4259,019276      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,09% )
               822      context-switches          #    0,193 K/sec                    ( +-  1,31% )
                15      cpu-migrations            #    0,004 K/sec                    ( +-  7,89% )
            79 928      page-faults               #    0,019 M/sec                    ( +-  0,01% )
    14 228 793 806      cycles                    #    3,341 GHz                      ( +-  0,08% )  (49,92%)
    29 011 760 342      instructions              #    2,04  insn per cycle           ( +-  0,08% )  (62,54%)
     6 812 768 092      branches                  # 1599,610 M/sec                    ( +-  0,05% )  (62,56%)
       169 436 247      branch-misses             #    2,49% of all branches          ( +-  0,07% )  (62,68%)
     6 617 477 028      L1-dcache-loads           # 1553,756 M/sec                    ( +-  0,07% )  (62,84%)
        72 643 321      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,21% )  (62,77%)
         3 792 842      LLC-loads                 #    0,891 M/sec                    ( +-  8,00% )  (49,95%)
           324 234      LLC-load-misses           #    8,55% of all LL-cache hits     ( +-  6,96% )  (49,88%)

       4,255140742 seconds time elapsed                                          ( +-  0,09% )
 Performance counter stats for './go1.test-tip+cl43293 -test.bench Fannkuch11' (10 runs):

       4076,691573      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,18% )
               809      context-switches          #    0,198 K/sec                    ( +-  1,07% )
                14      cpu-migrations            #    0,003 K/sec                    ( +-  7,07% )
            79 926      page-faults               #    0,020 M/sec                    ( +-  0,01% )
    13 622 118 634      cycles                    #    3,341 GHz                      ( +-  0,10% )  (49,99%)
    28 659 892 607      instructions              #    2,10  insn per cycle           ( +-  0,11% )  (62,64%)
     6 412 523 320      branches                  # 1572,972 M/sec                    ( +-  0,07% )  (62,67%)
       175 939 383      branch-misses             #    2,74% of all branches          ( +-  0,09% )  (62,74%)
     6 634 137 889      L1-dcache-loads           # 1627,334 M/sec                    ( +-  0,10% )  (62,81%)
        72 791 290      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,52% )  (62,68%)
         3 613 857      LLC-loads                 #    0,886 M/sec                    ( +-  3,46% )  (49,88%)
           315 113      LLC-load-misses           #    8,72% of all LL-cache hits     ( +-  5,94% )  (49,86%)

       4,072909978 seconds time elapsed                                          ( +-  0,19% )

highlights:

  1. amount of instructions executed is less for tip (99.4%) and tip+cl43293 (98.2%) compared to go18.
  2. however instructions per CPU cycle is lower at tip (2,04 insn/cycle) and tip+cl43293 (2,10 insn/cycle) compared to go18 (2,18 insn/cycle). That means there are probably more stalls due to instructions mischeduling or something. That results for more cycles being spent even though there are less instructions to execute.
  3. total amount of branches is higher at tip (114%) and tip+cl43293 (107%) compared to go18.
  4. L1d loads is only a bit slightly more for tip (0.2%) and tip+cl43293 (0.4%) compared to go18. L1d miss-ratio is also almost the same amongst 3 versions.
  5. LLC (last level cache - L3 in my case - that goes first before RAM) loads are more for tip (105.2%) and go almost back to go18 level at tip+cl43293 (100.2%). The variation in LLC-loads is more for tip over go18 and tip+cl43293. LLC mis ratio also is approximately back to go18 level at tip+cl43293.

To me all this suggests that after CL43293 or similar is applied the problem is in that:

  • there are more branches compared to go18, and
  • instructions are less well scheduled by compiler

Hope this helps a bit,
Kirill

Contributor

navytux commented May 12, 2017

For the reference here is how Fannkuch11 looks under perf stat -d:

 Performance counter stats for './go1.test-18 -test.bench Fannkuch11' (10 runs):

       4008,809815      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,06% )
               834      context-switches          #    0,208 K/sec                    ( +-  1,04% )
                14      cpu-migrations            #    0,003 K/sec                    ( +- 10,89% )
            79 865      page-faults               #    0,020 M/sec                    ( +-  0,01% )
    13 398 886 442      cycles                    #    3,342 GHz                      ( +-  0,06% )  (49,99%)
    29 180 837 581      instructions              #    2,18  insn per cycle           ( +-  0,06% )  (62,57%)
     5 974 722 394      branches                  # 1490,398 M/sec                    ( +-  0,06% )  (62,62%)
       162 930 510      branch-misses             #    2,73% of all branches          ( +-  0,09% )  (62,73%)
     6 602 849 508      L1-dcache-loads           # 1647,085 M/sec                    ( +-  0,11% )  (62,84%)
        73 480 588      L1-dcache-load-misses     #    1,11% of all L1-dcache hits    ( +-  0,24% )  (62,61%)
         3 604 844      LLC-loads                 #    0,899 M/sec                    ( +-  3,96% )  (49,89%)
           316 385      LLC-load-misses           #    8,78% of all LL-cache hits     ( +-  5,10% )  (49,85%)

       4,003714952 seconds time elapsed                                          ( +-  0,07% )
 Performance counter stats for './go1.test-tip -test.bench Fannkuch11' (10 runs):

       4259,019276      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,09% )
               822      context-switches          #    0,193 K/sec                    ( +-  1,31% )
                15      cpu-migrations            #    0,004 K/sec                    ( +-  7,89% )
            79 928      page-faults               #    0,019 M/sec                    ( +-  0,01% )
    14 228 793 806      cycles                    #    3,341 GHz                      ( +-  0,08% )  (49,92%)
    29 011 760 342      instructions              #    2,04  insn per cycle           ( +-  0,08% )  (62,54%)
     6 812 768 092      branches                  # 1599,610 M/sec                    ( +-  0,05% )  (62,56%)
       169 436 247      branch-misses             #    2,49% of all branches          ( +-  0,07% )  (62,68%)
     6 617 477 028      L1-dcache-loads           # 1553,756 M/sec                    ( +-  0,07% )  (62,84%)
        72 643 321      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,21% )  (62,77%)
         3 792 842      LLC-loads                 #    0,891 M/sec                    ( +-  8,00% )  (49,95%)
           324 234      LLC-load-misses           #    8,55% of all LL-cache hits     ( +-  6,96% )  (49,88%)

       4,255140742 seconds time elapsed                                          ( +-  0,09% )
 Performance counter stats for './go1.test-tip+cl43293 -test.bench Fannkuch11' (10 runs):

       4076,691573      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,18% )
               809      context-switches          #    0,198 K/sec                    ( +-  1,07% )
                14      cpu-migrations            #    0,003 K/sec                    ( +-  7,07% )
            79 926      page-faults               #    0,020 M/sec                    ( +-  0,01% )
    13 622 118 634      cycles                    #    3,341 GHz                      ( +-  0,10% )  (49,99%)
    28 659 892 607      instructions              #    2,10  insn per cycle           ( +-  0,11% )  (62,64%)
     6 412 523 320      branches                  # 1572,972 M/sec                    ( +-  0,07% )  (62,67%)
       175 939 383      branch-misses             #    2,74% of all branches          ( +-  0,09% )  (62,74%)
     6 634 137 889      L1-dcache-loads           # 1627,334 M/sec                    ( +-  0,10% )  (62,81%)
        72 791 290      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,52% )  (62,68%)
         3 613 857      LLC-loads                 #    0,886 M/sec                    ( +-  3,46% )  (49,88%)
           315 113      LLC-load-misses           #    8,72% of all LL-cache hits     ( +-  5,94% )  (49,86%)

       4,072909978 seconds time elapsed                                          ( +-  0,19% )

highlights:

  1. amount of instructions executed is less for tip (99.4%) and tip+cl43293 (98.2%) compared to go18.
  2. however instructions per CPU cycle is lower at tip (2,04 insn/cycle) and tip+cl43293 (2,10 insn/cycle) compared to go18 (2,18 insn/cycle). That means there are probably more stalls due to instructions mischeduling or something. That results for more cycles being spent even though there are less instructions to execute.
  3. total amount of branches is higher at tip (114%) and tip+cl43293 (107%) compared to go18.
  4. L1d loads is only a bit slightly more for tip (0.2%) and tip+cl43293 (0.4%) compared to go18. L1d miss-ratio is also almost the same amongst 3 versions.
  5. LLC (last level cache - L3 in my case - that goes first before RAM) loads are more for tip (105.2%) and go almost back to go18 level at tip+cl43293 (100.2%). The variation in LLC-loads is more for tip over go18 and tip+cl43293. LLC mis ratio also is approximately back to go18 level at tip+cl43293.

To me all this suggests that after CL43293 or similar is applied the problem is in that:

  • there are more branches compared to go18, and
  • instructions are less well scheduled by compiler

Hope this helps a bit,
Kirill

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 12, 2017

Contributor

Thanks, Kirill, that's great. My guess was that CL 43293 was making the critical sections of code more compact, thus keeping more in cache; nice to have that confirmed. I'm working on making CL 43293 submittable now, and then will look again.

Contributor

josharian commented May 12, 2017

Thanks, Kirill, that's great. My guess was that CL 43293 was making the critical sections of code more compact, thus keeping more in cache; nice to have that confirmed. I'm working on making CL 43293 submittable now, and then will look again.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 12, 2017

Contributor

Josh thanks for feedback and patches. Offhand me thinks to get this regression gone we need to fix "Kinda surprisingly" in 39ce590 (CL 38431) - in other words to understand the root cause and apply loop rotation only when it is more optimal judging on corresponding factors. Current state is, it seems, we are applying loop rotation always and it sometimes helps and sometimes harms.

@randall77, @TocarIP please see #18977 (comment) for a bit of info about what is going on with Fannkuch11.

Contributor

navytux commented May 12, 2017

Josh thanks for feedback and patches. Offhand me thinks to get this regression gone we need to fix "Kinda surprisingly" in 39ce590 (CL 38431) - in other words to understand the root cause and apply loop rotation only when it is more optimal judging on corresponding factors. Current state is, it seems, we are applying loop rotation always and it sometimes helps and sometimes harms.

@randall77, @TocarIP please see #18977 (comment) for a bit of info about what is going on with Fannkuch11.

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 12, 2017

Contributor

I'm going to dig a bit more, although I'll be mostly AFK for a while. In the meantime, Kirill, maybe you could confirm that disabling looprotate on your machine fixes the problem by adding "return" at the very beginning of func loopRotate in cmd/compile/internal/ssa/looprotate.go, to disable the pass.

Contributor

josharian commented May 12, 2017

I'm going to dig a bit more, although I'll be mostly AFK for a while. In the meantime, Kirill, maybe you could confirm that disabling looprotate on your machine fixes the problem by adding "return" at the very beginning of func loopRotate in cmd/compile/internal/ssa/looprotate.go, to disable the pass.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 12, 2017

Contributor

Josh, thanks for feedback. Sure, staying at the same tip = 482da51 here is what we have:

  1. going to CL 43293.2 (second patch in that CL) birings more speedup:
$ benchstat gotip+cl43293.txt gotip+cl43293.2.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.84s ± 1%   2.78s ± 0%  -2.09%  (p=0.000 n=19+16)
 Performance counter stats for './go1.test-tip+cl43293.2 -test.bench Fannkuch11' (10 runs):

       4029,642462      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,11% )
               814      context-switches          #    0,202 K/sec                    ( +-  1,00% )
                11      cpu-migrations            #    0,003 K/sec                    ( +-  8,11% )
             79921      page-faults               #    0,020 M/sec                    ( +-  0,01% )
       13482684046      cycles                    #    3,346 GHz                      ( +-  0,10% )  (50,05%)
       28731337641      instructions              #    2,13  insn per cycle           ( +-  0,17% )  (62,66%)
        6429703132      branches                  # 1595,601 M/sec                    ( +-  0,14% )  (62,72%)
         168958851      branch-misses             #    2,63% of all branches          ( +-  0,18% )  (62,74%)
        6634192402      L1-dcache-loads           # 1646,348 M/sec                    ( +-  0,14% )  (62,80%)
          73264431      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,66% )  (62,57%)
           4233228      LLC-loads                 #    1,051 M/sec                    ( +-  3,12% )  (49,84%)
            376635      LLC-load-misses           #    8,90% of all LL-cache hits     ( +-  9,29% )  (49,89%)

       4,026348487 seconds time elapsed                                          ( +-  0,11% )

some comments: there is a bit more (0.2%) instructions but insn/cycle grows from 2.10 to 2.13 and as the result total cpu cycles is ~ 1% less (which correllates with totall run time reported by perf and probably includes preparatory program actions -test.bench itself excludes from accounting). The amount of branches is too 0.2% more, however, if I read counters correctly, not all new instructions are branches as δ(insn) = 71445034 but δ(branches) = 17179812 (4x less).

  1. disabling loopRotate over CL 43293.2 makes it slower:
$ benchstat gotip+cl43293.2.txt gotip+cl43293.2-nolooprot.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.78s ± 0%   2.87s ± 0%  +2.93%  (p=0.000 n=16+15)
 Performance counter stats for './go1.test-tip+cl43293.2-nolooprot -test.bench Fannkuch11' (10 runs):

       4103,320043      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,12% )
               822      context-switches          #    0,200 K/sec                    ( +-  0,82% )
                12      cpu-migrations            #    0,003 K/sec                    ( +-  9,02% )
             79925      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       13711752541      cycles                    #    3,342 GHz                      ( +-  0,06% )  (49,99%)
       29498486031      instructions              #    2,15  insn per cycle           ( +-  0,07% )  (62,58%)
        7212426730      branches                  # 1757,705 M/sec                    ( +-  0,08% )  (62,64%)
         174290760      branch-misses             #    2,42% of all branches          ( +-  0,04% )  (62,73%)
        6609550227      L1-dcache-loads           # 1610,781 M/sec                    ( +-  0,07% )  (62,79%)
          73869349      L1-dcache-load-misses     #    1,12% of all L1-dcache hits    ( +-  0,28% )  (62,67%)
           3816460      LLC-loads                 #    0,930 M/sec                    ( +-  2,51% )  (49,91%)
            429992      LLC-load-misses           #   11,27% of all LL-cache hits     ( +-  8,17% )  (49,89%)

       4,098764012 seconds time elapsed                                          ( +-  0,12% )

insn / cycle grows from 2.13 to 2.15 (good) but there are 2.7% more total instructions so overal cycles grow by 1.7% (correlates wrt total run time reported by perf). There is 12% more branches and looks like they are the only visible factor that contribute to instruction growth: δ(branches) / δ(insn) ≈ 1.

  1. but disabling loopRotate over just tip makes it faster:
$ benchstat gotip.txt gotip-nolooprot.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.02s ± 1%   2.98s ± 0%  -1.29%  (p=0.000 n=19+16)
 Performance counter stats for './go1.test-tip-nolooprot -test.bench Fannkuch11' (10 runs):

       4215,739102      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,08% )
               827      context-switches          #    0,196 K/sec                    ( +-  0,81% )
                13      cpu-migrations            #    0,003 K/sec                    ( +- 10,58% )
             79932      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       14108860946      cycles                    #    3,347 GHz                      ( +-  0,08% )  (49,91%)
       29508075225      instructions              #    2,09  insn per cycle           ( +-  0,14% )  (62,49%)
        7220934286      branches                  # 1712,851 M/sec                    ( +-  0,09% )  (62,58%)
         174301863      branch-misses             #    2,41% of all branches          ( +-  0,06% )  (62,77%)
        6621367485      L1-dcache-loads           # 1570,630 M/sec                    ( +-  0,07% )  (62,86%)
          73684219      L1-dcache-load-misses     #    1,11% of all L1-dcache hits    ( +-  0,45% )  (62,71%)
           3886187      LLC-loads                 #    0,922 M/sec                    ( +-  3,16% )  (49,91%)
            369954      LLC-load-misses           #    9,52% of all LL-cache hits     ( +-  6,86% )  (49,82%)

       4,211567986 seconds time elapsed                                          ( +-  0,08% )

Here the number of instructions increases by 1.7% which is mostly due to 6% N(branches) increase (δ(branches) / δ(insn) ≈ 0.82) however the insn/cycle ratio also grows from 2.04 to 2.09 which overcompensate all that and resulting N(cycles) decrease by ~ 1%.

My observation:

  • above everywhere disabling loop rotation makes insn/cycle ↑ but go18 2.18 level is never reached; closest was tip+cl43293.2-nolooprot (2.15 vs 2.18 for go18).
  • the task here seems to be to optimize target function (Ncycles) via variating Ninstructions⋅(insn/cycle). Code transformations which minimize Ninstructions (loop rotation, etc) explicitly change only one part of the formula, but I'm not sure insn/cycle is modelled or taken into account anywhere.

( I've rebenchmarked previous runs from todays morning and reverified all timings several times )

Contributor

navytux commented May 12, 2017

Josh, thanks for feedback. Sure, staying at the same tip = 482da51 here is what we have:

  1. going to CL 43293.2 (second patch in that CL) birings more speedup:
$ benchstat gotip+cl43293.txt gotip+cl43293.2.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.84s ± 1%   2.78s ± 0%  -2.09%  (p=0.000 n=19+16)
 Performance counter stats for './go1.test-tip+cl43293.2 -test.bench Fannkuch11' (10 runs):

       4029,642462      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,11% )
               814      context-switches          #    0,202 K/sec                    ( +-  1,00% )
                11      cpu-migrations            #    0,003 K/sec                    ( +-  8,11% )
             79921      page-faults               #    0,020 M/sec                    ( +-  0,01% )
       13482684046      cycles                    #    3,346 GHz                      ( +-  0,10% )  (50,05%)
       28731337641      instructions              #    2,13  insn per cycle           ( +-  0,17% )  (62,66%)
        6429703132      branches                  # 1595,601 M/sec                    ( +-  0,14% )  (62,72%)
         168958851      branch-misses             #    2,63% of all branches          ( +-  0,18% )  (62,74%)
        6634192402      L1-dcache-loads           # 1646,348 M/sec                    ( +-  0,14% )  (62,80%)
          73264431      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,66% )  (62,57%)
           4233228      LLC-loads                 #    1,051 M/sec                    ( +-  3,12% )  (49,84%)
            376635      LLC-load-misses           #    8,90% of all LL-cache hits     ( +-  9,29% )  (49,89%)

       4,026348487 seconds time elapsed                                          ( +-  0,11% )

some comments: there is a bit more (0.2%) instructions but insn/cycle grows from 2.10 to 2.13 and as the result total cpu cycles is ~ 1% less (which correllates with totall run time reported by perf and probably includes preparatory program actions -test.bench itself excludes from accounting). The amount of branches is too 0.2% more, however, if I read counters correctly, not all new instructions are branches as δ(insn) = 71445034 but δ(branches) = 17179812 (4x less).

  1. disabling loopRotate over CL 43293.2 makes it slower:
$ benchstat gotip+cl43293.2.txt gotip+cl43293.2-nolooprot.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.78s ± 0%   2.87s ± 0%  +2.93%  (p=0.000 n=16+15)
 Performance counter stats for './go1.test-tip+cl43293.2-nolooprot -test.bench Fannkuch11' (10 runs):

       4103,320043      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,12% )
               822      context-switches          #    0,200 K/sec                    ( +-  0,82% )
                12      cpu-migrations            #    0,003 K/sec                    ( +-  9,02% )
             79925      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       13711752541      cycles                    #    3,342 GHz                      ( +-  0,06% )  (49,99%)
       29498486031      instructions              #    2,15  insn per cycle           ( +-  0,07% )  (62,58%)
        7212426730      branches                  # 1757,705 M/sec                    ( +-  0,08% )  (62,64%)
         174290760      branch-misses             #    2,42% of all branches          ( +-  0,04% )  (62,73%)
        6609550227      L1-dcache-loads           # 1610,781 M/sec                    ( +-  0,07% )  (62,79%)
          73869349      L1-dcache-load-misses     #    1,12% of all L1-dcache hits    ( +-  0,28% )  (62,67%)
           3816460      LLC-loads                 #    0,930 M/sec                    ( +-  2,51% )  (49,91%)
            429992      LLC-load-misses           #   11,27% of all LL-cache hits     ( +-  8,17% )  (49,89%)

       4,098764012 seconds time elapsed                                          ( +-  0,12% )

insn / cycle grows from 2.13 to 2.15 (good) but there are 2.7% more total instructions so overal cycles grow by 1.7% (correlates wrt total run time reported by perf). There is 12% more branches and looks like they are the only visible factor that contribute to instruction growth: δ(branches) / δ(insn) ≈ 1.

  1. but disabling loopRotate over just tip makes it faster:
$ benchstat gotip.txt gotip-nolooprot.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.02s ± 1%   2.98s ± 0%  -1.29%  (p=0.000 n=19+16)
 Performance counter stats for './go1.test-tip-nolooprot -test.bench Fannkuch11' (10 runs):

       4215,739102      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,08% )
               827      context-switches          #    0,196 K/sec                    ( +-  0,81% )
                13      cpu-migrations            #    0,003 K/sec                    ( +- 10,58% )
             79932      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       14108860946      cycles                    #    3,347 GHz                      ( +-  0,08% )  (49,91%)
       29508075225      instructions              #    2,09  insn per cycle           ( +-  0,14% )  (62,49%)
        7220934286      branches                  # 1712,851 M/sec                    ( +-  0,09% )  (62,58%)
         174301863      branch-misses             #    2,41% of all branches          ( +-  0,06% )  (62,77%)
        6621367485      L1-dcache-loads           # 1570,630 M/sec                    ( +-  0,07% )  (62,86%)
          73684219      L1-dcache-load-misses     #    1,11% of all L1-dcache hits    ( +-  0,45% )  (62,71%)
           3886187      LLC-loads                 #    0,922 M/sec                    ( +-  3,16% )  (49,91%)
            369954      LLC-load-misses           #    9,52% of all LL-cache hits     ( +-  6,86% )  (49,82%)

       4,211567986 seconds time elapsed                                          ( +-  0,08% )

Here the number of instructions increases by 1.7% which is mostly due to 6% N(branches) increase (δ(branches) / δ(insn) ≈ 0.82) however the insn/cycle ratio also grows from 2.04 to 2.09 which overcompensate all that and resulting N(cycles) decrease by ~ 1%.

My observation:

  • above everywhere disabling loop rotation makes insn/cycle ↑ but go18 2.18 level is never reached; closest was tip+cl43293.2-nolooprot (2.15 vs 2.18 for go18).
  • the task here seems to be to optimize target function (Ncycles) via variating Ninstructions⋅(insn/cycle). Code transformations which minimize Ninstructions (loop rotation, etc) explicitly change only one part of the formula, but I'm not sure insn/cycle is modelled or taken into account anywhere.

( I've rebenchmarked previous runs from todays morning and reverified all timings several times )

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 12, 2017

Contributor

I'm not an x86 guy (probably @TocarIP could help here) but I imagined: if there are several execution units inside cpu with somehow different properties, units overal load could depend on the order of instructions in instruction stream. With this thinking in mind I googled for "x86 add nop speedup" and found e.g.

https://research.google.com/pubs/pub37077.html
https://stackoverflow.com/questions/17896714/why-would-introducing-useless-mov-instructions-speed-up-a-tight-loop-in-x86-64-a/17906589
...

also if we simplify and assume data loads/stores are always hitting cache what affects insn/cycle is

N(branch-misses) / N(instructions)

(note not / N(branches)). The stackoverflow link and MAO article above contains some information on what factors could affect branch prediction and other loop related things.

Coincidence or not there the talking is usually too about 5% - 7% speedup or slowdown.

Contributor

navytux commented May 12, 2017

I'm not an x86 guy (probably @TocarIP could help here) but I imagined: if there are several execution units inside cpu with somehow different properties, units overal load could depend on the order of instructions in instruction stream. With this thinking in mind I googled for "x86 add nop speedup" and found e.g.

https://research.google.com/pubs/pub37077.html
https://stackoverflow.com/questions/17896714/why-would-introducing-useless-mov-instructions-speed-up-a-tight-loop-in-x86-64-a/17906589
...

also if we simplify and assume data loads/stores are always hitting cache what affects insn/cycle is

N(branch-misses) / N(instructions)

(note not / N(branches)). The stackoverflow link and MAO article above contains some information on what factors could affect branch prediction and other loop related things.

Coincidence or not there the talking is usually too about 5% - 7% speedup or slowdown.

@TocarIP

This comment has been minimized.

Show comment
Hide comment
@TocarIP

TocarIP May 12, 2017

Contributor

I've played a bit with https://github.com/andikleen/pmu-tools
It suggests that this bench is bound by branch misprediction in various forms:
toplev.py -l=3 -d ./go1.test -test.run=NONE -test.bench=Fan:

FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:         17.59 % Clocks [ 16.66%]
        This metric represents cycles fraction the CPU was stalled
        due to Branch Resteers...
        Sampling events:  br_misp_retired.all_branches
BAD            Bad_Speculation:                                         22.39 % Slots [ 17.00%]
BAD            Bad_Speculation.Branch_Mispredicts:                      22.21 % Slots [ 17.00%] BN
        This metric represents slots fraction the CPU has wasted due
        to Branch Misprediction...
        Sampling events:  br_misp_retired.all_branches
               MUX:                                                     16.62 %

Running ocperf.py record -e br_misp_retired.all_branches_pebs
Shows that guilty branches are caused by index checks:

884511   75ef95:       4d 39 e7                cmp    %r12,%r15
884512   75ef98:       73 33                   jae    75efcd <_/localdisk/itocar/golang/test/bench/go1.fannkuch+0x1ed>

and code layout:

75efcb:       eb 0e                   jmp    75efdb <_/localdisk/itocar/golang/test/bench/go1.fannkuch+0x1fb>
884526   75efcd:       e8 9e c8 cc ff          callq  42b870 <runtime.panicindex>

Note that in this case branch is unconditional!
With 43293 at least all problems are with two conditional branches.
Playing with MaxLoopPad in cmd/internal/obj/x86/asm6 produce no improvements, so the problem is not in branch targets alignment.
I'd like to experiment with branch source alignment, but this isn't as simple as changing 1 constant, and I will be away for a ~week.

Contributor

TocarIP commented May 12, 2017

I've played a bit with https://github.com/andikleen/pmu-tools
It suggests that this bench is bound by branch misprediction in various forms:
toplev.py -l=3 -d ./go1.test -test.run=NONE -test.bench=Fan:

FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:         17.59 % Clocks [ 16.66%]
        This metric represents cycles fraction the CPU was stalled
        due to Branch Resteers...
        Sampling events:  br_misp_retired.all_branches
BAD            Bad_Speculation:                                         22.39 % Slots [ 17.00%]
BAD            Bad_Speculation.Branch_Mispredicts:                      22.21 % Slots [ 17.00%] BN
        This metric represents slots fraction the CPU has wasted due
        to Branch Misprediction...
        Sampling events:  br_misp_retired.all_branches
               MUX:                                                     16.62 %

Running ocperf.py record -e br_misp_retired.all_branches_pebs
Shows that guilty branches are caused by index checks:

884511   75ef95:       4d 39 e7                cmp    %r12,%r15
884512   75ef98:       73 33                   jae    75efcd <_/localdisk/itocar/golang/test/bench/go1.fannkuch+0x1ed>

and code layout:

75efcb:       eb 0e                   jmp    75efdb <_/localdisk/itocar/golang/test/bench/go1.fannkuch+0x1fb>
884526   75efcd:       e8 9e c8 cc ff          callq  42b870 <runtime.panicindex>

Note that in this case branch is unconditional!
With 43293 at least all problems are with two conditional branches.
Playing with MaxLoopPad in cmd/internal/obj/x86/asm6 produce no improvements, so the problem is not in branch targets alignment.
I'd like to experiment with branch source alignment, but this isn't as simple as changing 1 constant, and I will be away for a ~week.

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 13, 2017

Contributor

I haven't had a chance to dig into this excellent data, but see also #20355 for something that seems definitely wrong with looprotate. It seems that looprotate is doing both good work and bad for the benchmark.

Contributor

josharian commented May 13, 2017

I haven't had a chance to dig into this excellent data, but see also #20355 for something that seems definitely wrong with looprotate. It seems that looprotate is doing both good work and bad for the benchmark.

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 13, 2017

Contributor

Also #20356.

Contributor

josharian commented May 13, 2017

Also #20356.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 16, 2017

Contributor

Josh thanks for feedback and your observations. I agree that having undocumented hash function mixed into branch prediction makes it harder to reason about. However as you say we can try to to see how much this affects performance. Let me show you something before we begin:

aaa

Details: as you suggest I've cocked a hacky way to check Fannkuch11 with shifted entry address for fannkuch() for go18 and various tips:

https://lab.nexedi.com/kirr/go/blob/dd8c9753/test/bench/go1/fannkuch-entryshifted

it hooks something like

kirr@deco:~/src/tools/go/go$ cat src/testing/zzz.go 
package testing

// shifting by ~ 3·16
func zzz() {
        println()
        println()
        println()
}
--- a/src/testing/testing.go
+++ b/src/testing/testing.go
@@ -499,6 +499,7 @@ func (c *common) Fail() {
        defer c.mu.Unlock()
        // c.done needs to be locked to synchronize checks to c.done in parent tests.
        if c.done {
+               zzz()
                panic("Fail in goroutine after " + c.name + " has completed")
        }
        c.failed = true

into testing package which gets linked into go1 benchmark before benchmark objects and as the result entry point of fannkuch() changes, e.g.:

@0
  fannkuch_test.go:12   0x75fdd0                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@1
  fannkuch_test.go:12   0x75fdf0                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@2
  fannkuch_test.go:12   0x75fe00                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@3
  fannkuch_test.go:12   0x75fe10                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@4
  fannkuch_test.go:12   0x75fe20                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
...

This way 3 go versions were tested: go18 (go1.8.1-3-gdd8c975372), gotip (4b0d74f ; already has CL 43293 applied) and gotip + CL 43491.3.

From the picture above it could be seen that go18 still clearly wins: eventhough there are some seldom overlaps with tip, if we compare good-luck to good-luck and bad-luck to bad-luck go18 is faster. It could be also seen that CL 43491.3 makes things unfortunately worse, at least on my cpu.

Resume: there, it seems, is hash(IP) effect and we can measure its range for cases when unrelated changes are made and whole function entry changes. However that's not all and rest is all intra-function (Fannkuch11 benchmark runs only one fannkuch function) for which compiler can directly control code-layout locally.

Generally agreeing that more high-level issues could bring more goods (like e.g. teaching compiler to staically prove and eliminate unnecessary bounds checks) from the above it is hard for me to say that we are done with this regression for Go1.9 .

However I confirm that CL 43491 overally speedups the compiler:

$ benchstat -geomean 4b0d74f89d.txt 4b0d74f89d+cl43491.3.txt 
name        old time/op     new time/op     delta
Template        174ms ± 2%      171ms ± 2%  -1.27%  (p=0.000 n=38+37)
Unicode        80.4ms ± 4%     79.9ms ± 3%  -0.60%  (p=0.039 n=39+38)
GoTypes         565ms ± 1%      561ms ± 2%  -0.82%  (p=0.000 n=37+40)
Compiler        2.65s ± 1%      2.62s ± 1%  -1.13%  (p=0.000 n=37+40)
SSA             5.31s ± 1%      5.24s ± 1%  -1.47%  (p=0.000 n=40+38)
Flate           112ms ± 2%      111ms ± 2%  -1.21%  (p=0.000 n=39+37)
GoParser        141ms ± 2%      140ms ± 2%  -0.98%  (p=0.000 n=40+40)
Reflect         350ms ± 1%      345ms ± 2%  -1.17%  (p=0.000 n=34+38)
Tar            99.0ms ± 3%     97.9ms ± 2%  -1.10%  (p=0.000 n=40+39)
XML             196ms ± 2%      194ms ± 2%  -1.09%  (p=0.000 n=39+38)
[Geo mean]      322ms           318ms       -1.08%

name        old user-ns/op  new user-ns/op  delta
Template         221M ± 9%       219M ±10%    ~     (p=0.158 n=39+40)
Unicode          110M ± 9%       108M ±15%    ~     (p=0.334 n=38+39)
GoTypes          737M ± 5%       737M ± 3%    ~     (p=0.537 n=40+39)
Compiler        3.48G ± 3%      3.46G ± 2%  -0.59%  (p=0.017 n=40+40)
SSA             7.52G ± 2%      7.40G ± 2%  -1.53%  (p=0.000 n=37+39)
Flate            143M ± 8%       141M ± 9%  -1.89%  (p=0.023 n=40+40)
GoParser         182M ± 8%       180M ± 9%    ~     (p=0.072 n=40+40)
Reflect          448M ± 3%       439M ± 4%  -2.19%  (p=0.000 n=40+40)
Tar              127M ±12%       124M ±10%  -2.20%  (p=0.019 n=40+40)
XML              250M ± 6%       246M ± 4%  -1.54%  (p=0.002 n=40+39)
[Geo mean]       421M            415M       -1.40%

Speaking about particular regression in this issue I've made some observations which are imho better to share. First of all @TocarIP thanks for mentioning pmu-tools which is a way to observe at least somehow what happens inside cpu. On my system it is too branch mispredictions. While profiling with ocperf.py record -e br_misp_retired.all_branches:pp I've noticed:

  • go18 has different code structure for innermost for i, j := 1, k-1; i < j; i, j = i+1, j-1 loop. In particular before entering the loop it checks loop condition i < j and if not skips loop completely. The go19 version does unconditional jump to after loop, setup registers there and then jumps unconditionally inside loop to condition check entry. (btw why not setup registers before loop and directly jump to loop entry ?)

  • in go18 there are two jumps that are regularly mispredicted - pre-loop-entry and loop jmp-back. In go19 the only major jump that is mispredicted is loop jmp-back. That could be maybe making the difference (in go19 there is no pre-loop conditional check jump at all so in profiler go18 looks like a molecule with 2 major spectral lines while go19 with only 1).

  • in go19 the inner loop conditional jump instruction starts at 0x...f address and occupies 2 bytes. That means it will overlap with 16 bytes line boundary which I think even Agner Fog advices against. @TocarIP could we please check somehow this is important or not?

Below are relevant screenshots for go18 and go19 with Josh's CLs. The inner loop is selected with loop arrow. Left column is ratio of mis-predicted events.

aaa
aaa

Thanks,
Kirill

Contributor

navytux commented May 16, 2017

Josh thanks for feedback and your observations. I agree that having undocumented hash function mixed into branch prediction makes it harder to reason about. However as you say we can try to to see how much this affects performance. Let me show you something before we begin:

aaa

Details: as you suggest I've cocked a hacky way to check Fannkuch11 with shifted entry address for fannkuch() for go18 and various tips:

https://lab.nexedi.com/kirr/go/blob/dd8c9753/test/bench/go1/fannkuch-entryshifted

it hooks something like

kirr@deco:~/src/tools/go/go$ cat src/testing/zzz.go 
package testing

// shifting by ~ 3·16
func zzz() {
        println()
        println()
        println()
}
--- a/src/testing/testing.go
+++ b/src/testing/testing.go
@@ -499,6 +499,7 @@ func (c *common) Fail() {
        defer c.mu.Unlock()
        // c.done needs to be locked to synchronize checks to c.done in parent tests.
        if c.done {
+               zzz()
                panic("Fail in goroutine after " + c.name + " has completed")
        }
        c.failed = true

into testing package which gets linked into go1 benchmark before benchmark objects and as the result entry point of fannkuch() changes, e.g.:

@0
  fannkuch_test.go:12   0x75fdd0                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@1
  fannkuch_test.go:12   0x75fdf0                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@2
  fannkuch_test.go:12   0x75fe00                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@3
  fannkuch_test.go:12   0x75fe10                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
@4
  fannkuch_test.go:12   0x75fe20                64488b0c25f8ffffff              MOVQ FS:0xfffffff8, CX
...

This way 3 go versions were tested: go18 (go1.8.1-3-gdd8c975372), gotip (4b0d74f ; already has CL 43293 applied) and gotip + CL 43491.3.

From the picture above it could be seen that go18 still clearly wins: eventhough there are some seldom overlaps with tip, if we compare good-luck to good-luck and bad-luck to bad-luck go18 is faster. It could be also seen that CL 43491.3 makes things unfortunately worse, at least on my cpu.

Resume: there, it seems, is hash(IP) effect and we can measure its range for cases when unrelated changes are made and whole function entry changes. However that's not all and rest is all intra-function (Fannkuch11 benchmark runs only one fannkuch function) for which compiler can directly control code-layout locally.

Generally agreeing that more high-level issues could bring more goods (like e.g. teaching compiler to staically prove and eliminate unnecessary bounds checks) from the above it is hard for me to say that we are done with this regression for Go1.9 .

However I confirm that CL 43491 overally speedups the compiler:

$ benchstat -geomean 4b0d74f89d.txt 4b0d74f89d+cl43491.3.txt 
name        old time/op     new time/op     delta
Template        174ms ± 2%      171ms ± 2%  -1.27%  (p=0.000 n=38+37)
Unicode        80.4ms ± 4%     79.9ms ± 3%  -0.60%  (p=0.039 n=39+38)
GoTypes         565ms ± 1%      561ms ± 2%  -0.82%  (p=0.000 n=37+40)
Compiler        2.65s ± 1%      2.62s ± 1%  -1.13%  (p=0.000 n=37+40)
SSA             5.31s ± 1%      5.24s ± 1%  -1.47%  (p=0.000 n=40+38)
Flate           112ms ± 2%      111ms ± 2%  -1.21%  (p=0.000 n=39+37)
GoParser        141ms ± 2%      140ms ± 2%  -0.98%  (p=0.000 n=40+40)
Reflect         350ms ± 1%      345ms ± 2%  -1.17%  (p=0.000 n=34+38)
Tar            99.0ms ± 3%     97.9ms ± 2%  -1.10%  (p=0.000 n=40+39)
XML             196ms ± 2%      194ms ± 2%  -1.09%  (p=0.000 n=39+38)
[Geo mean]      322ms           318ms       -1.08%

name        old user-ns/op  new user-ns/op  delta
Template         221M ± 9%       219M ±10%    ~     (p=0.158 n=39+40)
Unicode          110M ± 9%       108M ±15%    ~     (p=0.334 n=38+39)
GoTypes          737M ± 5%       737M ± 3%    ~     (p=0.537 n=40+39)
Compiler        3.48G ± 3%      3.46G ± 2%  -0.59%  (p=0.017 n=40+40)
SSA             7.52G ± 2%      7.40G ± 2%  -1.53%  (p=0.000 n=37+39)
Flate            143M ± 8%       141M ± 9%  -1.89%  (p=0.023 n=40+40)
GoParser         182M ± 8%       180M ± 9%    ~     (p=0.072 n=40+40)
Reflect          448M ± 3%       439M ± 4%  -2.19%  (p=0.000 n=40+40)
Tar              127M ±12%       124M ±10%  -2.20%  (p=0.019 n=40+40)
XML              250M ± 6%       246M ± 4%  -1.54%  (p=0.002 n=40+39)
[Geo mean]       421M            415M       -1.40%

Speaking about particular regression in this issue I've made some observations which are imho better to share. First of all @TocarIP thanks for mentioning pmu-tools which is a way to observe at least somehow what happens inside cpu. On my system it is too branch mispredictions. While profiling with ocperf.py record -e br_misp_retired.all_branches:pp I've noticed:

  • go18 has different code structure for innermost for i, j := 1, k-1; i < j; i, j = i+1, j-1 loop. In particular before entering the loop it checks loop condition i < j and if not skips loop completely. The go19 version does unconditional jump to after loop, setup registers there and then jumps unconditionally inside loop to condition check entry. (btw why not setup registers before loop and directly jump to loop entry ?)

  • in go18 there are two jumps that are regularly mispredicted - pre-loop-entry and loop jmp-back. In go19 the only major jump that is mispredicted is loop jmp-back. That could be maybe making the difference (in go19 there is no pre-loop conditional check jump at all so in profiler go18 looks like a molecule with 2 major spectral lines while go19 with only 1).

  • in go19 the inner loop conditional jump instruction starts at 0x...f address and occupies 2 bytes. That means it will overlap with 16 bytes line boundary which I think even Agner Fog advices against. @TocarIP could we please check somehow this is important or not?

Below are relevant screenshots for go18 and go19 with Josh's CLs. The inner loop is selected with loop arrow. Left column is ratio of mis-predicted events.

aaa
aaa

Thanks,
Kirill

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 16, 2017

Contributor

Still digesting the above.

In parallel, I just found that the linker already supports changing the text address:

$ go test -bench=. -ldflags="-T=0x20000"
$ go test -bench=. -ldflags="-T=0x21000"

Can you check whether this has a similar effect to your fuzzer?

Contributor

josharian commented May 16, 2017

Still digesting the above.

In parallel, I just found that the linker already supports changing the text address:

$ go test -bench=. -ldflags="-T=0x20000"
$ go test -bench=. -ldflags="-T=0x21000"

Can you check whether this has a similar effect to your fuzzer?

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 16, 2017

Contributor

@navytux beautiful.

That means it will overlap with 16 bytes line boundary which I think even Agner Fog advices against.

I'll try to hack in an experimental fix for this now. (Not sure how easy/hard it will be.) Do you have a pointer to where in the docs Agner Fog discusses this, so that I can get the scope of my tweak correct?

Contributor

josharian commented May 16, 2017

@navytux beautiful.

That means it will overlap with 16 bytes line boundary which I think even Agner Fog advices against.

I'll try to hack in an experimental fix for this now. (Not sure how easy/hard it will be.) Do you have a pointer to where in the docs Agner Fog discusses this, so that I can get the scope of my tweak correct?

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 16, 2017

Contributor

@navytux at the top of asm6.go is:

const (
	// Loop alignment constants:
	// want to align loop entry to LoopAlign-byte boundary,
	// and willing to insert at most MaxLoopPad bytes of NOP to do so.
	// We define a loop entry as the target of a backward jump.
	//
	// gcc uses MaxLoopPad = 10 for its 'generic x86-64' config,
	// and it aligns all jump targets, not just backward jump targets.
	//
	// As of 6/1/2012, the effect of setting MaxLoopPad = 10 here
	// is very slight but negative, so the alignment is disabled by
	// setting MaxLoopPad = 0. The code is here for reference and
	// for future experiments.
	//
	LoopAlign  = 16
	MaxLoopPad = 0
)

This indicates the goal is to align the target of the conditional jumps, not the jumps themselves. If that's right, I think you could easily experiment with this--setting the value of MaxLoopPad to 1 should push elements at 0x...f up by a byte. If that's wrong, I could try fixing it. If it makes a difference, and the heuristic is wrong (just want to avoid splitting an instruction in two), we could also make the heuristic better.

One other thought. It seems like there ought to be tools to allow you to insert NOPs into a binary, automatically adjusting all jump targets etc. If so, that would be another quick way to test the theory.

Contributor

josharian commented May 16, 2017

@navytux at the top of asm6.go is:

const (
	// Loop alignment constants:
	// want to align loop entry to LoopAlign-byte boundary,
	// and willing to insert at most MaxLoopPad bytes of NOP to do so.
	// We define a loop entry as the target of a backward jump.
	//
	// gcc uses MaxLoopPad = 10 for its 'generic x86-64' config,
	// and it aligns all jump targets, not just backward jump targets.
	//
	// As of 6/1/2012, the effect of setting MaxLoopPad = 10 here
	// is very slight but negative, so the alignment is disabled by
	// setting MaxLoopPad = 0. The code is here for reference and
	// for future experiments.
	//
	LoopAlign  = 16
	MaxLoopPad = 0
)

This indicates the goal is to align the target of the conditional jumps, not the jumps themselves. If that's right, I think you could easily experiment with this--setting the value of MaxLoopPad to 1 should push elements at 0x...f up by a byte. If that's wrong, I could try fixing it. If it makes a difference, and the heuristic is wrong (just want to avoid splitting an instruction in two), we could also make the heuristic better.

One other thought. It seems like there ought to be tools to allow you to insert NOPs into a binary, automatically adjusting all jump targets etc. If so, that would be another quick way to test the theory.

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 16, 2017

Contributor

go18 has different code structure for innermost for i, j := 1, k-1; i < j; i, j = i+1, j-1 loop.

Right. Keith mentioned that in the commit message for CL 38431:

This is a bit different than the peeling that the old obj
library did in that we don't duplicate the loop exit test.
We just jump to the test.

in go18 there are two jumps that are regularly mispredicted - pre-loop-entry and loop jmp-back. In go19 the only major jump that is mispredicted is loop jmp-back. That could be maybe making the difference

I'm not sure why going from 2 mispredicted jumps to 1 mispredicted jump would make things worse in and of itself. Anyway, the two mispredicted tests in 1.8 appear to be the same condition (i<j), with peeling.

With tip I see the commonly mispredicted jump laid out as a backward jump (jl) on my machine, as in your tip screenshot. With tip + CL 43491, I see it laid out as a forward jump (jge).

But I doubt it matters. I suspect that this branch is just hard to predict well. And if so, shuffling layout without adding 1.8-style peeling or something like it won't get us back to 1.8 levels.

You could try to confirm this thesis by manually reproducing the loop peeling from 1.8: Add if i >= j { ... } around the inner loop in the fannkuch function and see what affect it has.

If as it appears this is the decisive difference, we might just want to accept the regression for now and attempt to continue to improve things by other means. Or go ahead and reproduce the 1.8 loop peeling. That'd be a judgment call for @randall77. (Regardless, this has been a fruitful investigation.)

(Meanwhile, I'm trying yet again sigh to get my new UDOO x86 up and running, in the hopes that its CPU will allow me to locally reproduce some of this.)

Contributor

josharian commented May 16, 2017

go18 has different code structure for innermost for i, j := 1, k-1; i < j; i, j = i+1, j-1 loop.

Right. Keith mentioned that in the commit message for CL 38431:

This is a bit different than the peeling that the old obj
library did in that we don't duplicate the loop exit test.
We just jump to the test.

in go18 there are two jumps that are regularly mispredicted - pre-loop-entry and loop jmp-back. In go19 the only major jump that is mispredicted is loop jmp-back. That could be maybe making the difference

I'm not sure why going from 2 mispredicted jumps to 1 mispredicted jump would make things worse in and of itself. Anyway, the two mispredicted tests in 1.8 appear to be the same condition (i<j), with peeling.

With tip I see the commonly mispredicted jump laid out as a backward jump (jl) on my machine, as in your tip screenshot. With tip + CL 43491, I see it laid out as a forward jump (jge).

But I doubt it matters. I suspect that this branch is just hard to predict well. And if so, shuffling layout without adding 1.8-style peeling or something like it won't get us back to 1.8 levels.

You could try to confirm this thesis by manually reproducing the loop peeling from 1.8: Add if i >= j { ... } around the inner loop in the fannkuch function and see what affect it has.

If as it appears this is the decisive difference, we might just want to accept the regression for now and attempt to continue to improve things by other means. Or go ahead and reproduce the 1.8 loop peeling. That'd be a judgment call for @randall77. (Regardless, this has been a fruitful investigation.)

(Meanwhile, I'm trying yet again sigh to get my new UDOO x86 up and running, in the hopes that its CPU will allow me to locally reproduce some of this.)

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 16, 2017

Contributor

Josh thanks for feedback. I'll try to try -ldflags=-T... in some time.

About branch targets I think @TocarIP already tried to play with MaxLoopPad in #18977 (comment) however I decided to also check myself.

I'm taking tip = 4b0d74f as the base. If I just adjust MaxLoopPad whole program addresses become different and due to the hash(IP) effect everything is disrupted. Likewise if I adjust MaxLoopPad for whole fannkuch function only due to loops previous to inner loop changing addresses everything else is also disrupted. So I cooked the following patch to carefully-hacky adjust address of jmp instruction that was sitting at 0x...f by 1 byte :

diff --git a/src/cmd/internal/obj/x86/asm6.go b/src/cmd/internal/obj/x86/asm6.go
index bcf9318e2e..8b09bd58df 100644
--- a/src/cmd/internal/obj/x86/asm6.go
+++ b/src/cmd/internal/obj/x86/asm6.go
@@ -62,7 +62,7 @@ const (
        // for future experiments.
        //
        LoopAlign  = 16
-       MaxLoopPad = 0
+       MaxLoopPad = 6
 )
 
 type Optab struct {
@@ -1882,7 +1882,8 @@ func span6(ctxt *obj.Link, s *obj.LSym, newprog obj.ProgAlloc) {
                                // pad with NOPs
                                v := -c & (LoopAlign - 1)
 
-                               if v <= MaxLoopPad {
+                               if v <= MaxLoopPad && s.Name == `"".fannkuch` && c > 400 {
+                                       v = 1
                                        s.Grow(int64(c) + int64(v))
                                        fillnop(s.P[c:], int(v))
                                        c += v

this change shifts whole internal loop (and possible everything else after). Resulting program change is:

...
@@ -108,144 +108,151 @@ func fannkuch(n int) int {
   0x75ff62             4c8b2b                  MOVQ 0(BX), R13         
   0x75ff65             4531f6                  XORL R14, R14           
                        for {         // k!=0 ==> k>0
-  0x75ff68             eb6c                    JMP 0x75ffd6            
+  0x75ff68             eb6d                    JMP 0x75ffd7            
+  0x75ff6a             90                      NOPL                    
                                        perm[i], perm[j] = perm[j], perm[i]
-  0x75ff6a             4c39e7                  CMPQ R12, DI            
-  0x75ff6d             0f836f010000            JAE 0x7600e2            
-  0x75ff73             4d8b0cfb                MOVQ 0(R11)(DI*8), R9   
-  0x75ff77             4d39e7                  CMPQ R12, R15           
-  0x75ff7a             0f8362010000            JAE 0x7600e2            
-  0x75ff80             4b8b0cfb                MOVQ 0(R11)(R15*8), CX  
-  0x75ff84             49890cfb                MOVQ CX, 0(R11)(DI*8)   
-  0x75ff88             4f890cfb                MOVQ R9, 0(R11)(R15*8)  
+  0x75ff6b             4c39e7                  CMPQ R12, DI            
+  0x75ff6e             0f8372010000            JAE 0x7600e6            
+  0x75ff74             4d8b0cfb                MOVQ 0(R11)(DI*8), R9   
+  0x75ff78             4d39e7                  CMPQ R12, R15           
+  0x75ff7b             0f8365010000            JAE 0x7600e6            
+  0x75ff81             4b8b0cfb                MOVQ 0(R11)(R15*8), CX  
+  0x75ff85             49890cfb                MOVQ CX, 0(R11)(DI*8)   
+  0x75ff89             4f890cfb                MOVQ R9, 0(R11)(R15*8)  
                                for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
-  0x75ff8c             488d4f01                LEAQ 0x1(DI), CX        
-  0x75ff90             49ffcf                  DECQ R15                
-  0x75ff93             488b7c2440              MOVQ 0x40(SP), DI       
-  0x75ff98             4c8b4c2450              MOVQ 0x50(SP), R9       
-  0x75ff9d             48894c2448              MOVQ CX, 0x48(SP)       
-  0x75ffa2             488b4c2470              MOVQ 0x70(SP), CX       
-  0x75ffa7             488b7c2448              MOVQ 0x48(SP), DI       
-  0x75ffac             4c39ff                  CMPQ R15, DI            
-  0x75ffaf             7cb9                    JL 0x75ff6a             
+  0x75ff8d             488d4f01                LEAQ 0x1(DI), CX        
+  0x75ff91             49ffcf                  DECQ R15                
+  0x75ff94             488b7c2440              MOVQ 0x40(SP), DI       
+  0x75ff99             4c8b4c2450              MOVQ 0x50(SP), R9       
+  0x75ff9e             48894c2448              MOVQ CX, 0x48(SP)       
+  0x75ffa3             488b4c2470              MOVQ 0x70(SP), CX       
+  0x75ffa8             488b7c2448              MOVQ 0x48(SP), DI       
+  0x75ffad             4c39ff                  CMPQ R15, DI            
+  0x75ffb0             7cb9                    JL 0x75ff6b             
...

and it gives slight speedup:

$ benchstat tip.txt tip+mlp6fk400x1.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.79s ± 1%   2.77s ± 1%  -0.81%  (p=0.000 n=18+19)

I could not find a program to insert instructions directly into binary: after quick glance neither ht [1], nor radare2 [2] seem to be able to actually insert (not overwrite) instructions (but maybe I did not looked hard enough).

If analyzing properly I would try adjusting inner loop body and / or only last jump + look into how other jumps shift and try to adjust them there too. However my time for this issue is currently up and I'm not sure if I could spent something non-trivial in the upcoming days here (sorry about that).

[1] http://hte.sourceforge.net/
[2] https://www.radare.org/r/


You could try to confirm this thesis by manually reproducing the loop peeling from 1.8: Add if i >= j { ... } around the inner loop in the fannkuch function and see what affect it has

without and with the following patch

--- a/test/bench/go1/fannkuch_test.go
+++ b/test/bench/go1/fannkuch_test.go
@@ -41,8 +41,11 @@ func fannkuch(n int) int {
                        }
                        k := perm1[0] // cache perm[0] in k
                        for {         // k!=0 ==> k>0
-                               for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
-                                       perm[i], perm[j] = perm[j], perm[i]
+                               if i, j := 1, k-1; i < j {
+                                       for ; i < j; i, j = i+1, j-1 {
+                                       //for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
+                                               perm[i], perm[j] = perm[j], perm[i]
+                                       }
                                }
                                flips++
                                // Now exchange k (caching perm[0]) and perm[k]... with care!

we have

(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1/loopalign$ benchstat tip.txt tip+peel.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.79s ± 1%   2.80s ± 1%  +0.24%  (p=0.003 n=18+19)

However the code is not exactly the same as in go18 because "peeling" still jumps to inside loop entry:

@@ -100,144 +100,149 @@ pc392:
        MOVQ       R9, didpr-40(SP)
        MOVQ       (BX), R13              // fannkuch_test.go:42
        MOVL       $0, R14
-       JMP        pc518                  // fannkuch_test.go:43
-pc410:
-       CMPQ       DI, R12                // fannkuch_test.go:45
-       JCC        pc786
+       JMP        pc543                  // fannkuch_test.go:43
+pc413:
+       MOVQ       AX, _autotmp_40-32(SP)
+       MOVL       $1, AX                                               // <~~ peeling jumps inside loop
+       MOVQ       AX, i-48(SP)
+       MOVQ       _autotmp_40-32(SP), AX
+       JMP        pc496                  // fannkuch_test.go:45

+pc435:                                                                 // <-- loop
+       CMPQ       DI, R12                // fannkuch_test.go:47
+       JCC        pc801
        MOVQ       (R11)(DI*8), R9
        CMPQ       R15, R12
-       JCC        pc786
+       JCC        pc801
        MOVQ       (R11)(R15*8), CX
        MOVQ       CX, (R11)(DI*8)
        MOVQ       R9, (R11)(R15*8)
-       LEAQ       1(DI), CX              // fannkuch_test.go:44
+       LEAQ       1(DI), CX              // fannkuch_test.go:45
        DECQ       R15
        MOVQ       n1-56(SP), DI
        MOVQ       didpr-40(SP), R9
        MOVQ       CX, i-48(SP)
        MOVQ       count_ptr-8(SP), CX
-pc471:
+pc496:                                                                 // <-- cmp inside loop
        MOVQ       i-48(SP), DI
        CMPQ       DI, R15
-       JLT        pc410
-       CMPQ       R13, R12               // fannkuch_test.go:49
-       JCC        pc779
+       JLT        pc435
+pc506:
+       CMPQ       R13, R12               // fannkuch_test.go:52
+       JCC        pc794
        MOVQ       (R11)(R13*8), DI
-       MOVQ       R13, (R11)(R13*8)      // fannkuch_test.go:50
-       LEAQ       1(R14), R13            // fannkuch_test.go:47
-       TESTQ      DI, DI                 // fannkuch_test.go:52
-       JEQ        pc544
+       MOVQ       R13, (R11)(R13*8)      // fannkuch_test.go:53
+       LEAQ       1(R14), R13            // fannkuch_test.go:50
+       TESTQ      DI, DI                 // fannkuch_test.go:55
+       JEQ        pc559
        MOVQ       R13, R14
        MOVQ       DI, R13
        MOVQ       n1-56(SP), DI
...

As you say I tend to believe this should not be very important (not 100% sure though). I just wanted to mention it for completness that go18 and go19 generates different code structure.


Regarding 16 lines boundary:

I'm talking about jump source address because this address is taken into the branch predictor and could be potentially bad for decoding/etc.

My references:

  • https://patchwork.kernel.org/patch/6194771/ (search for "Intel CPUs can decode only up to 16 bytes at a time" and Linus answer)

  • http://www.agner.org/optimize/microarchitecture.pdf p. 28

    BTB organization
    The branch target buffer in Sandy Bridge is bigger than in Nehalem according to unofficial
    rumors. It is unknown whether it has one level, as in Core 2 and earlier processors, or two
    levels as in Nehalem. It can handle a maximum of four call instructions per 16 bytes of
    code. Conditional jumps are less efficient if there are more than 3 branch instructions per 16
    bytes of code.

    p. 29

    3.8 Branch prediction in Intel Haswell, Broadwell and Skylake
    ...
    The high throughput for taken branches of one per clock was observed for up to 128
    branches with no more than one branch per 16 bytes of code. If there is more than one
    branch per 16 bytes of code then the throughput is reduced to one jump per two clock
    cycles. If there are more than 128 branches in the critical part of the code, and if they are
    spaced by at least 16 bytes, then apparently the first 128 branches have the high
    throughput and the remaining have the low throughput.

I agree that's not exactly what I said, so maybe sorry for misinformation here. However from the above references the 16 lines and effects associated to them seems to be there and from this point of view having jump instruction scattering two lines in a hot loop is suspicious to me.

Maybe I'm wrong here but I would try to check it.
It would be good if @TocarIP could provide more feedback for this topic.

Thanks,
Kirill

Contributor

navytux commented May 16, 2017

Josh thanks for feedback. I'll try to try -ldflags=-T... in some time.

About branch targets I think @TocarIP already tried to play with MaxLoopPad in #18977 (comment) however I decided to also check myself.

I'm taking tip = 4b0d74f as the base. If I just adjust MaxLoopPad whole program addresses become different and due to the hash(IP) effect everything is disrupted. Likewise if I adjust MaxLoopPad for whole fannkuch function only due to loops previous to inner loop changing addresses everything else is also disrupted. So I cooked the following patch to carefully-hacky adjust address of jmp instruction that was sitting at 0x...f by 1 byte :

diff --git a/src/cmd/internal/obj/x86/asm6.go b/src/cmd/internal/obj/x86/asm6.go
index bcf9318e2e..8b09bd58df 100644
--- a/src/cmd/internal/obj/x86/asm6.go
+++ b/src/cmd/internal/obj/x86/asm6.go
@@ -62,7 +62,7 @@ const (
        // for future experiments.
        //
        LoopAlign  = 16
-       MaxLoopPad = 0
+       MaxLoopPad = 6
 )
 
 type Optab struct {
@@ -1882,7 +1882,8 @@ func span6(ctxt *obj.Link, s *obj.LSym, newprog obj.ProgAlloc) {
                                // pad with NOPs
                                v := -c & (LoopAlign - 1)
 
-                               if v <= MaxLoopPad {
+                               if v <= MaxLoopPad && s.Name == `"".fannkuch` && c > 400 {
+                                       v = 1
                                        s.Grow(int64(c) + int64(v))
                                        fillnop(s.P[c:], int(v))
                                        c += v

this change shifts whole internal loop (and possible everything else after). Resulting program change is:

...
@@ -108,144 +108,151 @@ func fannkuch(n int) int {
   0x75ff62             4c8b2b                  MOVQ 0(BX), R13         
   0x75ff65             4531f6                  XORL R14, R14           
                        for {         // k!=0 ==> k>0
-  0x75ff68             eb6c                    JMP 0x75ffd6            
+  0x75ff68             eb6d                    JMP 0x75ffd7            
+  0x75ff6a             90                      NOPL                    
                                        perm[i], perm[j] = perm[j], perm[i]
-  0x75ff6a             4c39e7                  CMPQ R12, DI            
-  0x75ff6d             0f836f010000            JAE 0x7600e2            
-  0x75ff73             4d8b0cfb                MOVQ 0(R11)(DI*8), R9   
-  0x75ff77             4d39e7                  CMPQ R12, R15           
-  0x75ff7a             0f8362010000            JAE 0x7600e2            
-  0x75ff80             4b8b0cfb                MOVQ 0(R11)(R15*8), CX  
-  0x75ff84             49890cfb                MOVQ CX, 0(R11)(DI*8)   
-  0x75ff88             4f890cfb                MOVQ R9, 0(R11)(R15*8)  
+  0x75ff6b             4c39e7                  CMPQ R12, DI            
+  0x75ff6e             0f8372010000            JAE 0x7600e6            
+  0x75ff74             4d8b0cfb                MOVQ 0(R11)(DI*8), R9   
+  0x75ff78             4d39e7                  CMPQ R12, R15           
+  0x75ff7b             0f8365010000            JAE 0x7600e6            
+  0x75ff81             4b8b0cfb                MOVQ 0(R11)(R15*8), CX  
+  0x75ff85             49890cfb                MOVQ CX, 0(R11)(DI*8)   
+  0x75ff89             4f890cfb                MOVQ R9, 0(R11)(R15*8)  
                                for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
-  0x75ff8c             488d4f01                LEAQ 0x1(DI), CX        
-  0x75ff90             49ffcf                  DECQ R15                
-  0x75ff93             488b7c2440              MOVQ 0x40(SP), DI       
-  0x75ff98             4c8b4c2450              MOVQ 0x50(SP), R9       
-  0x75ff9d             48894c2448              MOVQ CX, 0x48(SP)       
-  0x75ffa2             488b4c2470              MOVQ 0x70(SP), CX       
-  0x75ffa7             488b7c2448              MOVQ 0x48(SP), DI       
-  0x75ffac             4c39ff                  CMPQ R15, DI            
-  0x75ffaf             7cb9                    JL 0x75ff6a             
+  0x75ff8d             488d4f01                LEAQ 0x1(DI), CX        
+  0x75ff91             49ffcf                  DECQ R15                
+  0x75ff94             488b7c2440              MOVQ 0x40(SP), DI       
+  0x75ff99             4c8b4c2450              MOVQ 0x50(SP), R9       
+  0x75ff9e             48894c2448              MOVQ CX, 0x48(SP)       
+  0x75ffa3             488b4c2470              MOVQ 0x70(SP), CX       
+  0x75ffa8             488b7c2448              MOVQ 0x48(SP), DI       
+  0x75ffad             4c39ff                  CMPQ R15, DI            
+  0x75ffb0             7cb9                    JL 0x75ff6b             
...

and it gives slight speedup:

$ benchstat tip.txt tip+mlp6fk400x1.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.79s ± 1%   2.77s ± 1%  -0.81%  (p=0.000 n=18+19)

I could not find a program to insert instructions directly into binary: after quick glance neither ht [1], nor radare2 [2] seem to be able to actually insert (not overwrite) instructions (but maybe I did not looked hard enough).

If analyzing properly I would try adjusting inner loop body and / or only last jump + look into how other jumps shift and try to adjust them there too. However my time for this issue is currently up and I'm not sure if I could spent something non-trivial in the upcoming days here (sorry about that).

[1] http://hte.sourceforge.net/
[2] https://www.radare.org/r/


You could try to confirm this thesis by manually reproducing the loop peeling from 1.8: Add if i >= j { ... } around the inner loop in the fannkuch function and see what affect it has

without and with the following patch

--- a/test/bench/go1/fannkuch_test.go
+++ b/test/bench/go1/fannkuch_test.go
@@ -41,8 +41,11 @@ func fannkuch(n int) int {
                        }
                        k := perm1[0] // cache perm[0] in k
                        for {         // k!=0 ==> k>0
-                               for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
-                                       perm[i], perm[j] = perm[j], perm[i]
+                               if i, j := 1, k-1; i < j {
+                                       for ; i < j; i, j = i+1, j-1 {
+                                       //for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
+                                               perm[i], perm[j] = perm[j], perm[i]
+                                       }
                                }
                                flips++
                                // Now exchange k (caching perm[0]) and perm[k]... with care!

we have

(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1/loopalign$ benchstat tip.txt tip+peel.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.79s ± 1%   2.80s ± 1%  +0.24%  (p=0.003 n=18+19)

However the code is not exactly the same as in go18 because "peeling" still jumps to inside loop entry:

@@ -100,144 +100,149 @@ pc392:
        MOVQ       R9, didpr-40(SP)
        MOVQ       (BX), R13              // fannkuch_test.go:42
        MOVL       $0, R14
-       JMP        pc518                  // fannkuch_test.go:43
-pc410:
-       CMPQ       DI, R12                // fannkuch_test.go:45
-       JCC        pc786
+       JMP        pc543                  // fannkuch_test.go:43
+pc413:
+       MOVQ       AX, _autotmp_40-32(SP)
+       MOVL       $1, AX                                               // <~~ peeling jumps inside loop
+       MOVQ       AX, i-48(SP)
+       MOVQ       _autotmp_40-32(SP), AX
+       JMP        pc496                  // fannkuch_test.go:45

+pc435:                                                                 // <-- loop
+       CMPQ       DI, R12                // fannkuch_test.go:47
+       JCC        pc801
        MOVQ       (R11)(DI*8), R9
        CMPQ       R15, R12
-       JCC        pc786
+       JCC        pc801
        MOVQ       (R11)(R15*8), CX
        MOVQ       CX, (R11)(DI*8)
        MOVQ       R9, (R11)(R15*8)
-       LEAQ       1(DI), CX              // fannkuch_test.go:44
+       LEAQ       1(DI), CX              // fannkuch_test.go:45
        DECQ       R15
        MOVQ       n1-56(SP), DI
        MOVQ       didpr-40(SP), R9
        MOVQ       CX, i-48(SP)
        MOVQ       count_ptr-8(SP), CX
-pc471:
+pc496:                                                                 // <-- cmp inside loop
        MOVQ       i-48(SP), DI
        CMPQ       DI, R15
-       JLT        pc410
-       CMPQ       R13, R12               // fannkuch_test.go:49
-       JCC        pc779
+       JLT        pc435
+pc506:
+       CMPQ       R13, R12               // fannkuch_test.go:52
+       JCC        pc794
        MOVQ       (R11)(R13*8), DI
-       MOVQ       R13, (R11)(R13*8)      // fannkuch_test.go:50
-       LEAQ       1(R14), R13            // fannkuch_test.go:47
-       TESTQ      DI, DI                 // fannkuch_test.go:52
-       JEQ        pc544
+       MOVQ       R13, (R11)(R13*8)      // fannkuch_test.go:53
+       LEAQ       1(R14), R13            // fannkuch_test.go:50
+       TESTQ      DI, DI                 // fannkuch_test.go:55
+       JEQ        pc559
        MOVQ       R13, R14
        MOVQ       DI, R13
        MOVQ       n1-56(SP), DI
...

As you say I tend to believe this should not be very important (not 100% sure though). I just wanted to mention it for completness that go18 and go19 generates different code structure.


Regarding 16 lines boundary:

I'm talking about jump source address because this address is taken into the branch predictor and could be potentially bad for decoding/etc.

My references:

  • https://patchwork.kernel.org/patch/6194771/ (search for "Intel CPUs can decode only up to 16 bytes at a time" and Linus answer)

  • http://www.agner.org/optimize/microarchitecture.pdf p. 28

    BTB organization
    The branch target buffer in Sandy Bridge is bigger than in Nehalem according to unofficial
    rumors. It is unknown whether it has one level, as in Core 2 and earlier processors, or two
    levels as in Nehalem. It can handle a maximum of four call instructions per 16 bytes of
    code. Conditional jumps are less efficient if there are more than 3 branch instructions per 16
    bytes of code.

    p. 29

    3.8 Branch prediction in Intel Haswell, Broadwell and Skylake
    ...
    The high throughput for taken branches of one per clock was observed for up to 128
    branches with no more than one branch per 16 bytes of code. If there is more than one
    branch per 16 bytes of code then the throughput is reduced to one jump per two clock
    cycles. If there are more than 128 branches in the critical part of the code, and if they are
    spaced by at least 16 bytes, then apparently the first 128 branches have the high
    throughput and the remaining have the low throughput.

I agree that's not exactly what I said, so maybe sorry for misinformation here. However from the above references the 16 lines and effects associated to them seems to be there and from this point of view having jump instruction scattering two lines in a hot loop is suspicious to me.

Maybe I'm wrong here but I would try to check it.
It would be good if @TocarIP could provide more feedback for this topic.

Thanks,
Kirill

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 16, 2017

Contributor

However my time for this issue is currently up

Understood. Thanks for all the digging so far!

I'm just about out of ideas myself. I am very curious to know about whether the ldflags serves as a suitable fuzzer for this, though, because if so, we should add it to our auto benchmarks.

BTB organization

IIUC, the BTB is used to predict the target of indirect jumps and calls. All jumps in fannkuch are direct, though, so the BTB should not be used. @TocarIP might correct me, though.

maybe sorry for misinformation here

We're all muddling through together. :)

And I just noticed that @TocarIP also said:

I'd like to experiment with branch source alignment, but this isn't as simple as changing 1 constant, and I will be away for a ~week.

Contributor

josharian commented May 16, 2017

However my time for this issue is currently up

Understood. Thanks for all the digging so far!

I'm just about out of ideas myself. I am very curious to know about whether the ldflags serves as a suitable fuzzer for this, though, because if so, we should add it to our auto benchmarks.

BTB organization

IIUC, the BTB is used to predict the target of indirect jumps and calls. All jumps in fannkuch are direct, though, so the BTB should not be used. @TocarIP might correct me, though.

maybe sorry for misinformation here

We're all muddling through together. :)

And I just noticed that @TocarIP also said:

I'd like to experiment with branch source alignment, but this isn't as simple as changing 1 constant, and I will be away for a ~week.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 17, 2017

Contributor

Josh thanks once again for feedback and for your work on this issue.

I am very curious to know about whether the ldflags serves as a suitable fuzzer for this, though, because if so, we should add it to our auto benchmarks.

First I have to say that via -ldflags we can shift only by multiple of page size (=4K) because otherwise programs just segfault (probably due to somewhere there is assumption that some addresses are page-aligned):

kirr@deco:~/tmp/trashme/t$ cat hello.go 
package main

func main() {
        println("hello")
}
kirr@deco:~/tmp/trashme/t$ go build -ldflags=-T=0x401000 hello.go 
kirr@deco:~/tmp/trashme/t$ ./hello 
hello
kirr@deco:~/tmp/trashme/t$ go build -ldflags=-T=0x401010 hello.go 
kirr@deco:~/tmp/trashme/t$ ./hello 
Segmentation fault

So it is not as covering as my above shift by println. (0x401000 is address which is regularly used for .text section start)

Then here is what happens with ldflags for the same go18 (go1.8.1-3-gdd8c975372), gotip (4b0d74f ; already has CL 43293 applied) and gotip + CL 43491.3. with 4K shifts:

aaa

in other words the effect is almost gone.
(benchmarked with https://lab.nexedi.com/kirr/go/commit/f10506bb)

I think what is needed is to teach go test to link into binary before user objects some more fine-grained padding function and then it will work.

/cc @dvyukov, @aclements, @quentinmit (please see how much difference shifting by 16 bytes can do here: #18977 (comment))


BTB and rest: you could be right, but what I feel is that processing instructions is heavily tied to 16 bytes entries probably due to how decoding is organized and how branch prediction info is stored etc. Besides BTB was described for "Sandy Bridge" and for newer ones Intel Haswell, Broadwell and Skylake (I have 6th-gen so Skylake) it is about regular branch predictions.

Anyway for now this are all guesses and it would help to get some feedback from inside Intel (@TocarIP ?).

The documentation is also not always correct and as you could see about branch predictors in Agner Fog there are many missing pieces about how it works. So offhand if instruction processing unit is 16 bytes I would say for branch predictor it should also be at least somehow relevant.


Btw about CL 43491 - maybe you could look into how to change something there so that inner loop is still organized with jump backwards not jump forward. According to Agner Fog docs, if I understand correctly, backward jumps are a bit better for loops as at least they are predicted to be taken the first time, loops are detected with only backward jumps etc.

Hope this helps a bit,
Kirill

Contributor

navytux commented May 17, 2017

Josh thanks once again for feedback and for your work on this issue.

I am very curious to know about whether the ldflags serves as a suitable fuzzer for this, though, because if so, we should add it to our auto benchmarks.

First I have to say that via -ldflags we can shift only by multiple of page size (=4K) because otherwise programs just segfault (probably due to somewhere there is assumption that some addresses are page-aligned):

kirr@deco:~/tmp/trashme/t$ cat hello.go 
package main

func main() {
        println("hello")
}
kirr@deco:~/tmp/trashme/t$ go build -ldflags=-T=0x401000 hello.go 
kirr@deco:~/tmp/trashme/t$ ./hello 
hello
kirr@deco:~/tmp/trashme/t$ go build -ldflags=-T=0x401010 hello.go 
kirr@deco:~/tmp/trashme/t$ ./hello 
Segmentation fault

So it is not as covering as my above shift by println. (0x401000 is address which is regularly used for .text section start)

Then here is what happens with ldflags for the same go18 (go1.8.1-3-gdd8c975372), gotip (4b0d74f ; already has CL 43293 applied) and gotip + CL 43491.3. with 4K shifts:

aaa

in other words the effect is almost gone.
(benchmarked with https://lab.nexedi.com/kirr/go/commit/f10506bb)

I think what is needed is to teach go test to link into binary before user objects some more fine-grained padding function and then it will work.

/cc @dvyukov, @aclements, @quentinmit (please see how much difference shifting by 16 bytes can do here: #18977 (comment))


BTB and rest: you could be right, but what I feel is that processing instructions is heavily tied to 16 bytes entries probably due to how decoding is organized and how branch prediction info is stored etc. Besides BTB was described for "Sandy Bridge" and for newer ones Intel Haswell, Broadwell and Skylake (I have 6th-gen so Skylake) it is about regular branch predictions.

Anyway for now this are all guesses and it would help to get some feedback from inside Intel (@TocarIP ?).

The documentation is also not always correct and as you could see about branch predictors in Agner Fog there are many missing pieces about how it works. So offhand if instruction processing unit is 16 bytes I would say for branch predictor it should also be at least somehow relevant.


Btw about CL 43491 - maybe you could look into how to change something there so that inner loop is still organized with jump backwards not jump forward. According to Agner Fog docs, if I understand correctly, backward jumps are a bit better for loops as at least they are predicted to be taken the first time, loops are detected with only backward jumps etc.

Hope this helps a bit,
Kirill

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov May 17, 2017

Member

This reminds me of #8717.
We've constantly seen ~+/-3% fluctuations after completely unrelated changes (like deleting some code in a function that does not even execute). I and Russ were not able to get to the bottom of this. If you will shed some light on it, it would be great.

Member

dvyukov commented May 17, 2017

This reminds me of #8717.
We've constantly seen ~+/-3% fluctuations after completely unrelated changes (like deleting some code in a function that does not even execute). I and Russ were not able to get to the bottom of this. If you will shed some light on it, it would be great.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 17, 2017

Contributor

@dvyukov thanks for feedback. I thought the light was already sched in #18977 (comment) and #18977 (comment) . In short we currently believe it is hashing inside cpu being used for branch prediction which takes source jump IP into account, which sometime could result on hash collisions and this way branch prediction state for two unrelated branches could be overwriting each other.

Please see

http://www.agner.org/optimize/microarchitecture.pdf p. 15 (agree predictor) for overview.

The particular hash function used is not documented and is seemingly different (as well as other predictor details) from cpu to cpu.

However as Josh proposed there is a way to mitigate at least unrelated changes from benchmarks: perform the timings with tested function starting from different addresses, and take e.g. best or average. On amd64 go currently aligns functions to 16 bytes. Once again in #18977 (comment) I hackily did test for Fannkuch11 with changing its start in 16 bytes steps. It could be done more general and used to at least remove unrelated changes noise from the benchmarks.

I hope I understood Josh correct,
Kirill

Contributor

navytux commented May 17, 2017

@dvyukov thanks for feedback. I thought the light was already sched in #18977 (comment) and #18977 (comment) . In short we currently believe it is hashing inside cpu being used for branch prediction which takes source jump IP into account, which sometime could result on hash collisions and this way branch prediction state for two unrelated branches could be overwriting each other.

Please see

http://www.agner.org/optimize/microarchitecture.pdf p. 15 (agree predictor) for overview.

The particular hash function used is not documented and is seemingly different (as well as other predictor details) from cpu to cpu.

However as Josh proposed there is a way to mitigate at least unrelated changes from benchmarks: perform the timings with tested function starting from different addresses, and take e.g. best or average. On amd64 go currently aligns functions to 16 bytes. Once again in #18977 (comment) I hackily did test for Fannkuch11 with changing its start in 16 bytes steps. It could be done more general and used to at least remove unrelated changes noise from the benchmarks.

I hope I understood Josh correct,
Kirill

gopherbot pushed a commit that referenced this issue May 18, 2017

cmd/compile: don't break up contiguous blocks in looprotate
looprotate finds loop headers and arranges for them to be placed
after the body of the loop. This eliminates a jump from the body.

However, if the loop header is a series of contiguously laid out blocks,
the rotation introduces a new jump in that series.
This CL expands the "loop header" to move to be the entire
run of contiguously laid out blocks in the same loop.

This shrinks object files a little, and actually speeds up
the compiler noticeably. Numbers below.

Fannkuch performance seems to vary a lot by machine. On my laptop:

name          old time/op  new time/op  delta
Fannkuch11-8   2.89s ± 2%   2.85s ± 3%  -1.22%  (p=0.000 n=50+50)

This has a significant affect on the append benchmarks in #14758:

name   old time/op    new time/op    delta
Foo-8     312ns ± 3%     276ns ± 2%  -11.37%  (p=0.000 n=30+29)
Bar-8     565ns ± 2%     456ns ± 2%  -19.27%  (p=0.000 n=27+28)

Updates #18977
Fixes #20355

name        old time/op       new time/op       delta
Template          205ms ± 5%        204ms ± 8%    ~     (p=0.903 n=92+99)
Unicode          85.3ms ± 4%       85.1ms ± 3%    ~     (p=0.191 n=92+94)
GoTypes           512ms ± 4%        507ms ± 4%  -0.93%  (p=0.000 n=95+97)
Compiler          2.38s ± 3%        2.35s ± 3%  -1.27%  (p=0.000 n=98+95)
SSA               4.67s ± 3%        4.64s ± 3%  -0.62%  (p=0.000 n=95+96)
Flate             117ms ± 3%        117ms ± 3%    ~     (p=0.099 n=84+86)
GoParser          139ms ± 4%        137ms ± 4%  -0.90%  (p=0.000 n=97+98)
Reflect           329ms ± 5%        326ms ± 6%  -0.97%  (p=0.002 n=99+98)
Tar               102ms ± 6%        101ms ± 5%  -0.97%  (p=0.006 n=97+97)
XML               198ms ±10%        196ms ±13%    ~     (p=0.087 n=100+100)
[Geo mean]        318ms             316ms       -0.72%

name        old user-time/op  new user-time/op  delta
Template          250ms ± 7%        250ms ± 7%    ~     (p=0.850 n=94+92)
Unicode           107ms ± 8%        106ms ± 5%  -0.76%  (p=0.005 n=98+91)
GoTypes           665ms ± 5%        659ms ± 5%  -0.85%  (p=0.003 n=93+98)
Compiler          3.15s ± 3%        3.10s ± 3%  -1.60%  (p=0.000 n=99+98)
SSA               6.82s ± 3%        6.72s ± 4%  -1.55%  (p=0.000 n=94+98)
Flate             138ms ± 8%        138ms ± 6%    ~     (p=0.369 n=94+92)
GoParser          170ms ± 5%        168ms ± 6%  -1.13%  (p=0.002 n=96+98)
Reflect           412ms ± 8%        416ms ± 8%    ~     (p=0.169 n=100+100)
Tar               123ms ±18%        123ms ±14%    ~     (p=0.896 n=100+100)
XML               236ms ± 9%        234ms ±11%    ~     (p=0.124 n=100+100)
[Geo mean]        401ms             398ms       -0.63%

name        old alloc/op      new alloc/op      delta
Template         38.8MB ± 0%       38.8MB ± 0%    ~     (p=0.222 n=5+5)
Unicode          28.7MB ± 0%       28.7MB ± 0%    ~     (p=0.421 n=5+5)
GoTypes           109MB ± 0%        109MB ± 0%    ~     (p=0.056 n=5+5)
Compiler          457MB ± 0%        457MB ± 0%  +0.07%  (p=0.008 n=5+5)
SSA              1.10GB ± 0%       1.10GB ± 0%  +0.05%  (p=0.008 n=5+5)
Flate            24.5MB ± 0%       24.5MB ± 0%    ~     (p=0.222 n=5+5)
GoParser         30.9MB ± 0%       31.0MB ± 0%  +0.21%  (p=0.016 n=5+5)
Reflect          73.4MB ± 0%       73.4MB ± 0%    ~     (p=0.421 n=5+5)
Tar              25.5MB ± 0%       25.5MB ± 0%    ~     (p=0.548 n=5+5)
XML              40.9MB ± 0%       40.9MB ± 0%    ~     (p=0.151 n=5+5)
[Geo mean]       71.6MB            71.6MB       +0.07%

name        old allocs/op     new allocs/op     delta
Template           394k ± 0%         394k ± 0%    ~     (p=1.000 n=5+5)
Unicode            344k ± 0%         343k ± 0%    ~     (p=0.310 n=5+5)
GoTypes           1.16M ± 0%        1.16M ± 0%    ~     (p=1.000 n=5+5)
Compiler          4.42M ± 0%        4.42M ± 0%    ~     (p=1.000 n=5+5)
SSA               9.80M ± 0%        9.80M ± 0%    ~     (p=0.095 n=5+5)
Flate              237k ± 1%         238k ± 1%    ~     (p=0.310 n=5+5)
GoParser           320k ± 0%         322k ± 1%  +0.50%  (p=0.032 n=5+5)
Reflect            958k ± 0%         957k ± 0%    ~     (p=0.548 n=5+5)
Tar                252k ± 1%         252k ± 0%    ~     (p=1.000 n=5+5)
XML                400k ± 0%         400k ± 0%    ~     (p=0.841 n=5+5)
[Geo mean]         741k              742k       +0.06%

name        old object-bytes  new object-bytes  delta
Template           386k ± 0%         386k ± 0%  -0.05%  (p=0.008 n=5+5)
Unicode            202k ± 0%         202k ± 0%  -0.01%  (p=0.008 n=5+5)
GoTypes           1.16M ± 0%        1.16M ± 0%  -0.06%  (p=0.008 n=5+5)
Compiler          3.91M ± 0%        3.91M ± 0%  -0.06%  (p=0.008 n=5+5)
SSA               7.91M ± 0%        7.92M ± 0%  +0.01%  (p=0.008 n=5+5)
Flate              228k ± 0%         227k ± 0%  -0.04%  (p=0.008 n=5+5)
GoParser           283k ± 0%         283k ± 0%  -0.06%  (p=0.008 n=5+5)
Reflect            952k ± 0%         951k ± 0%  -0.02%  (p=0.008 n=5+5)
Tar                187k ± 0%         187k ± 0%  -0.04%  (p=0.008 n=5+5)
XML                406k ± 0%         406k ± 0%  -0.05%  (p=0.008 n=5+5)
[Geo mean]         648k              648k       -0.04%

Change-Id: I8630c4291a0eb2f7e7927bc04d7cc0efef181094
Reviewed-on: https://go-review.googlesource.com/43491
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 19, 2017

Contributor

For the reference: Fannkuch11 can be speedup -33% via improving BCE:

#20393 (comment)

Contributor

navytux commented May 19, 2017

For the reference: Fannkuch11 can be speedup -33% via improving BCE:

#20393 (comment)

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 19, 2017

Contributor

It can be sped up 33% with -B. That might or might not be the same thing.

Contributor

josharian commented May 19, 2017

It can be sped up 33% with -B. That might or might not be the same thing.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 20, 2017

Contributor

Correct. However all runtime bounds checks in fannkuch can be statically proven to be unneccessary:

#20393 (comment)

That means the compiler can theoretically reach -B level with just clever BCE.

Contributor

navytux commented May 20, 2017

Correct. However all runtime bounds checks in fannkuch can be statically proven to be unneccessary:

#20393 (comment)

That means the compiler can theoretically reach -B level with just clever BCE.

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov May 22, 2017

Member

@josharian Dmitry's performance dashboard was noisy and spammy and ultimately disabled.

It wasn't spammy and it wasn't particularly noisy. #8717 lead to episodic noise, but otherwise it reliably detected changes in multiple metrics within fractions of percent. Dashboard wasn't disabled. It was broken by changes to the app and then never fixed.

Member

dvyukov commented May 22, 2017

@josharian Dmitry's performance dashboard was noisy and spammy and ultimately disabled.

It wasn't spammy and it wasn't particularly noisy. #8717 lead to episodic noise, but otherwise it reliably detected changes in multiple metrics within fractions of percent. Dashboard wasn't disabled. It was broken by changes to the app and then never fixed.

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov May 22, 2017

Member

@navytux @josharian I thought the light was already sched in #18977 (comment) and #18977 (comment) .

The analysis looks great! Thanks!

Member

dvyukov commented May 22, 2017

@navytux @josharian I thought the light was already sched in #18977 (comment) and #18977 (comment) .

The analysis looks great! Thanks!

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 22, 2017

Contributor

@dvyukov thanks for feedback.

Contributor

navytux commented May 22, 2017

@dvyukov thanks for feedback.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Jun 6, 2017

Contributor

We're punting more loop rotation work to 1.10.

Contributor

randall77 commented Jun 6, 2017

We're punting more loop rotation work to 1.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment