Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering #18977

Open
cherrymui opened this issue Feb 7, 2017 · 42 comments
Assignees
Labels
Milestone

Comments

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Feb 7, 2017

CL https://go-review.googlesource.com/c/36205/ deletes assembler backend instruction reordering. This makes Fannkuch-11 on AMD64 slow down 6%. Figure out why and improve the compiler.

@cherrymui cherrymui self-assigned this Feb 7, 2017
@cherrymui cherrymui added this to the Go1.9 milestone Feb 7, 2017
@cherrymui cherrymui changed the title cmd/compile: Fannkuch-11 on AMD64 slow down 6% after removing assembler backend instruction reordering cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering Feb 7, 2017
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Mar 21, 2017

CL https://golang.org/cl/38431 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 24, 2017
Old loops look like this:
   loop:
     CMPQ ...
     JGE exit
     ...
     JMP loop
   exit:

New loops look like this:
    JMP entry
  loop:
    ...
  entry:
    CMPQ ...
    JLT loop

This removes one instruction (the unconditional jump) from
the inner loop.
Kinda surprisingly, it matters.

This is a bit different than the peeling that the old obj
library did in that we don't duplicate the loop exit test.
We just jump to the test.  I'm not sure if it is better or
worse to do that (peeling gets rid of the JMP but means more
code duplication), but this CL is certainly a much simpler
compiler change, so I'll try this way first.

The obj library used to do peeling before
CL https://go-review.googlesource.com/c/36205 turned it off.

Fixes #15837 (remove obj instruction reordering)
The reordering is already removed, this CL implements the only
part of that reordering that we'd like to keep.

Fixes #14758 (append loop)
name    old time/op    new time/op    delta
Foo-12     817ns ± 4%     538ns ± 0%  -34.08%   (p=0.000 n=10+9)
Bar-12     850ns ±11%     570ns ±13%  -32.88%  (p=0.000 n=10+10)

Update #19595 (BLAS slowdown)
name                       old time/op  new time/op  delta
DgemvMedMedNoTransIncN-12  13.2µs ± 9%  10.2µs ± 1%  -22.26%  (p=0.000 n=9+9)

Fixes #19633 (append loop)
name    old time/op    new time/op    delta
Foo-12     810ns ± 1%     540ns ± 0%  -33.30%   (p=0.000 n=8+9)

Update #18977 (Fannkuch11 regression)
name         old time/op    new time/op    delta
Fannkuch11-8                2.80s ± 0%     3.01s ± 0%  +7.47%   (p=0.000 n=9+10)
This one makes no sense.  There's strictly 1 less instruction in the
inner loop (17 instead of 18).  They are exactly the same instructions
except for the JMP that has been elided.

go1 benchmarks generally don't look very impressive.  But the gains for the
specific issues above make this CL still probably worth it.
name                      old time/op    new time/op    delta
BinaryTree17-8              2.32s ± 0%     2.34s ± 0%  +1.14%    (p=0.000 n=9+7)
Fannkuch11-8                2.80s ± 0%     3.01s ± 0%  +7.47%   (p=0.000 n=9+10)
FmtFprintfEmpty-8          44.1ns ± 1%    46.1ns ± 1%  +4.53%  (p=0.000 n=10+10)
FmtFprintfString-8         67.8ns ± 0%    74.4ns ± 1%  +9.80%   (p=0.000 n=10+9)
FmtFprintfInt-8            74.9ns ± 0%    78.4ns ± 0%  +4.67%   (p=0.000 n=8+10)
FmtFprintfIntInt-8          117ns ± 1%     123ns ± 1%  +4.69%   (p=0.000 n=9+10)
FmtFprintfPrefixedInt-8     160ns ± 1%     146ns ± 0%  -8.22%   (p=0.000 n=8+10)
FmtFprintfFloat-8           214ns ± 0%     206ns ± 0%  -3.91%    (p=0.000 n=8+8)
FmtManyArgs-8               468ns ± 0%     497ns ± 1%  +6.09%   (p=0.000 n=8+10)
GobDecode-8                6.16ms ± 0%    6.21ms ± 1%  +0.76%   (p=0.000 n=9+10)
GobEncode-8                4.90ms ± 0%    4.92ms ± 1%  +0.37%   (p=0.028 n=9+10)
Gzip-8                      209ms ± 0%     212ms ± 0%  +1.33%  (p=0.000 n=10+10)
Gunzip-8                   36.6ms ± 0%    38.0ms ± 1%  +4.03%    (p=0.000 n=9+9)
HTTPClientServer-8         84.2µs ± 0%    86.0µs ± 1%  +2.14%    (p=0.000 n=9+9)
JSONEncode-8               13.6ms ± 3%    13.8ms ± 1%  +1.55%   (p=0.003 n=9+10)
JSONDecode-8               53.2ms ± 5%    52.9ms ± 0%    ~     (p=0.280 n=10+10)
Mandelbrot200-8            3.78ms ± 0%    3.78ms ± 1%    ~      (p=0.661 n=10+9)
GoParse-8                  2.89ms ± 0%    2.94ms ± 2%  +1.50%  (p=0.000 n=10+10)
RegexpMatchEasy0_32-8      68.5ns ± 2%    68.9ns ± 1%    ~     (p=0.136 n=10+10)
RegexpMatchEasy0_1K-8       220ns ± 1%     225ns ± 1%  +2.41%  (p=0.000 n=10+10)
RegexpMatchEasy1_32-8      64.7ns ± 0%    64.5ns ± 0%  -0.28%  (p=0.042 n=10+10)
RegexpMatchEasy1_1K-8       348ns ± 1%     355ns ± 0%  +1.90%  (p=0.000 n=10+10)
RegexpMatchMedium_32-8      102ns ± 1%     105ns ± 1%  +2.95%  (p=0.000 n=10+10)
RegexpMatchMedium_1K-8     33.1µs ± 3%    32.5µs ± 0%  -1.75%  (p=0.000 n=10+10)
RegexpMatchHard_32-8       1.71µs ± 1%    1.70µs ± 1%  -0.84%   (p=0.002 n=10+9)
RegexpMatchHard_1K-8       51.1µs ± 0%    50.8µs ± 1%  -0.48%  (p=0.004 n=10+10)
Revcomp-8                   411ms ± 1%     402ms ± 0%  -2.22%   (p=0.000 n=10+9)
Template-8                 61.8ms ± 1%    59.7ms ± 0%  -3.44%    (p=0.000 n=9+9)
TimeParse-8                 306ns ± 0%     318ns ± 0%  +3.83%  (p=0.000 n=10+10)
TimeFormat-8                320ns ± 0%     318ns ± 1%  -0.53%   (p=0.012 n=7+10)

Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b
Reviewed-on: https://go-review.googlesource.com/38431
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented May 11, 2017

Tip (6897030) looks like a net win vs 1.8.1. Code size is smaller:

  • 1.8.1: "".fannkuch t=1 size=844 args=0x10 locals=0x80
  • tip: "".fannkuch STEXT size=829 args=0x10 locals=0x80

Execution time is faster, at least on my machine:

name          old time/op  new time/op  delta
Fannkuch11-8   2.74s ± 2%   2.68s ± 2%  -2.32%  (p=0.000 n=20+19)

So whatever regression occurred has at the least been compensated for, if not fixed by CL 38431.

I do see one notable regression, though, which is in the number of panicindex calls. I've filed #20332 for that.

I'm going to close this as fixed. @cherrymui or @randall77, please reopen if you disagree.

@josharian josharian closed this May 11, 2017
@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 11, 2017

@josharian, sad to bring bad news but here is how it looks on my machine:

$ benchstat go18.txt gotip.txt
name          old time/op  new time/op  delta
Fannkuch11-4   2.68s ± 0%   2.94s ± 0%  +9.88%  (p=0.000 n=20+19)

in other words it is ~ 10% regression.

$ go18 version
go version go1.8.1 linux/amd64
$ go version
go version devel +ee69c21747 Thu May 11 18:28:40 2017 +0000 linux/amd64

Linux deco 4.9.0-2-amd64 #1 SMP Debian 4.9.18-1 (2017-03-30) x86_64 GNU/Linux
Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented May 11, 2017

@navytux will you re-measure including CL 43291?

@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 11, 2017

let me try

@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 11, 2017

Strangely CL 43291 make things only worse:

$ benchstat gotip.txt gotip+cl43291.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.94s ± 0%   3.01s ± 0%  +2.29%  (p=0.000 n=19+19)

and I've doble-checked it is reproducible via recompiling (make.bash) and benchmarking tip then tip+CL then tip again then tip+CL again.

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented May 11, 2017

OK, re-opening.

One other things I see in tip vs 1.8.1 is that in 1.8.1, all the panicindex calls are at the end of the function, so jumps to them are always unlikely (forward jumps). In tip, they're scattered throughout. Not sure whether that's responsible, but I'll see if I can fix it and check whether it helps.

@josharian josharian reopened this May 11, 2017
@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 11, 2017

@josharian thanks for feedback and reopening.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented May 11, 2017

CL https://golang.org/cl/43293 mentions this issue.

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented May 11, 2017

@navytux will you see what effect CL 43293 has on your machine? Thanks.

@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 12, 2017

@josharian sure. Today's tip is go version devel +482da51803 Fri May 12 01:10:11 2017 +0000 linux/amd64 and it already includes CL 43291. We are using it as the base:

$ benchstat gotip.txt gotip+cl43293.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.01s ± 0%   2.84s ± 1%  -5.66%  (p=0.000 n=17+19)

in other words it is partial improvement and we are now back to 6% regression over go18:

$ benchstat go18.txt gotip+cl43293.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.68s ± 0%   2.84s ± 1%  +6.17%  (p=0.000 n=20+19)
@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 12, 2017

For the reference here is how Fannkuch11 looks under perf stat -d:

 Performance counter stats for './go1.test-18 -test.bench Fannkuch11' (10 runs):

       4008,809815      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,06% )
               834      context-switches          #    0,208 K/sec                    ( +-  1,04% )
                14      cpu-migrations            #    0,003 K/sec                    ( +- 10,89% )
            79 865      page-faults               #    0,020 M/sec                    ( +-  0,01% )
    13 398 886 442      cycles                    #    3,342 GHz                      ( +-  0,06% )  (49,99%)
    29 180 837 581      instructions              #    2,18  insn per cycle           ( +-  0,06% )  (62,57%)
     5 974 722 394      branches                  # 1490,398 M/sec                    ( +-  0,06% )  (62,62%)
       162 930 510      branch-misses             #    2,73% of all branches          ( +-  0,09% )  (62,73%)
     6 602 849 508      L1-dcache-loads           # 1647,085 M/sec                    ( +-  0,11% )  (62,84%)
        73 480 588      L1-dcache-load-misses     #    1,11% of all L1-dcache hits    ( +-  0,24% )  (62,61%)
         3 604 844      LLC-loads                 #    0,899 M/sec                    ( +-  3,96% )  (49,89%)
           316 385      LLC-load-misses           #    8,78% of all LL-cache hits     ( +-  5,10% )  (49,85%)

       4,003714952 seconds time elapsed                                          ( +-  0,07% )
 Performance counter stats for './go1.test-tip -test.bench Fannkuch11' (10 runs):

       4259,019276      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,09% )
               822      context-switches          #    0,193 K/sec                    ( +-  1,31% )
                15      cpu-migrations            #    0,004 K/sec                    ( +-  7,89% )
            79 928      page-faults               #    0,019 M/sec                    ( +-  0,01% )
    14 228 793 806      cycles                    #    3,341 GHz                      ( +-  0,08% )  (49,92%)
    29 011 760 342      instructions              #    2,04  insn per cycle           ( +-  0,08% )  (62,54%)
     6 812 768 092      branches                  # 1599,610 M/sec                    ( +-  0,05% )  (62,56%)
       169 436 247      branch-misses             #    2,49% of all branches          ( +-  0,07% )  (62,68%)
     6 617 477 028      L1-dcache-loads           # 1553,756 M/sec                    ( +-  0,07% )  (62,84%)
        72 643 321      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,21% )  (62,77%)
         3 792 842      LLC-loads                 #    0,891 M/sec                    ( +-  8,00% )  (49,95%)
           324 234      LLC-load-misses           #    8,55% of all LL-cache hits     ( +-  6,96% )  (49,88%)

       4,255140742 seconds time elapsed                                          ( +-  0,09% )
 Performance counter stats for './go1.test-tip+cl43293 -test.bench Fannkuch11' (10 runs):

       4076,691573      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,18% )
               809      context-switches          #    0,198 K/sec                    ( +-  1,07% )
                14      cpu-migrations            #    0,003 K/sec                    ( +-  7,07% )
            79 926      page-faults               #    0,020 M/sec                    ( +-  0,01% )
    13 622 118 634      cycles                    #    3,341 GHz                      ( +-  0,10% )  (49,99%)
    28 659 892 607      instructions              #    2,10  insn per cycle           ( +-  0,11% )  (62,64%)
     6 412 523 320      branches                  # 1572,972 M/sec                    ( +-  0,07% )  (62,67%)
       175 939 383      branch-misses             #    2,74% of all branches          ( +-  0,09% )  (62,74%)
     6 634 137 889      L1-dcache-loads           # 1627,334 M/sec                    ( +-  0,10% )  (62,81%)
        72 791 290      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,52% )  (62,68%)
         3 613 857      LLC-loads                 #    0,886 M/sec                    ( +-  3,46% )  (49,88%)
           315 113      LLC-load-misses           #    8,72% of all LL-cache hits     ( +-  5,94% )  (49,86%)

       4,072909978 seconds time elapsed                                          ( +-  0,19% )

highlights:

  1. amount of instructions executed is less for tip (99.4%) and tip+cl43293 (98.2%) compared to go18.
  2. however instructions per CPU cycle is lower at tip (2,04 insn/cycle) and tip+cl43293 (2,10 insn/cycle) compared to go18 (2,18 insn/cycle). That means there are probably more stalls due to instructions mischeduling or something. That results for more cycles being spent even though there are less instructions to execute.
  3. total amount of branches is higher at tip (114%) and tip+cl43293 (107%) compared to go18.
  4. L1d loads is only a bit slightly more for tip (0.2%) and tip+cl43293 (0.4%) compared to go18. L1d miss-ratio is also almost the same amongst 3 versions.
  5. LLC (last level cache - L3 in my case - that goes first before RAM) loads are more for tip (105.2%) and go almost back to go18 level at tip+cl43293 (100.2%). The variation in LLC-loads is more for tip over go18 and tip+cl43293. LLC mis ratio also is approximately back to go18 level at tip+cl43293.

To me all this suggests that after CL43293 or similar is applied the problem is in that:

  • there are more branches compared to go18, and
  • instructions are less well scheduled by compiler

Hope this helps a bit,
Kirill

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented May 12, 2017

Thanks, Kirill, that's great. My guess was that CL 43293 was making the critical sections of code more compact, thus keeping more in cache; nice to have that confirmed. I'm working on making CL 43293 submittable now, and then will look again.

@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 12, 2017

Josh thanks for feedback and patches. Offhand me thinks to get this regression gone we need to fix "Kinda surprisingly" in 39ce590 (CL 38431) - in other words to understand the root cause and apply loop rotation only when it is more optimal judging on corresponding factors. Current state is, it seems, we are applying loop rotation always and it sometimes helps and sometimes harms.

@randall77, @TocarIP please see #18977 (comment) for a bit of info about what is going on with Fannkuch11.

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented May 12, 2017

I'm going to dig a bit more, although I'll be mostly AFK for a while. In the meantime, Kirill, maybe you could confirm that disabling looprotate on your machine fixes the problem by adding "return" at the very beginning of func loopRotate in cmd/compile/internal/ssa/looprotate.go, to disable the pass.

@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 12, 2017

Josh, thanks for feedback. Sure, staying at the same tip = 482da51 here is what we have:

  1. going to CL 43293.2 (second patch in that CL) birings more speedup:
$ benchstat gotip+cl43293.txt gotip+cl43293.2.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.84s ± 1%   2.78s ± 0%  -2.09%  (p=0.000 n=19+16)
 Performance counter stats for './go1.test-tip+cl43293.2 -test.bench Fannkuch11' (10 runs):

       4029,642462      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,11% )
               814      context-switches          #    0,202 K/sec                    ( +-  1,00% )
                11      cpu-migrations            #    0,003 K/sec                    ( +-  8,11% )
             79921      page-faults               #    0,020 M/sec                    ( +-  0,01% )
       13482684046      cycles                    #    3,346 GHz                      ( +-  0,10% )  (50,05%)
       28731337641      instructions              #    2,13  insn per cycle           ( +-  0,17% )  (62,66%)
        6429703132      branches                  # 1595,601 M/sec                    ( +-  0,14% )  (62,72%)
         168958851      branch-misses             #    2,63% of all branches          ( +-  0,18% )  (62,74%)
        6634192402      L1-dcache-loads           # 1646,348 M/sec                    ( +-  0,14% )  (62,80%)
          73264431      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,66% )  (62,57%)
           4233228      LLC-loads                 #    1,051 M/sec                    ( +-  3,12% )  (49,84%)
            376635      LLC-load-misses           #    8,90% of all LL-cache hits     ( +-  9,29% )  (49,89%)

       4,026348487 seconds time elapsed                                          ( +-  0,11% )

some comments: there is a bit more (0.2%) instructions but insn/cycle grows from 2.10 to 2.13 and as the result total cpu cycles is ~ 1% less (which correllates with totall run time reported by perf and probably includes preparatory program actions -test.bench itself excludes from accounting). The amount of branches is too 0.2% more, however, if I read counters correctly, not all new instructions are branches as δ(insn) = 71445034 but δ(branches) = 17179812 (4x less).

  1. disabling loopRotate over CL 43293.2 makes it slower:
$ benchstat gotip+cl43293.2.txt gotip+cl43293.2-nolooprot.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.78s ± 0%   2.87s ± 0%  +2.93%  (p=0.000 n=16+15)
 Performance counter stats for './go1.test-tip+cl43293.2-nolooprot -test.bench Fannkuch11' (10 runs):

       4103,320043      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,12% )
               822      context-switches          #    0,200 K/sec                    ( +-  0,82% )
                12      cpu-migrations            #    0,003 K/sec                    ( +-  9,02% )
             79925      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       13711752541      cycles                    #    3,342 GHz                      ( +-  0,06% )  (49,99%)
       29498486031      instructions              #    2,15  insn per cycle           ( +-  0,07% )  (62,58%)
        7212426730      branches                  # 1757,705 M/sec                    ( +-  0,08% )  (62,64%)
         174290760      branch-misses             #    2,42% of all branches          ( +-  0,04% )  (62,73%)
        6609550227      L1-dcache-loads           # 1610,781 M/sec                    ( +-  0,07% )  (62,79%)
          73869349      L1-dcache-load-misses     #    1,12% of all L1-dcache hits    ( +-  0,28% )  (62,67%)
           3816460      LLC-loads                 #    0,930 M/sec                    ( +-  2,51% )  (49,91%)
            429992      LLC-load-misses           #   11,27% of all LL-cache hits     ( +-  8,17% )  (49,89%)

       4,098764012 seconds time elapsed                                          ( +-  0,12% )

insn / cycle grows from 2.13 to 2.15 (good) but there are 2.7% more total instructions so overal cycles grow by 1.7% (correlates wrt total run time reported by perf). There is 12% more branches and looks like they are the only visible factor that contribute to instruction growth: δ(branches) / δ(insn) ≈ 1.

  1. but disabling loopRotate over just tip makes it faster:
$ benchstat gotip.txt gotip-nolooprot.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.02s ± 1%   2.98s ± 0%  -1.29%  (p=0.000 n=19+16)
 Performance counter stats for './go1.test-tip-nolooprot -test.bench Fannkuch11' (10 runs):

       4215,739102      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,08% )
               827      context-switches          #    0,196 K/sec                    ( +-  0,81% )
                13      cpu-migrations            #    0,003 K/sec                    ( +- 10,58% )
             79932      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       14108860946      cycles                    #    3,347 GHz                      ( +-  0,08% )  (49,91%)
       29508075225      instructions              #    2,09  insn per cycle           ( +-  0,14% )  (62,49%)
        7220934286      branches                  # 1712,851 M/sec                    ( +-  0,09% )  (62,58%)
         174301863      branch-misses             #    2,41% of all branches          ( +-  0,06% )  (62,77%)
        6621367485      L1-dcache-loads           # 1570,630 M/sec                    ( +-  0,07% )  (62,86%)
          73684219      L1-dcache-load-misses     #    1,11% of all L1-dcache hits    ( +-  0,45% )  (62,71%)
           3886187      LLC-loads                 #    0,922 M/sec                    ( +-  3,16% )  (49,91%)
            369954      LLC-load-misses           #    9,52% of all LL-cache hits     ( +-  6,86% )  (49,82%)

       4,211567986 seconds time elapsed                                          ( +-  0,08% )

Here the number of instructions increases by 1.7% which is mostly due to 6% N(branches) increase (δ(branches) / δ(insn) ≈ 0.82) however the insn/cycle ratio also grows from 2.04 to 2.09 which overcompensate all that and resulting N(cycles) decrease by ~ 1%.

My observation:

  • above everywhere disabling loop rotation makes insn/cycle ↑ but go18 2.18 level is never reached; closest was tip+cl43293.2-nolooprot (2.15 vs 2.18 for go18).
  • the task here seems to be to optimize target function (Ncycles) via variating Ninstructions⋅(insn/cycle). Code transformations which minimize Ninstructions (loop rotation, etc) explicitly change only one part of the formula, but I'm not sure insn/cycle is modelled or taken into account anywhere.

( I've rebenchmarked previous runs from todays morning and reverified all timings several times )

@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 12, 2017

I'm not an x86 guy (probably @TocarIP could help here) but I imagined: if there are several execution units inside cpu with somehow different properties, units overal load could depend on the order of instructions in instruction stream. With this thinking in mind I googled for "x86 add nop speedup" and found e.g.

https://research.google.com/pubs/pub37077.html
https://stackoverflow.com/questions/17896714/why-would-introducing-useless-mov-instructions-speed-up-a-tight-loop-in-x86-64-a/17906589
...

also if we simplify and assume data loads/stores are always hitting cache what affects insn/cycle is

N(branch-misses) / N(instructions)

(note not / N(branches)). The stackoverflow link and MAO article above contains some information on what factors could affect branch prediction and other loop related things.

Coincidence or not there the talking is usually too about 5% - 7% speedup or slowdown.

@TocarIP

This comment has been minimized.

Copy link
Contributor

@TocarIP TocarIP commented May 12, 2017

I've played a bit with https://github.com/andikleen/pmu-tools
It suggests that this bench is bound by branch misprediction in various forms:
toplev.py -l=3 -d ./go1.test -test.run=NONE -test.bench=Fan:

FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:         17.59 % Clocks [ 16.66%]
        This metric represents cycles fraction the CPU was stalled
        due to Branch Resteers...
        Sampling events:  br_misp_retired.all_branches
BAD            Bad_Speculation:                                         22.39 % Slots [ 17.00%]
BAD            Bad_Speculation.Branch_Mispredicts:                      22.21 % Slots [ 17.00%] BN
        This metric represents slots fraction the CPU has wasted due
        to Branch Misprediction...
        Sampling events:  br_misp_retired.all_branches
               MUX:                                                     16.62 %

Running ocperf.py record -e br_misp_retired.all_branches_pebs
Shows that guilty branches are caused by index checks:

884511   75ef95:       4d 39 e7                cmp    %r12,%r15
884512   75ef98:       73 33                   jae    75efcd <_/localdisk/itocar/golang/test/bench/go1.fannkuch+0x1ed>

and code layout:

75efcb:       eb 0e                   jmp    75efdb <_/localdisk/itocar/golang/test/bench/go1.fannkuch+0x1fb>
884526   75efcd:       e8 9e c8 cc ff          callq  42b870 <runtime.panicindex>

Note that in this case branch is unconditional!
With 43293 at least all problems are with two conditional branches.
Playing with MaxLoopPad in cmd/internal/obj/x86/asm6 produce no improvements, so the problem is not in branch targets alignment.
I'd like to experiment with branch source alignment, but this isn't as simple as changing 1 constant, and I will be away for a ~week.

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented May 13, 2017

I haven't had a chance to dig into this excellent data, but see also #20355 for something that seems definitely wrong with looprotate. It seems that looprotate is doing both good work and bad for the benchmark.

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented May 13, 2017

Also #20356.

@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 17, 2017

@dvyukov thanks for feedback. I thought the light was already sched in #18977 (comment) and #18977 (comment) . In short we currently believe it is hashing inside cpu being used for branch prediction which takes source jump IP into account, which sometime could result on hash collisions and this way branch prediction state for two unrelated branches could be overwriting each other.

Please see

http://www.agner.org/optimize/microarchitecture.pdf p. 15 (agree predictor) for overview.

The particular hash function used is not documented and is seemingly different (as well as other predictor details) from cpu to cpu.

However as Josh proposed there is a way to mitigate at least unrelated changes from benchmarks: perform the timings with tested function starting from different addresses, and take e.g. best or average. On amd64 go currently aligns functions to 16 bytes. Once again in #18977 (comment) I hackily did test for Fannkuch11 with changing its start in 16 bytes steps. It could be done more general and used to at least remove unrelated changes noise from the benchmarks.

I hope I understood Josh correct,
Kirill

gopherbot pushed a commit that referenced this issue May 18, 2017
looprotate finds loop headers and arranges for them to be placed
after the body of the loop. This eliminates a jump from the body.

However, if the loop header is a series of contiguously laid out blocks,
the rotation introduces a new jump in that series.
This CL expands the "loop header" to move to be the entire
run of contiguously laid out blocks in the same loop.

This shrinks object files a little, and actually speeds up
the compiler noticeably. Numbers below.

Fannkuch performance seems to vary a lot by machine. On my laptop:

name          old time/op  new time/op  delta
Fannkuch11-8   2.89s ± 2%   2.85s ± 3%  -1.22%  (p=0.000 n=50+50)

This has a significant affect on the append benchmarks in #14758:

name   old time/op    new time/op    delta
Foo-8     312ns ± 3%     276ns ± 2%  -11.37%  (p=0.000 n=30+29)
Bar-8     565ns ± 2%     456ns ± 2%  -19.27%  (p=0.000 n=27+28)

Updates #18977
Fixes #20355

name        old time/op       new time/op       delta
Template          205ms ± 5%        204ms ± 8%    ~     (p=0.903 n=92+99)
Unicode          85.3ms ± 4%       85.1ms ± 3%    ~     (p=0.191 n=92+94)
GoTypes           512ms ± 4%        507ms ± 4%  -0.93%  (p=0.000 n=95+97)
Compiler          2.38s ± 3%        2.35s ± 3%  -1.27%  (p=0.000 n=98+95)
SSA               4.67s ± 3%        4.64s ± 3%  -0.62%  (p=0.000 n=95+96)
Flate             117ms ± 3%        117ms ± 3%    ~     (p=0.099 n=84+86)
GoParser          139ms ± 4%        137ms ± 4%  -0.90%  (p=0.000 n=97+98)
Reflect           329ms ± 5%        326ms ± 6%  -0.97%  (p=0.002 n=99+98)
Tar               102ms ± 6%        101ms ± 5%  -0.97%  (p=0.006 n=97+97)
XML               198ms ±10%        196ms ±13%    ~     (p=0.087 n=100+100)
[Geo mean]        318ms             316ms       -0.72%

name        old user-time/op  new user-time/op  delta
Template          250ms ± 7%        250ms ± 7%    ~     (p=0.850 n=94+92)
Unicode           107ms ± 8%        106ms ± 5%  -0.76%  (p=0.005 n=98+91)
GoTypes           665ms ± 5%        659ms ± 5%  -0.85%  (p=0.003 n=93+98)
Compiler          3.15s ± 3%        3.10s ± 3%  -1.60%  (p=0.000 n=99+98)
SSA               6.82s ± 3%        6.72s ± 4%  -1.55%  (p=0.000 n=94+98)
Flate             138ms ± 8%        138ms ± 6%    ~     (p=0.369 n=94+92)
GoParser          170ms ± 5%        168ms ± 6%  -1.13%  (p=0.002 n=96+98)
Reflect           412ms ± 8%        416ms ± 8%    ~     (p=0.169 n=100+100)
Tar               123ms ±18%        123ms ±14%    ~     (p=0.896 n=100+100)
XML               236ms ± 9%        234ms ±11%    ~     (p=0.124 n=100+100)
[Geo mean]        401ms             398ms       -0.63%

name        old alloc/op      new alloc/op      delta
Template         38.8MB ± 0%       38.8MB ± 0%    ~     (p=0.222 n=5+5)
Unicode          28.7MB ± 0%       28.7MB ± 0%    ~     (p=0.421 n=5+5)
GoTypes           109MB ± 0%        109MB ± 0%    ~     (p=0.056 n=5+5)
Compiler          457MB ± 0%        457MB ± 0%  +0.07%  (p=0.008 n=5+5)
SSA              1.10GB ± 0%       1.10GB ± 0%  +0.05%  (p=0.008 n=5+5)
Flate            24.5MB ± 0%       24.5MB ± 0%    ~     (p=0.222 n=5+5)
GoParser         30.9MB ± 0%       31.0MB ± 0%  +0.21%  (p=0.016 n=5+5)
Reflect          73.4MB ± 0%       73.4MB ± 0%    ~     (p=0.421 n=5+5)
Tar              25.5MB ± 0%       25.5MB ± 0%    ~     (p=0.548 n=5+5)
XML              40.9MB ± 0%       40.9MB ± 0%    ~     (p=0.151 n=5+5)
[Geo mean]       71.6MB            71.6MB       +0.07%

name        old allocs/op     new allocs/op     delta
Template           394k ± 0%         394k ± 0%    ~     (p=1.000 n=5+5)
Unicode            344k ± 0%         343k ± 0%    ~     (p=0.310 n=5+5)
GoTypes           1.16M ± 0%        1.16M ± 0%    ~     (p=1.000 n=5+5)
Compiler          4.42M ± 0%        4.42M ± 0%    ~     (p=1.000 n=5+5)
SSA               9.80M ± 0%        9.80M ± 0%    ~     (p=0.095 n=5+5)
Flate              237k ± 1%         238k ± 1%    ~     (p=0.310 n=5+5)
GoParser           320k ± 0%         322k ± 1%  +0.50%  (p=0.032 n=5+5)
Reflect            958k ± 0%         957k ± 0%    ~     (p=0.548 n=5+5)
Tar                252k ± 1%         252k ± 0%    ~     (p=1.000 n=5+5)
XML                400k ± 0%         400k ± 0%    ~     (p=0.841 n=5+5)
[Geo mean]         741k              742k       +0.06%

name        old object-bytes  new object-bytes  delta
Template           386k ± 0%         386k ± 0%  -0.05%  (p=0.008 n=5+5)
Unicode            202k ± 0%         202k ± 0%  -0.01%  (p=0.008 n=5+5)
GoTypes           1.16M ± 0%        1.16M ± 0%  -0.06%  (p=0.008 n=5+5)
Compiler          3.91M ± 0%        3.91M ± 0%  -0.06%  (p=0.008 n=5+5)
SSA               7.91M ± 0%        7.92M ± 0%  +0.01%  (p=0.008 n=5+5)
Flate              228k ± 0%         227k ± 0%  -0.04%  (p=0.008 n=5+5)
GoParser           283k ± 0%         283k ± 0%  -0.06%  (p=0.008 n=5+5)
Reflect            952k ± 0%         951k ± 0%  -0.02%  (p=0.008 n=5+5)
Tar                187k ± 0%         187k ± 0%  -0.04%  (p=0.008 n=5+5)
XML                406k ± 0%         406k ± 0%  -0.05%  (p=0.008 n=5+5)
[Geo mean]         648k              648k       -0.04%

Change-Id: I8630c4291a0eb2f7e7927bc04d7cc0efef181094
Reviewed-on: https://go-review.googlesource.com/43491
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 19, 2017

For the reference: Fannkuch11 can be speedup -33% via improving BCE:

#20393 (comment)

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented May 19, 2017

It can be sped up 33% with -B. That might or might not be the same thing.

@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 20, 2017

Correct. However all runtime bounds checks in fannkuch can be statically proven to be unneccessary:

#20393 (comment)

That means the compiler can theoretically reach -B level with just clever BCE.

@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented May 22, 2017

@josharian Dmitry's performance dashboard was noisy and spammy and ultimately disabled.

It wasn't spammy and it wasn't particularly noisy. #8717 lead to episodic noise, but otherwise it reliably detected changes in multiple metrics within fractions of percent. Dashboard wasn't disabled. It was broken by changes to the app and then never fixed.

@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented May 22, 2017

@navytux @josharian I thought the light was already sched in #18977 (comment) and #18977 (comment) .

The analysis looks great! Thanks!

@navytux

This comment has been minimized.

Copy link
Contributor

@navytux navytux commented May 22, 2017

@dvyukov thanks for feedback.

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Jun 6, 2017

We're punting more loop rotation work to 1.10.

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Dec 12, 2018

Punting to 1.13, too late for anything major in 1.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.