New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: SSA performance regression due to array zeroing #15925

Open
hydroflame opened this Issue Jun 1, 2016 · 14 comments

Comments

Projects
None yet
@hydroflame

hydroflame commented Jun 1, 2016

go version devel +bbd1dcd Wed Jun 1 21:32:46 2016 +0000 darwin/amd64

https://github.com/hydroflame/go1.7bench
This very simple program take 2 4 element vector and performs component wise addition.
If you run the benchmark with SSA and without you will see that it is much faster without SSA.

$ go test -run NONE -bench . -gcflags -ssa=0 > nossa.txt
testing: warning: no tests to run
$ go test -run NONE -bench . -gcflags -ssa=1 > ssa.txt
testing: warning: no tests to run
$ benchcmp nossa.txt ssa.txt 
benchmark          old ns/op     new ns/op     delta
BenchmarkAdd-4     2.45          11.4          +365.31%

I expect that the binary generated with SSA would be at least the same speed as the old one.

This particular example is extracted from one of my bigger library: github.com/luxengine/glm
for more example of SSA bad performance run benchmarks in

  • github.com/luxengine/glm
  • github.com/luxengine/glm/geo
@cespare

This comment has been minimized.

Contributor

cespare commented Jun 1, 2016

@randall77 randall77 added this to the Go1.8Early milestone Jun 1, 2016

@randall77

This comment has been minimized.

Contributor

randall77 commented Jun 1, 2016

The SSA backend doesn't handle arrays terribly well, as they are not SSAable. This leads to some extra copies that the old backend was able to avoid.

add is inlined in this test. The inner loop of the BenchmarkAdd is (with ssa):

    0x0047 00071 (vec_test.go:12)   MOVQ    $0, "".~r1(SP)
    0x004f 00079 (vec_test.go:12)   MOVQ    $0, "".~r1+8(SP)
    0x0058 00088 (vec_test.go:12)   MOVQ    $0, "".autotmp_5+48(SP)
    0x0061 00097 (vec_test.go:12)   MOVQ    $0, "".autotmp_5+56(SP)
    0x006a 00106 (vec_test.go:12)   MOVSS   "".v0+32(SP), X0
    0x0070 00112 (vec_test.go:12)   MOVSS   "".v1+16(SP), X1
    0x0076 00118 (vec_test.go:12)   ADDSS   X1, X0
    0x007a 00122 (vec_test.go:12)   MOVSS   X0, "".autotmp_5+48(SP)
    0x0080 00128 (vec_test.go:12)   MOVSS   "".v0+36(SP), X0
    0x0086 00134 (vec_test.go:12)   MOVSS   "".v1+20(SP), X1
    0x008c 00140 (vec_test.go:12)   ADDSS   X1, X0
    0x0090 00144 (vec_test.go:12)   MOVSS   X0, "".autotmp_5+52(SP)
    0x0096 00150 (vec_test.go:12)   MOVSS   "".v0+40(SP), X0
    0x009c 00156 (vec_test.go:12)   MOVSS   "".v1+24(SP), X1
    0x00a2 00162 (vec_test.go:12)   ADDSS   X1, X0
    0x00a6 00166 (vec_test.go:12)   MOVSS   X0, "".autotmp_5+56(SP)
    0x00ac 00172 (vec_test.go:12)   MOVSS   "".v0+44(SP), X0
    0x00b2 00178 (vec_test.go:12)   MOVSS   "".v1+28(SP), X1
    0x00b8 00184 (vec_test.go:12)   ADDSS   X1, X0
    0x00bc 00188 (vec_test.go:12)   MOVSS   X0, "".autotmp_5+60(SP)
    0x00c2 00194 (vec_test.go:12)   MOVUPS  "".autotmp_5+48(SP), X0
    0x00c7 00199 (vec_test.go:12)   MOVUPS  X0, "".~r1(SP)
    0x00cb 00203 (vec_test.go:12)   MOVUPS  "".~r1(SP), X0
    0x00cf 00207 (vec_test.go:12)   MOVUPS  X0, "".tmp(SB)
    0x00d6 00214 (vec_test.go:11)   INCQ    CX
    0x00d9 00217 (vec_test.go:11)   MOVQ    184(AX), DX
    0x00e0 00224 (vec_test.go:11)   CMPQ    CX, DX
    0x00e3 00227 (vec_test.go:11)   JLT $0, 71

Note the unnecessary zeroings of ~r1 and autotmp_5, and the copy autotmp5 -> ~r1 -> tmp.

    0x0090 00144 (vec_test.go:12)   LEAQ    "".v0+16(SP), BX
    0x0095 00149 (vec_test.go:12)   MOVQ    BX, CX
    0x0098 00152 (vec_test.go:12)   LEAQ    "".v1(SP), BX
    0x009c 00156 (vec_test.go:12)   MOVQ    BX, AX
    0x009f 00159 (vec_test.go:12)   MOVQ    $0, BX
    0x00a1 00161 (vec_test.go:12)   XORPS   X0, X0
    0x00a4 00164 (vec_test.go:12)   MOVQ    $0, BX
    0x00a6 00166 (vec_test.go:12)   MOVSS   (CX), X4
    0x00aa 00170 (vec_test.go:12)   MOVSS   (AX), X1
    0x00ae 00174 (vec_test.go:12)   ADDSS   X1, X4
    0x00b2 00178 (vec_test.go:12)   MOVSS   4(CX), X3
    0x00b7 00183 (vec_test.go:12)   MOVSS   4(AX), X1
    0x00bc 00188 (vec_test.go:12)   ADDSS   X1, X3
    0x00c0 00192 (vec_test.go:12)   MOVSS   8(CX), X2
    0x00c5 00197 (vec_test.go:12)   MOVSS   8(AX), X1
    0x00ca 00202 (vec_test.go:12)   ADDSS   X1, X2
    0x00ce 00206 (vec_test.go:12)   MOVSS   12(CX), X0
    0x00d3 00211 (vec_test.go:12)   MOVSS   12(AX), X1
    0x00d8 00216 (vec_test.go:12)   ADDSS   X1, X0
    0x00dc 00220 (vec_test.go:12)   MOVSS   X4, "".tmp(SB)
    0x00e4 00228 (vec_test.go:12)   MOVSS   X3, "".tmp+4(SB)
    0x00ec 00236 (vec_test.go:12)   MOVSS   X2, "".tmp+8(SB)
    0x00f4 00244 (vec_test.go:12)   MOVSS   X0, "".tmp+12(SB)
    0x00fc 00252 (vec_test.go:11)   INCQ    DX
    0x00ff 00255 (vec_test.go:11)   MOVQ    184(SI), BX
    0x0106 00262 (vec_test.go:11)   CMPQ    BX, DX
    0x0109 00265 (vec_test.go:11)   JGT $0, 144

The old compiler does the direct write to the destination.

This is on my radar to fix, but it isn't easy. Hopefully for 1.8.
See also #14762, it applies to the uninlined version of add.

@hydroflame

This comment has been minimized.

hydroflame commented Jun 1, 2016

If I changed for vec4{x,y,z,w float32} would that solve the problem ?

@randall77

This comment has been minimized.

Contributor

randall77 commented Jun 1, 2016

It might. Another workaround is to not pass around vectors by value. Use []float32 or *[4]float32 everywhere. (It is a bit odd that your add takes *vec4 as args but returns a vec4. It might be better anyway to do func (r *vec4) add(v0, v1 *vec4))

@hydroflame

This comment has been minimized.

hydroflame commented Jun 2, 2016

I tested it, it's definitely faster with structs

@josharian josharian changed the title from cmd/compile: performance much worst with SSA to cmd/compile: SSA performance regression due to array zeroing Jun 3, 2016

@quentinmit quentinmit added the NeedsFix label Oct 6, 2016

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 17, 2016

Why are structs and small arrays treated differently? Is blasting apart small arrays (the same way small structs are blasted apart) an easy optimization to add? I believe the old back ends did this for arrays of size <= 4 or something like that.

/cc @dr2chase @randall77 @cherrymui

@cherrymui

This comment has been minimized.

Contributor

cherrymui commented Oct 17, 2016

I believe it does not decompose arrays because it doesn't know how to handle non-constant indexing. There is a TODO in gc/ssa.go:

    case TARRAY:
        // We can't do arrays because dynamic indexing is
        // not supported on SSA variables.
        // TODO: maybe allow if length is <=1?  All indexes
        // are constant?  Might be good for the arrays
        // introduced by the compiler for variadic functions.

@rsc rsc modified the milestones: Go1.8, Go1.8Early Oct 20, 2016

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 21, 2016

Interesting. So the old back end, because it just worked on stack words, saw that these words were being used, never had their address taken, and completely registerized them. And SSA isn't smart enough to do that because it decides about variable tracking early? The implication here, though, is that all the array refs are constant indexes. Maybe if you know the array is only ever accessed by constant index (and address not taken, but that's needed for structs already) then it can be split apart?

@randall77

This comment has been minimized.

Contributor

randall77 commented Oct 21, 2016

Yes, if we knew that all accesses were constant indexes we could registerize it.
Or if it is [1] we can registerize it, because the only index that can get past the bounds check is 0.

@egonelbre

This comment has been minimized.

Contributor

egonelbre commented Apr 29, 2017

I was measuring different approaches to implementing Vector.Add. Almost made a new issue, but eventually found this.

One specific case stood out: https://play.golang.org/p/vzC344lntX. There is a 20x perfomance difference.

type Struct struct{ X, Y, Z float32 }
func (a Struct) Add(b Struct) Struct { return Struct{a.X + b.X, a.Y + b.Y, a.Z + b.Z} }

type Array [3]float32
func (a Array) Add(b Array) Array { return Array{a[0] + b[0], a[1] + b[1], a[2] + b[2]} }

BenchmarkStructAdd-8               2000000000               0.98 ns/op
BenchmarkArrayAdd-8                100000000               20.7 ns/op

Exhaustive benchmark of different approaches: https://play.golang.org/p/6diXIPOWu5

        Struct         Array
PP      2.91 ns/op     3.00 ns/op
PN      2.93 ns/op     2.94 ns/op
PPP     4.89 ns/op     9.41 ns/op
NNN     0.98 ns/op    20.70 ns/op
PPN     2.91 ns/op    17.70 ns/op
PNP     4.89 ns/op     8.94 ns/op
NPP     2.93 ns/op    15.20 ns/op
NNP     2.93 ns/op    14.30 ns/op
NPN     1.00 ns/op    21.00 ns/op
PNN     2.93 ns/op    15.00 ns/op
PPZ     2.92 ns/op     9.18 ns/op
NPZ     2.91 ns/op    14.90 ns/op
PNZ     2.92 ns/op     8.85 ns/op
NNZ     2.93 ns/op    14.30 ns/op
PPY     2.96 ns/op     2.94 ns/op
NPY     2.92 ns/op     8.74 ns/op
PNY     2.92 ns/op     2.95 ns/op
NNY     2.93 ns/op     9.05 ns/op

(go version devel +11eaf42 Sat Apr 29 04:15:49 2017 +0000 windows/amd64)

@tzneal

This comment has been minimized.

Member

tzneal commented Apr 29, 2017

I'm working on a change that allows const indexed small arrays to be SSA'd. I'm not 100% happy with the approach (storing a flag on the array type instead of on the node to indicate that it was variably indexed), so I'm taking another look but it does work and will solve this:

goos: linux
goarch: amd64
BenchmarkStruct-8   	2000000000	         0.78 ns/op
BenchmarkArray-8    	2000000000	         0.78 ns/op
PASS
ok  	_/tmp/foo	3.277s
@egonelbre

This comment has been minimized.

Contributor

egonelbre commented Apr 30, 2017

@tzneal wouldn't it be possible to rewrite variably indexed statements into switches? Just as a proof of concept: https://play.golang.org/p/m0n1E7Q2kg (I'm sure there is a way to avoid the performance hit introduced by the panic, when you have access to codegen)

BenchmarkStruct-8               2000000000               0.98 ns/op
BenchmarkArray-8                50000000                29.7 ns/op
BenchmarkArray2-8               100000000               22.0 ns/op
BenchmarkArray3-8               100000000               19.9 ns/op
BenchmarkArrayStruct-8          100000000               19.5 ns/op
BenchmarkArrayStruct3-8         200000000                6.91 ns/op

@randall77 randall77 modified the milestones: Go1.10, Go1.9 Jun 6, 2017

@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017

@stuartcarnie

This comment has been minimized.

stuartcarnie commented Feb 15, 2018

I tested a scenario using var acc [4]float vs var acc0, acc1, acc2, acc3 float64 and the 4-element array does not keep the accumulator values in registers.

In the following version, the accumulator values always write back to the stack in the loop

func SumFloat64Unroll_array(buf []float64) float64 {
	var (
		acc [4]float64
	)
	for i := 0; i < len(buf); i += 4 {
		bb := (*[4]float64)(unsafe.Pointer(&buf[i]))
		acc[0] += bb[0]
		acc[1] += bb[1]
		acc[2] += bb[2]
		acc[3] += bb[3]
	}
	return acc[0] + acc[1] + acc[2] + acc[3]
}

per

  movsd (CX)(DX*8), X0
  addsd "".acc(SP), X0
  movsd X0, "".acc(SP)
  movsd 8(CX)(DX*8), X0
  addsd "".acc+8(SP), X0
  movsd X0, "".acc+8(SP)
  movsd 16(CX)(DX*8), X0
  addsd "".acc+16(SP), X0
  movsd X0, "".acc+16(SP)
  movsd 24(CX)(DX*8), X0
  addsd "".acc+24(SP), X0
  movsd X0, "".acc+24(SP)

vs this version, each accumulator remains in a register for the loop

func SumFloat64Unroll_vars(buf []float64) float64 {
	var (
		acc0, acc1, acc2, acc3 float64
	)
	for i := 0; i < len(buf); i += 4 {
		bb := (*[4]float64)(unsafe.Pointer(&buf[i]))
		acc0 += bb[0]
		acc1 += bb[1]
		acc2 += bb[2]
		acc3 += bb[3]
	}
	return acc0 + acc1 + acc2 + acc3
}

per

  movsd (CX)(DX*8), X4
  addsd X4, X0
  movsd 8(CX)(DX*8), X4
  addsd X4, X1
  movsd 16(CX)(DX*8), X4
  addsd X4, X2
  movsd 24(CX)(DX*8), X4
  addsd X4, X3
@gopherbot

This comment has been minimized.

gopherbot commented Apr 29, 2018

Change https://golang.org/cl/106495 mentions this issue: cmd/compile: add some generic composite type optimizations

gopherbot pushed a commit that referenced this issue May 8, 2018

cmd/compile: add some generic composite type optimizations
Propagate values through some wide Zero/Move operations. Among
other things this allows us to optimize some kinds of array
initialization. For example, the following code no longer
requires a temporary be allocated on the stack. Instead it
writes the values directly into the return value.

func f(i uint32) [4]uint32 {
    return [4]uint32{i, i+1, i+2, i+3}
}

The return value is unnecessarily cleared but removing that is
probably a task for dead store analysis (I think it needs to
be able to match multiple Store ops to wide Zero ops).

In order to reliably remove stack variables that are rendered
unnecessary by these new rules I've added a new generic version
of the unread autos elimination pass.

These rules are triggered more than 5000 times when building and
testing the standard library.

Updates #15925 (fixes for arrays of up to 4 elements).
Updates #24386 (fixes for up to 4 kept elements).
Updates #24416.

compilebench results:

name       old time/op       new time/op       delta
Template         353ms ± 5%        359ms ± 3%    ~     (p=0.143 n=10+10)
Unicode          219ms ± 1%        217ms ± 4%    ~     (p=0.740 n=7+10)
GoTypes          1.26s ± 1%        1.26s ± 2%    ~     (p=0.549 n=9+10)
Compiler         6.00s ± 1%        6.08s ± 1%  +1.42%  (p=0.000 n=9+8)
SSA              15.3s ± 2%        15.6s ± 1%  +2.43%  (p=0.000 n=10+10)
Flate            237ms ± 2%        240ms ± 2%  +1.31%  (p=0.015 n=10+10)
GoParser         285ms ± 1%        285ms ± 1%    ~     (p=0.878 n=8+8)
Reflect          797ms ± 3%        807ms ± 2%    ~     (p=0.065 n=9+10)
Tar              334ms ± 0%        335ms ± 4%    ~     (p=0.460 n=8+10)
XML              419ms ± 0%        423ms ± 1%  +0.91%  (p=0.001 n=7+9)
StdCmd           46.0s ± 0%        46.4s ± 0%  +0.85%  (p=0.000 n=9+9)

name       old user-time/op  new user-time/op  delta
Template         337ms ± 3%        346ms ± 5%    ~     (p=0.053 n=9+10)
Unicode          205ms ±10%        205ms ± 8%    ~     (p=1.000 n=10+10)
GoTypes          1.22s ± 2%        1.21s ± 3%    ~     (p=0.436 n=10+10)
Compiler         5.85s ± 1%        5.93s ± 0%  +1.46%  (p=0.000 n=10+8)
SSA              14.9s ± 1%        15.3s ± 1%  +2.62%  (p=0.000 n=10+10)
Flate            229ms ± 4%        228ms ± 6%    ~     (p=0.796 n=10+10)
GoParser         271ms ± 3%        275ms ± 4%    ~     (p=0.165 n=10+10)
Reflect          779ms ± 5%        775ms ± 2%    ~     (p=0.971 n=10+10)
Tar              317ms ± 4%        319ms ± 5%    ~     (p=0.853 n=10+10)
XML              404ms ± 4%        409ms ± 5%    ~     (p=0.436 n=10+10)

name       old alloc/op      new alloc/op      delta
Template        34.9MB ± 0%       35.0MB ± 0%  +0.26%  (p=0.000 n=10+10)
Unicode         29.3MB ± 0%       29.3MB ± 0%  +0.02%  (p=0.000 n=10+10)
GoTypes          115MB ± 0%        115MB ± 0%  +0.30%  (p=0.000 n=10+10)
Compiler         519MB ± 0%        521MB ± 0%  +0.30%  (p=0.000 n=10+10)
SSA             1.55GB ± 0%       1.57GB ± 0%  +1.34%  (p=0.000 n=10+9)
Flate           24.1MB ± 0%       24.2MB ± 0%  +0.10%  (p=0.000 n=10+10)
GoParser        28.1MB ± 0%       28.1MB ± 0%  +0.07%  (p=0.000 n=10+10)
Reflect         78.7MB ± 0%       78.7MB ± 0%  +0.03%  (p=0.000 n=8+10)
Tar             34.4MB ± 0%       34.5MB ± 0%  +0.12%  (p=0.000 n=10+10)
XML             43.2MB ± 0%       43.2MB ± 0%  +0.13%  (p=0.000 n=10+10)

name       old allocs/op     new allocs/op     delta
Template          330k ± 0%         330k ± 0%  -0.01%  (p=0.017 n=10+10)
Unicode           337k ± 0%         337k ± 0%  +0.01%  (p=0.000 n=9+10)
GoTypes          1.15M ± 0%        1.15M ± 0%  +0.03%  (p=0.000 n=10+10)
Compiler         4.77M ± 0%        4.77M ± 0%  +0.03%  (p=0.000 n=9+10)
SSA              12.5M ± 0%        12.6M ± 0%  +1.16%  (p=0.000 n=10+10)
Flate             221k ± 0%         221k ± 0%  +0.05%  (p=0.000 n=9+10)
GoParser          275k ± 0%         275k ± 0%  +0.01%  (p=0.014 n=10+9)
Reflect           944k ± 0%         944k ± 0%  -0.02%  (p=0.000 n=10+10)
Tar               324k ± 0%         323k ± 0%  -0.12%  (p=0.000 n=10+10)
XML               384k ± 0%         384k ± 0%  -0.01%  (p=0.001 n=10+10)

name       old object-bytes  new object-bytes  delta
Template         476kB ± 0%        476kB ± 0%  -0.04%  (p=0.000 n=10+10)
Unicode          218kB ± 0%        218kB ± 0%    ~     (all equal)
GoTypes         1.58MB ± 0%       1.58MB ± 0%  -0.04%  (p=0.000 n=10+10)
Compiler        6.25MB ± 0%       6.24MB ± 0%  -0.09%  (p=0.000 n=10+10)
SSA             15.9MB ± 0%       16.1MB ± 0%  +1.22%  (p=0.000 n=10+10)
Flate            304kB ± 0%        304kB ± 0%  -0.13%  (p=0.000 n=10+10)
GoParser         370kB ± 0%        370kB ± 0%  -0.00%  (p=0.000 n=10+10)
Reflect         1.27MB ± 0%       1.27MB ± 0%  -0.12%  (p=0.000 n=10+10)
Tar              421kB ± 0%        419kB ± 0%  -0.64%  (p=0.000 n=10+10)
XML              518kB ± 0%        517kB ± 0%  -0.12%  (p=0.000 n=10+10)

name       old export-bytes  new export-bytes  delta
Template        16.7kB ± 0%       16.7kB ± 0%    ~     (all equal)
Unicode         6.52kB ± 0%       6.52kB ± 0%    ~     (all equal)
GoTypes         29.2kB ± 0%       29.2kB ± 0%    ~     (all equal)
Compiler        88.0kB ± 0%       88.0kB ± 0%    ~     (all equal)
SSA              109kB ± 0%        109kB ± 0%    ~     (all equal)
Flate           4.49kB ± 0%       4.49kB ± 0%    ~     (all equal)
GoParser        8.10kB ± 0%       8.10kB ± 0%    ~     (all equal)
Reflect         7.71kB ± 0%       7.71kB ± 0%    ~     (all equal)
Tar             9.15kB ± 0%       9.15kB ± 0%    ~     (all equal)
XML             12.3kB ± 0%       12.3kB ± 0%    ~     (all equal)

name       old text-bytes    new text-bytes    delta
HelloSize        676kB ± 0%        672kB ± 0%  -0.59%  (p=0.000 n=10+10)
CmdGoSize       7.26MB ± 0%       7.24MB ± 0%  -0.18%  (p=0.000 n=10+10)

name       old data-bytes    new data-bytes    delta
HelloSize       10.2kB ± 0%       10.2kB ± 0%    ~     (all equal)
CmdGoSize        248kB ± 0%        248kB ± 0%    ~     (all equal)

name       old bss-bytes     new bss-bytes     delta
HelloSize        125kB ± 0%        125kB ± 0%    ~     (all equal)
CmdGoSize        145kB ± 0%        145kB ± 0%    ~     (all equal)

name       old exe-bytes     new exe-bytes     delta
HelloSize       1.46MB ± 0%       1.45MB ± 0%  -0.31%  (p=0.000 n=10+10)
CmdGoSize       14.7MB ± 0%       14.7MB ± 0%  -0.17%  (p=0.000 n=10+10)

Change-Id: Ic72b0c189dd542f391e1c9ab88a76e9148dc4285
Reviewed-on: https://go-review.googlesource.com/106495
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment