New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: go1.7beta2 performance regression #16141

Open
samuel opened this Issue Jun 21, 2016 · 14 comments

Comments

Projects
None yet
@samuel

samuel commented Jun 21, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6.2 darwin/amd64
go version go1.7beta2 darwin/amd64

  1. What operating system and processor architecture are you using (go env)?

Intel i7-3540M (also tried on i7-2677M but didn't see the same regression)

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fno-common"
CXX="clang++"
CGO_ENABLED="1"

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/30/6hyj4x_x783f12hmbcmn5tt00000gn/T/go-build183604977=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"

  1. What did you do?

https://play.golang.org/p/3WDEr-_QZR

Disassembly (-gcflags -S): https://gist.github.com/samuel/3053bafe149a0459322f6eeaf8bd5ae5

  1. What did you expect to see?

Go 1.7beta2 the same performance or better than Go 1.6

  1. What did you see instead?

benchmark old ns/op new ns/op delta
BenchmarkVScaleF32-4 2906 3801 +30.80%
BenchmarkVMaxF32-4 2682 3951 +47.32%

benchmark old MB/s new MB/s speedup
BenchmarkVScaleF32-4 1409.19 1077.37 0.76x
BenchmarkVMaxF32-4 1527.02 1036.64 0.68x

@ianlancetaylor ianlancetaylor added this to the Go1.8 milestone Jun 21, 2016

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jun 21, 2016

CC @randall77 @josharian

Marking as 1.8 because it is late in the release cycle, but I'm not opposed to a fix for 1.7 if it is safe.

@josharian

This comment has been minimized.

Contributor

josharian commented Jun 22, 2016

Thanks for providing the disassembly.

I'm away from my computer for the week, so I can't confirm directly, but it looks like this is a register allocation issue. Looking at max, there's an unnecessary reg-reg mov in the loop in the SSA version. I'm surprised the impact is so big, but I'd guess it is because it creates a data dependency and prevents pipelining.

Also, I don't see why we are doing FP adjustments, since we don't call any functions (we only reference math.b; math.Inf is inlined). That's probably a bug, but it's not responsible for the slowdown. I'll file a separate issue next week.

This should wait for 1.8.

@randall77

This comment has been minimized.

Contributor

randall77 commented Aug 31, 2016

I confess I have no idea what is going on here.
@TocarIP
1.6 disassembly:

    0x0045 00069 (tmp1.go:10)   MOVSS   (CX), X0
    0x0049 00073 (tmp1.go:11)   CMPQ    AX, DI
    0x004c 00076 (tmp1.go:11)   JCC $1, 103
    0x004e 00078 (tmp1.go:11)   LEAQ    (R9)(AX*4), BX
    0x0052 00082 (tmp1.go:11)   MULSS   X2, X0
    0x0056 00086 (tmp1.go:11)   MOVSS   X0, (BX)
    0x005a 00090 (tmp1.go:10)   ADDQ    $4, CX
    0x005e 00094 (tmp1.go:10)   INCQ    AX
    0x0061 00097 (tmp1.go:10)   CMPQ    AX, SI
    0x0064 00100 (tmp1.go:10)   JLT $0, 69

tip disassembly:

    0x0044 00068 (tmp1.go:10)   MOVSS   (BX), X1
    0x0048 00072 (tmp1.go:11)   MULSS   X0, X1
    0x004c 00076 (tmp1.go:11)   CMPQ    SI, DX
    0x004f 00079 (tmp1.go:11)   JCC $0, 99
    0x0051 00081 (tmp1.go:11)   MOVSS   X1, (CX)(SI*4)
    0x0056 00086 (tmp1.go:10)   ADDQ    $4, BX
    0x005a 00090 (tmp1.go:10)   INCQ    SI
    0x005d 00093 (tmp1.go:10)   CMPQ    SI, AX
    0x0060 00096 (tmp1.go:10)   JLT $0, 68

Yet tip is 28% slower. For no discernible reason I can see.

Max is even worse.
go1.6:

    0x004a 00074 (tmp1.go:17)   MOVSS   (AX), X2
    0x004e 00078 (tmp1.go:18)   UCOMISS X3, X2
    0x0051 00081 (tmp1.go:18)   JHI 106
    0x0053 00083 (tmp1.go:17)   ADDQ    $4, AX
    0x0057 00087 (tmp1.go:17)   INCQ    CX
    0x005a 00090 (tmp1.go:17)   CMPQ    CX, DX
    0x005d 00093 (tmp1.go:17)   JLT $0, 74

    0x006a 00106 (tmp1.go:19)   MOVSS   X2, X3
    0x006e 00110 (tmp1.go:17)   JMP 83

tip:

    0x0036 00054 (tmp1.go:17)   MOVSS   (CX), X1
    0x003a 00058 (tmp1.go:18)   UCOMISS X0, X1
    0x003d 00061 (tmp1.go:18)   JLS 94
    0x003f 00063 (tmp1.go:17)   ADDQ    $4, CX
    0x0043 00067 (tmp1.go:17)   INCQ    DX
    0x0046 00070 (tmp1.go:22)   MOVUPS  X1, X0
    0x0049 00073 (tmp1.go:17)   CMPQ    DX, AX
    0x004c 00076 (tmp1.go:17)   JLT $0, 54

    0x005e 00094 (tmp1.go:22)   MOVUPS  X0, X1
    0x0061 00097 (tmp1.go:17)   JMP 63

Tip has 1 extra reg->reg move, yet it is 2.3x slower! 2.3x people. You could compute Max twice using 1.6 and still have some time left over. What the heck is going on?
(I tried MOVUPS->MOVSS, didn't help.)

@gopherbot

This comment has been minimized.

gopherbot commented Aug 31, 2016

CL https://golang.org/cl/28273 mentions this issue.

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Sep 1, 2016

I can't access playground link (Forbidden), so I couldn't build it myself and play with code/check perf counters.
At a glance it looks like tip version make ucomiss always depend on previous iteration (X0 is always modified)

@rjammala

This comment has been minimized.

rjammala commented Sep 1, 2016

gopherbot pushed a commit that referenced this issue Sep 1, 2016

cmd/compile: missing float indexed loads/stores on amd64
Update #16141

Change-Id: I7d32c5cdc197d86491a67ea579fa16cb3d675b51
Reviewed-on: https://go-review.googlesource.com/28273
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Sep 5, 2016

Looks like VScale is fixed by c538795, so I focused on VMaxF32.

I've tried implementing maxss recognition as part of phiopt, but while it produces smaller code:

// 46bc06:
movss (%rcx),%xmm1
add $0x4,%rcx
inc %rdx
maxss %xmm1,%xmm0
cmp %rax,%rdx
jl 46bc06 <_/localdisk/itocar/go_tmp/golang/src.VMaxF32+0x36>
... // return

It executes with almost the same speed as tip:
name old time/op new time/op delta
VMaxF32-4 3.62µs ± 7% 3.40µs ± 0% -6.04% (p=0.000 n=20+20)

name old speed new speed delta
VMaxF32-4 1.13GB/s ± 7% 1.20GB/s ± 0% +6.32% (p=0.000 n=20+18)

But has lower IPC
2.01 insns per cycle vs 3.16 insns per cycle (according to perf stat)

Looks like in 1.7 case, as well as in maxss case, there is a true dependency on previous value of X1,
which needs to be updated on each iteration.
But in 1.6 due to branch being highly predictable ( < 0.1% mispredictions)
several iterations can execute (speculatively) in parallel,
Because only in unlikely case (new random float is greater) we need to update X3 (current max).

I've tried reducing size of array to 1 << 6.
This makes maxss ~= 1.6 > 1.7, but with only 60% gap between 1.6 and 1.7
This supports the hypothesis that 1.6 is faster on iterations where current max is not updated.

I'm not sure what should be done:
Implementing max recognition will speed up max for small array or just 2 elements,
and this what Gcc does. But this won't help this case much and I'm not sure that all architectures has max instruction. Unroll and vectorization may help hide latency in the future.

Btw b.SetBytes(benchSize) should be b.SetBytes(benchSize*4), if the intention is to measure MB/s instead of megafloat32/s

@rasky

This comment has been minimized.

Member

rasky commented Sep 6, 2016

@TocarIP wouldn't be sufficient to make the compiler understand that the branch is unlikely rather than likely? It would generate a jump-ahead in the unlikely case (like Go 1.6 does), and that would in turn remove the extra MOV that causes the dependency.

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Sep 6, 2016

@rasky do you mean something like __buitin_expect in GCC or pattern detection in compiler?
It could help against 1.7. But compared to maxss behavior is data (e. g sorted/sorted in reverse) and array size (maxss is faster for small arrays) dependant and I'm not convinced that this is the right choice.

@rasky

This comment has been minimized.

Member

rasky commented Sep 6, 2016

I was thinking pattern detection in the compiler. If you can detect a min/max-loop and tune the likeliness of the branch, it would generate better code for the likely case, removing the dependency and improving the results. It would not be the BEST possible result, but it would be uniformly better than what Go 1.7 does and on par with Go 1.6. Exploring usage of maxss is then a separate possible optimization

@quentinmit quentinmit added the NeedsFix label Oct 11, 2016

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 21, 2016

@randall77, @TocarIP, is there anything left to do here?

@randall77

This comment has been minimized.

Contributor

randall77 commented Oct 21, 2016

I don't see any obvious fix. We're getting bitten by partial register dependencies that are the result of unlucky choices by regalloc.
Punting to 1.9.

@josharian

This comment has been minimized.

Contributor

josharian commented May 18, 2017

Reproduces at tip (638ebb0), although there are improvements from 1.7 to 1.8 to tip.

name \ time/op  17             18             tip
VScaleF32-8       2.97µs ± 2%    2.55µs ± 1%    2.24µs ± 1%
VMaxF32-8         3.90µs ± 1%    3.88µs ± 1%    3.85µs ± 1%

Single file reproduction code:

package dsp

import (
	"math"
	"math/rand"
	"testing"
)

func VScaleF32(input, output []float32, scale float32) {
	n := len(input)
	if len(output) < n {
		n = len(output)
	}
	for i, v := range input[:n] {
		output[i] = v * scale
	}
}

func VMaxF32(input []float32) float32 {
	max := float32(math.Inf(-1))
	for _, v := range input {
		if v > max {
			max = v
		}
	}
	return max
}

const benchSize = 1 << 12

func BenchmarkVScaleF32(b *testing.B) {
	input := make([]float32, benchSize)
	output := make([]float32, len(input))
	b.SetBytes(benchSize)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		VScaleF32(input, output, 1.0/benchSize)
	}
}

func BenchmarkVMaxF32(b *testing.B) {
	input := make([]float32, benchSize)
	rand.Seed(0)
	for i := 0; i < len(input); i++ {
		input[i] = rand.Float32()
	}
	b.SetBytes(benchSize)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		VMaxF32(input)
	}
}
@mdempsky

This comment has been minimized.

Member

mdempsky commented Nov 30, 2017

At tip, it seems VScaleF32-4 is ~23% faster, but VVMaxF32 is 20% slower at 1.10 than 1.6.

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment