cmd/compile: BLAS regressions at tip #19595

Open
btracey opened this Issue Mar 17, 2017 · 18 comments

Comments

Projects
None yet
7 participants
@btracey
Contributor

btracey commented Mar 17, 2017

Comparing go1.8 with go version devel +b9f6b22 Fri Mar 17 13:59:31 2017 +0000 darwin/amd64

For all the benchmarks, the package is The github.com/gonum/blas/native. Below, the noasm tag makes the code pure go, rather than the ASM implementations of ddot and daxpy (which use SIMD and thus make the code significantly faster).

The first benchmark is Daxpy. Command is go test -bench Daxpy -tags noasm -count 5

brendan:~/Documents/mygo/src/github.com/gonum$ benchstat daxpy18.txt dgaxpytip.txt 
name                      old time/op  new time/op   delta
DaxpySmallBothUnitary-8   21.8ns ± 2%   23.2ns ± 6%   +6.80%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.6ns ± 1%   29.7ns ± 1%  +16.08%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        24.8ns ± 1%   29.0ns ± 4%  +17.00%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.3ns ± 1%   28.6ns ± 1%  +13.04%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   872ns ± 1%   1153ns ± 0%  +32.14%  (p=0.008 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%   1.17µs ± 1%   +9.10%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        928ns ± 1%   1162ns ± 1%  +25.12%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.15µs ± 1%   1.22µs ± 1%   +6.68%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   88.6µs ± 1%  116.5µs ± 4%  +31.57%  (p=0.008 n=5+5)
DaxpyLargeIncUni-8         181µs ± 1%    185µs ± 2%   +2.35%  (p=0.016 n=5+5)
DaxpyLargeUniInc-8         140µs ± 1%    146µs ± 2%   +4.20%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        294µs ± 1%    296µs ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeBothUnitary-8    13.2ms ± 1%   14.2ms ± 2%   +8.00%  (p=0.008 n=5+5)
DaxpyHugeIncUni-8         30.5ms ± 3%   30.2ms ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeUniInc-8         29.3ms ± 1%   29.1ms ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeBothInc-8        46.6ms ± 2%   46.3ms ± 1%     ~     (p=0.222 n=5+5)

There are significant slowdowns at the smaller sizes, though they disappear at the larger sizes. I'm not sure these explain the regressions we're seeing in Dgemv, so it may be a separate issue.

The command is go test -bench Dgemv -tags noasm -count 5.

brendan:~/Documents/mygo/src/github.com/gonum$ benchstat dgemv18.txt dgemvtip.txt 
name                      old time/op  new time/op   delta
DgemvSmSmNoTransInc1-8     171ns ± 0%    179ns ± 1%   +4.81%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     206ns ± 1%    228ns ± 0%  +10.66%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       209ns ± 0%    220ns ± 1%   +5.07%  (p=0.016 n=4+5)
DgemvSmSmTransIncN-8       221ns ± 1%    257ns ± 1%  +16.17%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.93µs ± 3%   9.90µs ± 4%     ~     (p=0.841 n=5+5)
DgemvMedMedNoTransIncN-8  10.1µs ± 1%   12.9µs ± 1%  +27.62%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.7µs ± 1%   12.8µs ± 1%  +19.38%  (p=0.008 n=5+5)
DgemvMedMedTransIncN-8    11.4µs ± 1%   13.9µs ± 1%  +22.11%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     919µs ± 1%    922µs ± 2%     ~     (p=0.841 n=5+5)
DgemvLgLgNoTransIncN-8     937µs ± 2%   1219µs ± 2%  +30.12%  (p=0.008 n=5+5)
DgemvLgLgTransInc1-8       948µs ± 1%   1236µs ± 1%  +30.29%  (p=0.008 n=5+5)
DgemvLgLgTransIncN-8       988µs ± 1%   1223µs ± 1%  +23.85%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    14.4µs ± 1%   14.5µs ± 1%     ~     (p=0.421 n=5+5)
DgemvLgSmNoTransIncN-8    17.4µs ± 1%   20.2µs ± 1%  +15.57%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      17.6µs ± 2%   18.3µs ± 2%   +4.18%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 1%   22.4µs ± 2%  +24.35%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.65µs ± 1%   8.76µs ± 1%   +1.21%  (p=0.032 n=5+5)
DgemvSmLgNoTransIncN-8    8.74µs ± 1%  11.64µs ± 1%  +33.15%  (p=0.008 n=5+5)
DgemvSmLgTransInc1-8      9.36µs ± 1%  12.15µs ± 1%  +29.83%  (p=0.008 n=5+5)
DgemvSmLgTransIncN-8      10.1µs ± 0%   12.5µs ± 1%  +24.03%  (p=0.008 n=5+5)

As we see, here some of the largest regressions are at the larger sizes. These additionally work to about a 15% regression on Dgemm for almost all sizes, and about a 15% regression for the yet higher-level Cholesky decomposition.

@josharian josharian changed the title from BLAS regressions at tip to cmd/compile: BLAS regressions at tip Mar 17, 2017

@bradfitz bradfitz added this to the Go1.9 milestone Mar 17, 2017

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Mar 18, 2017

Contributor

Looks like there are some extra reg->reg moves in tip compared to 1.8.

I strongly suspect the introduction of MULSDmem ops. Previously we had

    m = some constant
    for ... {
        x = load ...
        x = m * x
    }

which works fine with no reg moves. After introducing MULSDmem ops, we have

    m = some constant
    for ... {
        x = m * load ...
    }

The op for x = m * load ... requires its output register (for x) to be the same as the input register (m). So there are 2 additional register moves, one to copy m to another register and one to move it back. The copy back then requires an additional branch instruction.

@TocarIP

source code:

func f(dst []float64, alpha float64, x, y []float64) {
	for i, v := range x {
		dst[i] = alpha*v + y[i]
	}
}

1.8:

	0x0038 00056 (/Users/khr/go/tmp1.go:4)	MOVSD	(DI), X1
	0x003c 00060 (/Users/khr/go/tmp1.go:5)	MULSD	X0, X1
	0x0040 00064 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, BX
	0x0043 00067 (/Users/khr/go/tmp1.go:5)	JCC	$0, 111
	0x0045 00069 (/Users/khr/go/tmp1.go:5)	MOVSD	(SI)(R8*8), X2
	0x004b 00075 (/Users/khr/go/tmp1.go:5)	ADDSD	X2, X1
	0x004f 00079 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, DX
	0x0052 00082 (/Users/khr/go/tmp1.go:5)	JCC	$0, 111
	0x0054 00084 (/Users/khr/go/tmp1.go:5)	MOVSD	X1, (CX)(R8*8)
	0x005a 00090 (/Users/khr/go/tmp1.go:4)	ADDQ	$8, DI
	0x005e 00094 (/Users/khr/go/tmp1.go:4)	INCQ	R8
	0x0061 00097 (/Users/khr/go/tmp1.go:4)	CMPQ	R8, AX
	0x0064 00100 (/Users/khr/go/tmp1.go:4)	JLT	$0, 56

tip:

	0x0038 00056 (/Users/khr/go/tmp1.go:5)	MOVUPS	X0, X1
	0x003b 00059 (/Users/khr/go/tmp1.go:5)	MULSD	(DI), X0
	0x003f 00063 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, SI
	0x0042 00066 (/Users/khr/go/tmp1.go:5)	JCC	$0, 122
	0x0044 00068 (/Users/khr/go/tmp1.go:5)	MOVSD	(CX)(R8*8), X2
	0x004a 00074 (/Users/khr/go/tmp1.go:5)	ADDSD	X2, X0
	0x004e 00078 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, BX
	0x0051 00081 (/Users/khr/go/tmp1.go:5)	JCC	$0, 115
	0x0053 00083 (/Users/khr/go/tmp1.go:5)	MOVSD	X0, (DX)(R8*8)
	0x0059 00089 (/Users/khr/go/tmp1.go:4)	ADDQ	$8, DI
	0x005d 00093 (/Users/khr/go/tmp1.go:4)	INCQ	R8
	0x0060 00096 (/Users/khr/go/tmp1.go:4)	CMPQ	R8, AX
	0x0063 00099 (/Users/khr/go/tmp1.go:4)	JGE	$0, 106
	0x0065 00101 (/Users/khr/go/tmp1.go:5)	MOVUPS	X1, X0
	0x0068 00104 (/Users/khr/go/tmp1.go:4)	JMP	56
Contributor

randall77 commented Mar 18, 2017

Looks like there are some extra reg->reg moves in tip compared to 1.8.

I strongly suspect the introduction of MULSDmem ops. Previously we had

    m = some constant
    for ... {
        x = load ...
        x = m * x
    }

which works fine with no reg moves. After introducing MULSDmem ops, we have

    m = some constant
    for ... {
        x = m * load ...
    }

The op for x = m * load ... requires its output register (for x) to be the same as the input register (m). So there are 2 additional register moves, one to copy m to another register and one to move it back. The copy back then requires an additional branch instruction.

@TocarIP

source code:

func f(dst []float64, alpha float64, x, y []float64) {
	for i, v := range x {
		dst[i] = alpha*v + y[i]
	}
}

1.8:

	0x0038 00056 (/Users/khr/go/tmp1.go:4)	MOVSD	(DI), X1
	0x003c 00060 (/Users/khr/go/tmp1.go:5)	MULSD	X0, X1
	0x0040 00064 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, BX
	0x0043 00067 (/Users/khr/go/tmp1.go:5)	JCC	$0, 111
	0x0045 00069 (/Users/khr/go/tmp1.go:5)	MOVSD	(SI)(R8*8), X2
	0x004b 00075 (/Users/khr/go/tmp1.go:5)	ADDSD	X2, X1
	0x004f 00079 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, DX
	0x0052 00082 (/Users/khr/go/tmp1.go:5)	JCC	$0, 111
	0x0054 00084 (/Users/khr/go/tmp1.go:5)	MOVSD	X1, (CX)(R8*8)
	0x005a 00090 (/Users/khr/go/tmp1.go:4)	ADDQ	$8, DI
	0x005e 00094 (/Users/khr/go/tmp1.go:4)	INCQ	R8
	0x0061 00097 (/Users/khr/go/tmp1.go:4)	CMPQ	R8, AX
	0x0064 00100 (/Users/khr/go/tmp1.go:4)	JLT	$0, 56

tip:

	0x0038 00056 (/Users/khr/go/tmp1.go:5)	MOVUPS	X0, X1
	0x003b 00059 (/Users/khr/go/tmp1.go:5)	MULSD	(DI), X0
	0x003f 00063 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, SI
	0x0042 00066 (/Users/khr/go/tmp1.go:5)	JCC	$0, 122
	0x0044 00068 (/Users/khr/go/tmp1.go:5)	MOVSD	(CX)(R8*8), X2
	0x004a 00074 (/Users/khr/go/tmp1.go:5)	ADDSD	X2, X0
	0x004e 00078 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, BX
	0x0051 00081 (/Users/khr/go/tmp1.go:5)	JCC	$0, 115
	0x0053 00083 (/Users/khr/go/tmp1.go:5)	MOVSD	X0, (DX)(R8*8)
	0x0059 00089 (/Users/khr/go/tmp1.go:4)	ADDQ	$8, DI
	0x005d 00093 (/Users/khr/go/tmp1.go:4)	INCQ	R8
	0x0060 00096 (/Users/khr/go/tmp1.go:4)	CMPQ	R8, AX
	0x0063 00099 (/Users/khr/go/tmp1.go:4)	JGE	$0, 106
	0x0065 00101 (/Users/khr/go/tmp1.go:5)	MOVUPS	X1, X0
	0x0068 00104 (/Users/khr/go/tmp1.go:4)	JMP	56
@gopherbot

This comment has been minimized.

Show comment
Hide comment

CL https://golang.org/cl/38337 mentions this issue.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Mar 18, 2017

Contributor

@btracey Please try the patch above and see if it fixes everything for you.

Contributor

randall77 commented Mar 18, 2017

@btracey Please try the patch above and see if it fixes everything for you.

@randall77 randall77 self-assigned this Mar 18, 2017

@btracey

This comment has been minimized.

Show comment
Hide comment
@btracey

btracey Mar 20, 2017

Contributor

It seems to have partially helped, but not fixed the problem. I'm not sure I applied it correctly though (see #19619)

DaxpySmallBothUnitary-8   21.6ns ± 1%   22.3ns ± 1%   +3.15%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.6ns ± 0%   27.8ns ± 1%   +8.75%  (p=0.016 n=4+5)
DaxpySmallUniInc-8        24.9ns ± 1%   27.4ns ± 2%  +10.13%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.4ns ± 0%   27.8ns ± 1%   +9.61%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   880ns ± 1%    884ns ± 1%     ~     (p=0.341 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%   1.18µs ± 1%  +10.12%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        927ns ± 0%   1168ns ± 1%  +26.07%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.15µs ± 2%   1.24µs ± 4%   +8.47%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   88.2µs ± 1%   89.2µs ± 0%   +1.18%  (p=0.008 n=5+5)
DaxpyLargeIncUni-8         185µs ± 1%    189µs ± 1%   +2.45%  (p=0.008 n=5+5)
DaxpyLargeUniInc-8         142µs ± 1%    146µs ± 1%   +3.11%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        292µs ± 0%    304µs ± 1%   +4.25%  (p=0.008 n=5+5)
DaxpyHugeBothUnitary-8    13.3ms ± 1%   13.1ms ± 1%   -1.43%  (p=0.016 n=5+5)
DaxpyHugeIncUni-8         30.4ms ± 2%   30.5ms ± 1%     ~     (p=0.690 n=5+5)
DaxpyHugeUniInc-8         29.1ms ± 0%   29.3ms ± 1%     ~     (p=0.286 n=4+5)
DaxpyHugeBothInc-8        47.4ms ± 1%   46.6ms ± 1%   -1.60%  (p=0.032 n=5+5)
DgemvSmSmNoTransInc1-8     171ns ± 1%    203ns ± 0%  +19.09%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     206ns ± 1%    228ns ± 1%  +10.37%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       210ns ± 1%    212ns ± 2%     ~     (p=0.333 n=5+5)
DgemvSmSmTransIncN-8       222ns ± 1%    245ns ± 1%  +10.08%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.68µs ± 1%  10.59µs ± 1%   +9.41%  (p=0.008 n=5+5)
DgemvMedMedNoTransIncN-8  10.1µs ± 1%   13.0µs ± 1%  +28.10%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.8µs ± 1%   10.9µs ± 3%     ~     (p=0.151 n=5+5)
DgemvMedMedTransIncN-8    11.3µs ± 0%   13.9µs ± 1%  +22.81%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     924µs ± 2%    934µs ± 1%     ~     (p=0.310 n=5+5)
DgemvLgLgNoTransIncN-8     929µs ± 1%   1212µs ± 1%  +30.49%  (p=0.008 n=5+5)
DgemvLgLgTransInc1-8       952µs ± 2%    957µs ± 2%     ~     (p=0.690 n=5+5)
DgemvLgLgTransIncN-8       974µs ± 0%   1212µs ± 1%  +24.38%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    14.4µs ± 1%   17.5µs ± 1%  +21.47%  (p=0.008 n=5+5)
DgemvLgSmNoTransIncN-8    17.9µs ± 1%   19.9µs ± 1%  +11.48%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      17.5µs ± 1%   17.5µs ± 0%     ~     (p=0.690 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 0%   20.5µs ± 1%  +14.09%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.70µs ± 1%   8.85µs ± 2%   +1.75%  (p=0.016 n=5+5)
DgemvSmLgNoTransIncN-8    8.72µs ± 1%  11.65µs ± 2%  +33.57%  (p=0.008 n=5+5)
DgemvSmLgTransInc1-8      9.41µs ± 0%   9.39µs ± 1%     ~     (p=0.421 n=5+5)
DgemvSmLgTransIncN-8      10.2µs ± 2%   12.7µs ± 1%  +24.70%  (p=0.008 n=5+5)
Contributor

btracey commented Mar 20, 2017

It seems to have partially helped, but not fixed the problem. I'm not sure I applied it correctly though (see #19619)

DaxpySmallBothUnitary-8   21.6ns ± 1%   22.3ns ± 1%   +3.15%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.6ns ± 0%   27.8ns ± 1%   +8.75%  (p=0.016 n=4+5)
DaxpySmallUniInc-8        24.9ns ± 1%   27.4ns ± 2%  +10.13%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.4ns ± 0%   27.8ns ± 1%   +9.61%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   880ns ± 1%    884ns ± 1%     ~     (p=0.341 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%   1.18µs ± 1%  +10.12%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        927ns ± 0%   1168ns ± 1%  +26.07%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.15µs ± 2%   1.24µs ± 4%   +8.47%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   88.2µs ± 1%   89.2µs ± 0%   +1.18%  (p=0.008 n=5+5)
DaxpyLargeIncUni-8         185µs ± 1%    189µs ± 1%   +2.45%  (p=0.008 n=5+5)
DaxpyLargeUniInc-8         142µs ± 1%    146µs ± 1%   +3.11%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        292µs ± 0%    304µs ± 1%   +4.25%  (p=0.008 n=5+5)
DaxpyHugeBothUnitary-8    13.3ms ± 1%   13.1ms ± 1%   -1.43%  (p=0.016 n=5+5)
DaxpyHugeIncUni-8         30.4ms ± 2%   30.5ms ± 1%     ~     (p=0.690 n=5+5)
DaxpyHugeUniInc-8         29.1ms ± 0%   29.3ms ± 1%     ~     (p=0.286 n=4+5)
DaxpyHugeBothInc-8        47.4ms ± 1%   46.6ms ± 1%   -1.60%  (p=0.032 n=5+5)
DgemvSmSmNoTransInc1-8     171ns ± 1%    203ns ± 0%  +19.09%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     206ns ± 1%    228ns ± 1%  +10.37%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       210ns ± 1%    212ns ± 2%     ~     (p=0.333 n=5+5)
DgemvSmSmTransIncN-8       222ns ± 1%    245ns ± 1%  +10.08%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.68µs ± 1%  10.59µs ± 1%   +9.41%  (p=0.008 n=5+5)
DgemvMedMedNoTransIncN-8  10.1µs ± 1%   13.0µs ± 1%  +28.10%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.8µs ± 1%   10.9µs ± 3%     ~     (p=0.151 n=5+5)
DgemvMedMedTransIncN-8    11.3µs ± 0%   13.9µs ± 1%  +22.81%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     924µs ± 2%    934µs ± 1%     ~     (p=0.310 n=5+5)
DgemvLgLgNoTransIncN-8     929µs ± 1%   1212µs ± 1%  +30.49%  (p=0.008 n=5+5)
DgemvLgLgTransInc1-8       952µs ± 2%    957µs ± 2%     ~     (p=0.690 n=5+5)
DgemvLgLgTransIncN-8       974µs ± 0%   1212µs ± 1%  +24.38%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    14.4µs ± 1%   17.5µs ± 1%  +21.47%  (p=0.008 n=5+5)
DgemvLgSmNoTransIncN-8    17.9µs ± 1%   19.9µs ± 1%  +11.48%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      17.5µs ± 1%   17.5µs ± 0%     ~     (p=0.690 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 0%   20.5µs ± 1%  +14.09%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.70µs ± 1%   8.85µs ± 2%   +1.75%  (p=0.016 n=5+5)
DgemvSmLgNoTransIncN-8    8.72µs ± 1%  11.65µs ± 2%  +33.57%  (p=0.008 n=5+5)
DgemvSmLgTransInc1-8      9.41µs ± 0%   9.39µs ± 1%     ~     (p=0.421 n=5+5)
DgemvSmLgTransIncN-8      10.2µs ± 2%   12.7µs ± 1%  +24.70%  (p=0.008 n=5+5)
@btracey

This comment has been minimized.

Show comment
Hide comment
@btracey

btracey Mar 20, 2017

Contributor

It looks like all of the Daxpy benchmarks got faster with this patch, even though there seems to still be some regression overall. The Dgemv benchmarks have a less clear story.

Contributor

btracey commented Mar 20, 2017

It looks like all of the Daxpy benchmarks got faster with this patch, even though there seems to still be some regression overall. The Dgemv benchmarks have a less clear story.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Mar 20, 2017

Contributor

I've taken a closer look at DgemvMedMedNoTransIncN. Its inner loop is gonum/internal/asm/f64/axpy.go:AxpyInc
There's no difference in the inner loop with CL38337.
The difference between 1.8 and tip is that we're no longer loop peeling the branch at the start of the loop. See CL 36205 and issues #18977 #15837.
So tip looks like:

loop:
    CMPQ ...
    JGE exit
    ...
    JMP loop
exit:

whereas 1.8 is:

    CMPQ ...
    JGE exit
loop:
    ...
    CMPQ ...
    JLT loop
exit:

I'm pretty surprised this matters at the ~30% slowdown level. There might be something I'm missing.
In any case, we plan to fix this one for 1.9.

Contributor

randall77 commented Mar 20, 2017

I've taken a closer look at DgemvMedMedNoTransIncN. Its inner loop is gonum/internal/asm/f64/axpy.go:AxpyInc
There's no difference in the inner loop with CL38337.
The difference between 1.8 and tip is that we're no longer loop peeling the branch at the start of the loop. See CL 36205 and issues #18977 #15837.
So tip looks like:

loop:
    CMPQ ...
    JGE exit
    ...
    JMP loop
exit:

whereas 1.8 is:

    CMPQ ...
    JGE exit
loop:
    ...
    CMPQ ...
    JLT loop
exit:

I'm pretty surprised this matters at the ~30% slowdown level. There might be something I'm missing.
In any case, we plan to fix this one for 1.9.

@dr2chase

This comment has been minimized.

Show comment
Hide comment
@dr2chase

dr2chase Mar 20, 2017

Contributor

We might be able to get the effect of the loop peel just by changing the weights on block ordering -- for loops where the header block has a loop exit edge, make the branch to the header be lower than the backedge to the header.

Contributor

dr2chase commented Mar 20, 2017

We might be able to get the effect of the loop peel just by changing the weights on block ordering -- for loops where the header block has a loop exit edge, make the branch to the header be lower than the backedge to the header.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Mar 20, 2017

Contributor

@dr2chase I feel like that would interact badly with the register allocator. If you put the loop header block last, then the 2nd block of the loop appears first in the schedule. But none of its predecessors have been processed yet, so the register allocator won't know what register state to start with. It could be fixed, I suppose, computing a regalloc schedule instead of just doing blocks in order.
It would generate:

    JMP entry
loop:
    ...
entry:
    CMPQ ...
    JLT loop
exit:

Which isn't quite the same as 1.8 either. Better or worse? I'm not sure.
It definitely has the appeal of doing less CFG surgery.

Contributor

randall77 commented Mar 20, 2017

@dr2chase I feel like that would interact badly with the register allocator. If you put the loop header block last, then the 2nd block of the loop appears first in the schedule. But none of its predecessors have been processed yet, so the register allocator won't know what register state to start with. It could be fixed, I suppose, computing a regalloc schedule instead of just doing blocks in order.
It would generate:

    JMP entry
loop:
    ...
entry:
    CMPQ ...
    JLT loop
exit:

Which isn't quite the same as 1.8 either. Better or worse? I'm not sure.
It definitely has the appeal of doing less CFG surgery.

@dr2chase

This comment has been minimized.

Show comment
Hide comment
@dr2chase

dr2chase Mar 20, 2017

Contributor

When I did the explicit loop peel to fix i,e:=range-of-slice loops in the presence of backedge rescheduling checks, PPC did get a little faster.

Contributor

dr2chase commented Mar 20, 2017

When I did the explicit loop peel to fix i,e:=range-of-slice loops in the presence of backedge rescheduling checks, PPC did get a little faster.

@gopherbot

This comment has been minimized.

Show comment
Hide comment

CL https://golang.org/cl/38431 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 23, 2017

cmd/compile: don't merge load+op if other op arg is still live
We want to merge a load and op into a single instruction

    l = LOAD ptr mem
    y = OP x l

into

    y = OPload x ptr mem

However, all of our OPload instructions require that y uses
the same register as x. If x is needed past this instruction, then
we must copy x somewhere else, losing the whole benefit of merging
the instructions in the first place.

Disable this optimization if x is live past the OP.

Also disable this optimization if the OP is in a deeper loop than the load.

Update #19595

Change-Id: I87f596aad7e91c9127bfb4705cbae47106e1e77a
Reviewed-on: https://go-review.googlesource.com/38337
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>

gopherbot pushed a commit that referenced this issue Mar 23, 2017

cmd/compile: initialize loop depths
Regalloc uses loop depths - make sure they are initialized!

Test to make sure we aren't pushing spills into loops.

This fixes a generated-code performance bug introduced with
the better spill placement change:
https://go-review.googlesource.com/c/34822/

Update #19595

Change-Id: Ib9f0da6fb588503518847d7aab51e569fd3fa61e
Reviewed-on: https://go-review.googlesource.com/38434
Reviewed-by: David Chase <drchase@google.com>
@gopherbot

This comment has been minimized.

Show comment
Hide comment

CL https://golang.org/cl/38434 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 24, 2017

cmd/compile: rotate loops so conditional branch is at the end
Old loops look like this:
   loop:
     CMPQ ...
     JGE exit
     ...
     JMP loop
   exit:

New loops look like this:
    JMP entry
  loop:
    ...
  entry:
    CMPQ ...
    JLT loop

This removes one instruction (the unconditional jump) from
the inner loop.
Kinda surprisingly, it matters.

This is a bit different than the peeling that the old obj
library did in that we don't duplicate the loop exit test.
We just jump to the test.  I'm not sure if it is better or
worse to do that (peeling gets rid of the JMP but means more
code duplication), but this CL is certainly a much simpler
compiler change, so I'll try this way first.

The obj library used to do peeling before
CL https://go-review.googlesource.com/c/36205 turned it off.

Fixes #15837 (remove obj instruction reordering)
The reordering is already removed, this CL implements the only
part of that reordering that we'd like to keep.

Fixes #14758 (append loop)
name    old time/op    new time/op    delta
Foo-12     817ns ± 4%     538ns ± 0%  -34.08%   (p=0.000 n=10+9)
Bar-12     850ns ±11%     570ns ±13%  -32.88%  (p=0.000 n=10+10)

Update #19595 (BLAS slowdown)
name                       old time/op  new time/op  delta
DgemvMedMedNoTransIncN-12  13.2µs ± 9%  10.2µs ± 1%  -22.26%  (p=0.000 n=9+9)

Fixes #19633 (append loop)
name    old time/op    new time/op    delta
Foo-12     810ns ± 1%     540ns ± 0%  -33.30%   (p=0.000 n=8+9)

Update #18977 (Fannkuch11 regression)
name         old time/op    new time/op    delta
Fannkuch11-8                2.80s ± 0%     3.01s ± 0%  +7.47%   (p=0.000 n=9+10)
This one makes no sense.  There's strictly 1 less instruction in the
inner loop (17 instead of 18).  They are exactly the same instructions
except for the JMP that has been elided.

go1 benchmarks generally don't look very impressive.  But the gains for the
specific issues above make this CL still probably worth it.
name                      old time/op    new time/op    delta
BinaryTree17-8              2.32s ± 0%     2.34s ± 0%  +1.14%    (p=0.000 n=9+7)
Fannkuch11-8                2.80s ± 0%     3.01s ± 0%  +7.47%   (p=0.000 n=9+10)
FmtFprintfEmpty-8          44.1ns ± 1%    46.1ns ± 1%  +4.53%  (p=0.000 n=10+10)
FmtFprintfString-8         67.8ns ± 0%    74.4ns ± 1%  +9.80%   (p=0.000 n=10+9)
FmtFprintfInt-8            74.9ns ± 0%    78.4ns ± 0%  +4.67%   (p=0.000 n=8+10)
FmtFprintfIntInt-8          117ns ± 1%     123ns ± 1%  +4.69%   (p=0.000 n=9+10)
FmtFprintfPrefixedInt-8     160ns ± 1%     146ns ± 0%  -8.22%   (p=0.000 n=8+10)
FmtFprintfFloat-8           214ns ± 0%     206ns ± 0%  -3.91%    (p=0.000 n=8+8)
FmtManyArgs-8               468ns ± 0%     497ns ± 1%  +6.09%   (p=0.000 n=8+10)
GobDecode-8                6.16ms ± 0%    6.21ms ± 1%  +0.76%   (p=0.000 n=9+10)
GobEncode-8                4.90ms ± 0%    4.92ms ± 1%  +0.37%   (p=0.028 n=9+10)
Gzip-8                      209ms ± 0%     212ms ± 0%  +1.33%  (p=0.000 n=10+10)
Gunzip-8                   36.6ms ± 0%    38.0ms ± 1%  +4.03%    (p=0.000 n=9+9)
HTTPClientServer-8         84.2µs ± 0%    86.0µs ± 1%  +2.14%    (p=0.000 n=9+9)
JSONEncode-8               13.6ms ± 3%    13.8ms ± 1%  +1.55%   (p=0.003 n=9+10)
JSONDecode-8               53.2ms ± 5%    52.9ms ± 0%    ~     (p=0.280 n=10+10)
Mandelbrot200-8            3.78ms ± 0%    3.78ms ± 1%    ~      (p=0.661 n=10+9)
GoParse-8                  2.89ms ± 0%    2.94ms ± 2%  +1.50%  (p=0.000 n=10+10)
RegexpMatchEasy0_32-8      68.5ns ± 2%    68.9ns ± 1%    ~     (p=0.136 n=10+10)
RegexpMatchEasy0_1K-8       220ns ± 1%     225ns ± 1%  +2.41%  (p=0.000 n=10+10)
RegexpMatchEasy1_32-8      64.7ns ± 0%    64.5ns ± 0%  -0.28%  (p=0.042 n=10+10)
RegexpMatchEasy1_1K-8       348ns ± 1%     355ns ± 0%  +1.90%  (p=0.000 n=10+10)
RegexpMatchMedium_32-8      102ns ± 1%     105ns ± 1%  +2.95%  (p=0.000 n=10+10)
RegexpMatchMedium_1K-8     33.1µs ± 3%    32.5µs ± 0%  -1.75%  (p=0.000 n=10+10)
RegexpMatchHard_32-8       1.71µs ± 1%    1.70µs ± 1%  -0.84%   (p=0.002 n=10+9)
RegexpMatchHard_1K-8       51.1µs ± 0%    50.8µs ± 1%  -0.48%  (p=0.004 n=10+10)
Revcomp-8                   411ms ± 1%     402ms ± 0%  -2.22%   (p=0.000 n=10+9)
Template-8                 61.8ms ± 1%    59.7ms ± 0%  -3.44%    (p=0.000 n=9+9)
TimeParse-8                 306ns ± 0%     318ns ± 0%  +3.83%  (p=0.000 n=10+10)
TimeFormat-8                320ns ± 0%     318ns ± 1%  -0.53%   (p=0.012 n=7+10)

Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b
Reviewed-on: https://go-review.googlesource.com/38431
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 11, 2017

Contributor

@btracey would you mind re-measuring with the latest tip, to assess whether this still might need fixes for 1.9? Thanks.

Contributor

josharian commented May 11, 2017

@btracey would you mind re-measuring with the latest tip, to assess whether this still might need fixes for 1.9? Thanks.

@btracey

This comment has been minimized.

Show comment
Hide comment
@btracey

btracey May 14, 2017

Contributor

Here are the current stats. Mostly better

DaxpySmallBothUnitary-8   19.5ns ± 1%  21.5ns ± 1%  +10.27%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.9ns ± 1%  28.4ns ± 2%   +9.58%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        25.5ns ± 1%  27.7ns ± 2%   +8.62%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.1ns ± 1%  27.4ns ± 4%   +8.92%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   865ns ± 1%   861ns ± 1%     ~     (p=0.190 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%  1.08µs ± 2%     ~     (p=0.690 n=5+5)
DaxpyMediumUniInc-8        922ns ± 1%   915ns ± 1%     ~     (p=0.103 n=5+5)
DaxpyMediumBothInc-8      1.14µs ± 1%  1.15µs ± 1%     ~     (p=0.841 n=5+5)
DaxpyLargeBothUnitary-8   87.9µs ± 2%  87.3µs ± 3%     ~     (p=0.151 n=5+5)
DaxpyLargeIncUni-8         173µs ± 2%   174µs ± 2%     ~     (p=1.000 n=5+5)
DaxpyLargeUniInc-8         137µs ± 1%   138µs ± 1%     ~     (p=0.056 n=5+5)
DaxpyLargeBothInc-8        299µs ± 1%   287µs ± 2%   -4.03%  (p=0.008 n=5+5)
DaxpyHugeBothUnitary-8    12.9ms ± 1%  12.9ms ± 0%     ~     (p=0.841 n=5+5)
DaxpyHugeIncUni-8         29.9ms ± 1%  29.9ms ± 3%     ~     (p=0.841 n=5+5)
DaxpyHugeUniInc-8         28.8ms ± 1%  28.7ms ± 0%     ~     (p=0.111 n=5+4)
DaxpyHugeBothInc-8        46.1ms ± 2%  45.8ms ± 1%     ~     (p=0.151 n=5+5)
DgemmSmSmSm-8             1.73µs ± 1%  1.73µs ± 2%     ~     (p=0.690 n=5+5)
DgemmMedMedMed-8           469µs ± 2%   469µs ± 1%     ~     (p=0.841 n=5+5)
DgemmMedLgMed-8           3.20ms ± 2%  3.39ms ±14%     ~     (p=0.310 n=5+5)
DgemmLgLgLg-8              313ms ± 1%   317ms ± 1%   +1.58%  (p=0.008 n=5+5)
DgemmLgSmLg-8             4.85ms ± 2%  4.88ms ± 1%     ~     (p=0.690 n=5+5)
DgemmLgLgSm-8             3.38ms ± 3%  3.39ms ± 2%     ~     (p=0.841 n=5+5)
DgemmHgHgSm-8              299ms ± 2%   297ms ± 4%     ~     (p=0.690 n=5+5)
DgemmMedMedMedTNT-8        486µs ± 1%   483µs ± 4%     ~     (p=0.310 n=5+5)
DgemmMedMedMedNTT-8        448µs ± 1%   453µs ± 4%     ~     (p=0.690 n=5+5)
DgemmMedMedMedTT-8         556µs ± 1%   576µs ± 2%   +3.72%  (p=0.008 n=5+5)
DgemvSmSmNoTransInc1-8     168ns ± 1%   181ns ± 1%   +7.98%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     207ns ± 1%   213ns ± 1%   +2.89%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       188ns ± 1%   204ns ± 1%   +8.61%  (p=0.008 n=5+5)
DgemvSmSmTransIncN-8       222ns ± 1%   244ns ± 0%   +9.81%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.43µs ± 1%  9.59µs ± 1%   +1.75%  (p=0.008 n=5+5)
DgemvMedMedNoTransIncN-8  10.6µs ± 2%  11.0µs ± 3%   +3.23%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.0µs ± 2%   9.9µs ± 0%   -1.24%  (p=0.032 n=5+5)
DgemvMedMedTransIncN-8    11.2µs ± 0%  11.5µs ± 1%   +2.45%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     909µs ± 1%   908µs ± 1%     ~     (p=0.690 n=5+5)
DgemvLgLgNoTransIncN-8     929µs ± 2%   929µs ± 1%     ~     (p=0.548 n=5+5)
DgemvLgLgTransInc1-8       920µs ± 1%   920µs ± 1%     ~     (p=1.000 n=5+5)
DgemvLgLgTransIncN-8       966µs ± 1%   977µs ± 1%   +1.19%  (p=0.016 n=5+5)
DgemvLgSmNoTransInc1-8    13.7µs ± 1%  14.7µs ± 1%   +6.91%  (p=0.008 n=5+5)
DgemvLgSmNoTransIncN-8    18.0µs ± 1%  18.0µs ± 1%     ~     (p=0.841 n=5+5)
DgemvLgSmTransInc1-8      15.2µs ± 1%  16.3µs ± 1%   +7.52%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.4µs ± 1%  19.7µs ± 1%   +7.27%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.93µs ± 1%  8.71µs ± 3%     ~     (p=0.056 n=5+5)
DgemvSmLgNoTransIncN-8    9.07µs ± 1%  9.03µs ± 1%     ~     (p=0.548 n=5+5)
DgemvSmLgTransInc1-8      9.59µs ± 1%  9.55µs ± 0%     ~     (p=0.341 n=5+5)
DgemvSmLgTransIncN-8      10.4µs ± 1%  10.3µs ± 1%     ~     (p=0.421 n=5+5)
Contributor

btracey commented May 14, 2017

Here are the current stats. Mostly better

DaxpySmallBothUnitary-8   19.5ns ± 1%  21.5ns ± 1%  +10.27%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.9ns ± 1%  28.4ns ± 2%   +9.58%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        25.5ns ± 1%  27.7ns ± 2%   +8.62%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.1ns ± 1%  27.4ns ± 4%   +8.92%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   865ns ± 1%   861ns ± 1%     ~     (p=0.190 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%  1.08µs ± 2%     ~     (p=0.690 n=5+5)
DaxpyMediumUniInc-8        922ns ± 1%   915ns ± 1%     ~     (p=0.103 n=5+5)
DaxpyMediumBothInc-8      1.14µs ± 1%  1.15µs ± 1%     ~     (p=0.841 n=5+5)
DaxpyLargeBothUnitary-8   87.9µs ± 2%  87.3µs ± 3%     ~     (p=0.151 n=5+5)
DaxpyLargeIncUni-8         173µs ± 2%   174µs ± 2%     ~     (p=1.000 n=5+5)
DaxpyLargeUniInc-8         137µs ± 1%   138µs ± 1%     ~     (p=0.056 n=5+5)
DaxpyLargeBothInc-8        299µs ± 1%   287µs ± 2%   -4.03%  (p=0.008 n=5+5)
DaxpyHugeBothUnitary-8    12.9ms ± 1%  12.9ms ± 0%     ~     (p=0.841 n=5+5)
DaxpyHugeIncUni-8         29.9ms ± 1%  29.9ms ± 3%     ~     (p=0.841 n=5+5)
DaxpyHugeUniInc-8         28.8ms ± 1%  28.7ms ± 0%     ~     (p=0.111 n=5+4)
DaxpyHugeBothInc-8        46.1ms ± 2%  45.8ms ± 1%     ~     (p=0.151 n=5+5)
DgemmSmSmSm-8             1.73µs ± 1%  1.73µs ± 2%     ~     (p=0.690 n=5+5)
DgemmMedMedMed-8           469µs ± 2%   469µs ± 1%     ~     (p=0.841 n=5+5)
DgemmMedLgMed-8           3.20ms ± 2%  3.39ms ±14%     ~     (p=0.310 n=5+5)
DgemmLgLgLg-8              313ms ± 1%   317ms ± 1%   +1.58%  (p=0.008 n=5+5)
DgemmLgSmLg-8             4.85ms ± 2%  4.88ms ± 1%     ~     (p=0.690 n=5+5)
DgemmLgLgSm-8             3.38ms ± 3%  3.39ms ± 2%     ~     (p=0.841 n=5+5)
DgemmHgHgSm-8              299ms ± 2%   297ms ± 4%     ~     (p=0.690 n=5+5)
DgemmMedMedMedTNT-8        486µs ± 1%   483µs ± 4%     ~     (p=0.310 n=5+5)
DgemmMedMedMedNTT-8        448µs ± 1%   453µs ± 4%     ~     (p=0.690 n=5+5)
DgemmMedMedMedTT-8         556µs ± 1%   576µs ± 2%   +3.72%  (p=0.008 n=5+5)
DgemvSmSmNoTransInc1-8     168ns ± 1%   181ns ± 1%   +7.98%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     207ns ± 1%   213ns ± 1%   +2.89%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       188ns ± 1%   204ns ± 1%   +8.61%  (p=0.008 n=5+5)
DgemvSmSmTransIncN-8       222ns ± 1%   244ns ± 0%   +9.81%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.43µs ± 1%  9.59µs ± 1%   +1.75%  (p=0.008 n=5+5)
DgemvMedMedNoTransIncN-8  10.6µs ± 2%  11.0µs ± 3%   +3.23%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.0µs ± 2%   9.9µs ± 0%   -1.24%  (p=0.032 n=5+5)
DgemvMedMedTransIncN-8    11.2µs ± 0%  11.5µs ± 1%   +2.45%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     909µs ± 1%   908µs ± 1%     ~     (p=0.690 n=5+5)
DgemvLgLgNoTransIncN-8     929µs ± 2%   929µs ± 1%     ~     (p=0.548 n=5+5)
DgemvLgLgTransInc1-8       920µs ± 1%   920µs ± 1%     ~     (p=1.000 n=5+5)
DgemvLgLgTransIncN-8       966µs ± 1%   977µs ± 1%   +1.19%  (p=0.016 n=5+5)
DgemvLgSmNoTransInc1-8    13.7µs ± 1%  14.7µs ± 1%   +6.91%  (p=0.008 n=5+5)
DgemvLgSmNoTransIncN-8    18.0µs ± 1%  18.0µs ± 1%     ~     (p=0.841 n=5+5)
DgemvLgSmTransInc1-8      15.2µs ± 1%  16.3µs ± 1%   +7.52%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.4µs ± 1%  19.7µs ± 1%   +7.27%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.93µs ± 1%  8.71µs ± 3%     ~     (p=0.056 n=5+5)
DgemvSmLgNoTransIncN-8    9.07µs ± 1%  9.03µs ± 1%     ~     (p=0.548 n=5+5)
DgemvSmLgTransInc1-8      9.59µs ± 1%  9.55µs ± 0%     ~     (p=0.341 n=5+5)
DgemvSmLgTransIncN-8      10.4µs ± 1%  10.3µs ± 1%     ~     (p=0.421 n=5+5)
@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 14, 2017

Contributor

But a few non-trivial regressions still. If you're up for it, would you mind patching in CLs 43293 and 43491 and see whether they help any? (No particular reason to think they will, other than that they have a significant effect on code layout in general.)

Contributor

josharian commented May 14, 2017

But a few non-trivial regressions still. If you're up for it, would you mind patching in CLs 43293 and 43491 and see whether they help any? (No particular reason to think they will, other than that they have a significant effect on code layout in general.)

@btracey

This comment has been minimized.

Show comment
Hide comment
@btracey

btracey May 14, 2017

Contributor

With both applied:

name                      old time/op  new time/op   delta
DaxpySmallBothUnitary-8   19.1ns ± 1%   19.7ns ± 1%   +3.03%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.3ns ± 1%   28.8ns ± 1%  +13.65%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        25.1ns ± 1%   25.7ns ± 1%   +2.47%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       24.7ns ± 1%   26.5ns ± 1%   +7.19%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   846ns ± 1%    858ns ± 3%     ~     (p=0.222 n=5+5)
DaxpyMediumIncUni-8       1.05µs ± 1%   1.22µs ±11%  +16.51%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        902ns ± 1%   1134ns ± 1%  +25.75%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.13µs ± 1%   1.19µs ± 3%   +5.80%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   85.8µs ± 1%   93.2µs ± 4%   +8.69%  (p=0.016 n=4+5)
DaxpyLargeIncUni-8         164µs ± 1%    179µs ± 2%   +8.89%  (p=0.008 n=5+5)
DaxpyLargeUniInc-8         134µs ± 0%    141µs ± 1%   +4.79%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        288µs ± 2%    287µs ± 2%     ~     (p=1.000 n=5+5)
DaxpyHugeBothUnitary-8    12.8ms ± 1%   12.8ms ± 1%     ~     (p=0.421 n=5+5)
DaxpyHugeIncUni-8         29.4ms ± 1%   29.3ms ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeUniInc-8         28.5ms ± 1%   28.4ms ± 1%     ~     (p=0.421 n=5+5)
DaxpyHugeBothInc-8        45.5ms ± 2%   45.5ms ± 1%     ~     (p=0.841 n=5+5)
DgemmSmSmSm-8             1.70µs ± 2%   1.66µs ± 1%   -2.55%  (p=0.008 n=5+5)
DgemmMedMedMed-8           467µs ± 3%    471µs ± 3%     ~     (p=0.421 n=5+5)
DgemmMedLgMed-8           3.16ms ± 3%   3.20ms ± 3%     ~     (p=0.151 n=5+5)
DgemmLgLgLg-8              315ms ± 2%    314ms ± 0%     ~     (p=0.730 n=5+4)
DgemmLgSmLg-8             4.87ms ± 5%   4.81ms ± 1%     ~     (p=0.690 n=5+5)
DgemmLgLgSm-8             3.37ms ± 3%   3.33ms ± 1%     ~     (p=0.222 n=5+5)
DgemmHgHgSm-8              298ms ± 3%    299ms ± 5%     ~     (p=1.000 n=5+5)
DgemmMedMedMedTNT-8        485µs ± 1%    492µs ± 2%     ~     (p=0.095 n=5+5)
DgemmMedMedMedNTT-8        452µs ± 1%    448µs ± 2%     ~     (p=0.310 n=5+5)
DgemmMedMedMedTT-8         561µs ± 3%    603µs ± 2%   +7.56%  (p=0.008 n=5+5)
DgemvSmSmNoTransInc1-8     166ns ± 1%    168ns ± 1%   +1.69%  (p=0.032 n=5+5)
DgemvSmSmNoTransIncN-8     205ns ± 1%    202ns ± 1%   -1.27%  (p=0.032 n=5+5)
DgemvSmSmTransInc1-8       186ns ± 1%    182ns ± 1%   -2.15%  (p=0.008 n=5+5)
DgemvSmSmTransIncN-8       219ns ± 1%    220ns ± 0%     ~     (p=0.175 n=5+4)
DgemvMedMedNoTransInc1-8  9.35µs ± 2%   9.36µs ± 0%     ~     (p=0.690 n=5+5)
DgemvMedMedNoTransIncN-8  10.4µs ± 0%   10.0µs ± 1%   -4.68%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    9.82µs ± 0%   9.77µs ± 1%   -0.52%  (p=0.032 n=5+5)
DgemvMedMedTransIncN-8    11.0µs ± 1%   12.4µs ± 1%  +12.07%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     893µs ± 1%    888µs ± 1%     ~     (p=0.421 n=5+5)
DgemvLgLgNoTransIncN-8     908µs ± 1%    904µs ± 0%     ~     (p=0.151 n=5+5)
DgemvLgLgTransInc1-8       905µs ± 1%    898µs ± 1%     ~     (p=0.056 n=5+5)
DgemvLgLgTransIncN-8       951µs ± 0%   1158µs ± 1%  +21.80%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    13.5µs ± 2%   13.4µs ± 1%     ~     (p=0.690 n=5+5)
DgemvLgSmNoTransIncN-8    17.8µs ± 2%   17.1µs ± 1%   -3.83%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      15.0µs ± 0%   14.6µs ± 1%   -2.41%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 1%   18.0µs ± 0%     ~     (p=0.738 n=5+5)
DgemvSmLgNoTransInc1-8    8.49µs ± 1%   8.44µs ± 1%     ~     (p=0.381 n=5+5)
DgemvSmLgNoTransIncN-8    8.56µs ± 1%   8.56µs ± 3%     ~     (p=0.690 n=5+5)
DgemvSmLgTransInc1-8      9.02µs ± 0%   9.11µs ± 2%   +0.91%  (p=0.016 n=5+5)
DgemvSmLgTransIncN-8      9.84µs ± 1%  12.11µs ± 2%  +23.10%  (p=0.008 n=5+5)
```
Contributor

btracey commented May 14, 2017

With both applied:

name                      old time/op  new time/op   delta
DaxpySmallBothUnitary-8   19.1ns ± 1%   19.7ns ± 1%   +3.03%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.3ns ± 1%   28.8ns ± 1%  +13.65%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        25.1ns ± 1%   25.7ns ± 1%   +2.47%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       24.7ns ± 1%   26.5ns ± 1%   +7.19%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   846ns ± 1%    858ns ± 3%     ~     (p=0.222 n=5+5)
DaxpyMediumIncUni-8       1.05µs ± 1%   1.22µs ±11%  +16.51%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        902ns ± 1%   1134ns ± 1%  +25.75%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.13µs ± 1%   1.19µs ± 3%   +5.80%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   85.8µs ± 1%   93.2µs ± 4%   +8.69%  (p=0.016 n=4+5)
DaxpyLargeIncUni-8         164µs ± 1%    179µs ± 2%   +8.89%  (p=0.008 n=5+5)
DaxpyLargeUniInc-8         134µs ± 0%    141µs ± 1%   +4.79%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        288µs ± 2%    287µs ± 2%     ~     (p=1.000 n=5+5)
DaxpyHugeBothUnitary-8    12.8ms ± 1%   12.8ms ± 1%     ~     (p=0.421 n=5+5)
DaxpyHugeIncUni-8         29.4ms ± 1%   29.3ms ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeUniInc-8         28.5ms ± 1%   28.4ms ± 1%     ~     (p=0.421 n=5+5)
DaxpyHugeBothInc-8        45.5ms ± 2%   45.5ms ± 1%     ~     (p=0.841 n=5+5)
DgemmSmSmSm-8             1.70µs ± 2%   1.66µs ± 1%   -2.55%  (p=0.008 n=5+5)
DgemmMedMedMed-8           467µs ± 3%    471µs ± 3%     ~     (p=0.421 n=5+5)
DgemmMedLgMed-8           3.16ms ± 3%   3.20ms ± 3%     ~     (p=0.151 n=5+5)
DgemmLgLgLg-8              315ms ± 2%    314ms ± 0%     ~     (p=0.730 n=5+4)
DgemmLgSmLg-8             4.87ms ± 5%   4.81ms ± 1%     ~     (p=0.690 n=5+5)
DgemmLgLgSm-8             3.37ms ± 3%   3.33ms ± 1%     ~     (p=0.222 n=5+5)
DgemmHgHgSm-8              298ms ± 3%    299ms ± 5%     ~     (p=1.000 n=5+5)
DgemmMedMedMedTNT-8        485µs ± 1%    492µs ± 2%     ~     (p=0.095 n=5+5)
DgemmMedMedMedNTT-8        452µs ± 1%    448µs ± 2%     ~     (p=0.310 n=5+5)
DgemmMedMedMedTT-8         561µs ± 3%    603µs ± 2%   +7.56%  (p=0.008 n=5+5)
DgemvSmSmNoTransInc1-8     166ns ± 1%    168ns ± 1%   +1.69%  (p=0.032 n=5+5)
DgemvSmSmNoTransIncN-8     205ns ± 1%    202ns ± 1%   -1.27%  (p=0.032 n=5+5)
DgemvSmSmTransInc1-8       186ns ± 1%    182ns ± 1%   -2.15%  (p=0.008 n=5+5)
DgemvSmSmTransIncN-8       219ns ± 1%    220ns ± 0%     ~     (p=0.175 n=5+4)
DgemvMedMedNoTransInc1-8  9.35µs ± 2%   9.36µs ± 0%     ~     (p=0.690 n=5+5)
DgemvMedMedNoTransIncN-8  10.4µs ± 0%   10.0µs ± 1%   -4.68%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    9.82µs ± 0%   9.77µs ± 1%   -0.52%  (p=0.032 n=5+5)
DgemvMedMedTransIncN-8    11.0µs ± 1%   12.4µs ± 1%  +12.07%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     893µs ± 1%    888µs ± 1%     ~     (p=0.421 n=5+5)
DgemvLgLgNoTransIncN-8     908µs ± 1%    904µs ± 0%     ~     (p=0.151 n=5+5)
DgemvLgLgTransInc1-8       905µs ± 1%    898µs ± 1%     ~     (p=0.056 n=5+5)
DgemvLgLgTransIncN-8       951µs ± 0%   1158µs ± 1%  +21.80%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    13.5µs ± 2%   13.4µs ± 1%     ~     (p=0.690 n=5+5)
DgemvLgSmNoTransIncN-8    17.8µs ± 2%   17.1µs ± 1%   -3.83%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      15.0µs ± 0%   14.6µs ± 1%   -2.41%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 1%   18.0µs ± 0%     ~     (p=0.738 n=5+5)
DgemvSmLgNoTransInc1-8    8.49µs ± 1%   8.44µs ± 1%     ~     (p=0.381 n=5+5)
DgemvSmLgNoTransIncN-8    8.56µs ± 1%   8.56µs ± 3%     ~     (p=0.690 n=5+5)
DgemvSmLgTransInc1-8      9.02µs ± 0%   9.11µs ± 2%   +0.91%  (p=0.016 n=5+5)
DgemvSmLgTransIncN-8      9.84µs ± 1%  12.11µs ± 2%  +23.10%  (p=0.008 n=5+5)
```
@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 15, 2017

Contributor

Thanks. Looks like those CLs are a mixed bag, maybe a bit worse on balance. To confirm, are these latest from 1.8 to tip+2 CLs? I ask because the before numbers e.g. for the last benchmark don't line up—10.4us two comments back vs 9.84us in the last comment. Iff you have all the benchmark files around, could you post the benchstat going from tip to tip+2CLs, maybe running with -geomean? Ta.

Contributor

josharian commented May 15, 2017

Thanks. Looks like those CLs are a mixed bag, maybe a bit worse on balance. To confirm, are these latest from 1.8 to tip+2 CLs? I ask because the before numbers e.g. for the last benchmark don't line up—10.4us two comments back vs 9.84us in the last comment. Iff you have all the benchmark files around, could you post the benchstat going from tip to tip+2CLs, maybe running with -geomean? Ta.

@btracey

This comment has been minimized.

Show comment
Hide comment
@btracey

btracey May 15, 2017

Contributor

It was 1.8.1 to the most recent. I'm running the benchmarks on my laptop (which I've been using all day), so the benchmarks could shift a bit. This is why I re-ran the 1.8.1 results (and why they don't exactly match up).

-geomean doesn't do what I expected. It prints the exact same output as before, except with a [Geo mean] 41.0µs 42.2µs +2.94% at the bottom (I ran benchstat -geomean blasbench181.txt blasbenchtip.txt, using the latest from golang.org/x/perf/cmd/benchstat). It seems like it's taking the geometric mean of all the runs rather than each benchmark individually.

Contributor

btracey commented May 15, 2017

It was 1.8.1 to the most recent. I'm running the benchmarks on my laptop (which I've been using all day), so the benchmarks could shift a bit. This is why I re-ran the 1.8.1 results (and why they don't exactly match up).

-geomean doesn't do what I expected. It prints the exact same output as before, except with a [Geo mean] 41.0µs 42.2µs +2.94% at the bottom (I ran benchstat -geomean blasbench181.txt blasbenchtip.txt, using the latest from golang.org/x/perf/cmd/benchstat). It seems like it's taking the geometric mean of all the runs rather than each benchmark individually.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Jun 6, 2017

Contributor

Punting to 1.10.

Contributor

randall77 commented Jun 6, 2017

Punting to 1.10.

@randall77 randall77 modified the milestones: Go1.10, Go1.9 Jun 6, 2017

@mdempsky mdempsky modified the milestones: Go1.10, Go1.11 Nov 30, 2017

@bradfitz bradfitz modified the milestones: Go1.11, Unreleased Jun 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment