Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: suboptimal assembly for a + b + 1 #31900

Open
josharian opened this issue May 8, 2019 · 8 comments

Comments

Projects
None yet
5 participants
@josharian
Copy link
Contributor

commented May 8, 2019

package p

func f(a, b uint64) uint64 {
	return a + b + 1
}

On amd64, this assembles to:

"".f STEXT nosplit size=24 args=0x18 locals=0x0
	0x0000 00000 (x.go:4)	MOVQ	"".a+8(SP), AX
	0x0005 00005 (x.go:4)	MOVQ	"".b+16(SP), CX
	0x000a 00010 (x.go:4)	LEAQ	(CX)(AX*1), AX
	0x000e 00014 (x.go:4)	LEAQ	1(AX), AX
	0x0012 00018 (x.go:4)	MOVQ	AX, "".~r2+24(SP)
	0x0017 00023 (x.go:4)	RET

I believe the two LEAQs should instead be ADDQ CX, AX, INC AX. This would be six bytes of instructions instead of 8.

I need to double-check, but I believe that this happens because we optimize a + b + const to LEAQ const(a)(b*1), dst, but then break apart the three-part LEAQ.

We can't fix this when lowering to instructions, because LEAQ doesn't clobber flags and ADD and INC do. So we need to somehow catch this in the rewrite rules. I wonder whether we should reconsider the previous strategy of breaking apart "slow" LEAQs at the last minute (for #21735).

cc @martisch @randall77

@josharian josharian added this to the Unplanned milestone May 8, 2019

@ulikunitz

This comment has been minimized.

Copy link
Contributor

commented May 11, 2019

Splitting the 3-operand LEA instruction into two 2-operand LEA instructions is only faster in tight loops that fit into the µOp cache. If there is no tight loop the cost of decoding two instructions instead of one is higher than the one cycle the 2 two-operand LEAs save during execution.

For the example above the 3-operand LEA will be faster than two LEAs or ADDs. The situation changes if the function is inlined into a loop that fits the µOp cache.

Intel formulated following rule in the Intel® 64 and IA-32 Architectures Optimization Reference Manual.

Assembly/Compiler Coding Rule 33. (ML impact, L generality) If an LEA instruction using the
scaled index is on the critical path, a sequence with ADDs may be better. If code density and bandwidth
out of the trace cache are the critical factor, then use the LEA instruction.

@martisch

This comment has been minimized.

Copy link
Member

commented May 11, 2019

Im fine with reverting the lea splitting to keep it simple and to reduce binary size.

However, Last time i benchmarked decoding two 2 operand leas was better than the latency of a 3 operand lea. Maybe thats changed and they require more than one or two cycles for decoding combined now. Otherwise i do not understand why they would be slower in the example above. The intel rule also only seems to suggest to me that the 3 operand lea is better for trace cache utilization and not that two leas are only better if served from the trace cache. I need to have a look again.

@ulikunitz can you post your cpu type and the benchmark and measurements that show 3 operand lea being faster than two 2 operand leas for comparison in the example above? Thanks

Generally i think we should prefer add 1 over inc. the later only has the upside of size while the former seems never slower (some archs take an extra cycle for flag update of inc), fuses better and doesnt partially update flags.

@ulikunitz

This comment has been minimized.

Copy link
Contributor

commented May 11, 2019

I have microbenchmarked the three variants of the functions on multiple platforms.

Variant 1: ADDQ + INCQ
Variant 2: LEAQ + LEAQ
Variant 3: LEAQ (3 operands)

On Intel variant 3 is always the fastest. The 3-operand LEAQ was only slower on an old Athlon X2 5600.

I used following commands to produce the output.

$ sed -n '/family/,+3p;/processor.*1/q' /proc/cpuinfo
$ go test -bench . -count 10 | tee bench.txt
$ benchstat bench.txt

Skylake (Xeon 2GHz)

cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) CPU @ 2.00GHz
stepping	: 3
name     time/op
Add/1-4  2.38ns ± 1%
Add/2-4  2.38ns ± 0%
Add/3-4  2.35ns ± 1%

Haswell

cpu family	: 6
model		: 69
model name	: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
stepping	: 1
name     time/op
Add/1-4  2.46ns ± 6%
Add/2-4  2.41ns ± 5%
Add/3-4  2.39ns ± 5%

Nehalem - before Sandy Bridge, where LEAQ implementation changed

cpu family	: 6
model		: 26
model name	: Intel(R) Core(TM) i7 CPU         950  @ 3.07GHz
stepping	: 5
name     time/op
Add/1-8  2.42ns ± 1%
Add/2-8  2.45ns ± 2%
Add/3-8  2.42ns ± 0%

Athlon X2

cpu family	: 15
model		: 67
model name	: AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
stepping	: 3
name     time/op
Add/1-2  3.94ns ± 0%
Add/2-2  3.98ns ± 2%
Add/3-2  4.38ns ± 3%

The assembler code for reference:

#include "textflag.h"

// func add1_1(x uint64, y uint64) uint64
TEXT ·add1_1(SB), NOSPLIT, $0-24
	MOVQ x+0(FP), AX
	MOVQ y+8(FP), CX
	ADDQ AX, CX
	INCQ CX
	MOVQ CX, ret+16(FP)
	RET

// func add1_2(x uint64, y uint64) uint64
TEXT ·add1_2(SB), NOSPLIT, $0-24
	MOVQ x+0(FP), AX
	MOVQ y+8(FP), CX
	LEAQ (AX)(CX*1), CX
	LEAQ 1(CX), CX
	MOVQ CX, ret+16(FP)
	RET

// func add1_3(x uint64, y uint64) uint64
TEXT ·add1_3(SB), NOSPLIT, $0-24
	MOVQ x+0(FP), AX
	MOVQ y+8(FP), CX
	LEAQ 1(AX)(CX*1), CX
	MOVQ CX, ret+16(FP)
	RET
@martisch

This comment has been minimized.

Copy link
Member

commented May 12, 2019

Thank you very much for the data. A difference of 0.03 ns doesnt usually mean it is generally slower but looks to be within benchmarking variance (going by the -+6%) which we will not be able to completely narrow down even on higher counts. where 0.2 or 0.3ns likely indeed mean its likely slower by a clock cycle. If these are really parallel benchmarks I usually run with -cpu=1 for these to reduce the load and interference and disable frequency scaling and turbo boost too.

I will run the 2 lea vs 3 op lea variant on my benchmarking computer once near it too.

@martisch

This comment has been minimized.

Copy link
Member

commented May 12, 2019

Currently only have my laptop at hand (i7-3520M 2.9ghz, Ivy Bridge) to make a quick benchmark.

On it disabling slow lea splitting (go tip 83f205f , ssa.go#L608 set to false then go install cmd/compile)

if v.AuxInt != 0 && v.Aux == nil {

makes the benchmarks 1 clock cycle (around 0.3ns) slower. I regard 0.1ns as within variance from runs here.

old = go tip 83f205f
new = go tip 83f205f with slow lea splitting disabled

go test -cpu=1 -count=10 -bench=.*
benchstat ~/lea2.bench ~/lea3.bench 

name              old time/op  new time/op  delta
LEA22_1_noinline  4.24ns ± 0%  4.52ns ± 0%   +6.70%  (p=0.000 n=8+10)
LEA22_4_noinline  4.31ns ± 2%  4.61ns ± 2%   +6.91%  (p=0.000 n=10+10)
LEA22_1_inline    0.58ns ± 2%  0.87ns ± 4%  +50.05%  (p=0.000 n=10+10)
LEA22_4_inline    0.59ns ± 3%  0.97ns ± 3%  +64.47%  (p=0.000 n=10+9)
go test -cpu=1 -count=100 -bench="LEA22_1" > ~/lea3.bench
benchstat ~/lea2.bench ~/lea3.bench 
name              old time/op  new time/op  delta
LEA22_1_noinline  4.33ns ± 4%  4.60ns ± 3%   +6.41%  (p=0.000 n=100+100)
LEA22_1_inline    0.58ns ± 5%  0.86ns ± 4%  +49.52%  (p=0.000 n=98+98)

I used e.g. go test -c go tool objdump -s BenchmarkLEA22_4_inline to check that the expected LEA instructions were emitted.

Benchmarks

var global int

func BenchmarkLEA22_1_noinline(b *testing.B) {
	var sink int
	for i := 0; i < b.N; i++ {
		sink = lea22_1_noinline(sink, sink)
	}
	global = sink
}

func BenchmarkLEA22_4_noinline(b *testing.B) {
	var sink int
	for i := 0; i < b.N; i++ {
		sink = lea22_4_noinline(sink, sink)
	}
	global = sink
}

func BenchmarkLEA22_1_inline(b *testing.B) {
	var sink int
	for i := 0; i < b.N; i++ {
		sink = lea22_1_inline(sink, sink)
	}
	global = sink
}

func BenchmarkLEA22_4_inline(b *testing.B) {
	var sink int
	for i := 0; i < b.N; i++ {
		sink = lea22_4_inline(sink, sink)
	}
	global = sink
}

Functions

//go:noinline
func lea22_1_noinline(a, b int) int {
	return 1 + a + b
}

func lea22_1_inline(a, b int) int {
	return 1 + a + b
}

//go:noinline
func lea22_4_noinline(a, b int) int {
	return 1 + (a + 4*b)
}

func lea22_4_inline(a, b int) int {
	return 1 + (a + 4*b)
}
@ulikunitz

This comment has been minimized.

Copy link
Contributor

commented May 12, 2019

@gopherbot

This comment has been minimized.

Copy link

commented May 12, 2019

Change https://golang.org/cl/176622 mentions this issue: cmd/compile: benchmark for slow lea

@martisch

This comment has been minimized.

Copy link
Member

commented May 12, 2019

Uploaded as https://go-review.googlesource.com/c/go/+/176622, also contains the change to generate slow leas in src/cmd/compile/internal/amd64/ssa.go.

For the original issue posted:
If it is better binary wise or for port utilization we could emit two adds. Thats also what gccgo 7.2.0 emits.

To make this work it would be nice if we have a last rule based optimization pass after the normal pass to do these kinds of low level transformations that should not interfere with other rules but need to be done before emitting instructions. This would also make amd64/ssa.go code simpler. Replacing mov with xor and some other optimization could fit this category as well. see #27034 where I had commented in that direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.