-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: avoid slow versions of LEA instructions on x86 #21735
Comments
If we are going to do this, we can split lea into lea+lea, to avoid clobbering flags.
|
Looks like I found a specific example of this.
|
Thanks for noticing the regression and finding a specific test case Ilya. I will work on a slow LEA fix within the next week unless somebody else wants to have a go at it earlier. |
can it be fixed in the assembler? I found that there are inefficient LEA used in assembly code. |
Fixing it in asm will need a manual fix (or script detecting it + cl to fix) as i dont think we should rewrite handwritten code when translating to binary. Maybe its needed to achieve some alignment of instructions. Also for asm depending on the context a LEA+ADD or MUL could be better or equally valid instead of LEA+LEA so i would say those should be fixed on a case by case basis with benchmarks if possible. Maybe some vet/lint check for asm might be nice. I was working on a fix in the amd64 opcode code that emits the assembler instructions. Similar how MOV of 0 is rewritten as XOR. |
I think we should leave assembly as is. Magic in the assembler is always confusing. So is the fix just to disable the rules that can generate 3-part LEA instructions? Basically, enforce that LEAQ[1248] can't have either an Aux or AuxInt? This is a (pretty weak) argument for introducing GOAMD64 architecture variants. If we could figure out reliably which chips have this problem and which don't. |
While looking into this I found and started on 3 different ways to reduce slow LEAs:
Still working on testing these as well as gathering statistics about occurrences and rerunning benchmarks so we have slow downs that we know covered. AFAIK Slow LEAs with 3 operands exist on modern Intel and AMD chips independent of 386 or x64 mode and word width. But 16bit LEA are slower even with 2 operands. |
We have a bunch of rules for merging lea into mov, so I feel like option 3 will have lowest amount of unintended consequences. It also makes it easier to switch between versions, if we want to e. g. optimize some function for size, instead of speed or update it for future CPUs. |
Change https://golang.org/cl/114655 mentions this issue: |
@TocarIP
|
Minimal example for that multiplication is: package foo
func mul(x uint32) uint32 {
return x * 0x101
} 0x0000 (../foo.go:4) MOVL "".x+8(SP), AX
0x0004 (../foo.go:4) MOVL AX, CX
0x0006 (../foo.go:4) SHLL $8, AX
0x0009 (../foo.go:4) LEAL (AX)(CX*1), AX
0x000c (../foo.go:4) MOVL AX, "".~r1+16(SP)
0x0010 (../foo.go:4) RET It used to emit 0x0000 (../foo.go:4) MOVL "".x+8(SP), AX
0x0004 (../foo.go:4) IMUL3L $257, AX, AX
0x000a (../foo.go:4) MOVL AX, "".~r1+16(SP)
0x000e (../foo.go:4) RET And prior to that, it was 0x0000 (../foo.go:3) MOVL "".x+8(SP), AX
0x0004 (../foo.go:4) IMULL $257, AX
0x000a (../foo.go:4) MOVL AX, "".~r1+16(SP)
0x000e (../foo.go:4) RET |
@quasilyte Thanks but i do not think this is a slow LEA version affecting the benchmark. (However the particular rewrite to MOV+SHL+(fast)LEA still looks like to be slower) |
@martisch
So it looks like |
Are there objections to waiting till 1.12? The performance improvements are overall very small (0.16% is what I measure pretty consistently, including across a much larger selection of benchmarks), and it's getting very late in the 1.11 cycle and we seem to be behind schedule. To summarize the performance changes, across the low-noise 256 benchmarks (out of 488 total, low noise is less than 2% max standard deviation reported across 30 trials) 15 showed >= 1% improvement, 16 showed >= 1% slowdown. For comparison, the experiment to pretend we are only targeting more-recent versions of intel hardware (CL 117925) had 83 improved, 6 slowed down, and the geomean improvement was 1.28% |
Abusing "early-in-cycle" so we don't forget it, this one is low risk. |
go tool objdump ../bin/go | grep "\.go\:" | grep -c "LEA.*0x.*[(].*[(].*" Before: 1012 After: 20 Updates #21735 Benchmarks thanks to drchase@google.com Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz benchstat -geomean *.stdout | grep -v pkg: name old time/op new time/op delta FastTest2KB-12 131.0ns ± 0% 131.0ns ± 0% ~ (all equal) BaseTest2KB-12 601.3ns ± 0% 601.3ns ± 0% ~ (p=0.942 n=45+41) Encoding4KBVerySparse-12 15.38µs ± 1% 15.41µs ± 0% +0.24% (p=0.000 n=44+43) Join_8-12 2.117s ± 2% 2.128s ± 2% +0.51% (p=0.007 n=48+48) HashimotoLight-12 1.663ms ± 1% 1.668ms ± 1% +0.35% (p=0.006 n=49+49) Sha3_224_MTU-12 4.843µs ± 0% 4.836µs ± 0% -0.14% (p=0.000 n=45+42) GenSharedKeyP256-12 73.74µs ± 0% 70.92µs ± 0% -3.82% (p=0.000 n=48+47) Run/10k/1-12 24.81s ± 0% 24.88s ± 0% +0.30% (p=0.000 n=48+47) Run/10k/16-12 4.621s ± 2% 4.625s ± 3% ~ (p=0.776 n=50+50) Dnrm2MediumPosInc-12 4.018µs ± 0% 4.019µs ± 0% ~ (p=0.060 n=50+48) DasumMediumUnitaryInc-12 855.3ns ± 0% 855.0ns ± 0% -0.03% (p=0.000 n=47+42) Dgeev/Circulant10-12 40.45µs ± 1% 41.11µs ± 1% +1.62% (p=0.000 n=45+49) Dgeev/Circulant100-12 10.42ms ± 2% 10.61ms ± 1% +1.83% (p=0.000 n=49+48) MulWorkspaceDense1000Hundredth-12 64.69ms ± 1% 64.63ms ± 1% ~ (p=0.718 n=48+48) ScaleVec10000Inc20-12 22.31µs ± 1% 22.29µs ± 1% ~ (p=0.424 n=50+50) ValidateVersionTildeFail-12 737.6ns ± 1% 736.0ns ± 1% -0.22% (p=0.000 n=49+49) StripHTML-12 2.846µs ± 0% 2.806µs ± 1% -1.40% (p=0.000 n=43+50) ReaderContains-12 6.073µs ± 0% 5.999µs ± 0% -1.22% (p=0.000 n=48+48) EncodeCodecFromInternalProtobuf-12 5.817µs ± 2% 5.555µs ± 2% -4.51% (p=0.000 n=47+47) TarjanSCCGnp_10_tenth-12 7.091µs ± 5% 7.132µs ± 7% ~ (p=0.361 n=50+50) TarjanSCCGnp_1000_half-12 82.25ms ± 3% 81.29ms ± 2% -1.16% (p=0.000 n=50+43) AStarUndirectedmallWorld_10_2_2_2_Heur-12 15.18µs ± 8% 15.11µs ± 7% ~ (p=0.511 n=50+49) LouvainDirectedMultiplex-12 20.92ms ± 1% 21.00ms ± 1% +0.36% (p=0.000 n=48+49) WalkAllBreadthFirstGnp_10_tenth-12 2.974µs ± 4% 2.964µs ± 5% ~ (p=0.504 n=50+50) WalkAllBreadthFirstGnp_1000_tenth-12 9.733ms ± 4% 9.741ms ± 4% ~ (p=0.774 n=48+50) TextMovementBetweenSegments-12 432.8µs ± 0% 433.2µs ± 1% ~ (p=0.128 n=50+50) Growth_MultiSegment-12 13.11ms ± 0% 13.19ms ± 1% +0.58% (p=0.000 n=44+46) AddingFields/Zap.Sugar-12 1.296µs ± 1% 1.310µs ± 2% +1.09% (p=0.000 n=43+43) AddingFields/apex/log-12 34.19µs ± 1% 34.31µs ± 1% +0.35% (p=0.000 n=45+45) AddingFields/inconshreveable/log15-12 30.08µs ± 2% 30.07µs ± 2% ~ (p=0.803 n=48+47) AddingFields/sirupsen/logrus-12 6.683µs ± 3% 6.735µs ± 3% +0.78% (p=0.000 n=43+42) [Geo mean] 143.5µs 143.3µs -0.16% Change-Id: I637203c75c837737f1febced75d5985703e51044 Reviewed-on: https://go-review.googlesource.com/114655 Run-TryBot: Martin Möhrmann <moehrmann@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
See also #36468 (comment), for an approach that can reduce the number of slow LEAs. |
Wondering what's left to do here since #21735 (comment) has been merged. Fix the 16bit LEA? |
Even for LEAQ and LEAL my CL doesnt manage to split all 3 operand LEA. There are some that had aux set and arent transformed. |
Change https://go.dev/cl/440035 mentions this issue: |
On newer amd64 cpus 3 operand LEA instructions are slow, CL 114655 split them to 2 LEA instructions in genssa. This CL make late lower pass run after addressing modes, and split 3 operand LEA in late lower pass so that we can do common-subexpression elimination for splited LEAs. Updates #21735 Change-Id: Ied49139c7abab655e1a14a6fd793bdf9f987d1f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/440035 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Joedian Reid <joedian@golang.org>
On newer amd64 cpus 3 operand LEA instructions are slow, CL 114655 split them to 2 LEA instructions in genssa. This CL make late lower pass run after addressing modes, and split 3 operand LEA in late lower pass so that we can do common-subexpression elimination for splited LEAs. Updates golang#21735 Change-Id: Ied49139c7abab655e1a14a6fd793bdf9f987d1f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/440035 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Joedian Reid <joedian@golang.org>
Change https://go.dev/cl/482820 mentions this issue: |
CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would go bad if the wrong-typed register was spilled. CL 482536 attempted to fix this by correcting the type in the rewrite, but that "fix" broke something on windows-amd64-race. Instead / for-now, remove the offending rewrite rules. Updates #21735. Fixes #59432. Change-Id: I0497c42db414f2055e1378e0a53e2bceee9cd5d9 Reviewed-on: https://go-review.googlesource.com/c/go/+/482820 Run-TryBot: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Change https://go.dev/cl/482164 mentions this issue: |
CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would go bad if the wrong-typed register was spilled. CL 482536 attempted to fix this by correcting the type in the rewrite, but that "fix" broke something on windows-amd64-race. Instead / for-now, remove the offending rewrite rules. Updates #21735. Updates #59432. Fixes #59468. Change-Id: I0497c42db414f2055e1378e0a53e2bceee9cd5d9 Reviewed-on: https://go-review.googlesource.com/c/go/+/482820 Run-TryBot: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> (cherry picked from commit 6a97a60) Reviewed-on: https://go-review.googlesource.com/c/go/+/482164 Auto-Submit: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>
CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would go bad if the wrong-typed register was spilled. CL 482536 attempted to fix this by correcting the type in the rewrite, but that "fix" broke something on windows-amd64-race. Instead / for-now, remove the offending rewrite rules. Updates golang#21735. Updates golang#59432. Fixes golang#59468. Change-Id: I0497c42db414f2055e1378e0a53e2bceee9cd5d9 Reviewed-on: https://go-review.googlesource.com/c/go/+/482820 Run-TryBot: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> (cherry picked from commit 6a97a60) Reviewed-on: https://go-review.googlesource.com/c/go/+/482164 Auto-Submit: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>
CL 440035 added rewrite rules to simplify "costly" LEA instructions, but the types in the rewrites were wrong and the code would go bad if the wrong-typed register was spilled. CL 482536 attempted to fix this by correcting the type in the rewrite, but that "fix" broke something on windows-amd64-race. Instead / for-now, remove the offending rewrite rules. Updates golang#21735. Updates golang#59432. Fixes golang#59468. Change-Id: I0497c42db414f2055e1378e0a53e2bceee9cd5d9 Reviewed-on: https://go-review.googlesource.com/c/go/+/482820 Run-TryBot: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> (cherry picked from commit 6a97a60) Reviewed-on: https://go-review.googlesource.com/c/go/+/482164 Auto-Submit: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>
On newer x86 cpus (amd and intel) 3 operand LEA instructions with base, index and offset have a higher latency and less throughput than 2 operand LEA instructions.
The compiler when emitting the instructions could rewrite slow leas into e.g. LEA + ADD instructions where possible (flag clobbering ok) similar how MOV $0 R is rewritten to XOR R R.
relevant llvm optimization ticket: https://reviews.llvm.org/D32277
/cc @TocarIP @randall77 @josharian
The text was updated successfully, but these errors were encountered: