-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: tight code optimization opportunities #47120
Comments
Here is another, separate opportunity, for GOAMD64=v3 compilation. The SHRXQ instruction takes an explicit shift register, has separate source and destination operands, and can read source from memory. That allows reducing the loop to
That change runs at 3400 MB/s (!). (The DFA tables were carefully constructed exactly to enable this implementation.) |
@rsc sorry for hijacked, but what means |
I see this hasn't had attention for a while but this is a problem I've noticed in ppc64 code too. Invariant values are not moved out of loops. I thought at one time there was work to do this but it must have been abandoned. Here is one example:
|
Change https://go.dev/cl/385174 mentions this issue: |
The SHRX/SHLX instruction can take any general register as the shift count operand, and can read source from memory. This CL introduces some operators to combine load and shift to one instruction. For #47120 Change-Id: I13b48f53c7d30067a72eb2c8382242045dead36a Reviewed-on: https://go-review.googlesource.com/c/go/+/385174 Reviewed-by: Keith Randall <khr@golang.org> Trust: Cherry Mui <cherryyz@google.com>
The generated x86 code can be improved in some fairly simple ways - hoisting computed constants out of loop bodies, and avoiding unnecessary register moves - that have a significant performance impact on tight loops. In the following example those improvements produce a 35% speedup.
Here is an alternate, DFA-based implementation of
utf8.Valid
that I have been playing with:There are no big benchmarks of Valid in the package, but here are some that could be added:
The old Valid implementation runs at around 1450 MB/s.
The implementation above runs at around 1600 MB/s.
Better but not what I had hoped.
It compiles as follows:
Translating this to proper non-regabi assembly I get:
This runs also at about 1600 MB/s.
First optimization: the
LEAQ ·dfa(SB), R8
should be hoisted out of the loop body.(I tried to do this in the Go version with
dfa := &dfa
but it got constant propagated away!)That change brings it up to 1750 MB/s.
Second optimization: use DI for
i
instead of CX, to avoid the pressure on CX.This lets the
LEAQ 1(CX), DI
and the laterMOVQ DI, CX
collapse to justLEAQ 1(DI), DI
.That change brings it up to 1900 MB/s.
The body is now:
Third optimization: since
DX
is moving intoCX
, do that one instruction earlier, allowing the use ofSI
to be optimized intoDX
to eliminate the finalMOVQ
:I think this ends up being just "compute the shift amount before the shifted value".
That change brings it up to 2150 MB/s.
This is still a direct translation of the Go code: there are no tricks the compiler couldn't do.
For this particular loop, the optimizations make the code run 35% faster.
Final assembly:
The text was updated successfully, but these errors were encountered: