cmd/compile: tight code optimization opportunities #47120
The generated x86 code can be improved in some fairly simple ways - hoisting computed constants out of loop bodies, and avoiding unnecessary register moves - that have a significant performance impact on tight loops. In the following example those improvements produce a 35% speedup.
Here is an alternate, DFA-based implementation of
There are no big benchmarks of Valid in the package, but here are some that could be added:
The old Valid implementation runs at around 1450 MB/s.
Translating this to proper non-regabi assembly I get:
This runs also at about 1600 MB/s.
First optimization: the
That change brings it up to 1750 MB/s.
Second optimization: use DI for
That change brings it up to 1900 MB/s.
The body is now:
Third optimization: since
I think this ends up being just "compute the shift amount before the shifted value".
This is still a direct translation of the Go code: there are no tricks the compiler couldn't do.
The text was updated successfully, but these errors were encountered:
Here is another, separate opportunity, for GOAMD64=v3 compilation. The SHRXQ instruction takes an explicit shift register, has separate source and destination operands, and can read source from memory. That allows reducing the loop to
That change runs at 3400 MB/s (!).
(The DFA tables were carefully constructed exactly to enable this implementation.)