Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
cmd/compile: various low level x86 instruction generation improvements #28671
While reading (to much) go generated assembly code I picked up a few x86 code sequences that seemed sub optimal. I do not remember where I had spotted each of them and some might just come from my imagination, compiler optimization guides or from outside the std library.
Instead of creating an issue per possibility here is a list of some possible low level performance improvements. Note that this does not mean they are common and therefore worth introducing. That can be evaluated. However these can serve to spark ideas for other improvements and for new compiler contributors to try out adding ssa optimization rules or codegen improvements and benchmarking their effects and frequency. UPDATE: CLs should make sure to include statistics/examples of use in std lib and/or generally when introducing optimizations.
Current assembly gc and gccgo create can be quickly checked with https://godbolt.org/.
Many of them should be considered as examples for more general optimizations.
Thanks. You are right that ORQ could introduce and unwarranted dependency (or at least its not known that all x86 do optimize the dependency away) . Likely not always a win unless optimizing for size only and needs to much complex analyses to be sure its a win in the concrete instruction flow.
Granted these are hard to measure especially in micro benchmarks. Smaller binary footprint means less cache use. Depending on the instruction and microarchitecture the instruction decoders can also decode more short/simple instructions per cycle and some archs seem to "only" fetch 16/32bytes of instructions per cycle. More instructions will fit into some loop buffers if they are smaller in bytes and therefore more loops will be able to make use of loop buffers and thereby could execute faster.