Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
cmd/compile: prefer AND instead of SHR+SHL #33826
In cl/19485 we added a generic SSA rule to replace an AND with specific constants by two SHIFT instructions.
While this optimization does avoid a load of the constant to be ANDed into an extra register and has a shorter encoding on amd64 it does use two data dependent instructions. There was already some discussion on the CL after accidental early submission that micro benchmarks do not show using two shifts to be faster. Some seem to show it can be slower e.g.
There has also been some unwanted interaction with other optimizations rules e.g. #32781.
Removing the And64 case too can make binaries slightly larger but in common cases where there is no register pressure should be as fast or faster as two SHIFTs on modern amd64 CPUs and should create less interference with other SSA rules that need not consider the additional case of a mask using AND having been rewritten to SHIFTs.
I intend to send CLs for evaluation and submission to remove the generic rule to rewrite AND into a pair of SHIFTs for go1.14 and follow up with some CLs that avoid regressing on interaction with other rules that are based on optimizing SHIFTs instead of ANDs. For example the AND instruction in
This issue is to document the related CLs and discuss this (de)optimization and whether any go gc supported 64bit platforms should keep rewriting some ANDs to two SHIFTs.
Looking at instruction timing it should be pretty much the same.
I am not saying that this is a bad change, just that I don't think this benchmark is accurate if it is emitting what you write.
On Ivy Bridge (your CPU) and pretty much all other CPUs, the instructions involved are all with a 1 cycle latency. Since your benchmark is serial, it should pretty much be the same assuming it is emitting MOVQ+ANDQ. If it is only ANDQ, it should be 2x faster (excluding the loop overhead).
It could also be that Ive Bridge is able to do some magic and can ignore the MOVQ.
Either way; with the
I would agree if viewed standalone. If there is surrounding code like in the benchmark loop I would expect the MOV being able to be issued in parallel ahead with some earlier instruction (unless the CPU is saturated with 4 retires per cycle already). The two shifts together always need at least 2 cycles after their input is ready from a previous loop iteration.
Checked again and the MOV is inside the loop. That a MOV is needed should only be clear very late - in the arch specific SSA (see why below). I do not think there is loop hoisting pass run after that stage.
AND on amd64 does not support 64bit immediates https://www.felixcloutier.com/x86/and .
Yeah, the CPU is doing magic.
A simple test, pure assembler:
The first attempt broke codegen tests in arm64 (bfxil) and s390x (abs and copysign).
Will make sure to run codegen tests for all platforms for the new version of the CL.
For a new version of the CL I added abs and copysign rules that need detect the new AND variant, still working on the arm64 bitfield adjustment.