Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: prefer AND instead of SHR+SHL #33826

Open
martisch opened this issue Aug 25, 2019 · 7 comments

Comments

@martisch
Copy link
Member

commented Aug 25, 2019

In cl/19485 we added a generic SSA rule to replace an AND with specific constants by two SHIFT instructions.

While this optimization does avoid a load of the constant to be ANDed into an extra register and has a shorter encoding on amd64 it does use two data dependent instructions. There was already some discussion on the CL after accidental early submission that micro benchmarks do not show using two shifts to be faster. Some seem to show it can be slower e.g. x &= 1 << 63. I benchmarked math.Copysign and it showed the same throughput using either variant. LLVM and GCC do not seem to replace AND with SHIFTs on amd64 either and some platforms like arm64 in the go gc compiler even reverse this rewrite.

https://go.googlesource.com/go/+/06f709a04acc6d5ba0ba181129e9ee93ed20f311/src/cmd/compile/internal/ssa/gen/ARM64.rules#1866

There has also been some unwanted interaction with other optimizations rules e.g. #32781.

The initial added rules from cl/19485 operated on And64 and And32. The And32 case was already removed in cl/20410 .

Removing the And64 case too can make binaries slightly larger but in common cases where there is no register pressure should be as fast or faster as two SHIFTs on modern amd64 CPUs and should create less interference with other SSA rules that need not consider the additional case of a mask using AND having been rewritten to SHIFTs.

I intend to send CLs for evaluation and submission to remove the generic rule to rewrite AND into a pair of SHIFTs for go1.14 and follow up with some CLs that avoid regressing on interaction with other rules that are based on optimizing SHIFTs instead of ANDs. For example the AND instruction in (u64>>32)&0xFFFFFFFF should still be optimized away by SSA.

This issue is to document the related CLs and discuss this (de)optimization and whether any go gc supported 64bit platforms should keep rewriting some ANDs to two SHIFTs.

/cc @klauspost @cherrymui @khr @dr2chase @laboger @mundaym

@martisch martisch added this to the Go1.14 milestone Aug 25, 2019

@martisch martisch self-assigned this Aug 25, 2019

@gopherbot

This comment has been minimized.

Copy link

commented Aug 25, 2019

Change https://golang.org/cl/191780 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

@klauspost

This comment has been minimized.

Copy link
Contributor

commented Aug 25, 2019

Looking at instruction timing it should be pretty much the same.

I am not saying that this is a bad change, just that I don't think this benchmark is accurate if it is emitting what you write.

func BenchmarkAndHighBits(b *testing.B) {
	x := uint(0)
	for i := 0; i < b.N; i++ {
		x &= 1 << 63
	}
	GlobalU = x
}

Is MOVQ $0x8000000000000000, RAX actually inside the loop? And wouldn't it emit ANDQ $0x8000000000000000, REG?

On Ivy Bridge (your CPU) and pretty much all other CPUs, the instructions involved are all with a 1 cycle latency. Since your benchmark is serial, it should pretty much be the same assuming it is emitting MOVQ+ANDQ. If it is only ANDQ, it should be 2x faster (excluding the loop overhead).

It could also be that Ive Bridge is able to do some magic and can ignore the MOVQ.

Either way; with the AND, the compiler will be able to do some optimizations. A clever compiler could even move the constant loading outside the loop, as mentioned above.

@martisch

This comment has been minimized.

Copy link
Member Author

commented Aug 25, 2019

Looking at instruction timing it should be pretty much the same.

I would agree if viewed standalone. If there is surrounding code like in the benchmark loop I would expect the MOV being able to be issued in parallel ahead with some earlier instruction (unless the CPU is saturated with 4 retires per cycle already). The two shifts together always need at least 2 cycles after their input is ready from a previous loop iteration.

Is MOVQ $0x8000000000000000, RAX actually inside the loop?

Checked again and the MOV is inside the loop. That a MOV is needed should only be clear very late - in the arch specific SSA (see why below). I do not think there is loop hoisting pass run after that stage.

And wouldn't it emit ANDQ $0x8000000000000000, REG?

AND on amd64 does not support 64bit immediates https://www.felixcloutier.com/x86/and .

@klauspost

This comment has been minimized.

Copy link
Contributor

commented Aug 25, 2019

Yeah, the CPU is doing magic.

A simple test, pure assembler:

TEXT ·testAnd(SB), NOSPLIT, $0
	MOVQ $65535, CX
	XORQ AX, AX

loop_and:
	MOVQ $0x8000000000000000, BX
	ANDQ BX, AX
	DECQ CX
	JNZ  loop_and
	RET

TEXT ·testShift(SB), NOSPLIT, $0
	MOVQ $65535, CX
	XORQ AX, AX

loop_shift:
	SHRQ $63, AX
	SHLQ $63, AX
	DECQ CX
	JNZ  loop_shift
	RET

Using AND is ~2x as fast on my Ryzen 3600. Moving the MOVQ outside the loop makes absolutely no difference, which indicates the magic :)

@laboger

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2019

Tried your CL on ppc64le and seems to be better by about 30% on the benchmark you provided.

gopherbot pushed a commit that referenced this issue Sep 9, 2019
compile: prefer an AND instead of SHR+SHL instructions
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.

Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.

However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.

Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.

For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f                SHRQ $0x3f, DX
48c1e23f                SHLQ $0x3f, DX

after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080    MOVQ $0x8000000000000000, AX
4821d0                  ANDQ DX, AX

Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.

Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:

var GlobalU uint

func BenchmarkAndHighBits(b *testing.B) {
	x := uint(0)
	for i := 0; i < b.N; i++ {
		x &= 1 << 63
	}
	GlobalU = x
}

amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name           old time/op  new time/op  delta
AndHighBits-4  0.61ns ± 6%  0.42ns ± 6%  -31.42%  (p=0.000 n=25+25):

Updates #33826
Updates #32781

Change-Id: I862d3587446410c447b9a7265196b57f85358633
Reviewed-on: https://go-review.googlesource.com/c/go/+/191780
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@martisch

This comment has been minimized.

Copy link
Member Author

commented Sep 9, 2019

The first attempt broke codegen tests in arm64 (bfxil) and s390x (abs and copysign).
https://build.golang.org/log/a47a1c98bac2aa509d570d6bf583609857b1be5a
https://build.golang.org/log/6456da8a84cc82e7407981287f211caaab41c7eb

Will make sure to run codegen tests for all platforms for the new version of the CL.

For a new version of the CL I added abs and copysign rules that need detect the new AND variant, still working on the arm64 bitfield adjustment.

@gopherbot

This comment has been minimized.

Copy link

commented Sep 9, 2019

Change https://golang.org/cl/194297 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.