Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: handle 64-bit masks better on 32-bit systems #32781

Open
rsc opened this issue Jun 26, 2019 · 6 comments
Labels
Milestone

Comments

@rsc
Copy link
Contributor

@rsc rsc commented Jun 26, 2019

On a 32-bit system, an expression like uint64value & constantUint64Mask could easily be decomposed two 32-bit mask operations. That does not seem to happen (see below, especially the OR instructions). It seems like probably it should.

$ cat x.go
package p

func f(x uint64) uint64 {
	return x & (1<<60 - 1)
}
$ GOARCH=arm go build -gcflags=-S x.go
# command-line-arguments
"".f STEXT size=40 args=0x10 locals=0x0 leaf
	0x0000 00000 (/Users/rsc/x.go:3)	TEXT	"".f(SB), LEAF|NOFRAME|ABIInternal, $-4-16
	0x0000 00000 (/Users/rsc/x.go:3)	FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (/Users/rsc/x.go:3)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (/Users/rsc/x.go:3)	FUNCDATA	$2, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x0000 00000 (/Users/rsc/x.go:4)	PCDATA	$0, $0
	0x0000 00000 (/Users/rsc/x.go:4)	PCDATA	$1, $0
	0x0000 00000 (/Users/rsc/x.go:4)	MOVW	"".x(FP), R0
	0x0004 00004 (/Users/rsc/x.go:4)	SRL	$28, R0, R1
	0x0008 00008 (/Users/rsc/x.go:4)	MOVW	"".x+4(FP), R2
	0x000c 00012 (/Users/rsc/x.go:4)	ORR	R2<<4, R1, R1
	0x0010 00016 (/Users/rsc/x.go:4)	BFXU	$28, R0, $0, R0
	0x0014 00020 (/Users/rsc/x.go:4)	ORR	R1<<28, R0, R0
	0x0018 00024 (/Users/rsc/x.go:4)	MOVW	R0, "".~r1+8(FP)
	0x001c 00028 (/Users/rsc/x.go:4)	SRL	$4, R1, R0
	0x0020 00032 (/Users/rsc/x.go:4)	MOVW	R0, "".~r1+12(FP)
	0x0024 00036 (/Users/rsc/x.go:4)	JMP	(R14)
...
$ GOARCH=386 go build -gcflags=-S x.go
# command-line-arguments
"".f STEXT size=62 args=0x10 locals=0x0
	0x0000 00000 (/Users/rsc/x.go:3)	TEXT	"".f(SB), ABIInternal, $0-16
	0x0000 00000 (/Users/rsc/x.go:3)	MOVL	(TLS), CX
	0x0007 00007 (/Users/rsc/x.go:3)	CMPL	SP, 8(CX)
	0x000a 00010 (/Users/rsc/x.go:3)	JLS	55
	0x000c 00012 (/Users/rsc/x.go:3)	FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x000c 00012 (/Users/rsc/x.go:3)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x000c 00012 (/Users/rsc/x.go:3)	FUNCDATA	$2, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	0x000c 00012 (/Users/rsc/x.go:4)	PCDATA	$0, $0
	0x000c 00012 (/Users/rsc/x.go:4)	PCDATA	$1, $0
	0x000c 00012 (/Users/rsc/x.go:4)	MOVL	"".x+8(SP), AX
	0x0010 00016 (/Users/rsc/x.go:4)	SHLL	$4, AX
	0x0013 00019 (/Users/rsc/x.go:4)	MOVL	"".x+4(SP), CX
	0x0017 00023 (/Users/rsc/x.go:4)	MOVL	CX, DX
	0x0019 00025 (/Users/rsc/x.go:4)	SHRL	$28, CX
	0x001c 00028 (/Users/rsc/x.go:4)	ORL	AX, CX
	0x001e 00030 (/Users/rsc/x.go:4)	SHLL	$4, DX
	0x0021 00033 (/Users/rsc/x.go:4)	SHRL	$4, DX
	0x0024 00036 (/Users/rsc/x.go:4)	MOVL	CX, AX
	0x0026 00038 (/Users/rsc/x.go:4)	SHLL	$28, CX
	0x0029 00041 (/Users/rsc/x.go:4)	ORL	DX, CX
	0x002b 00043 (/Users/rsc/x.go:4)	MOVL	CX, "".~r1+12(SP)
	0x002f 00047 (/Users/rsc/x.go:4)	SHRL	$4, AX
	0x0032 00050 (/Users/rsc/x.go:4)	MOVL	AX, "".~r1+16(SP)
	0x0036 00054 (/Users/rsc/x.go:4)	RET
	0x0037 00055 (/Users/rsc/x.go:4)	NOP
	0x0037 00055 (/Users/rsc/x.go:3)	PCDATA	$1, $-1
	0x0037 00055 (/Users/rsc/x.go:3)	PCDATA	$0, $-1
	0x0037 00055 (/Users/rsc/x.go:3)	CALL	runtime.morestack_noctxt(SB)
	0x003c 00060 (/Users/rsc/x.go:3)	JMP	0
@rsc rsc added this to the Go1.14 milestone Jun 26, 2019
@rsc rsc added the NeedsFix label Jun 26, 2019
@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Jun 26, 2019

Emmm. I'm pretty sure it decomposes 64-bit mask to two 32-bit masks when I wrote the code a few years ago. https://go.googlesource.com/go/+/6d1aaf143c879ae7acbd20a1e60bf74681bf6055/src/cmd/compile/internal/ssa/gen/dec64.rules#88

This particular case is interesting, as the generic opt pass (runs before decompose pass) changes the AND to a pair of shifts (x<<4>>4), and the decomposing shifts is more complex.

If it is a different mask, which cannot be rewritten to shifts, it works as expected.

func f(x uint64) uint64 {
	return x & 0x1234567890abcdef
}
	0x0000 00000 (x64.go:4)	MOVW	"".x(FP), R0
	0x0004 00004 (x64.go:4)	AND	$-1867788817, R0, R0
	0x000c 00012 (x64.go:4)	MOVW	R0, "".~r1+8(FP)
	0x0010 00016 (x64.go:4)	MOVW	"".x+4(FP), R0
	0x0014 00020 (x64.go:4)	AND	$305419896, R0, R0
	0x001c 00028 (x64.go:4)	MOVW	R0, "".~r1+12(FP)
	0x0020 00032 (x64.go:4)	JMP	(R14)

Maybe we should either not do the mask-to-shift rewriting for 64-bit ops on 32-bit architectures, or somehow recognizes this pattern and do better in decomposing.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jun 26, 2019

Change https://golang.org/cl/183957 mentions this issue: cmd/compile: not rewrite 64-bit AND to shifts on 32-bit machines

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 25, 2019

Change https://golang.org/cl/191780 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

gopherbot pushed a commit that referenced this issue Sep 9, 2019
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.

Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.

However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.

Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.

For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f                SHRQ $0x3f, DX
48c1e23f                SHLQ $0x3f, DX

after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080    MOVQ $0x8000000000000000, AX
4821d0                  ANDQ DX, AX

Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.

Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:

var GlobalU uint

func BenchmarkAndHighBits(b *testing.B) {
	x := uint(0)
	for i := 0; i < b.N; i++ {
		x &= 1 << 63
	}
	GlobalU = x
}

amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name           old time/op  new time/op  delta
AndHighBits-4  0.61ns ± 6%  0.42ns ± 6%  -31.42%  (p=0.000 n=25+25):

Updates #33826
Updates #32781

Change-Id: I862d3587446410c447b9a7265196b57f85358633
Reviewed-on: https://go-review.googlesource.com/c/go/+/191780
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Sep 9, 2019

Change https://golang.org/cl/194297 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

gopherbot pushed a commit that referenced this issue Sep 21, 2019
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.

Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.

However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.

Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.

For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f                SHRQ $0x3f, DX
48c1e23f                SHLQ $0x3f, DX

after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080    MOVQ $0x8000000000000000, AX
4821d0                  ANDQ DX, AX

Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.

Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:

    var GlobalU uint

    func BenchmarkAndHighBits(b *testing.B) {
        x := uint(0)
        for i := 0; i < b.N; i++ {
                x &= 1 << 63
        }
        GlobalU = x
    }

amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name           old time/op  new time/op  delta
AndHighBits-4  0.61ns ± 6%  0.42ns ± 6%  -31.42%  (p=0.000 n=25+25):

'go run run.go -all_codegen -v codegen' passes  with following adjustments:

ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment
       since ORshiftRL generation fusing '>> rc' and '|' interferes
       with matching ((x << lc) >> rc) to generate UBFX. Previously
       ORshiftLL was created first using the shifts generated for (y & ac).

S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs.

Updates #33826
Updates #32781

Change-Id: I43227da76b625de03fbc51117162b23b9c678cdb
Reviewed-on: https://go-review.googlesource.com/c/go/+/194297
Run-TryBot: Martin Möhrmann <martisch@uos.de>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Sep 23, 2019

Change https://golang.org/cl/196957 mentions this issue: Revert "compile: prefer an AND instead of SHR+SHL instructions"

gopherbot pushed a commit that referenced this issue Sep 23, 2019
This reverts CL 194297.

Reason for revert: introduced register allocation failures on PPC64LE builders.

Updates #33826
Updates #32781
Updates #34468

Change-Id: I7d0b55df8cdf8e7d2277f1814299b083c2692e48
Reviewed-on: https://go-review.googlesource.com/c/go/+/196957
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Sep 24, 2019

Change https://golang.org/cl/196810 mentions this issue: compile: prefer an AND instead of SHR+SHL instructions

gopherbot pushed a commit that referenced this issue Sep 24, 2019
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.

Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.

However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.

Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.

For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f                SHRQ $0x3f, DX
48c1e23f                SHLQ $0x3f, DX

after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080    MOVQ $0x8000000000000000, AX
4821d0                  ANDQ DX, AX

Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.

Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:

    var GlobalU uint

    func BenchmarkAndHighBits(b *testing.B) {
        x := uint(0)
        for i := 0; i < b.N; i++ {
                x &= 1 << 63
        }
        GlobalU = x
    }

amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name           old time/op  new time/op  delta
AndHighBits-4  0.61ns ± 6%  0.42ns ± 6%  -31.42%  (p=0.000 n=25+25):

'go run run.go -all_codegen -v codegen' passes  with following adjustments:

ARM64: The BFXIL pattern ((x << lc) >> rc | y & ac) needed adjustment
       since ORshiftRL generation fusing '>> rc' and '|' interferes
       with matching ((x << lc) >> rc) to generate UBFX. Previously
       ORshiftLL was created first using the shifts generated for (y & ac).

S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs.

Updates #33826
Updates #32781

Change-Id: I5a59f6239660d53c029cd22dfb44ddf39f93a56c
Reviewed-on: https://go-review.googlesource.com/c/go/+/196810
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.