Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: redundant moves and stack variables when function using bits.Add64 is inlined #33349

Open
renthraysk opened this issue Jul 29, 2019 · 6 comments

Comments

@renthraysk
Copy link

commented Jul 29, 2019

What version of Go are you using (go version)?

$ go version
go version go1.13beta1 linux/amd64

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jared/.cache/go-build"
GOENV="/home/jared/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/jared/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build580535950=/tmp/go-build -gno-record-gcc-switches"

What did you do?

https://play.golang.org/p/PWeQpDU4ilg

What did you expect to see?

No allocation of Zig()'s y variable on the stack, and no code duplication. Something along the lines of...

0x0089 00137 (bug.go:8)	MOVQ	"".x+168(SP), AX
0x0091 00145 (bug.go:8)	ADDQ	AX, AX
0x0094 00148 (bug.go:8)	SBBQ	CX, CX
0x009c 00156 (bug.go:9)	XORQ	AX, CX

What did you see instead?

Execution of the Zig()'s bits.Add64()/ADDQ twice, with unnecessary use of temporary variable y.

0x005a 00090 (bug.go:8)	MOVQ	"".x+168(SP), DX
0x0062 00098 (bug.go:8)	ADDQ	DX, DX
0x0065 00101 (bug.go:8)	MOVQ	DX, "".y+80(SP)
...
0x0089 00137 (bug.go:8)	MOVQ	"".x+168(SP), AX
0x0091 00145 (bug.go:8)	ADDQ	AX, AX
0x0094 00148 (bug.go:8)	SBBQ	AX, AX
0x0097 00151 (<unknown line number>)	NOP
0x0097 00151 (<unknown line number>)	NOP
0x0097 00151 (bug.go:9)	MOVQ	"".y+80(SP), CX
0x009c 00156 (bug.go:9)	XORQ	AX, CX
@bcmills

This comment has been minimized.

Copy link
Member

commented Jul 29, 2019

The symptom is superficially similar to #32969. Any idea whether the root cause is the same?

(CC @randall77 @dr2chase)

@bcmills bcmills added this to the Unplanned milestone Jul 29, 2019

@bcmills bcmills changed the title cmd/compile: Unnecessary code generated and temporary var allocated on stack cmd/compile: redundant moves and stack variables when using bits.Add64 Jul 29, 2019

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 29, 2019

I can't reproduce. I get this when compiling Zig with tip (5235501c0509385c28294b3a314891e1e7165163):

	0x0000 00000 (/usr/local/google/home/khr/gowork/issue33349.go:8)	MOVQ	"".x+8(SP), AX
	0x0005 00005 (/usr/local/google/home/khr/gowork/issue33349.go:8)	ADDQ	AX, AX
	0x0008 00008 (/usr/local/google/home/khr/gowork/issue33349.go:8)	SBBQ	CX, CX
	0x000b 00011 (/usr/local/google/home/khr/gowork/issue33349.go:9)	XORQ	CX, AX
	0x000e 00014 (/usr/local/google/home/khr/gowork/issue33349.go:9)	MOVQ	AX, "".~r1+16(SP)
@renthraysk

This comment has been minimized.

Copy link
Author

commented Jul 29, 2019

Yes, that's Zig()'s code. But when it is inlined into AppendZig() causes the problem. Apologies didn't make it clear.

@bcmills bcmills removed the WaitingForInfo label Jul 29, 2019

@bcmills bcmills changed the title cmd/compile: redundant moves and stack variables when using bits.Add64 cmd/compile: redundant moves and stack variables when code using bits.Add64 is inlined Jul 29, 2019

@bcmills bcmills changed the title cmd/compile: redundant moves and stack variables when code using bits.Add64 is inlined cmd/compile: redundant moves and stack variables when function using bits.Add64 is inlined Jul 29, 2019

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 29, 2019

Ok, that makes more sense.

Looks like we need a tweak to the schedule pass. Schedule produces:

v40 (+8) = ADDQcarry <uint64,flags> v10 v10
v65 (8) = Select0 <uint64> v40 (y[uint64])
v42 (8) = Select1 <flags> v40
v48 (25) = ADDQ <*byte> v115 v71
v50 (25) = MOVQstore <mem> v2 v48 v49
v53 (25) = MOVQstore <mem> [8] v2 v102 v50
v57 (25) = MOVQstore <mem> [16] v2 v86 v53
v58 (25) = CALLstatic <mem> {runtime.memmove} [24] v57
v45 (8) = SBBQcarrymask <uint64> v42

There's just a single ADDQ, and we use Select0/Select1 to get the 64-bit result and the carry, respectively. But then the carry isn't used until after the memmove. That means we need to spill the carry and restore it around the memmove call. And because there's no instruction to spill/restore flags (no cheap one, anyway), we reissue the flag-generating instruction after the memmove. After flagalloc, we have:

v40 (+8) = ADDQcarry <uint64,flags> v10 v10
v65 (8) = Select0 <uint64> v40 (y[uint64])
v42 (8) = Select1 <flags> v40
v48 (25) = ADDQ <*byte> v115 v71
v50 (25) = MOVQstore <mem> v2 v48 v49
v53 (25) = MOVQstore <mem> [8] v2 v102 v50
v57 (25) = MOVQstore <mem> [16] v2 v86 v53
v58 (25) = CALLstatic <mem> {runtime.memmove} [24] v57
v6 (8) = ADDQcarry <uint64,flags> v10 v10
v15 (8) = Select1 <flags> v6
v45 (8) = SBBQcarrymask <uint64> v15

v42 is now dead, we've recomputed it as v6+v15.

We already try to schedule flag users earlier, to make flag lifetimes short. Just pushing ReadFlags ahead of Memory (which is what memmove is) seems to fix this issue. I'll have to think a bit about whether that would break anything else.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

Just prioritizing the flags op is not enough, as it leads to another bad ordering. With that change we do just a single add, but end up spilling both the add result and the result of the SBBQ to the stack before the memmove. Those values are then loaded and xored immediately after the memmove. It would be better to do the xor before spilling so there is only one thing to spill+restore.

This will probably require a more general upgrade to the scheduling pass. We need to incorporate information to enable prioritization of instructions which kill more arguments than they generate results (or perhaps even enable subsequent instructions to do so). Right now it is done just by instruction class, which isn't precise enough.

@renthraysk

This comment has been minimized.

Copy link
Author

commented Aug 26, 2019

I did manage to find a similar issue with bits.Len(), as it also relies on flags.

package oddsched

import (
	"math/bits"
)

func AppendVarint(p []byte, x uint64) []byte {
	t := [10]byte{
		uint8(x) | 0x80,
		uint8(x>>7) | 0x80,
		uint8(x>>14) | 0x80,
		uint8(x>>21) | 0x80,
		uint8(x>>28) | 0x80,
		uint8(x>>35) | 0x80,
		uint8(x>>42) | 0x80,
		uint8(x>>49) | 0x80,
		uint8(x>>56) | 0x80,
		1,
	}
	n := (bits.Len64(x|1) + 6) / 7
	t[n-1] &= 0x7F
	return append(p, t[:n]...)
}

The assembly starts with performing x|1 and a BSR, then goes initialises t, so has to perform another BSRQ to finally compute n.

	MOVQ	"".x+152(SP), DX
	MOVQ	DX, BX
	ORQ	$1, DX
	BSRQ	DX, SI
        ... init t...
   	BSRQ	DX, DX
	MOVQ	$-1, DX
	CMOVQEQ	DX, SI
	LEAQ	7(SI), DX
	MOVQ	$5270498306774157605, AX
	MOVQ	DX, CX
	IMULQ	DX
	SARQ	$1, DX
	SARQ	$63, CX
	SUBQ	CX, DX
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.