Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: Load optimization suboptimal #48222

Open
klauspost opened this issue Sep 7, 2021 · 5 comments
Open

cmd/compile: Load optimization suboptimal #48222

klauspost opened this issue Sep 7, 2021 · 5 comments

Comments

@klauspost
Copy link
Contributor

@klauspost klauspost commented Sep 7, 2021

What version of Go are you using (go version)?

λ go version
go version go1.17 windows/amd64

Does this issue reproduce with the latest release?

Yes. And also previous versions.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=d:\temp\gocache
set GOENV=C:\Users\klaus\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=e:\gopath\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=e:\gopath
set GOPRIVATE=
set GOPROXY=https://goproxy.io
set GOROOT=c:\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=c:\go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.17
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=NUL
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=d:\temp\wintemp\go-build3764886800=/tmp/go-build -gno-record-gcc-switches 

What did you do?

Compile https://play.golang.org/p/oWj4bSaNiHN

Godbolt: https://go.godbolt.org/z/YnGcGW8Gc

What did you expect to see?

A single MOVL instruction used to load uint32.

What did you see instead?

The compiler tries to put the value together with the previous load, 2 other loads, 2 shifts and 2 or operations.

        MOVBLZX (AX), CX
        NOP
        TESTB   CL, CL
        JNE     foo_pc65
        MOVBLZX 1(AX), DX
        SHLL    $8, DX
        ORL     CX, DX
        MOVWLZX 2(AX), AX
        SHLL    $16, AX
        ORL     DX, AX

It it safe to rely on the value being in L1 when it has just been loaded. Some modern archs even do memory -> virtual register aliasing. I don't see any cases where this "optimization" is a benefit compared to straight up reloading the value.

@klauspost klauspost changed the title cmd/compile: Optimization adds complexity cmd/compile: Load optimization suboptimal Sep 7, 2021
@klauspost
Copy link
Contributor Author

@klauspost klauspost commented Sep 7, 2021

FWIW casting to *[4]byte spits out much nicer code, but IMO is worse to read and I cannot rely on Go 1.17 features for at least another year.

func foo(x []byte) uint32 {
    y := (*[4]byte)(x) // Avoid bounds check for clarity
	switch y[0] {
	case 0:
		return binary.LittleEndian.Uint32(y[:])
	}
	return 0
}

->

        CMPB    (AX), $0
        JNE     foo_pc42
        MOVL    (AX), AX

@josharian
Copy link
Contributor

@josharian josharian commented Sep 7, 2021

Want to do another one, @agarciamontoro?

@randall77
Copy link
Contributor

@randall77 randall77 commented Sep 7, 2021

Consider this program:

func foo(x []byte) uint32 {
    x = x[:4] // Avoid bounds check for clarity
    y := x[0]
	switch y {
	case 0:
		return uint32(y) | uint32(x[1])<<8 | uint32(x[2])<<16 | uint32(x[3])<<24
	}
	return 0
}

In this situation, we can't do a single 4-byte load inside the case because we might get into a situation where the bottom 8 bits of the returned value are not 0. That would be surprising. It can only happen if there's a data race, so technically it would be ok, but reloading from memory when the program doesn't do so, is something we try to avoid.

The OP's program has 2 separate loads, so it would be ok to do the 4-byte load optimization there. But unforunately, we do CSE before we do the 4-byte load optimization, so at that point the compiler can't tell the difference between the OP's program and mine. It errs on the side of caution and doesn't do the 4-byte load optimization.

@thanm thanm added this to the Backlog milestone Sep 7, 2021
@klauspost
Copy link
Contributor Author

@klauspost klauspost commented Sep 10, 2021

I would think it is always better to reload when constructing the value includes just a single load from memory.

Even constructing a 16 bit value:

        MOVBLZX (AX), CX
        [...]
        MOVBLZX 1(AX), DX
        SHLL    $8, DX
        ORL     CX, DX

... the second part has a considerably longer latency than a single MOVWLZX, 2 cycles on most CPUs.

@agarciamontoro
Copy link
Contributor

@agarciamontoro agarciamontoro commented Sep 30, 2021

Want to do another one, @agarciamontoro?

Thanks for the offer, @josharian! I'm right now investigating to do another one, so I'll pass on this opportunity and leave it open for someone else to grab :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants