Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: multiple LittleEndian.Uint* calls on the same buffer block load merging #52708

Open
greatroar opened this issue May 4, 2022 · 2 comments
Labels
NeedsInvestigation Performance
Milestone

Comments

@greatroar
Copy link

@greatroar greatroar commented May 4, 2022

What version of Go are you using (go version)?

$ go version
go version go1.18 linux/amd64
$ gotip version
go version devel go1.19-87cf92e7d51 Wed May 4 14:17:20 2022 +0000 linux/amd64

What did you do?

type T struct {
        buf [16]byte
}

func (t *T) readU32() uint32 {
	x := binary.LittleEndian.Uint32(t.buf[:])
	t.shiftBuf(4)
	return x
}

func (t *T) shiftBuf(n uint) {
	x := binary.LittleEndian.Uint64(t.buf[:])
	y := binary.LittleEndian.Uint64(t.buf[8:])

	shift := 8 * n
	x >>= shift
	x |= y << ((64 - shift) % 64)
	y >>= shift

	binary.LittleEndian.PutUint64(t.buf[:], x)
	binary.LittleEndian.PutUint64(t.buf[8:], y)
}

What did you expect to see?

A single MOVL for LittleEndian.Uint32, and a single MOVQ for each LittleEndian.(Put)?Uint64, or something similar.

What did you see instead?

(*T).shiftBuf gets inlined. The Uint32 call then gets broken up into four byte loads. The first Uint64 becomes a MOVL to fetch the upper word, which is then combined with the Uint32 result. The final PutUint64 is broken up into two MOVLs. With gotip:

"".(*T).readU32 STEXT nosplit size=115 args=0x8 locals=0x0 funcid=0x0 align=0x0
        0x0000 00000 (t.go:9)   TEXT    "".(*T).readU32(SB), NOSPLIT|ABIInternal, $0-8
        0x0000 00000 (t.go:9)   FUNCDATA        $0, gclocals·c2071639b6d8d876272b6494fd4db694cb4563c618e97968c244b75ebb6010cc(SB)
        0x0000 00000 (t.go:9)   FUNCDATA        $1, gclocals·27917eed0c3b3bbbded907160bb0e979825b780d6fad406e47efba824cbdf65b(SB)
        0x0000 00000 (t.go:9)   FUNCDATA        $5, "".(*T).readU32.arginfo1(SB)
        0x0000 00000 (t.go:9)   FUNCDATA        $6, "".(*T).readU32.argliveinfo(SB)
        0x0000 00000 (t.go:9)   PCDATA  $3, $1
        0x0000 00000 (t.go:10)  XCHGL   AX, AX
        0x0001 00001 (<unknown line number>)    NOP
        0x0001 00001 (t.go:11)  MOVBLZX (AX), CX
        0x0004 00004 ($GOROOT/src/encoding/binary/binary.go:81) MOVBLZX 1(AX), DX
        0x0008 00008 ($GOROOT/src/encoding/binary/binary.go:81) MOVBLZX 2(AX), BX
        0x000c 00012 ($GOROOT/src/encoding/binary/binary.go:81) MOVBLZX 3(AX), SI
        0x0010 00016 (t.go:16)  XCHGL   AX, AX
        0x0011 00017 (t.go:17)  XCHGL   AX, AX
        0x0012 00018 ($GOROOT/src/encoding/binary/binary.go:103)        MOVL    DX, DI
        0x0014 00020 ($GOROOT/src/encoding/binary/binary.go:103)        SHLQ    $8, DX
        0x0018 00024 ($GOROOT/src/encoding/binary/binary.go:103)        ORQ     CX, DX
        0x001b 00027 ($GOROOT/src/encoding/binary/binary.go:103)        MOVL    BX, R8
        0x001e 00030 ($GOROOT/src/encoding/binary/binary.go:103)        SHLQ    $16, BX
        0x0022 00034 ($GOROOT/src/encoding/binary/binary.go:103)        ORQ     DX, BX
        0x0025 00037 ($GOROOT/src/encoding/binary/binary.go:103)        MOVL    SI, DX
        0x0027 00039 ($GOROOT/src/encoding/binary/binary.go:103)        SHLQ    $24, SI
        0x002b 00043 ($GOROOT/src/encoding/binary/binary.go:103)        ORQ     BX, SI
        0x002e 00046 ($GOROOT/src/encoding/binary/binary.go:104)        MOVL    4(AX), BX
        0x0031 00049 ($GOROOT/src/encoding/binary/binary.go:104)        SHLQ    $32, BX
        0x0035 00053 ($GOROOT/src/encoding/binary/binary.go:104)        ORQ     BX, SI
        0x0038 00056 (t.go:20)  SHRQ    $32, SI
        0x003c 00060 ($GOROOT/src/encoding/binary/binary.go:104)        MOVQ    8(AX), BX
        0x0040 00064 (t.go:21)  MOVQ    BX, R9
        0x0043 00067 (t.go:21)  SHLQ    $32, BX
        0x0047 00071 (t.go:21)  ORQ     SI, BX
        0x004a 00074 (t.go:24)  XCHGL   AX, AX
        0x004b 00075 ($GOROOT/src/encoding/binary/binary.go:116)        MOVQ    BX, (AX)
        0x004e 00078 (t.go:22)  SHRQ    $32, R9
        0x0052 00082 (t.go:25)  XCHGL   AX, AX
        0x0053 00083 ($GOROOT/src/encoding/binary/binary.go:112)        MOVL    R9, 8(AX)
        0x0057 00087 ($GOROOT/src/encoding/binary/binary.go:116)        MOVL    $0, 12(AX)
        0x005e 00094 ($GOROOT/src/encoding/binary/binary.go:81) SHLL    $8, DI
        0x0061 00097 ($GOROOT/src/encoding/binary/binary.go:81) ORL     DI, CX
        0x0063 00099 ($GOROOT/src/encoding/binary/binary.go:81) SHLL    $16, R8
        0x0067 00103 ($GOROOT/src/encoding/binary/binary.go:81) ORL     CX, R8
        0x006a 00106 ($GOROOT/src/encoding/binary/binary.go:81) SHLL    $24, DX
        0x006d 00109 ($GOROOT/src/encoding/binary/binary.go:81) ORL     R8, DX
        0x0070 00112 (t.go:12)  MOVL    DX, AX
        0x0072 00114 (t.go:12)  RET

GOARCH=arm64 produces byte loads and shifts too.

@dr2chase dr2chase added Performance NeedsInvestigation labels May 4, 2022
@dr2chase dr2chase added this to the Backlog milestone May 4, 2022
@greatroar
Copy link
Author

@greatroar greatroar commented May 4, 2022

Workaround:

uint32(binary.LittleEndian.Uint64(t.buf[:]))

But when I want to read a BE integer from t.buf, this doesn't work anymore:

binary.LittleEndian.Uint32(t.buf[:])

has the same problem as the LE version, while

uint32(binary.BigEndian.Uint64(t.buf[:]) >> 32)

becomes eight byte loads on both amd64 and arm64. The workaround that works is

uint32(bits.ReverseBytes64(binary.LittleEndian.Uint64(t.buf[:])) >> 32)

@Jorropo
Copy link
Contributor

@Jorropo Jorropo commented May 5, 2022

The issue here is that before the lowering step the optimizer finds a way to have multiple uses of your load <byte>.
Capture d’écran du 2022-05-05 16-33-08
(see how the Load ops before lower have 2 uses each)

Our load merging rules rely on having .Uses == 1.

Or in other words.
We will merge:

// Slightly reformated binary.LittleEndian.Uint16 impl
y := uint16(b[1])<<8
x := uint16(b[0])
return x | y

But not:

yByte := b[1]
y := uint16(yByte)<<8
x := uint16(b[0])
return x | y, yByte

Because the second step have two uses of b[1], and your code optimize to the second case.

The reason your loads have two uses to begin with is that that previous optimization steps inline the shiftBuf and remove the duplicate load to the start of the array.
In other words, it gets compiled to roughly this:

type T struct {
        buf [16]byte
}

func (t *T) readU32() uint32 {
	x0 := b[0]
	x1 := b[1]
	x2 := b[2]
	x3 := b[3]
	// InlMark
	xLow64 := uint64(x0) | uint64(x1)<<8 | uint64(x2)<<16 | uint64(x3)<<24
	x := xLow64 | uint64(binary.LittleEndian.Uint32(t.buf[4:])) << 32
	y := binary.LittleEndian.Uint64(t.buf[8:])

	shift := 8 * n
	x >>= shift
	x |= y << ((64 - shift) % 64)
	y >>= shift

	binary.LittleEndian.PutUint64(t.buf[:], x)
	binary.LittleEndian.PutUint64(t.buf[8:], y)
	// End Inl
	xLow32 := uint32(x0) | uint32(x1)<<8 | uint32(x2)<<16 | uint32(x3)<<24
	return xLow32
}

Whats sad in that code is that it is not capable to realise that it can merge xLow32 and xLow64 because the register size difference scare it off.

I'm not sure what need to happen here.
There is lots of options:

  • Trying to refactor the code to allow to merge loads even with .Uses >= 1 is a dangerous line, it's rare that more loads is a good thing.
    If I had to guess, this is probably the fastest thing here (having 2 overlapping loads to the begining of the array) (the code you actually write), one for the 64 bits version and one for the 32 bits one.
    It's overlapping pipelinable loads, so should will hit ram together and shouldn't be slower than a single 64 bits load.
    Note, LLVM generate this for thoses cases.
    But that require more than hyper local vision, and gc's optimizer is not really written like that (it's the usual compilation time VS runtime speed tradeoff).
  • Either find a way to do a single 64 bits load and then truncate. That would require to merge loads earlier, idealy before optimising.
    For example the ssa generator could already replace binary.*.Uint32 by a 4 bytes load on architecture that supports unaligned loads.
  • Find a way to merge xLow32 and xLow64.
    This would generate that code instead:
type T struct {
        buf [16]byte
}

func (t *T) readU32() uint32 {
	xLow := binary.LittleEndian.Uint32(t.buf[4:])
	// InlMark
	x := uint64(xLow) | uint64(binary.LittleEndian.Uint32(t.buf[4:])) << 32 // A single shift & or assemble here
	y := binary.LittleEndian.Uint64(t.buf[8:])

	shift := 8 * n
	x >>= shift
	x |= y << ((64 - shift) % 64)
	y >>= shift

	binary.LittleEndian.PutUint64(t.buf[:], x)
	binary.LittleEndian.PutUint64(t.buf[8:], y)
	// End Inl
	return xLow
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Performance
Projects
None yet
Development

No branches or pull requests

3 participants