Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: encoding/binary.PutUint32 generates unaligned data storage on arm64 arch #59856

Closed
mattypiper opened this issue Apr 27, 2023 · 8 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mattypiper
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.20.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/matt/Library/Caches/go-build"
GOENV="/Users/matt/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/matt/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/matt/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.20.3/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.20.3/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.20.3"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="cc"
CXX="c++"
CGO_ENABLED="1"
GOMOD=""
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/c6/500c7t352v54d_jjsr5vf2gw0000gn/T/go-build1931171172=/tmp/go-build -gno-record-gcc-switches -fno-common"
GOROOT/bin/go version: go version go1.20.3 darwin/amd64
GOROOT/bin/go tool compile -V: compile version go1.20.3
uname -v: Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64
ProductName:	macOS
ProductVersion:	12.6
BuildVersion:	21G115
lldb --version: lldb-1400.0.38.13
Apple Swift version 5.7.1 (swiftlang-5.7.1.135.3 clang-1400.0.29.51)
gdb --version: GNU gdb (GDB) 13.1

What did you do?

The following code generates some funky assembly on ARM64.

package main

import (
	"encoding/binary"
	"fmt"
)

func main() {
	b := make([]byte, 4)
	x := uint8(0xff)
	binary.LittleEndian.PutUint32(b, uint32(x))
	fmt.Printf("%x\n", b)
}

What did you expect to see?

I'm compiling the executables with

GOARCH=amd64 GOOS=linux go build -o putuint32_bug_amd64 putuint32_bug.go
GOARCH=arm GOOS=linux go build -o putuint32_bug_arm putuint32_bug.go
GOARCH=arm64 GOOS=linux go build -o putuint32_bug_arm64 putuint32_bug.go

On x86_64 (amd64) and arm architectures, I get reasonable results.

In amd64, we have a movl $0xff,(%rax) and movups %xmm15,0x38(%rsp) (see line 4828cd below). This correctly sets b, producing output ff000000. The instructions are concise.

$ objdump -d putuint32_bug_amd64 | grep -A25 '<main\.main>:'
00000000004828a0 <main.main>:
  4828a0:	49 3b 66 10          	cmp    0x10(%r14),%rsp
  4828a4:	0f 86 85 00 00 00    	jbe    48292f <main.main+0x8f>
  4828aa:	48 83 ec 50          	sub    $0x50,%rsp
  4828ae:	48 89 6c 24 48       	mov    %rbp,0x48(%rsp)
  4828b3:	48 8d 6c 24 48       	lea    0x48(%rsp),%rbp
  4828b8:	48 8d 05 61 7c 00 00 	lea    0x7c61(%rip),%rax        # 48a520 <type:*+0x7520>
  4828bf:	bb 04 00 00 00       	mov    $0x4,%ebx
  4828c4:	48 89 d9             	mov    %rbx,%rcx
  4828c7:	e8 f4 36 fc ff       	call   445fc0 <runtime.makeslice>
  4828cc:	90                   	nop
  4828cd:	c7 00 ff 00 00 00    	movl   $0xff,(%rax)
  4828d3:	44 0f 11 7c 24 38    	movups %xmm15,0x38(%rsp)
  4828d9:	bb 04 00 00 00       	mov    $0x4,%ebx
  4828de:	48 89 d9             	mov    %rbx,%rcx
  4828e1:	e8 da 73 f8 ff       	call   409cc0 <runtime.convTslice>
  4828e6:	48 8d 15 f3 6d 00 00 	lea    0x6df3(%rip),%rdx        # 4896e0 <type:*+0x66e0>
  4828ed:	48 89 54 24 38       	mov    %rdx,0x38(%rsp)
  4828f2:	48 89 44 24 40       	mov    %rax,0x40(%rsp)
  4828f7:	48 8b 1d 52 c6 0a 00 	mov    0xac652(%rip),%rbx        # 52ef50 <os.Stdout>
  4828fe:	48 8d 05 13 60 03 00 	lea    0x36013(%rip),%rax        # 4b8918 <go:itab.*os.File,io.Writer>
  482905:	48 8d 0d b1 7b 01 00 	lea    0x17bb1(%rip),%rcx        # 49a4bd <go:string.*+0xa5>
  48290c:	bf 03 00 00 00       	mov    $0x3,%edi
  482911:	48 8d 74 24 38       	lea    0x38(%rsp),%rsi
  482916:	41 b8 01 00 00 00    	mov    $0x1,%r8d
  48291c:	4d 89 c1             	mov    %r8,%r9

On 32-bit arm, it's a similar story. mvn r1, #0 and strb r1, [r0] sets the first byte to 0xff, followed by three remaining strb r1 instructions to offsets from r0. See line a2d18 below. This works fine.

$ objdump -d putuint32_bug_arm | grep -A25 '<main\.main>:'
000a2ce8 <main.main>:
   a2ce8:	e59a1008 	ldr	r1, [sl, #8]
   a2cec:	e15d0001 	cmp	sp, r1
   a2cf0:	9a00002b 	bls	a2da4 <main.main+0xbc>
   a2cf4:	e52de034 	str	lr, [sp, #-52]!	; 0xffffffcc
   a2cf8:	e59f00b4 	ldr	r0, [pc, #180]	; a2db4 <main.main+0xcc>
   a2cfc:	e58d0004 	str	r0, [sp, #4]
   a2d00:	e3a00004 	mov	r0, #4
   a2d04:	e58d0008 	str	r0, [sp, #8]
   a2d08:	e58d000c 	str	r0, [sp, #12]
   a2d0c:	ebfef5e0 	bl	60494 <runtime.makeslice>
   a2d10:	e59d0010 	ldr	r0, [sp, #16]
   a2d14:	00000000 	andeq	r0, r0, r0
   a2d18:	e3e01000 	mvn	r1, #0
   a2d1c:	e5c01000 	strb	r1, [r0]
   a2d20:	e3a01000 	mov	r1, #0
   a2d24:	e5c01001 	strb	r1, [r0, #1]
   a2d28:	e5c01002 	strb	r1, [r0, #2]
   a2d2c:	e5c01003 	strb	r1, [r0, #3]
   a2d30:	e3a01000 	mov	r1, #0
   a2d34:	e58d102c 	str	r1, [sp, #44]	; 0x2c
   a2d38:	e3a01000 	mov	r1, #0
   a2d3c:	e58d1030 	str	r1, [sp, #48]	; 0x30
   a2d40:	e58d0004 	str	r0, [sp, #4]
   a2d44:	e3a00004 	mov	r0, #4
   a2d48:	e58d0008 	str	r0, [sp, #8]

What did you see instead?

On aarch64/arm64 architecture, something strange happens. It loads x3 with -1 , but then the strb/sturh/strb instructions produce a situation where there is a 1-2-1 byte store sequence. See line 8e880 below. This does correctly set the value, but it does so in a way that likely causes an unaligned data store. The system I'm working with generates a SIGBUS for the unaligned multi-byte store. Thus I cannot use the encoding/binary PutUint32 function for writing directly to mmap'd sections, at least not without writing to a temporary first and then using copy to go from temporary to mmap'd memory.

4 single byte strb instructions would've worked fine, or a 4-byte str. But the construct below is problematic due to the unaligned access.

% objdump -d putuint32_bug_arm64 | grep -A25 '<main\.main>:'
000000000008e850 <main.main>:
   8e850:	f9400b90 	ldr	x16, [x28, #16]
   8e854:	eb3063ff 	cmp	sp, x16
   8e858:	54000489 	b.ls	8e8e8 <main.main+0x98>  // b.plast
   8e85c:	f81a0ffe 	str	x30, [sp, #-96]!
   8e860:	f81f83fd 	stur	x29, [sp, #-8]
   8e864:	d10023fd 	sub	x29, sp, #0x8
   8e868:	b0000040 	adrp	x0, 97000 <type:*+0x7000>
   8e86c:	91148000 	add	x0, x0, #0x520
   8e870:	b27e03e1 	orr	x1, xzr, #0x4
   8e874:	aa0103e2 	mov	x2, x1
   8e878:	97ff1982 	bl	54e80 <runtime.makeslice>
   8e87c:	d503201f 	nop
   8e880:	92800003 	mov	x3, #0xffffffffffffffff    	// #-1
   8e884:	39000003 	strb	w3, [x0]
   8e888:	7800101f 	sturh	wzr, [x0, #1]
   8e88c:	39000c1f 	strb	wzr, [x0, #3]
   8e890:	a904ffff 	stp	xzr, xzr, [sp, #72]
   8e894:	b27e03e1 	orr	x1, xzr, #0x4
   8e898:	aa0103e2 	mov	x2, x1
   8e89c:	97fe29bd 	bl	18f90 <runtime.convTslice>
   8e8a0:	90000043 	adrp	x3, 96000 <type:*+0x6000>
   8e8a4:	911b8063 	add	x3, x3, #0x6e0
   8e8a8:	f90027e3 	str	x3, [sp, #72]
   8e8ac:	f9002be0 	str	x0, [sp, #80]
   8e8b0:	b00005db 	adrp	x27, 147000 <runtime.itabTableInit+0x1a0>

This could be an unintended optimization, since it is generally correct code, just a little awkward for something that should "PutUint32" and problematic/illegal for certain systems.

I would research and suggest a fix myself, but I am unfamiliar with how to find the assembly code generator for low level issues like this.

@ianlancetaylor ianlancetaylor changed the title encoding/binary.PutUint32 generates unaligned data storage on arm64 arch cmd/compile: encoding/binary.PutUint32 generates unaligned data storage on arm64 arch Apr 27, 2023
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 27, 2023
@ianlancetaylor ianlancetaylor added this to the Go1.21 milestone Apr 27, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 27, 2023
@ianlancetaylor
Copy link
Contributor

CC @randall77

@randall77
Copy link
Contributor

randall77 commented Apr 27, 2023

The system I'm working with generates a SIGBUS for the unaligned multi-byte store.

Unaligned accesses are definitely allowed on arm64. Is this a memory-mapped device or something?

I don't see how we support enforcing aligned accesses without turning all read and write combining off. For instance,

func f(p *[4]byte) {
    p[1] = 1
    p[2] = 2
}

We want to implement that with a 16-bit write of the constant 0x201. But that write might very well be unaligned. (The general problem has nothing to do with encoding/binary specifically, although that is a package in which this situation is likely to appear.)

This definitely requires some low-level support. I don't think there's any way for us to provide that by default. Using sync/atomic or assembly seem like the two obvious workarounds.

@erifan
Copy link

erifan commented Apr 27, 2023

I can confirm that arm64 allows unaligned access, the ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. Most unaligned accesses have no performance penalties. I'm also curious what kind of device this is.

@randall77
Copy link
Contributor

Note that the 1-2-1 store pattern is somewhat inefficient and I think should be fixed by https://go-review.googlesource.com/c/go/+/478475

@erifan
Copy link

erifan commented Apr 27, 2023

I am wondering if we can intrinsify this function into a 32bit store. However, the write address may still be unaligned, but we can check it before calling this function if alignment is needed.

@renthraysk
Copy link

@mattypiper
Copy link
Author

Apologies for confusion, absolutely arm64 supports unaligned access. The SIGBUS issue I experience is with a memory mapped device which does not support the unaligned access. I tried to carefully word the issue by saying things like "problematic" for "certain systems". And I understand how coalescing multiple byte writes into a single multi-byte instruction makes sense for performance. It just seemed strange for this particular go code to generate that 1-2-1 storage pattern. I would think that passing uint32(byte(0xff)) is enough to tell the compiler to generate a 4-byte store of 0x000000ff. Anyway, thanks for the quick response and consideration! I'll see if I can test the very recent patch from https://go-review.googlesource.com/c/go/+/478475.

@mattypiper
Copy link
Author

mattypiper commented May 1, 2023

@randall77 I can confirm that the 1-2-1 pattern is fixed in the master branch. Thanks for the link, good learning opportunity for me. There is now a single 32-bit store: str w3, [x0] in my example code. (Full assembly listing is pasted below).

And before I close the topic, I figured I'd try the 64-bit version as well. The code binary.LittleEndian.PutUint64(b, uint64(x)) produces a comparable and efficient orr x3, xzr, #0xff; str x3, [x0].

% objdump -d putuint32_bug_arm64 | grep -A25 '<main\.main>:'
000000000008b3c0 <main.main>:
   8b3c0:	f9400b90 	ldr	x16, [x28, #16]
   8b3c4:	eb3063ff 	cmp	sp, x16
   8b3c8:	54000449 	b.ls	8b450 <main.main+0x90>  // b.plast
   8b3cc:	f81a0ffe 	str	x30, [sp, #-96]!
   8b3d0:	f81f83fd 	stur	x29, [sp, #-8]
   8b3d4:	d10023fd 	sub	x29, sp, #0x8
   8b3d8:	90000060 	adrp	x0, 97000 <type:*+0x7000>
   8b3dc:	91058000 	add	x0, x0, #0x160
   8b3e0:	b27e03e1 	orr	x1, xzr, #0x4
   8b3e4:	aa0103e2 	mov	x2, x1
   8b3e8:	97ff2916 	bl	55840 <runtime.makeslice>
   8b3ec:	d503201f 	nop
   8b3f0:	b2401fe3 	orr	x3, xzr, #0xff
   8b3f4:	b9000003 	str	w3, [x0]
   8b3f8:	a904ffff 	stp	xzr, xzr, [sp, #72]
   8b3fc:	b27e03e1 	orr	x1, xzr, #0x4
   8b400:	aa0103e2 	mov	x2, x1
   8b404:	97fe344b 	bl	18530 <runtime.convTslice>
   8b408:	f0000043 	adrp	x3, 96000 <type:*+0x6000>
   8b40c:	910b8063 	add	x3, x3, #0x2e0
   8b410:	f90027e3 	str	x3, [sp, #72]
   8b414:	f9002be0 	str	x0, [sp, #80]
   8b418:	d000057b 	adrp	x27, 139000 <runtime.itabTableInit+0xc00>

@golang golang locked and limited conversation to collaborators Apr 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants