What version of Go are you using (go version)?
> go version
go version devel go1.22-8488309192 Fri Jul 28 02:46:31 2023 +0000 linux/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env)?
go env Output
> go env
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/hugo/.cache/go-build'
GOENV='/home/hugo/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/hugo/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/hugo/go'
GOPRIVATE=''
GOPROXY='direct'
GOROOT='/home/hugo/k/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/hugo/k/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='devel go1.22-8488309192 Fri Jul 28 02:46:31 2023 +0000'
GCCGO='gccgo'
GOAMD64='v3'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/hugo/k/go/src/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3945136505=/tmp/go-build -gno-record-gcc-switches'
What did you do?
Compile:
package a
//go:noinline
func f(s string) { /* do something */ }
func A(s []string) {
for _, v := range s {
f(v)
}
}
> GOSSAFUNC=A go build a.go
What did you expect to see?
I expect the CALL to require saving 2 registers into the stack.
What did you see instead?
3 registers are saved
00000 (6) TEXT command-line-arguments.A(SB), ABIInternal
00001 (6) FUNCDATA $0, gclocals·m/6RUmNv6NBhMUL8eleFFA==(SB)
00002 (6) FUNCDATA $1, gclocals·VtCL4RdUwCqwXEPeyJllRA==(SB)
00003 (6) FUNCDATA $5, command-line-arguments.A.arginfo1(SB)
00004 (6) FUNCDATA $6, command-line-arguments.A.argliveinfo(SB)
b1 00005 (6) PCDATA $3, $1
v23 00006 (7) MOVQ BX, command-line-arguments.s+8(SP) ; Here first save (in the loop init)
v23 00007 (7) PCDATA $3, $2
v5 00008 (7) XORL CX, CX
b1 00009 (7) JMP 22
v19 00010 (7) MOVQ CX, command-line-arguments..autotmp_9-16(SP) ; Here second save (inside the body)
v18 00011 (7) MOVQ AX, command-line-arguments..autotmp_10-8(SP) ; Third save (inside the body)
v27 00012 (7) MOVQ (AX), CX
v26 00013 (7) MOVQ 8(AX), BX
v22 00014 (+8) MOVQ CX, AX
v20 00015 (+8) PCDATA $1, $1
v20 00016 (+8) CALL command-line-arguments.f(SB)
v13 00017 (+7) MOVQ command-line-arguments..autotmp_10-8(SP), AX ; First restore (inside the body)
v30 00018 (7) ADDQ $16, AX
v29 00019 (7) MOVQ command-line-arguments..autotmp_9-16(SP), CX ; Second restore (inside the body)
v24 00020 (7) INCQ CX
v17 00021 (7) MOVQ command-line-arguments.s+8(SP), BX ; Third restore (inside the body)
v32 00022 (+7) CMPQ BX, CX
b2 00023 (7) JGT 10
b5 00024 (10) PCDATA $1, $-1
b5 00025 (10) RET
00026 (?) END
Potential Fixes
C++ like iteration
This happen due to how range rewrites loop, my function from before is rewritten like this:
func A(s []string) {
var v string
end := slice.len
i := 0
ptr := uintptr(slice.data)
for i < end {
ptrInsideTheLoop := unsafe.Pointer(ptr)
v = *(*string)(ptrInsideTheLoop)
f(v)
ptr = uintptr(ptrInsideTheLoop) + unsafe.Sizeof(v)
i++
}
}
3 registers are needed i, end and start.
I wrote a patch that do C++ style iteration (comparing pointers)
func A(s []string) {
var v string
end := uintptr(s.data) + uintptr(s.len) * unsafe.Sizeof(v)
ptr := uintptr(s.data)
for ptr < end {
ptrInsideTheLoop := unsafe.Pointer(ptr)
v = *(*string)(ptrInsideTheLoop)
f(v)
ptr = uintptr(ptrInsideTheLoop) + unsafe.Sizeof(a)
}
}
However this does not work because the loop body (here calling f) may cause a stack resize, and then end (which is a uintptr) is not fixed if it points to stack, while ptrInsideTheLoop is so their relationship gets all messed up.
We can't create an endInsideTheLoop which is an unsafe.Pointer like for ptr because end is one past the last valid index of the slice, which cause the GC to panic if it is scanned, we would need a pointer that is scanned by the stack relocator but not by the GC which AFAIK is not compatible with the runtime right now.
Downward count
I have an other patch (https://go-review.googlesource.com/c/go/+/512935) which later down the line (in the SSA) rewrites the 3 register loop into a 2 register one like this:
func A(s []string) {
var v string
i := slice.len
ptr := uintptr(slice.data)
for 0 < i {
ptrInsideTheLoop := unsafe.Pointer(ptr)
v = *(*string)(ptrInsideTheLoop)
f(v)
ptr = uintptr(ptrInsideTheLoop) + unsafe.Sizeof(v)
i--
}
}
It removes end by inverting the loop, instead of starting at a constant and counting up towards a variable (which require keeping alive the index and the bound), it start at the bound and count down until zero is reached.
It is slightly less efficient than the C++ like iteration because it does 2 math operation per iteration, adding to the running pointer and decrementing the index, however theses two instructions are not dependent on each other so they execute in parallel unless the loop body only leaves one free port when the loop's tail is reached, the huge amount of loops (such as the one in my example) will do memory anyway and wont be bottleneck by the increments, it also make the code slightly bigger such that maybe it crosses a cacheline boundry, however theses cases are rare and have minor impacts (it is always better than the 3 register form).
On the other side if you have a loop that you do not enter at a comparable rate at which you do iterations it is better, because it does not have an addition in the loop preheader (just load and compare).
Given the negligeable impacts of the 2 register downward count vs the C++ like range, I think this is a better solution than changing the runtime to have support for pointers that are handled by the stack rewriter but not the GC.
Note: it fixes all loops when an index is used to count from 0 to some bound but the index is not used, not just range loops where only the value is used, but this is a very popular place for this optimization to happen.
What version of Go are you using (
go version)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env)?go envOutputWhat did you do?
Compile:
What did you expect to see?
I expect the
CALLto require saving 2 registers into the stack.What did you see instead?
3 registers are saved
Potential Fixes
C++ like iteration
This happen due to how
rangerewrites loop, my function from before is rewritten like this:3 registers are needed
i,endandstart.I wrote a patch that do C++ style iteration (comparing pointers)
However this does not work because the loop body (here calling
f) may cause a stack resize, and thenend(which is auintptr) is not fixed if it points to stack, whileptrInsideTheLoopis so their relationship gets all messed up.We can't create an
endInsideTheLoopwhich is anunsafe.Pointerlike forptrbecause end is one past the last valid index of the slice, which cause the GC to panic if it is scanned, we would need a pointer that is scanned by the stack relocator but not by the GC which AFAIK is not compatible with the runtime right now.Downward count
I have an other patch (https://go-review.googlesource.com/c/go/+/512935) which later down the line (in the SSA) rewrites the 3 register loop into a 2 register one like this:
It removes
endby inverting the loop, instead of starting at a constant and counting up towards a variable (which require keeping alive the index and the bound), it start at the bound and count down until zero is reached.It is slightly less efficient than the C++ like iteration because it does 2 math operation per iteration, adding to the running pointer and decrementing the index, however theses two instructions are not dependent on each other so they execute in parallel unless the loop body only leaves one free port when the loop's tail is reached, the huge amount of loops (such as the one in my example) will do memory anyway and wont be bottleneck by the increments, it also make the code slightly bigger such that maybe it crosses a cacheline boundry, however theses cases are rare and have minor impacts (it is always better than the 3 register form).
On the other side if you have a loop that you do not enter at a comparable rate at which you do iterations it is better, because it does not have an addition in the loop preheader (just load and compare).
Given the negligeable impacts of the 2 register downward count vs the C++ like range, I think this is a better solution than changing the runtime to have support for pointers that are handled by the stack rewriter but not the GC.
Note: it fixes all loops when an index is used to count from 0 to some bound but the index is not used, not just range loops where only the value is used, but this is a very popular place for this optimization to happen.