Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: ranging exclusively values over slices and arrays that are not cheaply indexable keeps alive 3 registers instead of 2 #61629

Closed
Jorropo opened this issue Jul 28, 2023 · 1 comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@Jorropo
Copy link
Member

Jorropo commented Jul 28, 2023

What version of Go are you using (go version)?

> go version
go version devel go1.22-8488309192 Fri Jul 28 02:46:31 2023 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
> go env
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/hugo/.cache/go-build'
GOENV='/home/hugo/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/hugo/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/hugo/go'
GOPRIVATE=''
GOPROXY='direct'
GOROOT='/home/hugo/k/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/hugo/k/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='devel go1.22-8488309192 Fri Jul 28 02:46:31 2023 +0000'
GCCGO='gccgo'
GOAMD64='v3'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/hugo/k/go/src/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3945136505=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Compile:

package a

//go:noinline
func f(s string) { /* do something */ }

func A(s []string) {
 for _, v := range s {
  f(v)
 }
}
> GOSSAFUNC=A go build a.go

What did you expect to see?

I expect the CALL to require saving 2 registers into the stack.

What did you see instead?

3 registers are saved

    00000 (6) TEXT command-line-arguments.A(SB), ABIInternal
    00001 (6) FUNCDATA $0, gclocals·m/6RUmNv6NBhMUL8eleFFA==(SB)
    00002 (6) FUNCDATA $1, gclocals·VtCL4RdUwCqwXEPeyJllRA==(SB)
    00003 (6) FUNCDATA $5, command-line-arguments.A.arginfo1(SB)
    00004 (6) FUNCDATA $6, command-line-arguments.A.argliveinfo(SB)
b1  00005 (6) PCDATA $3, $1
v23 00006 (7) MOVQ BX, command-line-arguments.s+8(SP) ; Here first save (in the loop init)
v23 00007 (7) PCDATA $3, $2
v5  00008 (7) XORL CX, CX
b1  00009 (7) JMP 22
v19 00010 (7) MOVQ CX, command-line-arguments..autotmp_9-16(SP) ; Here second save (inside the body)
v18 00011 (7) MOVQ AX, command-line-arguments..autotmp_10-8(SP) ; Third save (inside the body)
v27 00012 (7) MOVQ (AX), CX
v26 00013 (7) MOVQ 8(AX), BX
v22 00014 (+8) MOVQ CX, AX
v20 00015 (+8) PCDATA $1, $1
v20 00016 (+8) CALL command-line-arguments.f(SB)
v13 00017 (+7) MOVQ command-line-arguments..autotmp_10-8(SP), AX ; First restore (inside the body)
v30 00018 (7) ADDQ $16, AX
v29 00019 (7) MOVQ command-line-arguments..autotmp_9-16(SP), CX ; Second restore (inside the body)
v24 00020 (7) INCQ CX
v17 00021 (7) MOVQ command-line-arguments.s+8(SP), BX ; Third restore (inside the body)
v32 00022 (+7) CMPQ BX, CX
b2  00023 (7) JGT 10
b5  00024 (10) PCDATA $1, $-1
b5  00025 (10) RET
    00026 (?) END

Potential Fixes

C++ like iteration

This happen due to how range rewrites loop, my function from before is rewritten like this:

func A(s []string) {
 var v string
 end := slice.len
 i := 0
 ptr := uintptr(slice.data)
 for i < end {
  ptrInsideTheLoop := unsafe.Pointer(ptr)
  v = *(*string)(ptrInsideTheLoop)
  f(v)
  ptr = uintptr(ptrInsideTheLoop) + unsafe.Sizeof(v)
  i++
 }
} 

3 registers are needed i, end and start.
I wrote a patch that do C++ style iteration (comparing pointers)

func A(s []string) {
 var v string
 end := uintptr(s.data) + uintptr(s.len) * unsafe.Sizeof(v)
 ptr := uintptr(s.data)
 for ptr < end {
  ptrInsideTheLoop := unsafe.Pointer(ptr)
  v = *(*string)(ptrInsideTheLoop)
  f(v)
  ptr = uintptr(ptrInsideTheLoop) + unsafe.Sizeof(a)
 }
} 

However this does not work because the loop body (here calling f) may cause a stack resize, and then end (which is a uintptr) is not fixed if it points to stack, while ptrInsideTheLoop is so their relationship gets all messed up.
We can't create an endInsideTheLoop which is an unsafe.Pointer like for ptr because end is one past the last valid index of the slice, which cause the GC to panic if it is scanned, we would need a pointer that is scanned by the stack relocator but not by the GC which AFAIK is not compatible with the runtime right now.

Downward count

I have an other patch (https://go-review.googlesource.com/c/go/+/512935) which later down the line (in the SSA) rewrites the 3 register loop into a 2 register one like this:

func A(s []string) {
 var v string
 i := slice.len
 ptr := uintptr(slice.data)
 for 0 < i {
  ptrInsideTheLoop := unsafe.Pointer(ptr)
  v = *(*string)(ptrInsideTheLoop)
  f(v)
  ptr = uintptr(ptrInsideTheLoop) + unsafe.Sizeof(v)
  i--
 }
} 

It removes end by inverting the loop, instead of starting at a constant and counting up towards a variable (which require keeping alive the index and the bound), it start at the bound and count down until zero is reached.
It is slightly less efficient than the C++ like iteration because it does 2 math operation per iteration, adding to the running pointer and decrementing the index, however theses two instructions are not dependent on each other so they execute in parallel unless the loop body only leaves one free port when the loop's tail is reached, the huge amount of loops (such as the one in my example) will do memory anyway and wont be bottleneck by the increments, it also make the code slightly bigger such that maybe it crosses a cacheline boundry, however theses cases are rare and have minor impacts (it is always better than the 3 register form).
On the other side if you have a loop that you do not enter at a comparable rate at which you do iterations it is better, because it does not have an addition in the loop preheader (just load and compare).

Given the negligeable impacts of the 2 register downward count vs the C++ like range, I think this is a better solution than changing the runtime to have support for pointers that are handled by the stack rewriter but not the GC.

Note: it fixes all loops when an index is used to count from 0 to some bound but the index is not used, not just range loops where only the value is used, but this is a very popular place for this optimization to happen.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 28, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/512935 mentions this issue: cmd/compile: try to rewrite loops to count down

@randall77 randall77 added this to the Unplanned milestone Jul 28, 2023
@Jorropo Jorropo changed the title cmd/compile: ranging over slices and arrays that are not cheaply indexable keeps alive 3 registers instead of 2 cmd/compile: ranging exclusively values over slices and arrays that are not cheaply indexable keeps alive 3 registers instead of 2 Jul 29, 2023
@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

4 participants