-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Description
$ go version
go version go1.24.2 darwin/arm64
Consider the following program:
package x
type t struct {
x0, x1, x2, x3 uint64
}
//go:nosplit
func (t *t) inc() {
t.x0 += t.x1
}
//go:nosplit
func run(t t, n int, f func(t) t) {
for {
t.inc()
t = f(t)
n--
if n == 0 {
break
}
}
}The function run produces the following abbridged assembly on x86:
TEXT command-line-arguments.run(SB), NOSPLIT|ABIInternal, $48-48
PUSHQ BP
MOVQ SP, BP
SUBQ $40, SP
MOVQ R8, 96(SP)
MOVQ AX, 56(SP)
MOVQ BX, 64(SP)
MOVQ CX, 72(SP)
MOVQ DI, 80(SP)
JMP 2f
1:
DECQ SI
MOVQ 96(SP), R8
2:
MOVQ SI, 32(SP)
MOVQ 56(SP), AX ; *
ADDQ 64(SP), AX ; .inc (inlined)
MOVQ AX, 56(SP) ; (dead)
MOVQ (R8), SI
MOVQ 64(SP), BX ; *
MOVQ 72(SP), CX ; *
MOVQ 80(SP), DI ; *
MOVQ R8, DX
CALL SI
MOVQ AX, 56(SP) ; *
MOVQ BX, 64(SP) ; *
MOVQ CX, 72(SP) ; *
MOVQ DI, 80(SP) ; *
MOVQ 32(SP), SI
CMPQ SI, $1
JNE 1b
ADDQ $40, SP
POPQ BP
RETNotice that around the call to the func (the call rsi instruction), the whole value of t is loaded and then spilled back to the stack, despite the fact that at all points in this function:
- That stack region and the argument registers
rax,rbx,rcx, andrdihave the same value, except across the four spill instructions, of course. - The value
56 + rspis never loaded into a register (i.e., the pointer is never materialized).
However, this goes away if I change inc to take and return its receiver by value:
TEXT command-line-arguments.run(SB), NOSPLIT|ABIInternal, $48-48
PUSHQ BP
MOVQ SP, BP
SUBQ $40, SP
MOVQ R8, 96(SP)
JMP 2f
1:
DECQ SI
MOVQ 96(SP), R8
2:
MOVQ SI, 32(SP)
MOVQ (R8), SI
ADDQ BX, AX
MOVQ R8, DX
CALL SI
MOVQ 32(SP), SI
CMPQ SI, $1
JNE 1b
ADDQ $40, SP
POPQ BP
RETSo, it seems that despite the fact the function has been inlined, Go is not able to lift the pointer operations up into registers. The same thing happens if I hand-inline inc, creating an explicit pointer each iteration of the loop.
This is very surprising, because this sort of lifting is a basic cleanup pass in LLVM: the mem2reg pass lifts non-escaping pointers to stack allocas into SSA registers, even in the face of reads and writes across control flow edges.
When I caught this inside of some high-throughput code I'm working on and eliminated all implicit pointer creation, I saw a significant jump in microbenchmark performance, from 153191 ns/op to 143013 ns/op, a 7% performance improvement (this was the remaining barrier to making my code never spill any registers across calls).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status