Skip to content

cmd/compile: non-escaping pointer operations block register lifting of values #73589

@mcy

Description

@mcy
$ go version
go version go1.24.2 darwin/arm64

Consider the following program:

package x

type t struct {
    x0, x1, x2, x3 uint64
}

//go:nosplit
func (t *t) inc() {
    t.x0 += t.x1
}

//go:nosplit
func run(t t, n int, f func(t) t) {
    for {
        t.inc()
        t = f(t)

        n--
        if n == 0 {
            break
        }
    }
}

The function run produces the following abbridged assembly on x86:

TEXT    command-line-arguments.run(SB), NOSPLIT|ABIInternal, $48-48
    PUSHQ   BP
    MOVQ    SP, BP
    SUBQ    $40, SP
    MOVQ    R8, 96(SP)
    MOVQ    AX, 56(SP)
    MOVQ    BX, 64(SP)
    MOVQ    CX, 72(SP)
    MOVQ    DI, 80(SP)
    JMP     2f
1:
    DECQ    SI
    MOVQ    96(SP), R8
2:
    MOVQ    SI, 32(SP)
    MOVQ    56(SP), AX  ; *
    ADDQ    64(SP), AX  ; .inc (inlined) 
    MOVQ    AX, 56(SP)  ; (dead)
    MOVQ    (R8), SI
    MOVQ    64(SP), BX  ; *
    MOVQ    72(SP), CX  ; *
    MOVQ    80(SP), DI  ; *
    MOVQ    R8, DX
    CALL    SI
    MOVQ    AX, 56(SP)  ; *
    MOVQ    BX, 64(SP)  ; *
    MOVQ    CX, 72(SP)  ; *
    MOVQ    DI, 80(SP)  ; *
    MOVQ    32(SP), SI
    CMPQ    SI, $1
    JNE     1b
    ADDQ    $40, SP
    POPQ    BP
    RET

Notice that around the call to the func (the call rsi instruction), the whole value of t is loaded and then spilled back to the stack, despite the fact that at all points in this function:

  • That stack region and the argument registers rax, rbx, rcx, and rdi have the same value, except across the four spill instructions, of course.
  • The value 56 + rsp is never loaded into a register (i.e., the pointer is never materialized).

However, this goes away if I change inc to take and return its receiver by value:

TEXT    command-line-arguments.run(SB), NOSPLIT|ABIInternal, $48-48
    PUSHQ   BP
    MOVQ    SP, BP
    SUBQ    $40, SP
    MOVQ    R8, 96(SP)
    JMP     2f
1:
    DECQ    SI
    MOVQ    96(SP), R8
2:
    MOVQ    SI, 32(SP)
    MOVQ    (R8), SI
    ADDQ    BX, AX
    MOVQ    R8, DX
    CALL    SI
    MOVQ    32(SP), SI
    CMPQ    SI, $1
    JNE     1b
    ADDQ    $40, SP
    POPQ    BP
    RET

So, it seems that despite the fact the function has been inlined, Go is not able to lift the pointer operations up into registers. The same thing happens if I hand-inline inc, creating an explicit pointer each iteration of the loop.

This is very surprising, because this sort of lifting is a basic cleanup pass in LLVM: the mem2reg pass lifts non-escaping pointers to stack allocas into SSA registers, even in the face of reads and writes across control flow edges.

When I caught this inside of some high-throughput code I'm working on and eliminated all implicit pointer creation, I saw a significant jump in microbenchmark performance, from 153191 ns/op to 143013 ns/op, a 7% performance improvement (this was the remaining barrier to making my code never spill any registers across calls).

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions