Skip to content

cmd/compile: ARM64 CSE flags generation between CMP 0$ and instruction already setting the flags according to the result #74878

@Tabaie

Description

@Tabaie

Consider the following Montgomery Reduction implementation:

func MontReduce(v uint64) uint32 {
	const (
		R    = 1 << 32
		qInv = 2164260865
	)

	m := uint32(v) * qInv
	t, borrow := bits.Sub64(v, uint64(m)*q, 0)

	res := uint32(t / R)
	if borrow != 0 {
		res += q
	}
	return res
}

Using objdump I get the following assembly (the comments are my own):

    MOVD $-65535, R1            // low part of qInv
    MOVK $(33024<<16), R1       // high part of qInv. Now R1 = qInv
    MULW R1, R0, R1             // R0 = v, R1 = m = v * qInv
    MOVD $1, R2
    MOVK $(32512<<16), R2       // R2 = q
    MUL R2, R1, R1              // R1 = m*q
    SUBS R1, R0, R1             // R1 = t = v - m*q
    NGC ZR, R3                  // R3 = -borrow
    NEG R3, R3                  // R3 = borrow
    LSR $32, R1, R1             // R1 = res = t / R
    ADD R2, R1, R2              // R2 = res + q
    CMP $0, R3
    CSEL NE, R2, R1, R0         // if borrow ! = 0 etc
    RET

However, the following simpler code also works:

TEXT ·MontReduce(SB), NOSPLIT, $16-8

    MOVD v+0(FP), R0            // R0 = v

    MOVD $1, R2                 // low part of qInv
    MOVK $(33024<<16), R2       // high part of qInv. Now R2 = qInv
    MULW R2, R0, R1             // R0 = v, R1 = m = v * qInv
    MOVK $(32512<<16), R2       // R2 = q
    MUL R2, R1, R1              // R1 = m*q
    SUBS R1, R0, R1             // R1 = t = v - m*q
    LSR $32, R1, R1             // R1 = res = t / R
    ADD R2, R1, R2              // R2 = res + q

    CSEL CS, R1, R2, R0

    MOVD R0, ret+8(FP)
    RET

In the generated code, we're using an extra instruction to make sure a nonzero borrow value would be equal to 1, not -1, even though the condition only checks if it is zero, and is thus unaffected by the NEG.
Furthermore, since none of the instructions after the SUBS modify the carry flag, it doesn't have to explicitly reside in a register at all. One can directly use it in CSEL and remove the CMP (saving an addition equivalent.)

I don't know if something like this is easy enough to patch to be worth the modest perf gains. Just thought it was interesting. :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    FeatureRequestIssues asking for a new feature that does not need a proposal.ImplementationIssues describing a semantics-preserving change to the Go implementation.NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.arch-arm64compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions