Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: arm64 multiplication with constant optimization #67575

Open
egonelbre opened this issue May 22, 2024 · 15 comments
Open

cmd/compile: arm64 multiplication with constant optimization #67575

egonelbre opened this issue May 22, 2024 · 15 comments
Labels
arch-arm64 compiler/runtime Issues related to the Go compiler and/or runtime. help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@egonelbre
Copy link
Contributor

egonelbre commented May 22, 2024

While looking into optimizing ed25119 verification on ARM64, I noticed that x * 19 is not optimized into shifted adds.

func multiply19(x uint64) uint64 {
    return x * 19
}
// which compiles to:
MOVD    $19, R1
MUL     R1, R0, R0

This can be reduced to:

ADD     R0<<3, R0, R1
ADD     R1<<1, R0, R0

The general form seems to be:

x * c, where
    c = 1 + 2^N + 2^(N+M), N != M;
    N > 1, M > 1 // not sure whether this restriction is necessary

Then the multiplication can be rewritten as:
    x + (x + x << M) << N

This can be checked with:
    (c-1)&1 == 0 && bits.OnesCount(c - 1) == 2

Which holds for numbers like:
    7, 11, 13, 19, 21, 25, 35, 37, 41, 49, 67, 69, 73, 81, 97, 131, 133, 137, 145, 161, 193...

There is also a similar reduction, that can be done:

x * c, where
    c = 1 + 2^N + 2^M + 2^(N+M), N < M
    N > 1  // not sure whether this restriction is necessary

Then the multiplication can be rewritten as:
    x = x + x<<N; x = x + x<<M

This can be checked with:
    (c-1)&1 == 0 && bits.OnesCount(c - 1) == 3 && highbit - lowbit = midbit

Which holds for numbers like:
    15, 23, 27, 29, 39, 43, 51, 53, 57, 71, 75, 83, 89, 99, 101, 135, 139, 147, 163, 169, 177, 195...

I didn't verify, but this might be useful on amd64 as well.


I can send a CL about this, but I'm not sure whether there are some corner cases with high c values that I'm not thinking of. Similarly, I wasn't able to figure out how to write the second reduction in SSA rules.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 22, 2024
@randall77
Copy link
Contributor

amd64 already reduces *19 like you suggest.

This all sounds reasonable to me.
The second reduction can be done using rules by just repeating the inner operation - the repeated part will get CSEd subsequently. (Some of the amd64 rules do exactly that.)

@randall77 randall77 added this to the Unplanned milestone May 22, 2024
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 22, 2024
@egonelbre
Copy link
Contributor Author

egonelbre commented May 23, 2024

Some additional notes. It seems there are few additional variants that follow similar patterns:

c = 1 + 2^H - 2^L, where L < H
  => (x + x<<L) - x<<H

c = 2^H - 2^L, where L < H
  => (x<<L) - x<<H

c = 2^L + 2^H, where L < H
  => (x<<L) + x<<H

PS: does anyone know whether these rules have some standard naming?

@cooler-SAI
Copy link

Some additional notes. It seems there are few additional variants that follow similar patterns:

c = 1 + 2^H - 2^L, where L < H
  => (x + x<<L) - x<<H

c = 2^H - 2^L, where L < H
  => (x<<L) - x<<H

c = 2^L + 2^H, where L < H
  => (x<<L) + x<<H

PS: does anyone know whether these rules have some standard naming?

They don't have specific standard names but are widely recognized techniques in the context of bit manipulation and arithmetic operations.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/626998 mentions this issue: cmd/compile: improve multiplication strength reduction

@erifan
Copy link

erifan commented Nov 12, 2024

Hi @randall77 , I can't log in to my google-source account registered with the Arm email. Regarding this multiplication optimization, one thing to note on arm64 is that the latency and throughput of arm64 shift addition are related to the shift amount. For example on V2, it is as follows:


instruction                            latency          throughput           pipeline
add shift, shift <= 4                       1                6                       I
add shift, shift > 4                        2                2                       M
MUL                                         2                 2                     M

So in theory, when the shift amount is greater than 4, the cost of MUL and ADD is the same. So when converting a constant MUL into two ADDs, the shift amount of ADD needs to be considered. However, this seems to be handled differently in different compilers. The handling of gcc and clang seems to be different. Also you can refer to this link: dotnet/runtime#75119

@randall77
Copy link
Contributor

on arm64 is that the latency and throughput of arm64 shift addition are related to the shift amount.

I'm not seeing that on an M2 ultra.

bench.s:

TEXT ·ShiftAdd3(SB),0,$0-0
	MOVD	$65536, R1
	MOVD	$0, R0
loop:
	ADD	R0<<3, R0, R0
	SUB	$1, R1, R1
	CBNZ	R1, loop
	RET

TEXT ·ShiftAdd6(SB),0,$0-0
	MOVD	$65536, R1
	MOVD	$0, R0
loop:
	ADD	R0<<6, R0, R0
	SUB	$1, R1, R1
	CBNZ	R1, loop
	RET

bench.go:

package p

func ShiftAdd3()
func ShiftAdd6()

bench_test.go:

package p

import "testing"

func BenchmarkShiftAdd3(b *testing.B) {
	for range b.N {
		ShiftAdd3()
	}
}

func BenchmarkShiftAdd6(b *testing.B) {
	for range b.N {
		ShiftAdd6()
	}
}

This gives me the same time for both benchmarks, at around 38000ns.
That's 0.58ns/iter, which at 3.5GHz is is almost exactly 2 clocks.

If I change the add instruction to remove the shift, then I get a 2x speedup, which means add is 1 clock. Shifts are also 1 clock.
Multiplies look to be 3 clocks.

So maybe this would not be worth it if we end up using 2 shift-containing instructions. And maybe a wash with 1 add+shift instruction and 1 plain (add or shift but not both) instruction. (Although not having to load the constant into a register probably helps if that load can't be lifted out of a loop.)

Of course, that's all M2 ultra specific. Different arm chips I'm sure are different. There's no real easy way to deal with that fact, other than to pick something reasonable and live with it.

@erifan
Copy link

erifan commented Nov 12, 2024

This is the test results on an V2 machine:

$ ./bench.test -test.bench BenchmarkShiftAdd
goos: linux
goarch: arm64
pkg: bench
BenchmarkShiftAdd3-72    	   54376	     20533 ns/op
BenchmarkShiftAdd6-72    	   29246	     41028 ns/op
PASS

The latency and throughput of shift-add are related to the shift amount, which is the same on almost all ARM v8/v9 chips.
We can also reproduce this result using C code. I think gcc may not handle this for other reasons, such as the interaction with other optimizations (I don't know how gcc implements this optimization). Or it may just not be handled properly.

@erifan
Copy link

erifan commented Nov 12, 2024

But this also depends on the execution environment. If there is no loop, we should actually compare a MOV + a MUL instruction and an ADD-SHIFT instruction, because MUL does not support immediate values, and we need to move the constant to a register with MOV first. But if there is a loop, and the MOV instruction is moved outside the loop, then the performance comparison is between one MUL and one Add-SHIFT instruction.

@egonelbre
Copy link
Contributor Author

Yeah, that's why I got a bit stuck with it and forgot to follow up. The results were quite different on M1 and Raspberry Pi. The old benchmarks FiloSottile/edwards25519@9c34bf2 for manual fixes in edwards25519 on RPi.

The benchmark on RPI4:

oos: linux
goarch: arm64
pkg: bench.test
            │ results.log  │
            │    sec/op    │
ShiftAdd3-4   87.55µ ± ∞ ¹
ShiftAdd6-4   87.54µ ± ∞ ¹
geomean       87.55µ
¹ need >= 6 samples for confidence interval at level 0.95

The benchmark on M1:

oos: darwin
goarch: arm64
pkg: bench.test
cpu: Apple M1 Pro
             │ results2.log │
             │    sec/op    │
ShiftAdd3-10   40.86µ ± ∞ ¹
ShiftAdd6-10   41.08µ ± ∞ ¹
geomean        40.97µ
¹ need >= 6 samples for confidence interval at level 0.95

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/626076 mentions this issue: cmd/compile/internal/ssa: optimize multiplication by constant [ARM64]

@egonelbre
Copy link
Contributor Author

And if I compare just the multiplication by 19 optimization then I get the following results. The annoying part is that the performance difference for both is significant. So ideally, there would be some way to target the "slow arm multiply" vs "fast arm multiply".

Mac M1:

Mul19-10        61.14µ ± 1%
Mul19shift-10   81.64µ ± 1%

RPi4:

Mul19-4        218.8µ ± 0%
Mul19shift-4   175.1µ ± 3%

Benchmarked funcs:

TEXT ·Mul19(SB),0,$0-0
	MOVD	$65536, R2
	MOVD	$218643, R0
loop:
	MOVD    $19, R1
	MUL     R1, R0, R0
	SUB	$1, R2, R2
	CBNZ	R2, loop
	RET

TEXT ·Mul19shift(SB),0,$0-0
	MOVD	$65536, R2
	MOVD	$218643, R0
loop:
	ADD	R0<<3, R0, R1
	ADD	R1<<1, R0, R0
	SUB	$1, R2, R2
	CBNZ	R2, loop
	RET

@randall77
Copy link
Contributor

So ideally, there would be some way to target the "slow arm multiply" vs "fast arm multiply".

Unfortunately our build model doesn't allow that. Maybe we could conditionalize on GOAMD64 values, but most people aren't setting that.

I'm inclined to select the faster code when it is universal, and the smaller code when platforms disagree about the fastest code. If they are both tied, flip a coin.

In the case of *19, that means prefer Mul19 over Mul19shift (as the constant in Mul19 could ideally be lifted out of a loop).

@egonelbre
Copy link
Contributor Author

egonelbre commented Feb 14, 2025

One option could be to base it on armv7- (using shift add) and armv8+ (using mul). RPi4+64bit is still armv8, but the cut-off is probably the closest. Although, that's probably pretty close to 64bit vs 32bit.

Or another way to think about it, is to use mul, because people can always do the optimization manually, when it's relevant in the context.

@egonelbre
Copy link
Contributor Author

One additional thing I remembered is that when the multiplication can be simplified to shifts and adds, then the common subexpression elimination, might be able to optimize things further. But, that might be useful only in some corner case.

@egonelbre
Copy link
Contributor Author

egonelbre commented Feb 28, 2025

What I gathered from experimenting with replacing different ops.

For RPi4, there's a benefit from replacing multiplication with two shift-adds. For M1 it makes things slower. So, it would be better to avoid that. This means my initial proposal doesn't seem a good idea.

The approach in https://go-review.googlesource.com/c/go/+/626998 however does seem reasonable. If it's possible to replace a multiplication with shift-add (or shift-sub) + regular add/sub/shift, then it seems to be the same performance wise, or maybe bit better, as multiplication on M1. But, there's a clear win on RPi4.

If the operation can be replaced with a single instruction, then there's a clear win on both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 compiler/runtime Issues related to the Go compiler and/or runtime. help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

8 participants