-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: arm64 multiplication with constant optimization #67575
Comments
amd64 already reduces *19 like you suggest. This all sounds reasonable to me. |
Some additional notes. It seems there are few additional variants that follow similar patterns:
PS: does anyone know whether these rules have some standard naming? |
They don't have specific standard names but are widely recognized techniques in the context of bit manipulation and arithmetic operations. |
Change https://go.dev/cl/626998 mentions this issue: |
Hi @randall77 , I can't log in to my google-source account registered with the Arm email. Regarding this multiplication optimization, one thing to note on arm64 is that the latency and throughput of arm64 shift addition are related to the shift amount. For example on V2, it is as follows:
So in theory, when the shift amount is greater than 4, the cost of MUL and ADD is the same. So when converting a constant MUL into two ADDs, the shift amount of ADD needs to be considered. However, this seems to be handled differently in different compilers. The handling of gcc and clang seems to be different. Also you can refer to this link: dotnet/runtime#75119 |
I'm not seeing that on an M2 ultra. bench.s:
bench.go:
bench_test.go:
This gives me the same time for both benchmarks, at around 38000ns. If I change the add instruction to remove the shift, then I get a 2x speedup, which means add is 1 clock. Shifts are also 1 clock. So maybe this would not be worth it if we end up using 2 shift-containing instructions. And maybe a wash with 1 add+shift instruction and 1 plain (add or shift but not both) instruction. (Although not having to load the constant into a register probably helps if that load can't be lifted out of a loop.) Of course, that's all M2 ultra specific. Different arm chips I'm sure are different. There's no real easy way to deal with that fact, other than to pick something reasonable and live with it. |
This is the test results on an V2 machine:
The latency and throughput of shift-add are related to the shift amount, which is the same on almost all ARM v8/v9 chips. |
But this also depends on the execution environment. If there is no loop, we should actually compare a MOV + a MUL instruction and an ADD-SHIFT instruction, because MUL does not support immediate values, and we need to move the constant to a register with MOV first. But if there is a loop, and the MOV instruction is moved outside the loop, then the performance comparison is between one MUL and one Add-SHIFT instruction. |
Yeah, that's why I got a bit stuck with it and forgot to follow up. The results were quite different on M1 and Raspberry Pi. The old benchmarks FiloSottile/edwards25519@9c34bf2 for manual fixes in edwards25519 on RPi. The benchmark on RPI4:
The benchmark on M1:
|
Change https://go.dev/cl/626076 mentions this issue: |
And if I compare just the multiplication by 19 optimization then I get the following results. The annoying part is that the performance difference for both is significant. So ideally, there would be some way to target the "slow arm multiply" vs "fast arm multiply". Mac M1:
RPi4:
Benchmarked funcs:
|
Unfortunately our build model doesn't allow that. Maybe we could conditionalize on I'm inclined to select the faster code when it is universal, and the smaller code when platforms disagree about the fastest code. If they are both tied, flip a coin. In the case of *19, that means prefer Mul19 over Mul19shift (as the constant in Mul19 could ideally be lifted out of a loop). |
One option could be to base it on armv7- (using shift add) and armv8+ (using mul). RPi4+64bit is still armv8, but the cut-off is probably the closest. Although, that's probably pretty close to 64bit vs 32bit. Or another way to think about it, is to use mul, because people can always do the optimization manually, when it's relevant in the context. |
One additional thing I remembered is that when the multiplication can be simplified to shifts and adds, then the common subexpression elimination, might be able to optimize things further. But, that might be useful only in some corner case. |
What I gathered from experimenting with replacing different ops. For RPi4, there's a benefit from replacing multiplication with two shift-adds. For M1 it makes things slower. So, it would be better to avoid that. This means my initial proposal doesn't seem a good idea. The approach in https://go-review.googlesource.com/c/go/+/626998 however does seem reasonable. If it's possible to replace a multiplication with shift-add (or shift-sub) + regular add/sub/shift, then it seems to be the same performance wise, or maybe bit better, as multiplication on M1. But, there's a clear win on RPi4. If the operation can be replaced with a single instruction, then there's a clear win on both. |
While looking into optimizing ed25119 verification on ARM64, I noticed that
x * 19
is not optimized into shifted adds.This can be reduced to:
The general form seems to be:
There is also a similar reduction, that can be done:
I didn't verify, but this might be useful on amd64 as well.
I can send a CL about this, but I'm not sure whether there are some corner cases with high
c
values that I'm not thinking of. Similarly, I wasn't able to figure out how to write the second reduction in SSA rules.The text was updated successfully, but these errors were encountered: