[AArch64] Suboptimal code for multiplication by certain constants #89430

Kmeakin · 2024-04-19T18:14:04Z

For some constants, GCC is able to generate sequences of add where LLVM generates mul. I have checked all constants between 1 and 100 (https://godbolt.org/z/rxej44fGj):

For all of the examples below(11, 13, 19, 21, 25, 27, 35, 37, 41, 49, 51, 69, 73, 81, 85), LLVM generates

mulK:
        mov     w8, K
        mul     w0, w0, w8
        ret

mul11:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 1
        ret

mul13:
        add     w1, w0, w0, lsl 1
        add     w0, w0, w1, lsl 2
        ret

mul19:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 1
        ret

mul21:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 2
        ret

mul25:
        add     w0, w0, w0, lsl 2
        add     w0, w0, w0, lsl 2
        ret

mul27:
        add     w0, w0, w0, lsl 1
        add     w0, w0, w0, lsl 3
        ret

mul35:
        add     w1, w0, w0, lsl 4
        add     w0, w0, w1, lsl 1
        ret

mul37:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 2
        ret

mul41:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 3
        ret

mul49:
        add     w1, w0, w0, lsl 1
        add     w0, w0, w1, lsl 4
        ret

mul51:
        add     w0, w0, w0, lsl 1
        add     w0, w0, w0, lsl 4
        ret

mul69:
        add     w1, w0, w0, lsl 4
        add     w0, w0, w1, lsl 2
        ret

mul73:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 3
        ret

mul81:
        add     w0, w0, w0, lsl 3
        add     w0, w0, w0, lsl 3
        ret

mul85:
        add     w0, w0, w0, lsl 2
        add     w0, w0, w0, lsl 4
        ret

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-04-19T18:14:20Z

@llvm/issue-subscribers-backend-aarch64

Author: Karl Meakin (Kmeakin)

For some constants, GCC is able to generate sequences of `add` where LLVM generates `mul`. I have checked all constants between 1 and 100 (https://godbolt.org/z/rxej44fGj):

For all of the examples below(11, 13, 19, 21, 25, 27, 35, 37, 41, 49, 51, 69, 73, 81, 85), LLVM generates

mulK:
        mov     w8, K
        mul     w0, w0, w8
        ret

mul11:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 1
        ret

mul13:
        add     w1, w0, w0, lsl 1
        add     w0, w0, w1, lsl 2
        ret

mul19:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 1
        ret

mul21:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 2
        ret

mul25:
        add     w0, w0, w0, lsl 2
        add     w0, w0, w0, lsl 2
        ret

mul27:
        add     w0, w0, w0, lsl 1
        add     w0, w0, w0, lsl 3
        ret

mul35:
        add     w1, w0, w0, lsl 4
        add     w0, w0, w1, lsl 1
        ret

mul37:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 2
        ret

mul41:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 3
        ret

mul49:
        add     w1, w0, w0, lsl 1
        add     w0, w0, w1, lsl 4
        ret

mul51:
        add     w0, w0, w0, lsl 1
        add     w0, w0, w0, lsl 4
        ret

mul69:
        add     w1, w0, w0, lsl 4
        add     w0, w0, w1, lsl 2
        ret

mul73:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 3
        ret

mul81:
        add     w0, w0, w0, lsl 3
        add     w0, w0, w0, lsl 3
        ret

mul85:
        add     w0, w0, w0, lsl 2
        add     w0, w0, w0, lsl 4
        ret

dtcxzyw · 2024-04-19T18:24:25Z

I believe #88791 and its follow-up patches will fix this :)

efriedma-quic · 2024-04-19T19:19:44Z

We have to be a bit careful weighing these optimizations; for certain combinations of target CPU/shift amount/register width, two add-with-shift instructions are actually more expensive than a multiply.

efriedma-quic · 2024-04-19T22:34:46Z

Also, gcc misses some combinations, for example:

#include <stdint.h>
typedef uint32_t u32;
u32 a(u32 x, u32 y) { return x -y*4; }
u32 b(u32 x) { return x * -7;}
u32 c(u32 x) { return a(x, b(x)); }

…+shl+add Change the costmodel to lower a = b * C where C = (1 + 2^m) * 2^n + 1 to add w8, w0, w0, lsl #m add w0, w0, w8, lsl #n Note: The latency can vary depending on the shirt amount Fix part of llvm#89430

davemgreen · 2024-04-21T16:33:20Z

Do we have any evidence that these are better as add+shift? As far as I understand GCC optimized it this way because older cores had slower mul and faster add+lsl, but that has changed in more recent cores and mul is now usually relatively quick.

…+shl+add Change the costmodel to lower a = b * C where C = (1 + 2^m) * 2^n + 1 to add w8, w0, w0, lsl #m add w0, w0, w8, lsl #n Note: The latency of add can vary depending on the shirt amount They are cheap as a move when the shift amounts is 4 or less. Fix part of llvm#89430

vfdff · 2024-04-25T01:54:28Z

the all above numbers listed (11, 13, 19, 21, 25, 27, 35, 37, 41, 49, 51, 69, 73, 81, 85) can be optimized With FeatureALULSLFast

11 = (((1<<2) + 1) << 1) + 1
13 = (((1<<1) + 1) << 2) + 1
19 = (((1<<3) + 1) << 1) + 1
21 = (((1<<2) + 1) << 2) + 1
25 = ((1<<2) + 1) * ((1<<2) + 1)
27 = ((1<<1) + 1) * ((1<<3) + 1)
35 = ((1<<1) + 1) * ((1<<3) + 1)
...

vfdff · 2024-04-25T02:30:44Z

Also, gcc misses some combinations, for example:

#include <stdint.h>
typedef uint32_t u32;
u32 a(u32 x, u32 y) { return x -y*4; }
u32 b(u32 x) { return x * -7;}
u32 c(u32 x) { return a(x, b(x)); }

This case is also not supported by llvm now

…+shl+sub Change the costmodel to lower a = b * C where C = 1 - (1 - 2^m) * 2^n to sub w8, w0, w0, lsl #m sub w0, w0, w8, lsl #n Fix llvm#89430

…+shl+sub (#90199) Change the costmodel to lower a = b * C where C = 1 - (1 - 2^m) * 2^n to sub w8, w0, w0, lsl #m sub w0, w0, w8, lsl #n Fix #89430

Kmeakin added backend:AArch64 missed-optimization labels Apr 19, 2024

vfdff mentioned this issue Apr 21, 2024

[AArch64][SelectionDAG] Lower multiplication by a constant to shl+add+shl+add #89532

Merged

vfdff mentioned this issue Apr 26, 2024

[AArch64][SelectionDAG] Lower multiplication by a constant to shl+sub+shl+sub #90199

Merged

vfdff closed this as completed in #90199 May 6, 2024

vfdff added a commit that referenced this issue May 6, 2024

[AArch64][SelectionDAG] Lower multiplication by a constant to shl+sub…

e123643

…+shl+sub (#90199) Change the costmodel to lower a = b * C where C = 1 - (1 - 2^m) * 2^n to sub w8, w0, w0, lsl #m sub w0, w0, w8, lsl #n Fix #89430

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64] Suboptimal code for multiplication by certain constants #89430

[AArch64] Suboptimal code for multiplication by certain constants #89430

Kmeakin commented Apr 19, 2024

llvmbot commented Apr 19, 2024

dtcxzyw commented Apr 19, 2024

efriedma-quic commented Apr 19, 2024

efriedma-quic commented Apr 19, 2024

davemgreen commented Apr 21, 2024

vfdff commented Apr 25, 2024

vfdff commented Apr 25, 2024

[AArch64] Suboptimal code for multiplication by certain constants #89430

[AArch64] Suboptimal code for multiplication by certain constants #89430

Comments

Kmeakin commented Apr 19, 2024

llvmbot commented Apr 19, 2024

dtcxzyw commented Apr 19, 2024

efriedma-quic commented Apr 19, 2024

efriedma-quic commented Apr 19, 2024

davemgreen commented Apr 21, 2024

vfdff commented Apr 25, 2024

vfdff commented Apr 25, 2024