Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] Suboptimal code for multiplication by certain constants #89430

Closed
Kmeakin opened this issue Apr 19, 2024 · 7 comments · Fixed by #90199
Closed

[AArch64] Suboptimal code for multiplication by certain constants #89430

Kmeakin opened this issue Apr 19, 2024 · 7 comments · Fixed by #90199

Comments

@Kmeakin
Copy link
Contributor

Kmeakin commented Apr 19, 2024

For some constants, GCC is able to generate sequences of add where LLVM generates mul. I have checked all constants between 1 and 100 (https://godbolt.org/z/rxej44fGj):

For all of the examples below(11, 13, 19, 21, 25, 27, 35, 37, 41, 49, 51, 69, 73, 81, 85), LLVM generates

mulK:
        mov     w8, K
        mul     w0, w0, w8
        ret
mul11:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 1
        ret
mul13:
        add     w1, w0, w0, lsl 1
        add     w0, w0, w1, lsl 2
        ret
mul19:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 1
        ret
mul21:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 2
        ret
mul25:
        add     w0, w0, w0, lsl 2
        add     w0, w0, w0, lsl 2
        ret
mul27:
        add     w0, w0, w0, lsl 1
        add     w0, w0, w0, lsl 3
        ret
mul35:
        add     w1, w0, w0, lsl 4
        add     w0, w0, w1, lsl 1
        ret
mul37:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 2
        ret
mul41:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 3
        ret
mul49:
        add     w1, w0, w0, lsl 1
        add     w0, w0, w1, lsl 4
        ret
mul51:
        add     w0, w0, w0, lsl 1
        add     w0, w0, w0, lsl 4
        ret
mul69:
        add     w1, w0, w0, lsl 4
        add     w0, w0, w1, lsl 2
        ret
mul73:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 3
        ret
mul81:
        add     w0, w0, w0, lsl 3
        add     w0, w0, w0, lsl 3
        ret
mul85:
        add     w0, w0, w0, lsl 2
        add     w0, w0, w0, lsl 4
        ret
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 19, 2024

@llvm/issue-subscribers-backend-aarch64

Author: Karl Meakin (Kmeakin)

For some constants, GCC is able to generate sequences of `add` where LLVM generates `mul`. I have checked all constants between 1 and 100 (https://godbolt.org/z/rxej44fGj):

For all of the examples below(11, 13, 19, 21, 25, 27, 35, 37, 41, 49, 51, 69, 73, 81, 85), LLVM generates

mulK:
        mov     w8, K
        mul     w0, w0, w8
        ret
mul11:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 1
        ret
mul13:
        add     w1, w0, w0, lsl 1
        add     w0, w0, w1, lsl 2
        ret
mul19:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 1
        ret
mul21:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 2
        ret
mul25:
        add     w0, w0, w0, lsl 2
        add     w0, w0, w0, lsl 2
        ret
mul27:
        add     w0, w0, w0, lsl 1
        add     w0, w0, w0, lsl 3
        ret
mul35:
        add     w1, w0, w0, lsl 4
        add     w0, w0, w1, lsl 1
        ret
mul37:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 2
        ret
mul41:
        add     w1, w0, w0, lsl 2
        add     w0, w0, w1, lsl 3
        ret
mul49:
        add     w1, w0, w0, lsl 1
        add     w0, w0, w1, lsl 4
        ret
mul51:
        add     w0, w0, w0, lsl 1
        add     w0, w0, w0, lsl 4
        ret
mul69:
        add     w1, w0, w0, lsl 4
        add     w0, w0, w1, lsl 2
        ret
mul73:
        add     w1, w0, w0, lsl 3
        add     w0, w0, w1, lsl 3
        ret
mul81:
        add     w0, w0, w0, lsl 3
        add     w0, w0, w0, lsl 3
        ret
mul85:
        add     w0, w0, w0, lsl 2
        add     w0, w0, w0, lsl 4
        ret

@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 19, 2024

I believe #88791 and its follow-up patches will fix this :)

@efriedma-quic
Copy link
Collaborator

We have to be a bit careful weighing these optimizations; for certain combinations of target CPU/shift amount/register width, two add-with-shift instructions are actually more expensive than a multiply.

@efriedma-quic
Copy link
Collaborator

Also, gcc misses some combinations, for example:

#include <stdint.h>
typedef uint32_t u32;
u32 a(u32 x, u32 y) { return x -y*4; }
u32 b(u32 x) { return x * -7;}
u32 c(u32 x) { return a(x, b(x)); }

vfdff added a commit to vfdff/llvm-project that referenced this issue Apr 21, 2024
…+shl+add

Change the costmodel to lower a = b * C where C = (1 + 2^m) * 2^n + 1 to
          add   w8, w0, w0, lsl #m
          add   w0, w0, w8, lsl #n
Note: The latency can vary depending on the shirt amount
Fix part of llvm#89430
@davemgreen
Copy link
Collaborator

Do we have any evidence that these are better as add+shift? As far as I understand GCC optimized it this way because older cores had slower mul and faster add+lsl, but that has changed in more recent cores and mul is now usually relatively quick.

vfdff added a commit to vfdff/llvm-project that referenced this issue Apr 24, 2024
…+shl+add

Change the costmodel to lower a = b * C where C = (1 + 2^m) * 2^n + 1 to
          add   w8, w0, w0, lsl #m
          add   w0, w0, w8, lsl #n
Note: The latency of add can vary depending on the shirt amount
      They are cheap as a move when the shift amounts is 4 or less.
Fix part of llvm#89430
vfdff added a commit to vfdff/llvm-project that referenced this issue Apr 25, 2024
…+shl+add

Change the costmodel to lower a = b * C where C = (1 + 2^m) * 2^n + 1 to
          add   w8, w0, w0, lsl #m
          add   w0, w0, w8, lsl #n
Note: The latency of add can vary depending on the shirt amount
      They are cheap as a move when the shift amounts is 4 or less.
Fix part of llvm#89430
@vfdff
Copy link
Contributor

vfdff commented Apr 25, 2024

the all above numbers listed (11, 13, 19, 21, 25, 27, 35, 37, 41, 49, 51, 69, 73, 81, 85) can be optimized With FeatureALULSLFast

  • 11 = (((1<<2) + 1) << 1) + 1
  • 13 = (((1<<1) + 1) << 2) + 1
  • 19 = (((1<<3) + 1) << 1) + 1
  • 21 = (((1<<2) + 1) << 2) + 1
  • 25 = ((1<<2) + 1) * ((1<<2) + 1)
  • 27 = ((1<<1) + 1) * ((1<<3) + 1)
  • 35 = ((1<<1) + 1) * ((1<<3) + 1)
  • ...

@vfdff
Copy link
Contributor

vfdff commented Apr 25, 2024

Also, gcc misses some combinations, for example:

#include <stdint.h>
typedef uint32_t u32;
u32 a(u32 x, u32 y) { return x -y*4; }
u32 b(u32 x) { return x * -7;}
u32 c(u32 x) { return a(x, b(x)); }

This case is also not supported by llvm now

vfdff added a commit to vfdff/llvm-project that referenced this issue Apr 26, 2024
…+shl+sub

Change the costmodel to lower a = b * C where C = 1 - (1 - 2^m) * 2^n to
              sub  w8, w0, w0, lsl #m
              sub  w0, w0, w8, lsl #n
Fix llvm#89430
vfdff added a commit that referenced this issue May 6, 2024
…+shl+sub (#90199)

Change the costmodel to lower a = b * C where C = 1 - (1 - 2^m) * 2^n to
              sub  w8, w0, w0, lsl #m
              sub  w0, w0, w8, lsl #n
Fix #89430
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants