Suboptimal code generation for 256-bit integer addition containing constant bits in arm64 #60571

aqjune · 2023-02-06T22:35:32Z

Hello all,

It seems llc is producing suboptimal code for 256-bit int add when target is ARM64.

https://godbolt.org/z/MfEMGKE8n
The command I used is llc --mtriple=arm64-unknown-unknown -mcpu=neoverse-n1 -O3 (which is also used in the Godbolt instance).

The above code evaluates [x3;x2;x1;x0] + [y3;y2;y1;0x1] and stores the results into the memory.

The generated code is

        adds    x8, x1, x4
        adcs    x9, x2, x5
        adc     x10, x3, x6
        adds    x11, x0, #1
        adcs    x8, x8, xzr
        stp     x11, x8, [x7]
        adcs    x8, x9, xzr
        cinc    x9, x10, hs
        stp     x8, x9, [x7, #16]
        ret

.. which doesn't quite seem optimal.
I think it can be further reduced to something like this (please ignore register names here, they are arbitrary):

adds x15, x0, #1
adcs x16, x1, x2
adcs x17, x3, x4
adc x18, x5, x6
stp x15, x16, [x7]
stp x17, x18, [x7, #16]

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-02-06T22:35:49Z

@llvm/issue-subscribers-backend-aarch64

aqjune · 2023-02-06T22:55:53Z

If the low 64 bits of y are not set to 00..01, llc 's output looks okay: https://godbolt.org/z/rsfoWzv4j

aqjune · 2023-02-07T18:07:15Z

I can see that IR Dump After AArch64 Instruction Selection (aarch64-isel) lowers to the sequence.

aqjune · 2023-02-07T19:42:47Z

It is DAGCombine that does this sequence of transformations:

x + (((1  y1<<64) | y2<<128) | y3<<192)
=>
x + (((y1<<64 | 1) | y2<<128) | y3<<192)
=>
x + (((y1<<64 | y2<<128) | 1 ) | y3<<192)
=>
x + (((y1<<64 | y2<<128) | y3<<192) | 1)
=>
(x + ((y1<<64 | y2<<128) | y3<<192)) + 1

at DAGCombiner.cpp, line 2516:

    // Reassociate (add (or x, c), y) -> (add add(x, y), c)) if (or x, c) is
    // equivalent to (add x, c).
    // Reassociate (add (xor x, c), y) -> (add add(x, y), c)) if (xor x, c) is
    // equivalent to (add x, c).
    auto ReassociateAddOr = [&](SDValue N0, SDValue N1) {
      if (isADDLike(N0, DAG) && N0.hasOneUse() &&
          isConstantOrConstantVector(N0.getOperand(1), /* NoOpaque */ true)) {
        return DAG.getNode(ISD::ADD, DL, VT,
                           DAG.getNode(ISD::ADD, DL, VT, N1, N0.getOperand(0)),
                           N0.getOperand(1));
      }
      return SDValue();
    };
    if (SDValue Add = ReassociateAddOr(N0, N1))
      return Add;
    if (SDValue Add = ReassociateAddOr(N1, N0))
      return Add;

I think this transformation isn't beneficial if the add is not a legal operation in the architecture.

aqjune · 2023-02-15T18:47:49Z

Made a patch here: https://reviews.llvm.org/D144116

…t if benefit is unclear This patch resolves suboptimal code generation reported by #60571 . DAGCombiner currently converts `(x or/xor const) + y` to `(x + y) + const` if this is valid. However, if `.. + const` is broken down into a sequences of adds with carries, the benefit is not clear, introducing two more add(-with-carry) ops (total 6) in the case of the reported issue whereas the optimal sequence must only have 4 add(-with-carry)s. This patch resolves this issue by allowing this conversion only when (1) `.. + const` is legal or promotable, or (2) `const` is a sign bit because it does not introduce more adds. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D144116

…t if benefit is unclear This patch resolves suboptimal code generation reported by llvm/llvm-project#60571 . DAGCombiner currently converts `(x or/xor const) + y` to `(x + y) + const` if this is valid. However, if `.. + const` is broken down into a sequences of adds with carries, the benefit is not clear, introducing two more add(-with-carry) ops (total 6) in the case of the reported issue whereas the optimal sequence must only have 4 add(-with-carry)s. This patch resolves this issue by allowing this conversion only when (1) `.. + const` is legal or promotable, or (2) `const` is a sign bit because it does not introduce more adds. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D144116

aqjune added the backend:AArch64 label Feb 6, 2023

EugeneZelenko added llvm:codegen and removed backend:AArch64 labels Feb 15, 2023

aqjune closed this as completed Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal code generation for 256-bit integer addition containing constant bits in arm64 #60571

Suboptimal code generation for 256-bit integer addition containing constant bits in arm64 #60571

aqjune commented Feb 6, 2023 •

edited by VoltrexKeyva

llvmbot commented Feb 6, 2023

aqjune commented Feb 6, 2023

aqjune commented Feb 7, 2023

aqjune commented Feb 7, 2023 •

edited by VoltrexKeyva

aqjune commented Feb 15, 2023

Suboptimal code generation for 256-bit integer addition containing constant bits in arm64 #60571

Suboptimal code generation for 256-bit integer addition containing constant bits in arm64 #60571

Comments

aqjune commented Feb 6, 2023 • edited by VoltrexKeyva

llvmbot commented Feb 6, 2023

aqjune commented Feb 6, 2023

aqjune commented Feb 7, 2023

aqjune commented Feb 7, 2023 • edited by VoltrexKeyva

aqjune commented Feb 15, 2023

aqjune commented Feb 6, 2023 •

edited by VoltrexKeyva

aqjune commented Feb 7, 2023 •

edited by VoltrexKeyva