[RISCV][WIP] Fold (sh3add Z, (add X, (slli Y, 6))) -> (sh3add (sh3add Y, Z), X). #85734

topperc · 2024-03-19T04:53:19Z

This gives a 0.5% reduction in dynamic instruction count for 531.deepsjeng_r from spec2017. Using 2 sh3add matches what gcc generates.

This pattern appears when indexing arrays like uint64_t fillUpAttacks[64][8]. The first index needs to multiplied by 64 bytes and the second index needs to be multiplied by 8 bytes. Then both multiplied indices need to be added to the start of the array to calculate the full address. Alternatively, you can multiply the first index by 8 add it to the second index, then multiply the sum by 8 before adding the base pointer.

What we currently generate is a direct result of how GEPs are expanded in SelectionDAGBuilder.

This patch is a proof of concept I hacked together to do measurements and not necessarily how I think it should be implemented. There other variations of this pattern with different shift amounts that I did not handle and did not look for.

We do a lot of work during isel to find shXadd and slli instructions that are obscured. In the motivating example, Y in (slli Y, 6) is (srli W, 58). What becomes (slli (srli W, 58), 6) is (and (srl W, 52), -64) when we start isel. We convert to a shift pair during selection of the and. Unless we repeat all of that cleverness to find all the variations of this pattern, we need to do this optimization sometime after isel.

There could be deeper versions of this pattern with more indices too.

X86 misses the opportunity to use 2 LEAs on the same code.

Posting this patch for discussion.

… Y, Z), X). This gives a 0.5% reduction in dynamic instruction count for 531.deepsjeng_r from spec2017. This matches what gcc generates. This pattern appears when indexing arrays like `uint64_t fillUpAttacks[64][8]`. The first index needs to multipled by 64 and the second index needs to be multiplied by 8. The both multiplied indices need to be added to the start of the array to calculate the full address. Alternatively, you can multiply the first index by 8 add it to the second index, then multiply the sum by 8 before adding the base pointer. This patch is a proof of concept and not how I think it should be implemented. We do a lot of work during isel to find shXadd and slli instructions that are obscured. In the motivating example, Y in (slli Y, 6) is (srli W, 58). The pattern is (slli (srli W, 58), 6) is written as (and (srli W, 52), -64) when we start isel. We detect the shift pair during selection of the and. Unless we repeat all of that cleverness, we need to do this optimization sometime after isel. X86 misses the opportunity to use 2 LEAs on the same code. Posting this patch for discussion.

github-actions · 2024-03-19T04:55:55Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 470040bd4d54f39f9ac0868a2197fa2ae3e6d4f5 dbd3f2e1775a57e63c40afbdbc35897d5672484f -- llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp llvm/lib/Target/RISCV/RISCVISelDAGToDAG.h

View the diff from clang-format here.

diff --git a/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp b/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
index 42eb4104ae..e7ab25b6e8 100644
--- a/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
@@ -3352,10 +3352,10 @@ bool RISCVDAGToDAGISel::doPeepholeSHXADD(SDNode *N) {
   SDValue Y = N11.getOperand(0);
   SDValue Z = N->getOperand(0);
 
-  SDNode *SH3ADD1 = CurDAG->getMachineNode(RISCV::SH3ADD, SDLoc(N), N->getValueType(0),
-                                          Y, Z);
-  SDNode *SH3ADD2 = CurDAG->getMachineNode(RISCV::SH3ADD, SDLoc(N), N->getValueType(0),
-                                           SDValue(SH3ADD1, 0), X);
+  SDNode *SH3ADD1 =
+      CurDAG->getMachineNode(RISCV::SH3ADD, SDLoc(N), N->getValueType(0), Y, Z);
+  SDNode *SH3ADD2 = CurDAG->getMachineNode(
+      RISCV::SH3ADD, SDLoc(N), N->getValueType(0), SDValue(SH3ADD1, 0), X);
   ReplaceUses(N, SH3ADD2);
   return true;
 }

mgudim · 2024-03-19T06:16:56Z

I tried to run isel on this IR:

define dso_local signext i64 @foo(i64 noundef signext %x, i64 noundef signext %y, i64 noundef signext %z) local_unnamed_addr #0 {
entry:
  %shl = shl i64 %z, 3
  %shl1 = shl i64 %y, 6
  %add = add nsw i64 %shl1, %shl
  %add2 = add nsw i64 %add, %x
  ret i64 %add2
}

which is (z << 3 + y << 6) + x and dag combine simplified it to (y << 3 + z) << 3 + x:

=== foo

Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 16 nodes:
  t0: ch,glue = EntryToken
          t4: i64,ch = CopyFromReg t0, Register:i64 %1
        t10: i64 = shl t4, Constant:i64<6>
          t6: i64,ch = CopyFromReg t0, Register:i64 %2
        t8: i64 = shl t6, Constant:i64<3>
      t11: i64 = add nsw t10, t8
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
    t12: i64 = add nsw t11, t2
  t14: ch,glue = CopyToReg t0, Register:i64 $x10, t12
  t15: ch = RISCVISD::RET_GLUE t14, Register:i64 $x10, t14:1



Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
            t4: i64,ch = CopyFromReg t0, Register:i64 %1
          t16: i64 = shl t4, Constant:i64<3>
          t6: i64,ch = CopyFromReg t0, Register:i64 %2
        t17: i64 = add t16, t6
      t18: i64 = shl t17, Constant:i64<3>
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
    t12: i64 = add nsw t18, t2
  t14: ch,glue = CopyToReg t0, Register:i64 $x10, t12
  t15: ch = RISCVISD::RET_GLUE t14, Register:i64 $x10, t14:1

so I think what's missing is the ability to reassociate the expression. I ran a simpler example to see if dag combine can do a simple reassosiation. From this IR:

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local signext i64 @foo(i64 noundef signext %x, i64 noundef signext %y, i64 noundef signext %z) local_unnamed_addr #0 {
entry:
  %add1 = add i64 %z, 3
  %add2 = add i64 %y, 6
  %add3 = add nsw i64 %add1, %add2
  ret i64 %add3
}

dag combine could reassociate:

=== foo

Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t6: i64,ch = CopyFromReg t0, Register:i64 %2
      t8: i64 = add t6, Constant:i64<3>
        t4: i64,ch = CopyFromReg t0, Register:i64 %1
      t10: i64 = add t4, Constant:i64<6>
    t11: i64 = add nsw t8, t10
  t13: ch,glue = CopyToReg t0, Register:i64 $x10, t11
  t14: ch = RISCVISD::RET_GLUE t13, Register:i64 $x10, t13:1



Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
        t4: i64,ch = CopyFromReg t0, Register:i64 %1
        t6: i64,ch = CopyFromReg t0, Register:i64 %2
      t17: i64 = add t4, t6
    t20: i64 = add t17, Constant:i64<9>
  t13: ch,glue = CopyToReg t0, Register:i64 $x10, t20
  t14: ch = RISCVISD::RET_GLUE t13, Register:i64 $x10, t13:1

So it looks like we can do this by generalizing existing dag combines?

llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp

topperc · 2024-03-19T06:54:29Z

which is (z << 3 + y << 6) + x and dag combine simplified it to (y << 3 + z) << 3 + x:

Weird that it did that unless it knew about the sh3add instruction existing. Without a sh3add instruction, it doesn't save code and increases the path length.

wangpc-pp · 2024-03-19T07:06:28Z

I'm think about introducing ISD::GetElementPtr just like TargetOpcode::G_PTR_ADD. Would that be more generic?

topperc · 2024-03-19T07:25:34Z

I'm think about introducing ISD::GetElementPtr just like TargetOpcode::G_PTR_ADD. Would that be more generic?

What's the motivation for that?

I don't know if that's more generic. G_PTR_ADD only takes two operands, a pointer and an offset. A GEP with two indices can be lowered to a G_PTR_ADD+SHL for each index or one G_PTR_ADD with ADDs and SHLs to calculate the offset. And there are a couple ways to calculate the offset.

wangpc-pp · 2024-03-19T07:34:13Z

I'm think about introducing ISD::GetElementPtr just like TargetOpcode::G_PTR_ADD. Would that be more generic?

What's the motivation for that?

It's mainly because:

We do a lot of work during isel to find shXadd and slli instructions that are obscured.

If we can lower GEP directly, I think that would be easier to optimize.

I don't know if that's more generic. G_PTR_ADD only takes two operands, a pointer and an offset. A GEP with two indices can be lowered to a G_PTR_ADD+SHL for each index or one G_PTR_ADD with ADDs and SHLs to calculate the offset. And there are a couple ways to calculate the offset.

topperc · 2024-03-19T16:22:22Z

Here's example IR extracted from the benchmark https://godbolt.org/z/fxc688sbG

mgudim · 2024-03-26T07:12:24Z

@topperc In case no one has started on this, I can assign it to myself and work on it.

topperc · 2024-03-27T04:54:48Z

@topperc In case no one has started on this, I can assign it to myself and work on it.

Are you going to tackle it has reassociate in DAG combine?

There are multiple variations of this that need to be handled.
-If the index to the gep is a srl, the shl will be combined to a shift and an and instead of two shifts.
-Need to handle the case where the GEP indices indices are zext from i32 so we can use shXadd.uw. The and and shl might could be in either order. We handle the (and (shl X, C), mask) case in selectSHXADD_UWOp during isel.

mgudim · 2024-03-27T05:12:48Z

Are you going to tackle it has reassociate in DAG combine?

Not sure yet. I'll take a closer look and keep the variations you mentioned in mind.

This transform is looking for the shNadd idiom for zba, but that can be obscured if there's another value being added to the result. The choice to restrict to one level of association is tactical - we could of course do more, but there's the usual compile time tradeoff, and this covers the motivating example. This is a solution to a reduced test case originally flagged in the description of llvm#85734.

preames · 2024-03-27T22:24:50Z

One possible approach: #86883

Thanks @mgudim for the reduced test case and the discussion which got me thinking about this.

…ore shXadd. This reassociates patterns like (sh3add Z, (add X, (slli Y, 6))) into (sh3add (sh3add Y, Z), X). This improves a pattern that occurs in 531.deepsjeng_r. Reducing the dynamic instruction count by 0.5%. This may be possible to improve in SelectionDAG, but given the special cases around shXadd formation, it's not obvious it can be done in a robust way without adding multiple special cases. I've used a GEP with 2 indices because that mostly closely resembles the motivating case. Most of the test cases are the simplest GEP case. One test has a logical right shift on an index which is closer to the deepsjeng code. This requires special handling in isel to reverse a DAGCombiner canonicalization that turns a pair of shifts into (srl (and X, C1), C2). See also llvm#85734 which had a hacky version of a similar optimization.

…-> (sh3add (sh3add Y, Z), X). This is an alternative to the new pass proposed in llvm#87544. This improves a pattern that occurs in 531.deepsjeng_r. Reducing the dynamic instruction count by 0.5%. This may be possible to improve in SelectionDAG, but given the special cases around shXadd formation, it's not obvious it can be done in a robust way without adding multiple special cases. I've used a GEP with 2 indices because that mostly closely resembles the motivating case. Most of the test cases are the simplest GEP case. One test has a logical right shift on an index which is closer to the deepsjeng code. This requires special handling in isel to reverse a DAGCombiner canonicalization that turns a pair of shifts into (srl (and X, C1), C2). See also llvm#85734 which had a hacky version of a similar optimization.

topperc requested review from asb, preames, kito-cheng, mshockwave, mgudim, rtayl and wangpc-pp March 19, 2024 04:53

wangpc-pp reviewed Mar 19, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp Show resolved Hide resolved

preames mentioned this pull request Mar 27, 2024

[WIP][RISCV] Support 3-argument associative add for transformAddShlImm #86883

Open

topperc mentioned this pull request Apr 3, 2024

[RISCV] Add a MIR pass to reassociate shXadd, add, and slli to form more shXadd. #87544

Closed

topperc mentioned this pull request Apr 6, 2024

[RISCV] Add MachineCombiner to fold (sh3add Z, (add X, (slli Y, 6))) -> (sh3add (sh3add Y, Z), X). #87884

Merged

topperc closed this Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISCV][WIP] Fold (sh3add Z, (add X, (slli Y, 6))) -> (sh3add (sh3add Y, Z), X). #85734

[RISCV][WIP] Fold (sh3add Z, (add X, (slli Y, 6))) -> (sh3add (sh3add Y, Z), X). #85734

topperc commented Mar 19, 2024

github-actions bot commented Mar 19, 2024

mgudim commented Mar 19, 2024

topperc commented Mar 19, 2024

wangpc-pp commented Mar 19, 2024

topperc commented Mar 19, 2024

wangpc-pp commented Mar 19, 2024

topperc commented Mar 19, 2024

mgudim commented Mar 26, 2024

topperc commented Mar 27, 2024

mgudim commented Mar 27, 2024

preames commented Mar 27, 2024

[RISCV][WIP] Fold (sh3add Z, (add X, (slli Y, 6))) -> (sh3add (sh3add Y, Z), X). #85734

[RISCV][WIP] Fold (sh3add Z, (add X, (slli Y, 6))) -> (sh3add (sh3add Y, Z), X). #85734

Conversation

topperc commented Mar 19, 2024

github-actions bot commented Mar 19, 2024

mgudim commented Mar 19, 2024

topperc commented Mar 19, 2024

wangpc-pp commented Mar 19, 2024

topperc commented Mar 19, 2024

wangpc-pp commented Mar 19, 2024

topperc commented Mar 19, 2024

mgudim commented Mar 26, 2024

topperc commented Mar 27, 2024

mgudim commented Mar 27, 2024

preames commented Mar 27, 2024