Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV][WIP] Fold (sh3add Z, (add X, (slli Y, 6))) -> (sh3add (sh3add Y, Z), X). #85734

Closed
wants to merge 1 commit into from

Conversation

topperc
Copy link
Collaborator

@topperc topperc commented Mar 19, 2024

This gives a 0.5% reduction in dynamic instruction count for 531.deepsjeng_r from spec2017. Using 2 sh3add matches what gcc generates.

This pattern appears when indexing arrays like uint64_t fillUpAttacks[64][8]. The first index needs to multiplied by 64 bytes and the second index needs to be multiplied by 8 bytes. Then both multiplied indices need to be added to the start of the array to calculate the full address. Alternatively, you can multiply the first index by 8 add it to the second index, then multiply the sum by 8 before adding the base pointer.

What we currently generate is a direct result of how GEPs are expanded in SelectionDAGBuilder.

This patch is a proof of concept I hacked together to do measurements and not necessarily how I think it should be implemented. There other variations of this pattern with different shift amounts that I did not handle and did not look for.

We do a lot of work during isel to find shXadd and slli instructions that are obscured. In the motivating example, Y in (slli Y, 6) is (srli W, 58). What becomes (slli (srli W, 58), 6) is (and (srl W, 52), -64) when we start isel. We convert to a shift pair during selection of the and. Unless we repeat all of that cleverness to find all the variations of this pattern, we need to do this optimization sometime after isel.

There could be deeper versions of this pattern with more indices too.

X86 misses the opportunity to use 2 LEAs on the same code.

Posting this patch for discussion.

… Y, Z), X).

This gives a 0.5% reduction in dynamic instruction count for
531.deepsjeng_r from spec2017. This matches what gcc generates.

This pattern appears when indexing arrays like `uint64_t fillUpAttacks[64][8]`.
The first index needs to multipled by 64 and the second index needs
to be multiplied by 8. The both multiplied indices need to be added
to the start of the array to calculate the full address. Alternatively,
you can multiply the first index by 8 add it to the second index, then
multiply the sum by 8 before adding the base pointer.

This patch is a proof of concept and not how I think it should be
implemented.

We do a lot of work during isel to find shXadd and slli instructions
that are obscured. In the motivating example, Y in (slli Y, 6) is
(srli W, 58). The pattern is (slli (srli W, 58), 6) is written as
(and (srli W, 52), -64) when we start isel. We detect the shift pair
during selection of the and. Unless we repeat all of that cleverness,
we need to do this optimization sometime after isel.

X86 misses the opportunity to use 2 LEAs on the same code.

Posting this patch for discussion.
Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 470040bd4d54f39f9ac0868a2197fa2ae3e6d4f5 dbd3f2e1775a57e63c40afbdbc35897d5672484f -- llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp llvm/lib/Target/RISCV/RISCVISelDAGToDAG.h
View the diff from clang-format here.
diff --git a/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp b/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
index 42eb4104ae..e7ab25b6e8 100644
--- a/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
@@ -3352,10 +3352,10 @@ bool RISCVDAGToDAGISel::doPeepholeSHXADD(SDNode *N) {
   SDValue Y = N11.getOperand(0);
   SDValue Z = N->getOperand(0);
 
-  SDNode *SH3ADD1 = CurDAG->getMachineNode(RISCV::SH3ADD, SDLoc(N), N->getValueType(0),
-                                          Y, Z);
-  SDNode *SH3ADD2 = CurDAG->getMachineNode(RISCV::SH3ADD, SDLoc(N), N->getValueType(0),
-                                           SDValue(SH3ADD1, 0), X);
+  SDNode *SH3ADD1 =
+      CurDAG->getMachineNode(RISCV::SH3ADD, SDLoc(N), N->getValueType(0), Y, Z);
+  SDNode *SH3ADD2 = CurDAG->getMachineNode(
+      RISCV::SH3ADD, SDLoc(N), N->getValueType(0), SDValue(SH3ADD1, 0), X);
   ReplaceUses(N, SH3ADD2);
   return true;
 }

@mgudim
Copy link
Contributor

mgudim commented Mar 19, 2024

I tried to run isel on this IR:

define dso_local signext i64 @foo(i64 noundef signext %x, i64 noundef signext %y, i64 noundef signext %z) local_unnamed_addr #0 {
entry:
  %shl = shl i64 %z, 3
  %shl1 = shl i64 %y, 6
  %add = add nsw i64 %shl1, %shl
  %add2 = add nsw i64 %add, %x
  ret i64 %add2
}

which is (z << 3 + y << 6) + x and dag combine simplified it to (y << 3 + z) << 3 + x:

=== foo

Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 16 nodes:
  t0: ch,glue = EntryToken
          t4: i64,ch = CopyFromReg t0, Register:i64 %1
        t10: i64 = shl t4, Constant:i64<6>
          t6: i64,ch = CopyFromReg t0, Register:i64 %2
        t8: i64 = shl t6, Constant:i64<3>
      t11: i64 = add nsw t10, t8
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
    t12: i64 = add nsw t11, t2
  t14: ch,glue = CopyToReg t0, Register:i64 $x10, t12
  t15: ch = RISCVISD::RET_GLUE t14, Register:i64 $x10, t14:1



Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
            t4: i64,ch = CopyFromReg t0, Register:i64 %1
          t16: i64 = shl t4, Constant:i64<3>
          t6: i64,ch = CopyFromReg t0, Register:i64 %2
        t17: i64 = add t16, t6
      t18: i64 = shl t17, Constant:i64<3>
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
    t12: i64 = add nsw t18, t2
  t14: ch,glue = CopyToReg t0, Register:i64 $x10, t12
  t15: ch = RISCVISD::RET_GLUE t14, Register:i64 $x10, t14:1

so I think what's missing is the ability to reassociate the expression. I ran a simpler example to see if dag combine can do a simple reassosiation. From this IR:

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local signext i64 @foo(i64 noundef signext %x, i64 noundef signext %y, i64 noundef signext %z) local_unnamed_addr #0 {
entry:
  %add1 = add i64 %z, 3
  %add2 = add i64 %y, 6
  %add3 = add nsw i64 %add1, %add2
  ret i64 %add3
}

dag combine could reassociate:

=== foo

Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t6: i64,ch = CopyFromReg t0, Register:i64 %2
      t8: i64 = add t6, Constant:i64<3>
        t4: i64,ch = CopyFromReg t0, Register:i64 %1
      t10: i64 = add t4, Constant:i64<6>
    t11: i64 = add nsw t8, t10
  t13: ch,glue = CopyToReg t0, Register:i64 $x10, t11
  t14: ch = RISCVISD::RET_GLUE t13, Register:i64 $x10, t13:1



Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
        t4: i64,ch = CopyFromReg t0, Register:i64 %1
        t6: i64,ch = CopyFromReg t0, Register:i64 %2
      t17: i64 = add t4, t6
    t20: i64 = add t17, Constant:i64<9>
  t13: ch,glue = CopyToReg t0, Register:i64 $x10, t20
  t14: ch = RISCVISD::RET_GLUE t13, Register:i64 $x10, t13:1

So it looks like we can do this by generalizing existing dag combines?

@topperc
Copy link
Collaborator Author

topperc commented Mar 19, 2024

which is (z << 3 + y << 6) + x and dag combine simplified it to (y << 3 + z) << 3 + x:

Weird that it did that unless it knew about the sh3add instruction existing. Without a sh3add instruction, it doesn't save code and increases the path length.

@wangpc-pp
Copy link
Contributor

I'm think about introducing ISD::GetElementPtr just like TargetOpcode::G_PTR_ADD. Would that be more generic?

@topperc
Copy link
Collaborator Author

topperc commented Mar 19, 2024

I'm think about introducing ISD::GetElementPtr just like TargetOpcode::G_PTR_ADD. Would that be more generic?

What's the motivation for that?

I don't know if that's more generic. G_PTR_ADD only takes two operands, a pointer and an offset. A GEP with two indices can be lowered to a G_PTR_ADD+SHL for each index or one G_PTR_ADD with ADDs and SHLs to calculate the offset. And there are a couple ways to calculate the offset.

@wangpc-pp
Copy link
Contributor

I'm think about introducing ISD::GetElementPtr just like TargetOpcode::G_PTR_ADD. Would that be more generic?

What's the motivation for that?

It's mainly because:

We do a lot of work during isel to find shXadd and slli instructions that are obscured.

If we can lower GEP directly, I think that would be easier to optimize.

I don't know if that's more generic. G_PTR_ADD only takes two operands, a pointer and an offset. A GEP with two indices can be lowered to a G_PTR_ADD+SHL for each index or one G_PTR_ADD with ADDs and SHLs to calculate the offset. And there are a couple ways to calculate the offset.

@topperc
Copy link
Collaborator Author

topperc commented Mar 19, 2024

Here's example IR extracted from the benchmark https://godbolt.org/z/fxc688sbG

@mgudim
Copy link
Contributor

mgudim commented Mar 26, 2024

@topperc In case no one has started on this, I can assign it to myself and work on it.

@topperc
Copy link
Collaborator Author

topperc commented Mar 27, 2024

@topperc In case no one has started on this, I can assign it to myself and work on it.

Are you going to tackle it has reassociate in DAG combine?

There are multiple variations of this that need to be handled.
-If the index to the gep is a srl, the shl will be combined to a shift and an and instead of two shifts.
-Need to handle the case where the GEP indices indices are zext from i32 so we can use shXadd.uw. The and and shl might could be in either order. We handle the (and (shl X, C), mask) case in selectSHXADD_UWOp during isel.

@mgudim
Copy link
Contributor

mgudim commented Mar 27, 2024

Are you going to tackle it has reassociate in DAG combine?

Not sure yet. I'll take a closer look and keep the variations you mentioned in mind.

preames added a commit to preames/llvm-project that referenced this pull request Mar 27, 2024
This transform is looking for the shNadd idiom for zba, but that can
be obscured if there's another value being added to the result.  The
choice to restrict to one level of association is tactical - we
could of course do more, but there's the usual compile time tradeoff,
and this covers the motivating example.

This is a solution to a reduced test case originally flagged in
the description of llvm#85734.
@preames
Copy link
Collaborator

preames commented Mar 27, 2024

One possible approach: #86883

Thanks @mgudim for the reduced test case and the discussion which got me thinking about this.

topperc added a commit to topperc/llvm-project that referenced this pull request Apr 3, 2024
…ore shXadd.

This reassociates patterns like (sh3add Z, (add X, (slli Y, 6)))
into (sh3add (sh3add Y, Z), X).

This improves a pattern that occurs in 531.deepsjeng_r. Reducing
the dynamic instruction count by 0.5%.

This may be possible to improve in SelectionDAG, but given the special
cases around shXadd formation, it's not obvious it can be done in a
robust way without adding multiple special cases.

I've used a GEP with 2 indices because that mostly closely resembles
the motivating case. Most of the test cases are the simplest GEP case.
One test has a logical right shift on an index which is closer to
the deepsjeng code. This requires special handling in isel to reverse
a DAGCombiner canonicalization that turns a pair of shifts into
(srl (and X, C1), C2).

See also llvm#85734 which had a hacky version of a similar optimization.
topperc added a commit to topperc/llvm-project that referenced this pull request Apr 5, 2024
…ore shXadd.

This reassociates patterns like (sh3add Z, (add X, (slli Y, 6)))
into (sh3add (sh3add Y, Z), X).

This improves a pattern that occurs in 531.deepsjeng_r. Reducing
the dynamic instruction count by 0.5%.

This may be possible to improve in SelectionDAG, but given the special
cases around shXadd formation, it's not obvious it can be done in a
robust way without adding multiple special cases.

I've used a GEP with 2 indices because that mostly closely resembles
the motivating case. Most of the test cases are the simplest GEP case.
One test has a logical right shift on an index which is closer to
the deepsjeng code. This requires special handling in isel to reverse
a DAGCombiner canonicalization that turns a pair of shifts into
(srl (and X, C1), C2).

See also llvm#85734 which had a hacky version of a similar optimization.
topperc added a commit to topperc/llvm-project that referenced this pull request Apr 6, 2024
…-> (sh3add (sh3add Y, Z), X).

This is an alternative to the new pass proposed in llvm#87544.

This improves a pattern that occurs in 531.deepsjeng_r. Reducing
the dynamic instruction count by 0.5%.

This may be possible to improve in SelectionDAG, but given the special
cases around shXadd formation, it's not obvious it can be done in a
robust way without adding multiple special cases.

I've used a GEP with 2 indices because that mostly closely resembles
the motivating case. Most of the test cases are the simplest GEP case.
One test has a logical right shift on an index which is closer to
the deepsjeng code. This requires special handling in isel to reverse
a DAGCombiner canonicalization that turns a pair of shifts into
(srl (and X, C1), C2).

See also llvm#85734 which had a hacky version of a similar optimization.
@topperc topperc closed this Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants