-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV][WIP] Fold (sh3add Z, (add X, (slli Y, 6))) -> (sh3add (sh3add Y, Z), X). #85734
Conversation
… Y, Z), X). This gives a 0.5% reduction in dynamic instruction count for 531.deepsjeng_r from spec2017. This matches what gcc generates. This pattern appears when indexing arrays like `uint64_t fillUpAttacks[64][8]`. The first index needs to multipled by 64 and the second index needs to be multiplied by 8. The both multiplied indices need to be added to the start of the array to calculate the full address. Alternatively, you can multiply the first index by 8 add it to the second index, then multiply the sum by 8 before adding the base pointer. This patch is a proof of concept and not how I think it should be implemented. We do a lot of work during isel to find shXadd and slli instructions that are obscured. In the motivating example, Y in (slli Y, 6) is (srli W, 58). The pattern is (slli (srli W, 58), 6) is written as (and (srli W, 52), -64) when we start isel. We detect the shift pair during selection of the and. Unless we repeat all of that cleverness, we need to do this optimization sometime after isel. X86 misses the opportunity to use 2 LEAs on the same code. Posting this patch for discussion.
You can test this locally with the following command:git-clang-format --diff 470040bd4d54f39f9ac0868a2197fa2ae3e6d4f5 dbd3f2e1775a57e63c40afbdbc35897d5672484f -- llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp llvm/lib/Target/RISCV/RISCVISelDAGToDAG.h View the diff from clang-format here.diff --git a/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp b/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
index 42eb4104ae..e7ab25b6e8 100644
--- a/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
@@ -3352,10 +3352,10 @@ bool RISCVDAGToDAGISel::doPeepholeSHXADD(SDNode *N) {
SDValue Y = N11.getOperand(0);
SDValue Z = N->getOperand(0);
- SDNode *SH3ADD1 = CurDAG->getMachineNode(RISCV::SH3ADD, SDLoc(N), N->getValueType(0),
- Y, Z);
- SDNode *SH3ADD2 = CurDAG->getMachineNode(RISCV::SH3ADD, SDLoc(N), N->getValueType(0),
- SDValue(SH3ADD1, 0), X);
+ SDNode *SH3ADD1 =
+ CurDAG->getMachineNode(RISCV::SH3ADD, SDLoc(N), N->getValueType(0), Y, Z);
+ SDNode *SH3ADD2 = CurDAG->getMachineNode(
+ RISCV::SH3ADD, SDLoc(N), N->getValueType(0), SDValue(SH3ADD1, 0), X);
ReplaceUses(N, SH3ADD2);
return true;
}
|
I tried to run isel on this IR:
which is
so I think what's missing is the ability to reassociate the expression. I ran a simpler example to see if dag combine can do a simple reassosiation. From this IR:
dag combine could reassociate:
So it looks like we can do this by generalizing existing dag combines? |
Weird that it did that unless it knew about the sh3add instruction existing. Without a sh3add instruction, it doesn't save code and increases the path length. |
I'm think about introducing |
What's the motivation for that? I don't know if that's more generic. G_PTR_ADD only takes two operands, a pointer and an offset. A GEP with two indices can be lowered to a G_PTR_ADD+SHL for each index or one G_PTR_ADD with ADDs and SHLs to calculate the offset. And there are a couple ways to calculate the offset. |
It's mainly because:
If we can lower GEP directly, I think that would be easier to optimize.
|
Here's example IR extracted from the benchmark https://godbolt.org/z/fxc688sbG |
@topperc In case no one has started on this, I can assign it to myself and work on it. |
Are you going to tackle it has reassociate in DAG combine? There are multiple variations of this that need to be handled. |
Not sure yet. I'll take a closer look and keep the variations you mentioned in mind. |
This transform is looking for the shNadd idiom for zba, but that can be obscured if there's another value being added to the result. The choice to restrict to one level of association is tactical - we could of course do more, but there's the usual compile time tradeoff, and this covers the motivating example. This is a solution to a reduced test case originally flagged in the description of llvm#85734.
…ore shXadd. This reassociates patterns like (sh3add Z, (add X, (slli Y, 6))) into (sh3add (sh3add Y, Z), X). This improves a pattern that occurs in 531.deepsjeng_r. Reducing the dynamic instruction count by 0.5%. This may be possible to improve in SelectionDAG, but given the special cases around shXadd formation, it's not obvious it can be done in a robust way without adding multiple special cases. I've used a GEP with 2 indices because that mostly closely resembles the motivating case. Most of the test cases are the simplest GEP case. One test has a logical right shift on an index which is closer to the deepsjeng code. This requires special handling in isel to reverse a DAGCombiner canonicalization that turns a pair of shifts into (srl (and X, C1), C2). See also llvm#85734 which had a hacky version of a similar optimization.
…ore shXadd. This reassociates patterns like (sh3add Z, (add X, (slli Y, 6))) into (sh3add (sh3add Y, Z), X). This improves a pattern that occurs in 531.deepsjeng_r. Reducing the dynamic instruction count by 0.5%. This may be possible to improve in SelectionDAG, but given the special cases around shXadd formation, it's not obvious it can be done in a robust way without adding multiple special cases. I've used a GEP with 2 indices because that mostly closely resembles the motivating case. Most of the test cases are the simplest GEP case. One test has a logical right shift on an index which is closer to the deepsjeng code. This requires special handling in isel to reverse a DAGCombiner canonicalization that turns a pair of shifts into (srl (and X, C1), C2). See also llvm#85734 which had a hacky version of a similar optimization.
…-> (sh3add (sh3add Y, Z), X). This is an alternative to the new pass proposed in llvm#87544. This improves a pattern that occurs in 531.deepsjeng_r. Reducing the dynamic instruction count by 0.5%. This may be possible to improve in SelectionDAG, but given the special cases around shXadd formation, it's not obvious it can be done in a robust way without adding multiple special cases. I've used a GEP with 2 indices because that mostly closely resembles the motivating case. Most of the test cases are the simplest GEP case. One test has a logical right shift on an index which is closer to the deepsjeng code. This requires special handling in isel to reverse a DAGCombiner canonicalization that turns a pair of shifts into (srl (and X, C1), C2). See also llvm#85734 which had a hacky version of a similar optimization.
This gives a 0.5% reduction in dynamic instruction count for 531.deepsjeng_r from spec2017. Using 2 sh3add matches what gcc generates.
This pattern appears when indexing arrays like
uint64_t fillUpAttacks[64][8]
. The first index needs to multiplied by 64 bytes and the second index needs to be multiplied by 8 bytes. Then both multiplied indices need to be added to the start of the array to calculate the full address. Alternatively, you can multiply the first index by 8 add it to the second index, then multiply the sum by 8 before adding the base pointer.What we currently generate is a direct result of how GEPs are expanded in SelectionDAGBuilder.
This patch is a proof of concept I hacked together to do measurements and not necessarily how I think it should be implemented. There other variations of this pattern with different shift amounts that I did not handle and did not look for.
We do a lot of work during isel to find shXadd and slli instructions that are obscured. In the motivating example, Y in
(slli Y, 6)
is(srli W, 58)
. What becomes(slli (srli W, 58), 6)
is(and (srl W, 52), -64)
when we start isel. We convert to a shift pair during selection of theand
. Unless we repeat all of that cleverness to find all the variations of this pattern, we need to do this optimization sometime after isel.There could be deeper versions of this pattern with more indices too.
X86 misses the opportunity to use 2 LEAs on the same code.
Posting this patch for discussion.