Skip to content

Conversation

rj-jesus
Copy link
Contributor

@rj-jesus rj-jesus commented Feb 5, 2024

This patch removes the m_OneUse check that was being done for the following patterns:
(A + RHS) + RHS --> A + (RHS << 1)
LHS + (A + LHS) --> A + (LHS << 1)

This may increase the instruction count if (A + RHS) or (A + LHS) are used more than once but will usually help with LICM later on if A is an induction variable.

This is exemplified here: https://godbolt.org/z/q94vbf5e9 - the first instruction of the second iteration ([1,0]) can start much earlier on with the patch, allowing for significantly higher IPC.

I've benchmarked this with RAJAPerf on a Grace system and see an overall performance uplift of 2% across the whole benchmark which comes mainly from ~45% uplifts in three kernels. Worst I saw was a 1.2% slowdown on a single kernel which I look into later.

Please let me know if I should move this elsewhere!

This patch removes the `m_OneUse` check that was being done for the
following patterns:
  (A + RHS) + RHS --> A + (RHS << 1)
  LHS + (A + LHS) --> A + (LHS << 1)

This may increase the instruction count if (A + RHS) or (A + LHS) are
used more than once but usually will help with LICM later on if A is an
induction variable.
@llvmbot
Copy link
Member

llvmbot commented Feb 5, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Ricardo Jesus (rj-jesus)

Changes

This patch removes the m_OneUse check that was being done for the following patterns:
(A + RHS) + RHS --> A + (RHS << 1)
LHS + (A + LHS) --> A + (LHS << 1)

This may increase the instruction count if (A + RHS) or (A + LHS) are used more than once but will usually help with LICM later on if A is an induction variable.

This is exemplified here: https://godbolt.org/z/q94vbf5e9 - the first instruction of the second iteration ([1,0]) can start much earlier on with the patch, allowing for significantly higher IPC.

I've benchmarked this with RAJAPerf on a Grace system and see an overall performance uplift of 2% across the whole benchmark which comes mainly from ~45% uplifts in three kernels. Worst I saw was a 1.2% slowdown on a single kernel which I look into later.

Please let me know if I should move this elsewhere!


Full diff: https://github.com/llvm/llvm-project/pull/80705.diff

1 Files Affected:

  • (modified) llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp (+5-2)
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp b/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
index 8a00b75a1f7404..984f7a619d4ab8 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
@@ -1521,11 +1521,14 @@ Instruction *InstCombinerImpl::visitAdd(BinaryOperator &I) {
     return BinaryOperator::CreateSub(A, B);
 
   // (A + RHS) + RHS --> A + (RHS << 1)
-  if (match(LHS, m_OneUse(m_c_Add(m_Value(A), m_Specific(RHS)))))
+  // This may increase instruction count if (A + RHS) is used more than once
+  // but helps make (RHS << 1) suitable for potential LICM.
+  if (match(LHS, m_c_Add(m_Value(A), m_Specific(RHS))))
     return BinaryOperator::CreateAdd(A, Builder.CreateShl(RHS, 1, "reass.add"));
 
   // LHS + (A + LHS) --> A + (LHS << 1)
-  if (match(RHS, m_OneUse(m_c_Add(m_Value(A), m_Specific(LHS)))))
+  // Same as above.
+  if (match(RHS, m_c_Add(m_Value(A), m_Specific(LHS))))
     return BinaryOperator::CreateAdd(A, Builder.CreateShl(LHS, 1, "reass.add"));
 
   {

@rj-jesus rj-jesus marked this pull request as draft February 5, 2024 16:22
@rj-jesus rj-jesus closed this Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants