Skip to content

[LoopUnroll] Structural cost savings analysis for full loop unrolling #114579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

lucas-rami
Copy link
Contributor

@lucas-rami lucas-rami commented Nov 1, 2024

This PR was originally meant to optimize a specific nested loop structure when targeting the AMDGPU backend. Following feedback from reviewers, the contribution was moved to the target-independent part of the loop-unrolling pass. In the process, the scope of the contribution increased non-trivially to work well with the current pass design.

When the loop unroll pass deems full unrolling of a loop possible, it checks whether the estimated full unroll cost is lower that a pre-determined threshold. When this check fails, the pass attempts to run a costly iteration-by-iteration analysis to obtain much more precise unrolled/rolled costs for the loop, then uses those results to boost the threshold by a certain amount. Due to the analysis's complexity it can only run on innermost loops with small trip count, preventing full unrolling of loops that seem too costly in other cases.

This introduces an additional simpler analysis that runs when the former analysis fails and which can work on outer loops independently of trip count. The analysis attempts to come up with a more accurate unrolled cost for the input loop to later compare against the threshold. It does so by identifying instructions that will be removable in the unrolled form of the loop (so-called "known instructions") and deducting their cost from the previously established full unroll cost. Additionally, the analysis derives cost savings from the existence of subloops whose trip count would become runtime-independent following full unrolling of an outer loop. Similarly, more cost savings can come from the target of conditional branches becoming known at runtime. If the overall unrolled cost of the loop was lowered enough by the analysis, the loop ends up fully unrolled.

Original PR description

AMDGPU TTI's unrolling preferences disallow unrolling loops with a runtime trip count. However, it is common enough for an inner loop's bounds to only depend on an outer loop's induction variable as well as constants, for example when iterating over half of a matrix.

for (size_t i = 0 ; i < N ; ++i) {
  for (size_t j = i ; j < N ; ++j) {
   // matrix[i][j]
  }
}

In such cases, unrolling the outer loop can not only be beneficial by itself but can also enable the unrolling of the (copies of the) inner loop whose trip count will have been rendered runtime-independent, yielding significant performance benefits in some cases.

This commit adds a new static bonus to the unrolling threshold of outer loops for each of their inner loop whose trip count would go from runtime-dependent to runtime-independent if the outer loop were to be unrolled. This increases the likelihood to see interesting optimization opportunities in loop nests that have the particular structure outlined above.

Fixes SWDEV-485683.

Copy link

github-actions bot commented Nov 1, 2024

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Nov 1, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Lucas Ramirez (lucas-rami)

Changes

AMDGPU TTI's unrolling preferences disallow unrolling loops with a runtime trip count. However, it is common enough for an inner loop's bounds to only depend on an outer loop's induction variable as well as constants, for example when iterating over half of a matrix.

for (size_t i = 0 ; i &lt; N ; ++i) {
  for (size_t j = i ; j &lt; N ; ++j) {
   // matrix[i][j]
  }
}

In such cases, unrolling the outer loop can not only be beneficial by itself but can also enable the unrolling of the (copies of the) inner loop whose trip count will have been rendered runtime-independent, yielding significant performance benefits in some cases.

This commit adds a new static bonus to the unrolling threshold of outer loops for each of their inner loop whose trip count would go from runtime-dependent to runtime-independent if the outer loop were to be unrolled. This increases the likelihood to see interesting optimization opportunities in loop nests that have the particular structure outlined above.

Fixes SWDEV-485683.


Full diff: https://github.com/llvm/llvm-project/pull/114579.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp (+67-1)
  • (added) llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll (+84)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 5160851f8c4424..79250ad1f83064 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -47,6 +47,13 @@ static cl::opt<unsigned> UnrollThresholdIf(
   cl::desc("Unroll threshold increment for AMDGPU for each if statement inside loop"),
   cl::init(200), cl::Hidden);
 
+static cl::opt<unsigned> UnrollThresholdNestedStatic(
+    "amdgpu-unroll-threshold-nested-static",
+    cl::desc("Unroll threshold increment for AMDGPU for each nested loop whose "
+             "trip count will be made runtime-independent when fully-unrolling "
+             "the outer loop"),
+    cl::init(200), cl::Hidden);
+
 static cl::opt<bool> UnrollRuntimeLocal(
   "amdgpu-unroll-runtime-local",
   cl::desc("Allow runtime unroll for AMDGPU if local memory used in a loop"),
@@ -148,8 +155,67 @@ void AMDGPUTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
       }
     }
   }
-
   unsigned MaxBoost = std::max(ThresholdPrivate, ThresholdLocal);
+
+  if (llvm::PHINode *IV = L->getInductionVariable(SE)) {
+    // Look for subloops whose trip count would go from runtime-dependent to
+    // runtime-independent if we were to unroll the loop. Give a bonus to the
+    // current loop's unrolling threshold for each of these, as fully unrolling
+    // it would likely expose additional optimization opportunities.
+    for (const Loop *SubLoop : L->getSubLoops()) {
+      std::optional<Loop::LoopBounds> Bounds = SubLoop->getBounds(SE);
+      if (!Bounds)
+        continue;
+      Value *InitIV = &Bounds->getInitialIVValue();
+      Value *FinalIV = &Bounds->getFinalIVValue();
+      Value *StepVal = Bounds->getStepValue();
+      if (!StepVal)
+        continue;
+
+      // Determines whether SubIV's derivation depends exclusively on constants
+      // and/or IV; if it does, SubIVDependsOnIV is set to true if IV is
+      // involved in the derivation.
+      bool SubIVDependsOnIV = false;
+      std::function<bool(const Value *, unsigned)> FromConstsOrLoopIV =
+          [&](const Value *SubIV, unsigned Depth) -> bool {
+        if (SubIV == IV) {
+          SubIVDependsOnIV = true;
+          return true;
+        }
+        if (isa<Constant>(SubIV))
+          return true;
+        if (Depth >= 10)
+          return false;
+
+        const Instruction *I = dyn_cast<Instruction>(SubIV);
+        // No point in checking outside the loop since IV is necessarily inside
+        // it; also stop searching when encountering an instruction that will
+        // likely not allow SubIV's value to be statically computed.
+        if (!I || !L->contains(I) || !isa<BinaryOperator, CastInst, PHINode>(I))
+          return false;
+
+        // SubIV depends on constants or IV if all of the instruction's
+        // operands involved in its derivation also depend on constants or IV.
+        return llvm::all_of(I->operand_values(), [&](const Value *V) {
+          return FromConstsOrLoopIV(V, Depth + 1);
+        });
+      };
+
+      if (FromConstsOrLoopIV(InitIV, 0) && FromConstsOrLoopIV(FinalIV, 0) &&
+          FromConstsOrLoopIV(StepVal, 0) && SubIVDependsOnIV) {
+        UP.Threshold += UnrollThresholdNestedStatic;
+        LLVM_DEBUG(dbgs() << "Set unroll threshold " << UP.Threshold
+                          << " for loop:\n"
+                          << *L
+                          << " due to subloop's trip count becoming "
+                             "runtime-independent after unrolling:\n  "
+                          << *SubLoop);
+        if (UP.Threshold >= MaxBoost)
+          return;
+      }
+    }
+  }
+
   for (const BasicBlock *BB : L->getBlocks()) {
     const DataLayout &DL = BB->getDataLayout();
     unsigned LocalGEPsSeen = 0;
diff --git a/llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll b/llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll
new file mode 100644
index 00000000000000..36101c50db98ac
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll
@@ -0,0 +1,84 @@
+; RUN: opt -S -mtriple=amdgcn-- -passes=loop-unroll -debug-only=AMDGPUtti < %s 2>&1 | FileCheck %s
+
+; For @dependent_sub_fullunroll, the threshold bonus should apply
+; CHECK: due to subloop's trip count becoming runtime-independent after unrolling
+
+; For @dependent_sub_no_fullunroll, the threshold bonus should not apply
+; CHECK-NOT: due to subloop's trip count becoming runtime-independent after unrolling
+
+; Check that the outer loop of a double-nested loop where the inner loop's trip
+; count depends exclusively on constants and the outer IV is fully unrolled
+; thanks to receiving a threshold bonus in AMDGPU's TTI.
+
+; CHECK-LABEL: @dependent_sub_fullunroll
+; CHECK: inner.header_latch_exiting.7
+; CHECK: outer.latch_exiting.7
+
+define void @dependent_sub_fullunroll(ptr noundef %mem) {
+entry:
+  br label %outer.header
+
+outer.header:                                                 ; preds = %entry, %outer.latch_exiting
+  %outer.iv = phi i32 [ 0, %entry ], [ %outer.iv_next, %outer.latch_exiting ]
+  br label %inner.header_latch_exiting
+
+inner.header_latch_exiting:                                   ; preds = %outer.header, %inner.header_latch_exiting
+  %inner.iv = phi i32 [ %outer.iv, %outer.header ], [ %inner.iv_next, %inner.header_latch_exiting ]
+  %inner.iv_next = add nuw nsw i32 %inner.iv, 1
+  %outer.iv.ext = zext nneg i32 %outer.iv to i64
+  %idx_part = mul nuw nsw i64 %outer.iv.ext, 16 
+  %inner.iv.ext = zext nneg i32 %inner.iv to i64
+  %idx = add nuw nsw i64 %idx_part, %inner.iv.ext 
+  %addr = getelementptr inbounds i8, ptr %mem, i64 %idx
+  store i32 0, ptr %addr
+  %inner.cond = icmp ult i32 %inner.iv_next, 8
+  br i1 %inner.cond, label %inner.header_latch_exiting, label %outer.latch_exiting, !llvm.loop !1
+
+outer.latch_exiting:                                          ; preds = %inner.header_latch_exiting
+  %outer.iv_next = add nuw nsw i32 %outer.iv, 1
+  %outer.cond = icmp ult i32 %outer.iv_next, 8
+  br i1 %outer.cond, label %outer.header, label %end, !llvm.loop !1
+  
+end:                                                          ; preds = %outer.latch_exiting
+  ret void
+}
+
+; Check that the outer loop of the same loop nest as dependent_sub_fullunroll
+; is not fully unrolled when the inner loop's final IV value depends on a
+; function argument instead of a combination of the outer IV and constants.
+
+; CHECK-LABEL: @dependent_sub_no_fullunroll
+; CHECK-NOT: outer.latch_exiting.7
+; CHECK-NOT: outer.latch_exiting.7
+
+define void @dependent_sub_no_fullunroll(ptr noundef %mem, i32 noundef %inner.ub) {
+entry:
+  br label %outer.header
+
+outer.header:                                                 ; preds = %entry, %outer.latch_exiting
+  %outer.iv = phi i32 [ 0, %entry ], [ %outer.iv_next, %outer.latch_exiting ]
+  br label %inner.header_latch_exiting
+
+inner.header_latch_exiting:                                   ; preds = %outer.header, %inner.header_latch_exiting
+  %inner.iv = phi i32 [ %outer.iv, %outer.header ], [ %inner.iv_next, %inner.header_latch_exiting ]
+  %inner.iv_next = add nuw nsw i32 %inner.iv, 1
+  %outer.iv.ext = zext nneg i32 %outer.iv to i64
+  %idx_part = mul nuw nsw i64 %outer.iv.ext, 16 
+  %inner.iv.ext = zext nneg i32 %inner.iv to i64
+  %idx = add nuw nsw i64 %idx_part, %inner.iv.ext 
+  %addr = getelementptr inbounds i8, ptr %mem, i64 %idx
+  store i32 0, ptr %addr
+  %inner.cond = icmp ult i32 %inner.iv_next, %inner.ub
+  br i1 %inner.cond, label %inner.header_latch_exiting, label %outer.latch_exiting, !llvm.loop !1
+
+outer.latch_exiting:                                          ; preds = %inner.header_latch_exiting
+  %outer.iv_next = add nuw nsw i32 %outer.iv, 1
+  %outer.cond = icmp ult i32 %outer.iv_next, 8
+  br i1 %outer.cond, label %outer.header, label %end, !llvm.loop !1
+  
+end:                                                          ; preds = %outer.latch_exiting
+  ret void
+}
+
+!1 = !{!1, !2}
+!2 = !{!"amdgpu.loop.unroll.threshold", i32 100}

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to be in the generic unroll logic, and not in this backend hook

@lucas-rami
Copy link
Contributor Author

Thank you both for the fast review and comments.

I would expect this to be in the generic unroll logic, and not in this backend hook

Should I move this to the target-independent logic then? If so it seems that the analyzeLoopUnrollCost function would be the most meaningful place to incorporate this in, but the analysis is guarded by a check that the loop under consideration is an innermost loop, and I am guessing we don't want to change that. What do you think?

@shiltian
Copy link
Contributor

I agree with @arsenm that this should go to a target independent part. If you don't want to break anything existing, the simplest solution would be to check the recurrence of the new argument, and only proceed when it is set.

Copy link

github-actions bot commented Dec 19, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@lucas-rami
Copy link
Contributor Author

I moved the optimization to the target-independent part of the loop unrolling pass. In the process, I significantly changed the way the analysis increases the likelihood of the loop being fully unrolled to be consistent with the way unroll cost estimation is generally done in the pass.

Instead of adding an absolute value to the unrolling threshold on encountering an optimizable subloop, my analysis now tries to estimate the per-iteration cost savings that fully unrolling a loop would yield, in an attempt to compare a more precise full unrolling cost against the threshold. The general idea is to find instructions in the loop which will be optimizable away once the loop IV is known. Subloops which may become unrollable if we were to unroll the outer loop yield additional cost savings. Simple control structure akin to if/then(/else) yield additional cost savings as well.

@lucas-rami lucas-rami changed the title [AMDGPU][TTI] Threshold bonus to loops whose unrolling makes nested loops unrollable [LoopUnroll] Structural cost savings analysis for full loop unrolling Dec 19, 2024
@nikic
Copy link
Contributor

nikic commented Dec 19, 2024

This seems conceptually similar to something I tried a while ago (I have an old patch stub here: 26df05e). The problem I ran into, and why I didn't end up pursuing this further, was that if we more accurately model full unrolling cost, we end up unrolling more (duh) and from inspecting IR diffs the additional unrolls weren't always profitable. If we make cost modelling more accurate, we may have to reduce unrolling thresholds to compensate.

The other thing to watch out for, especially for your motivating example, is catastrophic unrolling. If you have this kind of nested loop and unroll the outer loop first, you can get quadratic unrolling. See #57865 for an example where simple input IR results in massive output IR that hangs the compiler and #67137 for a related patch. If I understand correctly, your patch is going to make this problem worse than it is right now, because now there will be an additional bonus to unrolling the outer loop.

@lucas-rami
Copy link
Contributor Author

Thanks a lot for adding context here, I see the danger of catastrophic unrolling and I now see that my patch will make it worse. My original motivation for this patch was in fact an AMDGPU kernel structurally similar to my motivating example where all loops need to be fully unrolled to achieve good performance (see discourse post here as well).

I think a solution would be, if there are subloops, to severely limit (or even nullify) the cost savings my analysis estimates if these subloops are too big or have big trip counts (what "too big" means to be defined somehow). If the latter cannot be determined, we would have to err on the "do not unroll" side. Do you think this would prevent catastrophic unrolling and still make the analysis a somewhat useful optimization?

Comment on lines 425 to 426
if (!isa<BinaryOperator, CastInst, CmpInst>(I))
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to track this specific set of operations?

Copy link
Contributor Author

@lucas-rami lucas-rami Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be too restrictive indeed. My rationale is that I only want to count instructions that would be folded by unrolling because all of their operands would be constants. I may just need to filter out instructions with side effects, phis, and terminators?

The only functional change concerns the selection of potentially
foldable instructions inside loops ("foldable instructions" replaces
"known instructions" since the former seems like a more canonical term).
Instead of filtering-in instruction types that I think are foldable I
now filter-out instructions w/ side-effect, phis, and terminators.
SelectInst is special-cased because, unlike other instructions, it is
foldable as soon as its single "select operand" is known.
@lucas-rami
Copy link
Contributor Author

This is just addressing the inline remarks, I am still thinking of solutions on the larger issue of improving the heuristics to avoid catastrophic unrolling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants