[LoopUnroll] Structural cost savings analysis for full loop unrolling #114579

lucas-rami · 2024-11-01T17:27:12Z

This PR was originally meant to optimize a specific nested loop structure when targeting the AMDGPU backend. Following feedback from reviewers, the contribution was moved to the target-independent part of the loop-unrolling pass. In the process, the scope of the contribution increased non-trivially to work well with the current pass design.

When the loop unroll pass deems full unrolling of a loop possible, it checks whether the estimated full unroll cost is lower that a pre-determined threshold. When this check fails, the pass attempts to run a costly iteration-by-iteration analysis to obtain much more precise unrolled/rolled costs for the loop, then uses those results to boost the threshold by a certain amount. Due to the analysis's complexity it can only run on innermost loops with small trip count, preventing full unrolling of loops that seem too costly in other cases.

This introduces an additional simpler analysis that runs when the former analysis fails and which can work on outer loops independently of trip count. The analysis attempts to come up with a more accurate unrolled cost for the input loop to later compare against the threshold. It does so by identifying instructions that will be removable in the unrolled form of the loop (so-called "known instructions") and deducting their cost from the previously established full unroll cost. Additionally, the analysis derives cost savings from the existence of subloops whose trip count would become runtime-independent following full unrolling of an outer loop. Similarly, more cost savings can come from the target of conditional branches becoming known at runtime. If the overall unrolled cost of the loop was lowered enough by the analysis, the loop ends up fully unrolled.

Original PR description

AMDGPU TTI's unrolling preferences disallow unrolling loops with a runtime trip count. However, it is common enough for an inner loop's bounds to only depend on an outer loop's induction variable as well as constants, for example when iterating over half of a matrix.

for (size_t i = 0 ; i < N ; ++i) {
  for (size_t j = i ; j < N ; ++j) {
   // matrix[i][j]
  }
}

In such cases, unrolling the outer loop can not only be beneficial by itself but can also enable the unrolling of the (copies of the) inner loop whose trip count will have been rendered runtime-independent, yielding significant performance benefits in some cases.

This commit adds a new static bonus to the unrolling threshold of outer loops for each of their inner loop whose trip count would go from runtime-dependent to runtime-independent if the outer loop were to be unrolled. This increases the likelihood to see interesting optimization opportunities in loop nests that have the particular structure outlined above.

Fixes SWDEV-485683.

github-actions · 2024-11-01T17:27:32Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2024-11-01T17:27:36Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Lucas Ramirez (lucas-rami)

Changes

AMDGPU TTI's unrolling preferences disallow unrolling loops with a runtime trip count. However, it is common enough for an inner loop's bounds to only depend on an outer loop's induction variable as well as constants, for example when iterating over half of a matrix.

for (size_t i = 0 ; i &lt; N ; ++i) {
  for (size_t j = i ; j &lt; N ; ++j) {
   // matrix[i][j]
  }
}

In such cases, unrolling the outer loop can not only be beneficial by itself but can also enable the unrolling of the (copies of the) inner loop whose trip count will have been rendered runtime-independent, yielding significant performance benefits in some cases.

This commit adds a new static bonus to the unrolling threshold of outer loops for each of their inner loop whose trip count would go from runtime-dependent to runtime-independent if the outer loop were to be unrolled. This increases the likelihood to see interesting optimization opportunities in loop nests that have the particular structure outlined above.

Fixes SWDEV-485683.

Full diff: https://github.com/llvm/llvm-project/pull/114579.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp (+67-1)
(added) llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll (+84)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 5160851f8c4424..79250ad1f83064 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -47,6 +47,13 @@ static cl::opt<unsigned> UnrollThresholdIf(
   cl::desc("Unroll threshold increment for AMDGPU for each if statement inside loop"),
   cl::init(200), cl::Hidden);
 
+static cl::opt<unsigned> UnrollThresholdNestedStatic(
+    "amdgpu-unroll-threshold-nested-static",
+    cl::desc("Unroll threshold increment for AMDGPU for each nested loop whose "
+             "trip count will be made runtime-independent when fully-unrolling "
+             "the outer loop"),
+    cl::init(200), cl::Hidden);
+
 static cl::opt<bool> UnrollRuntimeLocal(
   "amdgpu-unroll-runtime-local",
   cl::desc("Allow runtime unroll for AMDGPU if local memory used in a loop"),
@@ -148,8 +155,67 @@ void AMDGPUTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
       }
     }
   }
-
   unsigned MaxBoost = std::max(ThresholdPrivate, ThresholdLocal);
+
+  if (llvm::PHINode *IV = L->getInductionVariable(SE)) {
+    // Look for subloops whose trip count would go from runtime-dependent to
+    // runtime-independent if we were to unroll the loop. Give a bonus to the
+    // current loop's unrolling threshold for each of these, as fully unrolling
+    // it would likely expose additional optimization opportunities.
+    for (const Loop *SubLoop : L->getSubLoops()) {
+      std::optional<Loop::LoopBounds> Bounds = SubLoop->getBounds(SE);
+      if (!Bounds)
+        continue;
+      Value *InitIV = &Bounds->getInitialIVValue();
+      Value *FinalIV = &Bounds->getFinalIVValue();
+      Value *StepVal = Bounds->getStepValue();
+      if (!StepVal)
+        continue;
+
+      // Determines whether SubIV's derivation depends exclusively on constants
+      // and/or IV; if it does, SubIVDependsOnIV is set to true if IV is
+      // involved in the derivation.
+      bool SubIVDependsOnIV = false;
+      std::function<bool(const Value *, unsigned)> FromConstsOrLoopIV =
+          [&](const Value *SubIV, unsigned Depth) -> bool {
+        if (SubIV == IV) {
+          SubIVDependsOnIV = true;
+          return true;
+        }
+        if (isa<Constant>(SubIV))
+          return true;
+        if (Depth >= 10)
+          return false;
+
+        const Instruction *I = dyn_cast<Instruction>(SubIV);
+        // No point in checking outside the loop since IV is necessarily inside
+        // it; also stop searching when encountering an instruction that will
+        // likely not allow SubIV's value to be statically computed.
+        if (!I || !L->contains(I) || !isa<BinaryOperator, CastInst, PHINode>(I))
+          return false;
+
+        // SubIV depends on constants or IV if all of the instruction's
+        // operands involved in its derivation also depend on constants or IV.
+        return llvm::all_of(I->operand_values(), [&](const Value *V) {
+          return FromConstsOrLoopIV(V, Depth + 1);
+        });
+      };
+
+      if (FromConstsOrLoopIV(InitIV, 0) && FromConstsOrLoopIV(FinalIV, 0) &&
+          FromConstsOrLoopIV(StepVal, 0) && SubIVDependsOnIV) {
+        UP.Threshold += UnrollThresholdNestedStatic;
+        LLVM_DEBUG(dbgs() << "Set unroll threshold " << UP.Threshold
+                          << " for loop:\n"
+                          << *L
+                          << " due to subloop's trip count becoming "
+                             "runtime-independent after unrolling:\n  "
+                          << *SubLoop);
+        if (UP.Threshold >= MaxBoost)
+          return;
+      }
+    }
+  }
+
   for (const BasicBlock *BB : L->getBlocks()) {
     const DataLayout &DL = BB->getDataLayout();
     unsigned LocalGEPsSeen = 0;
diff --git a/llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll b/llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll
new file mode 100644
index 00000000000000..36101c50db98ac
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll
@@ -0,0 +1,84 @@
+; RUN: opt -S -mtriple=amdgcn-- -passes=loop-unroll -debug-only=AMDGPUtti < %s 2>&1 | FileCheck %s
+
+; For @dependent_sub_fullunroll, the threshold bonus should apply
+; CHECK: due to subloop's trip count becoming runtime-independent after unrolling
+
+; For @dependent_sub_no_fullunroll, the threshold bonus should not apply
+; CHECK-NOT: due to subloop's trip count becoming runtime-independent after unrolling
+
+; Check that the outer loop of a double-nested loop where the inner loop's trip
+; count depends exclusively on constants and the outer IV is fully unrolled
+; thanks to receiving a threshold bonus in AMDGPU's TTI.
+
+; CHECK-LABEL: @dependent_sub_fullunroll
+; CHECK: inner.header_latch_exiting.7
+; CHECK: outer.latch_exiting.7
+
+define void @dependent_sub_fullunroll(ptr noundef %mem) {
+entry:
+  br label %outer.header
+
+outer.header:                                                 ; preds = %entry, %outer.latch_exiting
+  %outer.iv = phi i32 [ 0, %entry ], [ %outer.iv_next, %outer.latch_exiting ]
+  br label %inner.header_latch_exiting
+
+inner.header_latch_exiting:                                   ; preds = %outer.header, %inner.header_latch_exiting
+  %inner.iv = phi i32 [ %outer.iv, %outer.header ], [ %inner.iv_next, %inner.header_latch_exiting ]
+  %inner.iv_next = add nuw nsw i32 %inner.iv, 1
+  %outer.iv.ext = zext nneg i32 %outer.iv to i64
+  %idx_part = mul nuw nsw i64 %outer.iv.ext, 16 
+  %inner.iv.ext = zext nneg i32 %inner.iv to i64
+  %idx = add nuw nsw i64 %idx_part, %inner.iv.ext 
+  %addr = getelementptr inbounds i8, ptr %mem, i64 %idx
+  store i32 0, ptr %addr
+  %inner.cond = icmp ult i32 %inner.iv_next, 8
+  br i1 %inner.cond, label %inner.header_latch_exiting, label %outer.latch_exiting, !llvm.loop !1
+
+outer.latch_exiting:                                          ; preds = %inner.header_latch_exiting
+  %outer.iv_next = add nuw nsw i32 %outer.iv, 1
+  %outer.cond = icmp ult i32 %outer.iv_next, 8
+  br i1 %outer.cond, label %outer.header, label %end, !llvm.loop !1
+  
+end:                                                          ; preds = %outer.latch_exiting
+  ret void
+}
+
+; Check that the outer loop of the same loop nest as dependent_sub_fullunroll
+; is not fully unrolled when the inner loop's final IV value depends on a
+; function argument instead of a combination of the outer IV and constants.
+
+; CHECK-LABEL: @dependent_sub_no_fullunroll
+; CHECK-NOT: outer.latch_exiting.7
+; CHECK-NOT: outer.latch_exiting.7
+
+define void @dependent_sub_no_fullunroll(ptr noundef %mem, i32 noundef %inner.ub) {
+entry:
+  br label %outer.header
+
+outer.header:                                                 ; preds = %entry, %outer.latch_exiting
+  %outer.iv = phi i32 [ 0, %entry ], [ %outer.iv_next, %outer.latch_exiting ]
+  br label %inner.header_latch_exiting
+
+inner.header_latch_exiting:                                   ; preds = %outer.header, %inner.header_latch_exiting
+  %inner.iv = phi i32 [ %outer.iv, %outer.header ], [ %inner.iv_next, %inner.header_latch_exiting ]
+  %inner.iv_next = add nuw nsw i32 %inner.iv, 1
+  %outer.iv.ext = zext nneg i32 %outer.iv to i64
+  %idx_part = mul nuw nsw i64 %outer.iv.ext, 16 
+  %inner.iv.ext = zext nneg i32 %inner.iv to i64
+  %idx = add nuw nsw i64 %idx_part, %inner.iv.ext 
+  %addr = getelementptr inbounds i8, ptr %mem, i64 %idx
+  store i32 0, ptr %addr
+  %inner.cond = icmp ult i32 %inner.iv_next, %inner.ub
+  br i1 %inner.cond, label %inner.header_latch_exiting, label %outer.latch_exiting, !llvm.loop !1
+
+outer.latch_exiting:                                          ; preds = %inner.header_latch_exiting
+  %outer.iv_next = add nuw nsw i32 %outer.iv, 1
+  %outer.cond = icmp ult i32 %outer.iv_next, 8
+  br i1 %outer.cond, label %outer.header, label %end, !llvm.loop !1
+  
+end:                                                          ; preds = %outer.latch_exiting
+  ret void
+}
+
+!1 = !{!1, !2}
+!2 = !{!"amdgpu.loop.unroll.threshold", i32 100}

arsenm

I would expect this to be in the generic unroll logic, and not in this backend hook

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll

lucas-rami · 2024-11-04T11:02:10Z

Thank you both for the fast review and comments.

I would expect this to be in the generic unroll logic, and not in this backend hook

Should I move this to the target-independent logic then? If so it seems that the analyzeLoopUnrollCost function would be the most meaningful place to incorporate this in, but the analysis is guarded by a check that the loop under consideration is an innermost loop, and I am guessing we don't want to change that. What do you think?

shiltian · 2024-12-10T18:41:16Z

I agree with @arsenm that this should go to a target independent part. If you don't want to break anything existing, the simplest solution would be to check the recurrence of the new argument, and only proceed when it is set.

github-actions · 2024-12-19T14:26:13Z

✅ With the latest revision this PR passed the C/C++ code formatter.

lucas-rami · 2024-12-19T14:27:52Z

I moved the optimization to the target-independent part of the loop unrolling pass. In the process, I significantly changed the way the analysis increases the likelihood of the loop being fully unrolled to be consistent with the way unroll cost estimation is generally done in the pass.

Instead of adding an absolute value to the unrolling threshold on encountering an optimizable subloop, my analysis now tries to estimate the per-iteration cost savings that fully unrolling a loop would yield, in an attempt to compare a more precise full unrolling cost against the threshold. The general idea is to find instructions in the loop which will be optimizable away once the loop IV is known. Subloops which may become unrollable if we were to unroll the outer loop yield additional cost savings. Simple control structure akin to if/then(/else) yield additional cost savings as well.

nikic · 2024-12-19T15:39:45Z

This seems conceptually similar to something I tried a while ago (I have an old patch stub here: 26df05e). The problem I ran into, and why I didn't end up pursuing this further, was that if we more accurately model full unrolling cost, we end up unrolling more (duh) and from inspecting IR diffs the additional unrolls weren't always profitable. If we make cost modelling more accurate, we may have to reduce unrolling thresholds to compensate.

The other thing to watch out for, especially for your motivating example, is catastrophic unrolling. If you have this kind of nested loop and unroll the outer loop first, you can get quadratic unrolling. See #57865 for an example where simple input IR results in massive output IR that hangs the compiler and #67137 for a related patch. If I understand correctly, your patch is going to make this problem worse than it is right now, because now there will be an additional bonus to unrolling the outer loop.

lucas-rami · 2024-12-19T16:02:22Z

Thanks a lot for adding context here, I see the danger of catastrophic unrolling and I now see that my patch will make it worse. My original motivation for this patch was in fact an AMDGPU kernel structurally similar to my motivating example where all loops need to be fully unrolled to achieve good performance (see discourse post here as well).

I think a solution would be, if there are subloops, to severely limit (or even nullify) the cost savings my analysis estimates if these subloops are too big or have big trip counts (what "too big" means to be defined somehow). If the latter cannot be determined, we would have to err on the "do not unroll" side. Do you think this would prevent catastrophic unrolling and still make the analysis a somewhat useful optimization?

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

arsenm · 2024-12-20T12:56:49Z

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

+    if (!isa<BinaryOperator, CastInst, CmpInst>(I))
+      return false;


Why do you need to track this specific set of operations?

I may be too restrictive indeed. My rationale is that I only want to count instructions that would be folded by unrolling because all of their operands would be constants. I may just need to filter out instructions with side effects, phis, and terminators?

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

The only functional change concerns the selection of potentially foldable instructions inside loops ("foldable instructions" replaces "known instructions" since the former seems like a more canonical term). Instead of filtering-in instruction types that I think are foldable I now filter-out instructions w/ side-effect, phis, and terminators. SelectInst is special-cased because, unlike other instructions, it is foldable as soon as its single "select operand" is known.

lucas-rami · 2025-01-02T13:44:06Z

This is just addressing the inline remarks, I am still thinking of solutions on the larger issue of improving the heuristics to avoid catastrophic unrolling.

Give cond. loop threshold bonus to outer loop in loop nests

f56c432

lucas-rami added the backend:AMDGPU label Nov 1, 2024

llvmbot added the llvm:transforms label Nov 1, 2024

arsenm reviewed Nov 1, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp Outdated Show resolved Hide resolved

arsenm reviewed Nov 1, 2024

View reviewed changes

llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll Outdated Show resolved Hide resolved

llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll Outdated Show resolved Hide resolved

shiltian reviewed Nov 1, 2024

View reviewed changes

llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-dependent-sub.ll Outdated Show resolved Hide resolved

Address reviewers' comments

0a72dca

lucas-rami requested review from arsenm and shiltian November 14, 2024 17:14

Moved logic to target-independent analysis and improved it

30a4827

Fix clang-format

271cb46

lucas-rami changed the title ~~[AMDGPU][TTI] Threshold bonus to loops whose unrolling makes nested loops unrollable~~ [LoopUnroll] Structural cost savings analysis for full loop unrolling Dec 19, 2024

lucas-rami removed the backend:AMDGPU label Dec 19, 2024

arsenm requested review from fhahn, preames, davemgreen and annamthomas December 20, 2024 01:32

arsenm reviewed Dec 20, 2024

View reviewed changes

lucas-rami added 2 commits December 20, 2024 18:40

Merge branch 'main' into unroll-thresold-bonus-to-nest

9cb76b5

		if (!isa<BinaryOperator, CastInst, CmpInst>(I))
		return false;

[LoopUnroll] Structural cost savings analysis for full loop unrolling #114579

Are you sure you want to change the base?

[LoopUnroll] Structural cost savings analysis for full loop unrolling #114579

Uh oh!

Conversation

lucas-rami commented Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Original PR description

Uh oh!

github-actions bot commented Nov 1, 2024

Uh oh!

llvmbot commented Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucas-rami commented Nov 4, 2024

Uh oh!

shiltian commented Dec 10, 2024

Uh oh!

github-actions bot commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucas-rami commented Dec 19, 2024

Uh oh!

nikic commented Dec 19, 2024

Uh oh!

lucas-rami commented Dec 19, 2024

Uh oh!

Uh oh!

arsenm Dec 20, 2024

Choose a reason for hiding this comment

Uh oh!

lucas-rami Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucas-rami commented Jan 2, 2025

Uh oh!

Uh oh!

lucas-rami commented Nov 1, 2024 •

edited

Loading

llvmbot commented Nov 1, 2024 •

edited

Loading

github-actions bot commented Dec 19, 2024 •

edited

Loading

lucas-rami Dec 20, 2024 •

edited

Loading