[LoopUnroll] Make use of MaxTripCount for loops with "#pragma unroll" #74703

xiangzh1 · 2023-12-07T08:39:17Z

Back ground:
This is an extension of 74268, in 74268 we try to fix loop unroll fail at SimplifyCFG.

SimplifyCFG folding loop branches then cause loop unroll failed for "#program unroll" loop.
for example:

#program unroll
for (int I = 0; I < ConstNum; ++I) { // folding "I < ConstNum" and "Cond2"
  if (Cond2) {
  break;
  }
  xxx loop body;
}

Duo to the fix in SimplifyCFG will miss branch folding optimization. As @nikic suggested, we try fix it at loop unroll stage.
The pragma unroll metadata only takes effect if there is an exact trip count, but not if there is an upper bound trip count. This patch make it work with an upper bound trip count as well in shouldPragmaUnroll().

Why we do this:
Loop unroll is important in stack nervous devices (e.g. GPU, and that is why a lot of GPU code mark loop with "#program unroll").
It usually much simplify the address (offset) calculations in old iterations, then we can do a lot of others optimizations, e.g, SROA, for these simplifed address (escape alloca the whole aggregates).

llvmbot · 2023-12-07T08:39:43Z

@llvm/pr-subscribers-llvm-transforms

Author: XiangZhang (xiangzh1)

Changes

Back ground:
This is an extension of 74268, in 74268 we try to fix loop unroll fail at SimplifyCFG.

SimplifyCFG folding loop branches then cause loop unroll failed for "#program unroll" loop.
for example:

#program unroll
for (int I = 0; I &lt; ConstNum; ++I) { // ConstNum &gt; 1
  if (Cond2) {
  break;
  }
  xxx loop body;
}

Duo to the fix in SimplifyCFG will miss branch folding optimization. As @nikic suggested, we try fix it at loop unroll stage.
The pragma unroll metadata only takes effect if there is an exact trip count, but not if there is an upper bound trip count. This patch make it work with an upper bound trip count as well in shouldPragmaUnroll().

Why we do this:
Loop unroll is important in stack nervous devices (e.g. GPU, and that is why a lot of GPU code mark loop with "#program unroll").
It usually much simplify the address (offset) calculations in old iterations, then we can do a lot of others optimizations, e.g, SROA, for these simplifed address (escape alloca the whole aggregates).

Patch is 64.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/74703.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (+7-2)
(added) llvm/test/Transforms/SimplifyCFG/simplify-cfg-unroll.ll (+939)

diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
index 446aa497026d3..963f97d796ae7 100644
--- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
@@ -755,7 +755,7 @@ static unsigned getFullUnrollBoostingFactor(const EstimatedUnrollCost &Cost,
 static std::optional<unsigned>
 shouldPragmaUnroll(Loop *L, const PragmaInfo &PInfo,
                    const unsigned TripMultiple, const unsigned TripCount,
-                   const UnrollCostEstimator UCE,
+                   unsigned MaxTripCount, const UnrollCostEstimator UCE,
                    const TargetTransformInfo::UnrollingPreferences &UP) {
 
   // Using unroll pragma
@@ -776,6 +776,11 @@ shouldPragmaUnroll(Loop *L, const PragmaInfo &PInfo,
   if (PInfo.PragmaFullUnroll && TripCount != 0)
     return TripCount;
 
+  // Small MaxTripCount is clearly calculated with "pragma unroll".
+  if (PInfo.PragmaEnableUnroll && !TripCount && MaxTripCount &&
+      MaxTripCount <= UnrollMaxUpperBound)
+    return MaxTripCount;
+
   // if didn't return until here, should continue to other priorties
   return std::nullopt;
 }
@@ -902,7 +907,7 @@ bool llvm::computeUnrollCount(
   // 1st priority is unroll count set by "unroll-count" option.
   // 2nd priority is unroll count set by pragma.
   if (auto UnrollFactor = shouldPragmaUnroll(L, PInfo, TripMultiple, TripCount,
-                                             UCE, UP)) {
+                                             MaxTripCount, UCE, UP)) {
     UP.Count = *UnrollFactor;
 
     if (UserUnrollCount || (PragmaCount > 0)) {
diff --git a/llvm/test/Transforms/SimplifyCFG/simplify-cfg-unroll.ll b/llvm/test/Transforms/SimplifyCFG/simplify-cfg-unroll.ll
new file mode 100644
index 0000000000000..9f7a95637c064
--- /dev/null
+++ b/llvm/test/Transforms/SimplifyCFG/simplify-cfg-unroll.ll
@@ -0,0 +1,939 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt < %s -S -passes=simplifycfg | FileCheck %s --check-prefixes=CHECK-CFG
+; RUN: opt < %s -S -passes=simplifycfg,loop-unroll --unroll-max-upperbound=17 | FileCheck %s --check-prefixes=CHECK-UNROLL
+
+; This test designed to check:
+; We can still unroll loop with 'pragma unroll' if loop count(trip count) was destroyed by previous optimization.
+; For exmaple, in following test, loop condition "Dim < 16" was 'merged' with "Dim == Dims" in folding branches
+; at simplifycfg. But if custumer mark the loop with "#pragma unroll", we can still successfully unroll it under
+; unroll-max-upperbound.
+;
+; __device__ void func(int Idx, int *Arr[], int Dims, int *Out) {
+;   #pragma unroll
+;   for (int Dim = 0; Dim < 16; ++Dim) {
+;     if (Dim == Dims) {
+;       break;
+;     }
+;     int divmod = Arr[Dim][Idx];
+;     Idx = divmod + 1;
+;
+;     for (int arg = 0; arg < 4; arg++) {
+;       Out[arg] += Arr[Dim][arg];
+;       bar();
+;     }
+;   }
+; }
+
+define void @func(i32 noundef %Idx, ptr noundef %Arr, i32 noundef %Dims, ptr noundef %Out) {
+; CHECK-CFG-LABEL: define void @func(
+; CHECK-CFG-SAME: i32 noundef [[IDX:%.*]], ptr noundef [[ARR:%.*]], i32 noundef [[DIMS:%.*]], ptr noundef [[OUT:%.*]]) {
+; CHECK-CFG-NEXT:  entry:
+; CHECK-CFG-NEXT:    br label [[FOR_COND:%.*]]
+; CHECK-CFG:       for.cond:
+; CHECK-CFG-NEXT:    [[DIM_0:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC16:%.*]], [[FOR_COND_CLEANUP6:%.*]] ]
+; CHECK-CFG-NEXT:    [[IDX_ADDR_0:%.*]] = phi i32 [ [[IDX]], [[ENTRY]] ], [ [[ADD:%.*]], [[FOR_COND_CLEANUP6]] ]
+; CHECK-CFG-NEXT:    [[CMP:%.*]] = icmp sge i32 [[DIM_0]], 16
+; CHECK-CFG-NEXT:    [[CMP1:%.*]] = icmp eq i32 [[DIM_0]], [[DIMS]]
+; CHECK-CFG-NEXT:    [[OR_COND:%.*]] = or i1 [[CMP]], [[CMP1]]
+; CHECK-CFG-NEXT:    br i1 [[OR_COND]], label [[CLEANUP:%.*]], label [[IF_END:%.*]]
+; CHECK-CFG:       if.end:
+; CHECK-CFG-NEXT:    [[IDXPROM:%.*]] = sext i32 [[DIM_0]] to i64
+; CHECK-CFG-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 [[IDXPROM]]
+; CHECK-CFG-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARRAYIDX]], align 8
+; CHECK-CFG-NEXT:    [[IDXPROM2:%.*]] = sext i32 [[IDX_ADDR_0]] to i64
+; CHECK-CFG-NEXT:    [[ARRAYIDX3:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 [[IDXPROM2]]
+; CHECK-CFG-NEXT:    [[TMP1:%.*]] = load i32, ptr [[ARRAYIDX3]], align 4
+; CHECK-CFG-NEXT:    [[ADD]] = add nsw i32 [[TMP1]], 1
+; CHECK-CFG-NEXT:    br label [[FOR_COND4:%.*]]
+; CHECK-CFG:       for.cond4:
+; CHECK-CFG-NEXT:    [[ARG_0:%.*]] = phi i32 [ 0, [[IF_END]] ], [ [[INC:%.*]], [[FOR_BODY7:%.*]] ]
+; CHECK-CFG-NEXT:    [[CMP5:%.*]] = icmp slt i32 [[ARG_0]], 4
+; CHECK-CFG-NEXT:    br i1 [[CMP5]], label [[FOR_BODY7]], label [[FOR_COND_CLEANUP6]]
+; CHECK-CFG:       for.cond.cleanup6:
+; CHECK-CFG-NEXT:    [[INC16]] = add nsw i32 [[DIM_0]], 1
+; CHECK-CFG-NEXT:    br label [[FOR_COND]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-CFG:       for.body7:
+; CHECK-CFG-NEXT:    [[TMP2:%.*]] = load ptr, ptr [[ARRAYIDX]], align 8
+; CHECK-CFG-NEXT:    [[IDXPROM10:%.*]] = sext i32 [[ARG_0]] to i64
+; CHECK-CFG-NEXT:    [[ARRAYIDX11:%.*]] = getelementptr inbounds i32, ptr [[TMP2]], i64 [[IDXPROM10]]
+; CHECK-CFG-NEXT:    [[TMP3:%.*]] = load i32, ptr [[ARRAYIDX11]], align 4
+; CHECK-CFG-NEXT:    [[ARRAYIDX13:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 [[IDXPROM10]]
+; CHECK-CFG-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX13]], align 4
+; CHECK-CFG-NEXT:    [[ADD14:%.*]] = add nsw i32 [[TMP4]], [[TMP3]]
+; CHECK-CFG-NEXT:    store i32 [[ADD14]], ptr [[ARRAYIDX13]], align 4
+; CHECK-CFG-NEXT:    call void @_Z3barv()
+; CHECK-CFG-NEXT:    [[INC]] = add nsw i32 [[ARG_0]], 1
+; CHECK-CFG-NEXT:    br label [[FOR_COND4]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-CFG:       cleanup:
+; CHECK-CFG-NEXT:    ret void
+;
+; CHECK-UNROLL-LABEL: define void @func(
+; CHECK-UNROLL-SAME: i32 noundef [[IDX:%.*]], ptr noundef [[ARR:%.*]], i32 noundef [[DIMS:%.*]], ptr noundef [[OUT:%.*]]) {
+; CHECK-UNROLL-NEXT:  entry:
+; CHECK-UNROLL-NEXT:    br label [[FOR_COND:%.*]]
+; CHECK-UNROLL:       for.cond:
+; CHECK-UNROLL-NEXT:    [[CMP1:%.*]] = icmp eq i32 0, [[DIMS]]
+; CHECK-UNROLL-NEXT:    br i1 [[CMP1]], label [[CLEANUP:%.*]], label [[IF_END:%.*]]
+; CHECK-UNROLL:       if.end:
+; CHECK-UNROLL-NEXT:    br label [[FOR_COND4:%.*]]
+; CHECK-UNROLL:       for.cond4:
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7:%.*]]
+; CHECK-UNROLL:       for.cond.cleanup6:
+; CHECK-UNROLL-NEXT:    [[CMP1_1:%.*]] = icmp eq i32 1, [[DIMS]]
+; CHECK-UNROLL-NEXT:    br i1 [[CMP1_1]], label [[CLEANUP]], label [[IF_END_1:%.*]]
+; CHECK-UNROLL:       if.end.1:
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 1
+; CHECK-UNROLL-NEXT:    br label [[FOR_COND4_1:%.*]]
+; CHECK-UNROLL:       for.cond4.1:
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_12:%.*]]
+; CHECK-UNROLL:       for.body7.12:
+; CHECK-UNROLL-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARRAYIDX_1]], align 8
+; CHECK-UNROLL-NEXT:    [[TMP1:%.*]] = load i32, ptr [[TMP0]], align 4
+; CHECK-UNROLL-NEXT:    [[TMP2:%.*]] = load i32, ptr [[OUT]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_11:%.*]] = add nsw i32 [[TMP2]], [[TMP1]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_11]], ptr [[OUT]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_1_1:%.*]]
+; CHECK-UNROLL:       for.body7.1.1:
+; CHECK-UNROLL-NEXT:    [[TMP3:%.*]] = load ptr, ptr [[ARRAYIDX_1]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_1_1:%.*]] = getelementptr inbounds i32, ptr [[TMP3]], i64 1
+; CHECK-UNROLL-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX11_1_1]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_1_1:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 1
+; CHECK-UNROLL-NEXT:    [[TMP5:%.*]] = load i32, ptr [[ARRAYIDX13_1_1]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_1_1:%.*]] = add nsw i32 [[TMP5]], [[TMP4]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_1_1]], ptr [[ARRAYIDX13_1_1]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_2_1:%.*]]
+; CHECK-UNROLL:       for.body7.2.1:
+; CHECK-UNROLL-NEXT:    [[TMP6:%.*]] = load ptr, ptr [[ARRAYIDX_1]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_2_1:%.*]] = getelementptr inbounds i32, ptr [[TMP6]], i64 2
+; CHECK-UNROLL-NEXT:    [[TMP7:%.*]] = load i32, ptr [[ARRAYIDX11_2_1]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_2_1:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 2
+; CHECK-UNROLL-NEXT:    [[TMP8:%.*]] = load i32, ptr [[ARRAYIDX13_2_1]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_2_1:%.*]] = add nsw i32 [[TMP8]], [[TMP7]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_2_1]], ptr [[ARRAYIDX13_2_1]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_3_1:%.*]]
+; CHECK-UNROLL:       for.body7.3.1:
+; CHECK-UNROLL-NEXT:    [[TMP9:%.*]] = load ptr, ptr [[ARRAYIDX_1]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_3_1:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 3
+; CHECK-UNROLL-NEXT:    [[TMP10:%.*]] = load i32, ptr [[ARRAYIDX11_3_1]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_3_1:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 3
+; CHECK-UNROLL-NEXT:    [[TMP11:%.*]] = load i32, ptr [[ARRAYIDX13_3_1]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_3_1:%.*]] = add nsw i32 [[TMP11]], [[TMP10]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_3_1]], ptr [[ARRAYIDX13_3_1]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br i1 false, label [[FOR_BODY7_4:%.*]], label [[FOR_COND_CLEANUP6_1:%.*]]
+; CHECK-UNROLL:       for.cond.cleanup6.1:
+; CHECK-UNROLL-NEXT:    [[CMP1_2:%.*]] = icmp eq i32 2, [[DIMS]]
+; CHECK-UNROLL-NEXT:    br i1 [[CMP1_2]], label [[CLEANUP]], label [[IF_END_2:%.*]]
+; CHECK-UNROLL:       if.end.2:
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX_2:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 2
+; CHECK-UNROLL-NEXT:    br label [[FOR_COND4_2:%.*]]
+; CHECK-UNROLL:       for.cond4.2:
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_24:%.*]]
+; CHECK-UNROLL:       for.body7.24:
+; CHECK-UNROLL-NEXT:    [[TMP12:%.*]] = load ptr, ptr [[ARRAYIDX_2]], align 8
+; CHECK-UNROLL-NEXT:    [[TMP13:%.*]] = load i32, ptr [[TMP12]], align 4
+; CHECK-UNROLL-NEXT:    [[TMP14:%.*]] = load i32, ptr [[OUT]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_23:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_23]], ptr [[OUT]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_1_2:%.*]]
+; CHECK-UNROLL:       for.body7.1.2:
+; CHECK-UNROLL-NEXT:    [[TMP15:%.*]] = load ptr, ptr [[ARRAYIDX_2]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_1_2:%.*]] = getelementptr inbounds i32, ptr [[TMP15]], i64 1
+; CHECK-UNROLL-NEXT:    [[TMP16:%.*]] = load i32, ptr [[ARRAYIDX11_1_2]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_1_2:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 1
+; CHECK-UNROLL-NEXT:    [[TMP17:%.*]] = load i32, ptr [[ARRAYIDX13_1_2]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_1_2:%.*]] = add nsw i32 [[TMP17]], [[TMP16]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_1_2]], ptr [[ARRAYIDX13_1_2]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_2_2:%.*]]
+; CHECK-UNROLL:       for.body7.2.2:
+; CHECK-UNROLL-NEXT:    [[TMP18:%.*]] = load ptr, ptr [[ARRAYIDX_2]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_2_2:%.*]] = getelementptr inbounds i32, ptr [[TMP18]], i64 2
+; CHECK-UNROLL-NEXT:    [[TMP19:%.*]] = load i32, ptr [[ARRAYIDX11_2_2]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_2_2:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 2
+; CHECK-UNROLL-NEXT:    [[TMP20:%.*]] = load i32, ptr [[ARRAYIDX13_2_2]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_2_2:%.*]] = add nsw i32 [[TMP20]], [[TMP19]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_2_2]], ptr [[ARRAYIDX13_2_2]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_3_2:%.*]]
+; CHECK-UNROLL:       for.body7.3.2:
+; CHECK-UNROLL-NEXT:    [[TMP21:%.*]] = load ptr, ptr [[ARRAYIDX_2]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_3_2:%.*]] = getelementptr inbounds i32, ptr [[TMP21]], i64 3
+; CHECK-UNROLL-NEXT:    [[TMP22:%.*]] = load i32, ptr [[ARRAYIDX11_3_2]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_3_2:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 3
+; CHECK-UNROLL-NEXT:    [[TMP23:%.*]] = load i32, ptr [[ARRAYIDX13_3_2]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_3_2:%.*]] = add nsw i32 [[TMP23]], [[TMP22]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_3_2]], ptr [[ARRAYIDX13_3_2]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br i1 false, label [[FOR_BODY7_4]], label [[FOR_COND_CLEANUP6_2:%.*]]
+; CHECK-UNROLL:       for.cond.cleanup6.2:
+; CHECK-UNROLL-NEXT:    [[CMP1_3:%.*]] = icmp eq i32 3, [[DIMS]]
+; CHECK-UNROLL-NEXT:    br i1 [[CMP1_3]], label [[CLEANUP]], label [[IF_END_3:%.*]]
+; CHECK-UNROLL:       if.end.3:
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX_3:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 3
+; CHECK-UNROLL-NEXT:    br label [[FOR_COND4_3:%.*]]
+; CHECK-UNROLL:       for.cond4.3:
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_36:%.*]]
+; CHECK-UNROLL:       for.body7.36:
+; CHECK-UNROLL-NEXT:    [[TMP24:%.*]] = load ptr, ptr [[ARRAYIDX_3]], align 8
+; CHECK-UNROLL-NEXT:    [[TMP25:%.*]] = load i32, ptr [[TMP24]], align 4
+; CHECK-UNROLL-NEXT:    [[TMP26:%.*]] = load i32, ptr [[OUT]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_35:%.*]] = add nsw i32 [[TMP26]], [[TMP25]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_35]], ptr [[OUT]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_1_3:%.*]]
+; CHECK-UNROLL:       for.body7.1.3:
+; CHECK-UNROLL-NEXT:    [[TMP27:%.*]] = load ptr, ptr [[ARRAYIDX_3]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_1_3:%.*]] = getelementptr inbounds i32, ptr [[TMP27]], i64 1
+; CHECK-UNROLL-NEXT:    [[TMP28:%.*]] = load i32, ptr [[ARRAYIDX11_1_3]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_1_3:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 1
+; CHECK-UNROLL-NEXT:    [[TMP29:%.*]] = load i32, ptr [[ARRAYIDX13_1_3]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_1_3:%.*]] = add nsw i32 [[TMP29]], [[TMP28]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_1_3]], ptr [[ARRAYIDX13_1_3]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_2_3:%.*]]
+; CHECK-UNROLL:       for.body7.2.3:
+; CHECK-UNROLL-NEXT:    [[TMP30:%.*]] = load ptr, ptr [[ARRAYIDX_3]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_2_3:%.*]] = getelementptr inbounds i32, ptr [[TMP30]], i64 2
+; CHECK-UNROLL-NEXT:    [[TMP31:%.*]] = load i32, ptr [[ARRAYIDX11_2_3]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_2_3:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 2
+; CHECK-UNROLL-NEXT:    [[TMP32:%.*]] = load i32, ptr [[ARRAYIDX13_2_3]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_2_3:%.*]] = add nsw i32 [[TMP32]], [[TMP31]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_2_3]], ptr [[ARRAYIDX13_2_3]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_3_3:%.*]]
+; CHECK-UNROLL:       for.body7.3.3:
+; CHECK-UNROLL-NEXT:    [[TMP33:%.*]] = load ptr, ptr [[ARRAYIDX_3]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_3_3:%.*]] = getelementptr inbounds i32, ptr [[TMP33]], i64 3
+; CHECK-UNROLL-NEXT:    [[TMP34:%.*]] = load i32, ptr [[ARRAYIDX11_3_3]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_3_3:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 3
+; CHECK-UNROLL-NEXT:    [[TMP35:%.*]] = load i32, ptr [[ARRAYIDX13_3_3]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_3_3:%.*]] = add nsw i32 [[TMP35]], [[TMP34]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_3_3]], ptr [[ARRAYIDX13_3_3]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br i1 false, label [[FOR_BODY7_4]], label [[FOR_COND_CLEANUP6_3:%.*]]
+; CHECK-UNROLL:       for.cond.cleanup6.3:
+; CHECK-UNROLL-NEXT:    [[CMP1_4:%.*]] = icmp eq i32 4, [[DIMS]]
+; CHECK-UNROLL-NEXT:    br i1 [[CMP1_4]], label [[CLEANUP]], label [[IF_END_4:%.*]]
+; CHECK-UNROLL:       if.end.4:
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX_4:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 4
+; CHECK-UNROLL-NEXT:    br label [[FOR_COND4_4:%.*]]
+; CHECK-UNROLL:       for.cond4.4:
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_48:%.*]]
+; CHECK-UNROLL:       for.body7.48:
+; CHECK-UNROLL-NEXT:    [[TMP36:%.*]] = load ptr, ptr [[ARRAYIDX_4]], align 8
+; CHECK-UNROLL-NEXT:    [[TMP37:%.*]] = load i32, ptr [[TMP36]], align 4
+; CHECK-UNROLL-NEXT:    [[TMP38:%.*]] = load i32, ptr [[OUT]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_47:%.*]] = add nsw i32 [[TMP38]], [[TMP37]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_47]], ptr [[OUT]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_1_4:%.*]]
+; CHECK-UNROLL:       for.body7.1.4:
+; CHECK-UNROLL-NEXT:    [[TMP39:%.*]] = load ptr, ptr [[ARRAYIDX_4]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_1_4:%.*]] = getelementptr inbounds i32, ptr [[TMP39]], i64 1
+; CHECK-UNROLL-NEXT:    [[TMP40:%.*]] = load i32, ptr [[ARRAYIDX11_1_4]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_1_4:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 1
+; CHECK-UNROLL-NEXT:    [[TMP41:%.*]] = load i32, ptr [[ARRAYIDX13_1_4]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_1_4:%.*]] = add nsw i32 [[TMP41]], [[TMP40]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_1_4]], ptr [[ARRAYIDX13_1_4]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_2_4:%.*]]
+; CHECK-UNROLL:       for.body7.2.4:
+; CHECK-UNROLL-NEXT:    [[TMP42:%.*]] = load ptr, ptr [[ARRAYIDX_4]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_2_4:%.*]] = getelementptr inbounds i32, ptr [[TMP42]], i64 2
+; CHECK-UNROLL-NEXT:    [[TMP43:%.*]] = load i32, ptr [[ARRAYIDX11_2_4]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_2_4:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 2
+; CHECK-UNROLL-NEXT:    [[TMP44:%.*]] = load i32, ptr [[ARRAYIDX13_2_4]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_2_4:%.*]] = add nsw i32 [[TMP44]], [[TMP43]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_2_4]], ptr [[ARRAYIDX13_2_4]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_3_4:%.*]]
+; CHECK-UNROLL:       for.body7.3.4:
+; CHECK-UNROLL-NEXT:    [[TMP45:%.*]] = load ptr, ptr [[ARRAYIDX_4]], align 8
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX11_3_4:%.*]] = getelementptr inbounds i32, ptr [[TMP45]], i64 3
+; CHECK-UNROLL-NEXT:    [[TMP46:%.*]] = load i32, ptr [[ARRAYIDX11_3_4]], align 4
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX13_3_4:%.*]] = getelementptr inbounds i32, ptr [[OUT]], i64 3
+; CHECK-UNROLL-NEXT:    [[TMP47:%.*]] = load i32, ptr [[ARRAYIDX13_3_4]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_3_4:%.*]] = add nsw i32 [[TMP47]], [[TMP46]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_3_4]], ptr [[ARRAYIDX13_3_4]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br i1 false, label [[FOR_BODY7_4]], label [[FOR_COND_CLEANUP6_4:%.*]]
+; CHECK-UNROLL:       for.cond.cleanup6.4:
+; CHECK-UNROLL-NEXT:    [[CMP1_5:%.*]] = icmp eq i32 5, [[DIMS]]
+; CHECK-UNROLL-NEXT:    br i1 [[CMP1_5]], label [[CLEANUP]], label [[IF_END_5:%.*]]
+; CHECK-UNROLL:       if.end.5:
+; CHECK-UNROLL-NEXT:    [[ARRAYIDX_5:%.*]] = getelementptr inbounds ptr, ptr [[ARR]], i64 5
+; CHECK-UNROLL-NEXT:    br label [[FOR_COND4_5:%.*]]
+; CHECK-UNROLL:       for.cond4.5:
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_5:%.*]]
+; CHECK-UNROLL:       for.body7.5:
+; CHECK-UNROLL-NEXT:    [[TMP48:%.*]] = load ptr, ptr [[ARRAYIDX_5]], align 8
+; CHECK-UNROLL-NEXT:    [[TMP49:%.*]] = load i32, ptr [[TMP48]], align 4
+; CHECK-UNROLL-NEXT:    [[TMP50:%.*]] = load i32, ptr [[OUT]], align 4
+; CHECK-UNROLL-NEXT:    [[ADD14_5:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]
+; CHECK-UNROLL-NEXT:    store i32 [[ADD14_5]], ptr [[OUT]], align 4
+; CHECK-UNROLL-NEXT:    call void @_Z3barv()
+; CHECK-UNROLL-NEXT:    br label [[FOR_BODY7_1_5:%.*]]
+; CHECK-UNROLL:       for.body7.1.5:
+; CHECK-UNROLL-NEXT:    [[TMP51:%.*]] = load ptr, ptr [[ARRAYIDX_5]], align 8
+; CHECK-UNROLL...
[truncated]

nikic

Looks conceptually fine to me.

nikic · 2023-12-07T13:20:44Z

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

@@ -776,6 +776,11 @@ shouldPragmaUnroll(Loop *L, const PragmaInfo &PInfo,
  if (PInfo.PragmaFullUnroll && TripCount != 0)
    return TripCount;

+  // Small MaxTripCount is clearly calculated with "pragma unroll".


I don't understand what this comment is trying to tell. The code would be clearer without it.

ok, let me rm it, I just want to express the "MaxTripCount is clearly calculated" (e.g 17) not roughly estimated （e.g "int32 I=0, I < UnkowInt32LoopCount, I++", we may roughly set the MaxTripCount is Int_Max (0x7ffffffff)）

Ah, I get what you mean now. It would probably be okay to use a higher limit than UnrollMaxUpperBound if pragma is involved, but I agree that simply unrolling to any MaxTripCount would not be safe.

llvm/test/Transforms/SimplifyCFG/simplify-cfg-unroll.ll

…n simplify cfg

xiangzh1 · 2023-12-08T01:56:01Z

update, address @nikic's comments

…unt is destroyed by preious optimization

nikic

LGTM

xiangzh1 · 2023-12-08T11:35:52Z

Thanks a lot!

… unroll" (llvm#74703)" This reverts commit 1d6a678. Change-Id: Ie10f8f68b4e3ad47339c853b576e16dc868cd395

…llvm#74703) Fix loop unroll fail caused by branches folding. For example: SimplifyCFG foldloop branches then cause loop unroll failed for "#program unroll" loop. ``` for (int I = 0; I < ConstNum; ++I) { // folding "I < ConstNum" and "Cond2" if (Cond2) { break; } xxx loop body; } ``` The pragma unroll metadata only takes effect if there is an exact trip count, but not if there is an upper bound trip count. This patch make it work with an upper bound trip count as well in shouldPragmaUnroll(). Loop unroll is important in stack nervous devices (e.g. GPU, and that is why a lot of GPU code mark loop with "#program unroll"). It usually much simplify the address (offset) calculations in old iterations, then we can do a lot of others optimizations, e.g, SROA, for these simplifed address (escape alloca the whole aggregates). Change-Id: Ibd1a1a6cdcf98fd36be43dd2594976a7b0c588cf

… unroll" (llvm#74703)" This reverts commit c5afecc. Change-Id: Iaa26202b30f8a2bd206d3714476c8eaaef65f582

… unroll" (llvm#74703)" This reverts commit 1d6a678. Change-Id: Ied8f01a15f9af8579e711ff01ab8c3c8a8e4976f

llvmbot added the llvm:transforms label Dec 7, 2023

xiangzh1 requested review from nikic, bcl5980, rotateright, topperc and goldsteinn December 7, 2023 08:39

xiangzh1 mentioned this pull request Dec 7, 2023

[SimplifyCFG] Not folding branch in loop header with constant iterations #74268

Closed

nikic reviewed Dec 7, 2023

View reviewed changes

[LoopUnroll] Pre-commit test for loop unroll after folding branches i…

0e5655e

…n simplify cfg

xiangzh1 force-pushed the users/xiangzhangllvm/loop-unroll-folding-branches branch from 675e3cb to c5043e5 Compare December 8, 2023 01:50

[LoopUnroll] Unroll small loops with 'pragma unroll' when its trip co…

d0dc993

…unt is destroyed by preious optimization

xiangzh1 force-pushed the users/xiangzhangllvm/loop-unroll-folding-branches branch from c5043e5 to d0dc993 Compare December 8, 2023 02:18

nikic approved these changes Dec 8, 2023

View reviewed changes

nikic changed the title ~~Users/xiangzhangllvm/loop unroll folding branches~~ [LoopUnroll] Make use of MaxTripCount for loops with "#pragma unroll" Dec 8, 2023

xiangzh1 merged commit 1d6a678 into main Dec 8, 2023
4 checks passed

xiangzh1 deleted the users/xiangzhangllvm/loop-unroll-folding-branches branch December 8, 2023 11:43

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Feb 23, 2024

Revert "[LoopUnroll] Make use of MaxTripCount for loops with "#pragma…

79fe06a

… unroll" (llvm#74703)" This reverts commit 1d6a678. Change-Id: Ie10f8f68b4e3ad47339c853b576e16dc868cd395

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 28, 2024

Revert "[LoopUnroll] Make use of MaxTripCount for loops with "#pragma…

9087c66

… unroll" (llvm#74703)" This reverts commit c5afecc. Change-Id: Iaa26202b30f8a2bd206d3714476c8eaaef65f582

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 28, 2024

Revert "[LoopUnroll] Make use of MaxTripCount for loops with "#pragma…

bdfac03

… unroll" (llvm#74703)" This reverts commit 1d6a678. Change-Id: Ied8f01a15f9af8579e711ff01ab8c3c8a8e4976f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopUnroll] Make use of MaxTripCount for loops with "#pragma unroll" #74703

[LoopUnroll] Make use of MaxTripCount for loops with "#pragma unroll" #74703

xiangzh1 commented Dec 7, 2023 •

edited

llvmbot commented Dec 7, 2023

nikic left a comment

nikic Dec 7, 2023

xiangzh1 Dec 7, 2023

nikic Dec 8, 2023

xiangzh1 commented Dec 8, 2023

nikic left a comment

xiangzh1 commented Dec 8, 2023

[LoopUnroll] Make use of MaxTripCount for loops with "#pragma unroll" #74703

[LoopUnroll] Make use of MaxTripCount for loops with "#pragma unroll" #74703

Conversation

xiangzh1 commented Dec 7, 2023 • edited

llvmbot commented Dec 7, 2023

nikic left a comment

Choose a reason for hiding this comment

nikic Dec 7, 2023

Choose a reason for hiding this comment

xiangzh1 Dec 7, 2023

Choose a reason for hiding this comment

nikic Dec 8, 2023

Choose a reason for hiding this comment

xiangzh1 commented Dec 8, 2023

nikic left a comment

Choose a reason for hiding this comment

xiangzh1 commented Dec 8, 2023

xiangzh1 commented Dec 7, 2023 •

edited