[LoopUnroll] Penalize interior control flow #67137

nikic · 2023-09-22T13:41:51Z

This patch adds a small cost penalty for each (non-trivial, non-exiting, non-latch) branch in the loop.

At a high level, I believe that fully unrolling loops with interior control flow tends to be less profitable, because we don't get straight-line (or extended-BB) code out of it, so it is less amenable to further optimization. It also dilutes branch predictors.

The specific motivation for this is twofold:

This avoids an undesirable unroll in x264 that led to the revert of https://reviews.llvm.org/D156532. While the loop vectorizer can handle interior control flow, the SLP vectorizer can't (and likely won't be able to in any foreseeable future).
While looking into Crash at -O3 in iterativelySimplifyCFG: Assertion `IterCnt++ < 1000 && "Iterative simplification didn't converge!"' failed. #57865, I found a case where we produce a quadratic unroll result, because we first unroll an outer loop and then after simplification unroll (N copies of) the inner loop as well, resulting in huge IR output. The penalty applied here does not prevent this in general, but does disincentivize it. (We don't want to forbid unrolling outer loops entirely, it is necessary for vectorization in some cases.)

The penalty of 2 here is chosen somewhat arbitrarily as the lowest value that fixes both issues. I think the outer loop case may want an even larger penalty, but this seems like a conservative starting point.

llvmbot · 2023-09-22T13:42:58Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-transforms

Changes

This patch adds a small cost penalty for each (non-trivial, non-exiting, non-latch) branch in the loop.

At a high level, I believe that fully unrolling loops with interior control flow tends to be less profitable, because we don't get straight-line (or extended-BB) code out of it, so it is less amenable to further optimization. It also dilutes branch predictors.

The specific motivation for this is twofold:

This avoids an undesirable unroll in x264 that led to the revert of https://reviews.llvm.org/D156532. While the loop vectorizer can handle interior control flow, the SLP vectorizer can't (and likely won't be able to in any foreseeable future).
While looking into Crash at -O3 in iterativelySimplifyCFG: Assertion `IterCnt++ < 1000 && "Iterative simplification didn't converge!"' failed. #57865, I found a case where we produce a quadratic unroll result, because we first unroll an outer loop and then after simplification unroll (N copies of) the inner loop as well, resulting in huge IR output. The penalty applied here does not prevent this in general, but does disincentivize it. (We don't want to forbid unrolling outer loops entirely, it is necessary for vectorization in some cases.)

The penalty of 2 here is chosen somewhat arbitrarily as the lowest value that fixes both issues. I think the outer loop case may want an even larger penalty, but this seems like a conservative starting point.

Patch is 51.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67137.diff

4 Files Affected:

(modified) llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (+14)
(added) llvm/test/Transforms/LoopUnroll/quadratic-unroll.ll (+340)
(modified) llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll (+1-1)
(modified) llvm/test/Transforms/PhaseOrdering/AArch64/quant_4x4.ll (+16-428)

diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
index 335b489d3cb25ba..41c82783a7a4d79 100644
--- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
@@ -173,6 +173,10 @@ static cl::opt<unsigned>
                            cl::desc("Default threshold (max size of unrolled "
                                     "loop), used in all but O3 optimizations"));
 
+static cl::opt<unsigned> UnrollInteriorControlFlowPenalty(
+    "unroll-interior-control-flow-penalty", cl::init(2), cl::Hidden,
+    cl::desc("Penalty for non-exiting branches in the loop"));
+
 /// A magic value for use with the Threshold parameter to indicate
 /// that the loop unroll should be performed regardless of how much
 /// code expansion would result.
@@ -676,6 +680,16 @@ InstructionCost llvm::ApproximateLoopSize(
 
   InstructionCost LoopSize = Metrics.NumInsts;
 
+  // Add a penalty for interior control flow (excluding exits and latches).
+  // Unrolling such loops is less profitable, as it does not result in
+  // straight-line code (or extended basic blocks if multiple exits). This
+  // also disincentivizes unrolling outer loops, which may result in large
+  // size increases if the inner loop is also unrolled later.
+  for (BasicBlock *BB : L->blocks())
+    if (!BB->getSingleSuccessor() && !L->isLoopExiting(BB) &&
+        !L->isLoopLatch(BB))
+      LoopSize += UnrollInteriorControlFlowPenalty;
+
   // Don't allow an estimate of size zero.  This would allows unrolling of loops
   // with huge iteration counts, which is a compile time problem even if it's
   // not a problem for code quality. Also, the code using this size may assume
diff --git a/llvm/test/Transforms/LoopUnroll/quadratic-unroll.ll b/llvm/test/Transforms/LoopUnroll/quadratic-unroll.ll
new file mode 100644
index 000000000000000..bfd88a12c9ca034
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/quadratic-unroll.ll
@@ -0,0 +1,340 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -S -passes='loop(loop-unroll-full),simplifycfg,loop(loop-unroll-full)' -unroll-threshold=300 < %s | FileCheck %s
+
+; Check that we don't produce quadratic output by unrolling both the inner
+; and outer loop.
+
+@a = global i32 0, align 4
+@d = global ptr null, align 8
+@e = external global ptr
+
+define i8 @f(ptr %p) {
+; CHECK-LABEL: define i8 @f(
+; CHECK-SAME: ptr [[P:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[FOR_COND1_PREHEADER:%.*]]
+; CHECK:       for.cond1.preheader:
+; CHECK-NEXT:    [[H_017:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC12:%.*]], [[FOR_INC11:%.*]] ]
+; CHECK-NEXT:    br label [[FOR_BODY4:%.*]]
+; CHECK:       for.body4:
+; CHECK-NEXT:    [[V:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT:%.*]] = icmp eq i32 [[V]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT]], label [[FOR_END:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_1:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_1:%.*]] = icmp eq i32 [[V_1]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_1]], label [[FOR_END_1:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.1:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_2:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_2:%.*]] = icmp eq i32 [[V_2]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_2]], label [[FOR_END_2:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.2:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_3:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_3:%.*]] = icmp eq i32 [[V_3]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_3]], label [[FOR_END_3:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.3:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_4:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_4:%.*]] = icmp eq i32 [[V_4]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_4]], label [[FOR_END_4:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.4:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_5:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_5:%.*]] = icmp eq i32 [[V_5]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_5]], label [[FOR_END_5:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.5:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_6:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_6:%.*]] = icmp eq i32 [[V_6]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_6]], label [[FOR_END_6:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.6:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_7:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_7:%.*]] = icmp eq i32 [[V_7]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_7]], label [[FOR_END_7:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.7:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_8:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_8:%.*]] = icmp eq i32 [[V_8]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_8]], label [[FOR_END_8:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.8:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_9:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_9:%.*]] = icmp eq i32 [[V_9]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_9]], label [[FOR_END_9:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.9:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_10:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_10:%.*]] = icmp eq i32 [[V_10]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_10]], label [[FOR_END_10:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.10:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_11:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_11:%.*]] = icmp eq i32 [[V_11]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_11]], label [[FOR_END_11:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.11:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_12:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_12:%.*]] = icmp eq i32 [[V_12]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_12]], label [[FOR_END_12:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.12:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_13:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_13:%.*]] = icmp eq i32 [[V_13]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_13]], label [[FOR_END_13:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.13:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_14:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_14:%.*]] = icmp eq i32 [[V_14]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_14]], label [[FOR_END_14:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.14:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_15:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_15:%.*]] = icmp eq i32 [[V_15]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_15]], label [[FOR_END_15:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.15:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_16:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_16:%.*]] = icmp eq i32 [[V_16]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_16]], label [[FOR_END_16:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.16:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_17:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_17:%.*]] = icmp eq i32 [[V_17]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_17]], label [[FOR_END_17:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.17:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_18:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_18:%.*]] = icmp eq i32 [[V_18]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_18]], label [[FOR_END_18:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.18:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_19:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_19:%.*]] = icmp eq i32 [[V_19]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_19]], label [[FOR_END_19:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.19:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_20:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_20:%.*]] = icmp eq i32 [[V_20]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_20]], label [[FOR_END_20:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.20:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_21:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_21:%.*]] = icmp eq i32 [[V_21]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_21]], label [[FOR_END_21:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.21:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_22:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_22:%.*]] = icmp eq i32 [[V_22]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_22]], label [[FOR_END_22:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.22:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_23:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_23:%.*]] = icmp eq i32 [[V_23]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_23]], label [[FOR_END_23:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.23:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_24:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_24:%.*]] = icmp eq i32 [[V_24]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_24]], label [[FOR_END_24:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.24:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_25:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_25:%.*]] = icmp eq i32 [[V_25]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_25]], label [[FOR_END_25:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.25:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_26:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_26:%.*]] = icmp eq i32 [[V_26]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_26]], label [[FOR_END_26:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.26:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_27:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_27:%.*]] = icmp eq i32 [[V_27]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_27]], label [[FOR_END_27:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.27:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_28:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_28:%.*]] = icmp eq i32 [[V_28]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_28]], label [[FOR_END_28:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.28:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_29:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_29:%.*]] = icmp eq i32 [[V_29]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_29]], label [[FOR_END_29:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.29:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_30:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_30:%.*]] = icmp eq i32 [[V_30]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_30]], label [[FOR_END_30:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.30:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_31:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_31:%.*]] = icmp eq i32 [[V_31]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_31]], label [[FOR_END_31:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.31:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_32:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_32:%.*]] = icmp eq i32 [[V_32]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_32]], label [[FOR_END_32:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.32:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_33:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_33:%.*]] = icmp eq i32 [[V_33]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_33]], label [[FOR_END_33:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.33:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_34:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_34:%.*]] = icmp eq i32 [[V_34]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_34]], label [[FOR_END_34:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.34:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_35:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_35:%.*]] = icmp eq i32 [[V_35]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_35]], label [[FOR_END_35:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.35:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_36:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_36:%.*]] = icmp eq i32 [[V_36]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_36]], label [[FOR_END_36:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.36:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_37:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_37:%.*]] = icmp eq i32 [[V_37]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_37]], label [[FOR_END_37:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.37:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_38:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_38:%.*]] = icmp eq i32 [[V_38]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_38]], label [[FOR_END_38:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.38:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_39:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_39:%.*]] = icmp eq i32 [[V_39]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_39]], label [[FOR_END_39:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.39:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_40:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_40:%.*]] = icmp eq i32 [[V_40]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_40]], label [[FOR_END_40:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.40:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_41:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_41:%.*]] = icmp eq i32 [[V_41]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_41]], label [[FOR_END_41:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.41:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_42:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_42:%.*]] = icmp eq i32 [[V_42]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_42]], label [[FOR_END_42:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.42:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_43:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_43:%.*]] = icmp eq i32 [[V_43]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_43]], label [[FOR_END_43:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.43:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_44:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_44:%.*]] = icmp eq i32 [[V_44]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_44]], label [[FOR_END_44:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.44:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_45:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_45:%.*]] = icmp eq i32 [[V_45]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_45]], label [[FOR_END_45:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.45:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_46:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_46:%.*]] = icmp eq i32 [[V_46]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_46]], label [[FOR_END_46:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.46:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    br label [[FOR_INC11]]
+; CHECK:       for.inc11:
+; CHECK-NEXT:    [[INC12]] = add nuw nsw i32 [[H_017]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i32 [[H_017]], 21
+; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_COND1_PREHEADER]], label [[FOR_END13:%.*]]
+; CHECK:       for.end13:
+; CHECK-NEXT:    ret i8 undef
+;
+entry:
+  br label %for.cond1.preheader
+
+for.cond1.preheader:                              ; preds = %entry, %for.inc11
+  %h.017 = phi i32 [ 0, %entry ], [ %inc12, %for.inc11 ]
+  br label %for.body4
+
+for.body4:                                        ; preds = %for.cond1.preheader, %for.end
+  %g.016 = phi i16 [ -22, %for.cond1.preheader ], [ %inc9, %for.end ]
+  %v = load i32, ptr %p, align 4
+  %tobool.not = icmp eq i32 %v, 0
+  br i1 %tobool.not, label %for.cond5.preheader, label %for.inc11
+
+for.cond5.preheader:                              ; preds = %for.body4
+  br label %for.end
+
+for.end:                                          ; preds = %for.cond5.preheader
+  store i32 0, ptr @a, align 4
+  store ptr @e, ptr @d, align 8
+  %inc9 = add i16 %g.016, 1
+  %cmp2 = icmp slt i16 %inc9, 25
+  br i1 %cmp2, label %for.body4, label %for.inc11
+
+for.inc11:                                        ; preds = %for.end, %for.body4
+  %inc12 = add nuw nsw i32 %h.017, 1
+  %cmp = icmp ult i32 %h.017, 21
+  br i1 %cmp, label %for.cond1.preheader, label %for.end13
+
+for.end13:                                        ; preds = %for.inc11
+  ret i8 undef
+}
diff --git a/llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll b/llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll
index de548523134568f..7027cac9c9ea355 100644
--- a/llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll...
[truncated]

nikic · 2023-09-22T13:45:16Z

llvm/test/Transforms/PhaseOrdering/AArch64/quant_4x4.ll

We no longer unroll the scalar fallback loop here, which I think is a win?

I think so, yes!

This patch adds a small cost penalty for each (non-trivial, non-exiting, non-latch) branch in the loop. At a high level, I believe that fully unrolling loops with interior control flow tends to be less profitable, because we don't get straight-line (or extended-BB) code out of it, so it is less amenable to further optimization. It also dilutes branch predictors. The specific motivation for this is twofold: * This avoids an undesirable unroll in x264 that led to the revert of https://reviews.llvm.org/D156532. While the loop vectorizer can handle interior control flow, the SLP vectorizer can't (and likely won't be able to in any foreseeable future). * While looking into llvm#57865, I found a case where we produce a quadratic unroll result, because we first unroll an outer loop and then after simplification unroll (N copies of) the inner loop as well, resulting in huge IR output. The penalty applied here does not prevent this entirely, but does disincentivize it. (We don't want to forbid unrolling outer loops entirely, it is necessary for vectorization in some cases.) The chosen penalty of 2 here is chosen somewhat arbitrarily as the lowest value that fixes both issues. I think the outer loop case may want an even larger penalty, but this seems like a conservative starting point.

preames

LGTM - At a high level, this makes sense to me. Please give it a couple days for others to comment.

fhahn

Do you have any data on how widespread the impact of this is ?

Penalizing loops with internal control flow makes sense to me in general, unless unrolling allows the branches to be folded. Should this be integrated with UnrolledInstAnalyzer? I haven't looked at that in a while though so I am not sure if it actually can correctly estimate such things.

fhahn · 2023-09-22T14:52:47Z

llvm/test/Transforms/PhaseOrdering/AArch64/quant_4x4.ll

I think so, yes!

davemgreen · 2023-09-25T14:35:36Z

I think this would be OK for the original case in x264 that was worse, and sounds like it would make sense in general. I'm not sure about other specific cases though, I know of cases where the outer loop unrolling can help the inner loop unroll too. Like in the example in https://reviews.llvm.org/D156532#4574569 but with more inner loops.

nikic · 2023-09-25T15:16:11Z

Do you have any data on how widespread the impact of this is ?

Here is the code size impact on CTMark: http://llvm-compile-time-tracker.com/compare.php?from=1c9b63f1038d13be7d83015565e2ecc395964c04&to=46f453b2872f5c31de0a9f6c4e35cbe52765a608&stat=instructions:u And here are the corresponding IR diffs: nikic/llvm-ir-diffs@badc90e

Penalizing loops with internal control flow makes sense to me in general, unless unrolling allows the branches to be folded. Should this be integrated with UnrolledInstAnalyzer? I haven't looked at that in a while though so I am not sure if it actually can correctly estimate such things.

Yeah, I think the "unless unrolling allows the branches to be folded" is the key bit. This is not something we capture right now. The problem here is that we have two modes of analyzing unrolling: analyzeLoopUnrollCost() which is used for full unrolling up to 10 iterations, is very precise, but also very expensive (#Insts * #Iterations) and then we have ApproximateLoopSize() for everything else, which is completely crude and does not reason about simplifications at all.

I'm considering making ApproximateLoopSize() smarter first, by determining which instructions we expect to constant fold after full unrolling without any per-iteration analysis (basically if we have simple recurrence with constant start and constant step, then any side-effect free instruction depending only on it will constant fold). This would make estimates for full unrolling more accurate and would allow us to not apply the branch penalty to branches that will fold away. I'm somewhat afraid this will back-fire though, because it will increase the amount of unrolling if thresholds remain the same (and, generally speaking, we are at a point where average unrolling needs to decrease rather than increase for better performance).

nikic · 2023-09-28T08:30:51Z

I gave making full unroll cost modelling more precise a try, here's how it could look like: 26df05e But of course, that does result in non-trivial increases to text size: http://llvm-compile-time-tracker.com/compare.php?from=18be23f82a4c66ef6c838d3324979825acbcf49f&to=26df05e27ce6c6742c6f523c926f4874ff5ca3ef&stat=size-text I'm not entirely sure where to go from here -- making the cost modelling more precise clearly seems like a good idea, but that probably also needs a recalibration of unroll thresholds.

nikic requested review from fhahn, preames and davemgreen September 22, 2023 13:41

llvmbot added the llvm:transforms label Sep 22, 2023

nikic commented Sep 22, 2023

View reviewed changes

nikic force-pushed the unroll-branch-penalty branch from cfbe72a to 46f453b Compare September 22, 2023 14:46

llvmbot added the backend:AMDGPU label Sep 22, 2023

preames approved these changes Sep 22, 2023

View reviewed changes

fhahn reviewed Sep 22, 2023

View reviewed changes

llvm/test/Transforms/PhaseOrdering/AArch64/quant_4x4.ll

Copy link

Contributor

fhahn Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopUnroll] Penalize interior control flow #67137

[LoopUnroll] Penalize interior control flow #67137

nikic commented Sep 22, 2023

llvmbot commented Sep 22, 2023 •

edited

nikic Sep 22, 2023

fhahn Sep 22, 2023

preames left a comment

fhahn left a comment

fhahn Sep 22, 2023

davemgreen commented Sep 25, 2023

nikic commented Sep 25, 2023

nikic commented Sep 28, 2023

[LoopUnroll] Penalize interior control flow #67137

Are you sure you want to change the base?

[LoopUnroll] Penalize interior control flow #67137

Conversation

nikic commented Sep 22, 2023

llvmbot commented Sep 22, 2023 • edited

nikic Sep 22, 2023

Choose a reason for hiding this comment

fhahn Sep 22, 2023

Choose a reason for hiding this comment

preames left a comment

Choose a reason for hiding this comment

fhahn left a comment

Choose a reason for hiding this comment

fhahn Sep 22, 2023

Choose a reason for hiding this comment

davemgreen commented Sep 25, 2023

nikic commented Sep 25, 2023

nikic commented Sep 28, 2023

llvmbot commented Sep 22, 2023 •

edited