Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoopUnroll] Penalize interior control flow #67137

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nikic
Copy link
Contributor

@nikic nikic commented Sep 22, 2023

This patch adds a small cost penalty for each (non-trivial, non-exiting, non-latch) branch in the loop.

At a high level, I believe that fully unrolling loops with interior control flow tends to be less profitable, because we don't get straight-line (or extended-BB) code out of it, so it is less amenable to further optimization. It also dilutes branch predictors.

The specific motivation for this is twofold:

The penalty of 2 here is chosen somewhat arbitrarily as the lowest value that fixes both issues. I think the outer loop case may want an even larger penalty, but this seems like a conservative starting point.

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 22, 2023

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-transforms

Changes

This patch adds a small cost penalty for each (non-trivial, non-exiting, non-latch) branch in the loop.

At a high level, I believe that fully unrolling loops with interior control flow tends to be less profitable, because we don't get straight-line (or extended-BB) code out of it, so it is less amenable to further optimization. It also dilutes branch predictors.

The specific motivation for this is twofold:

The penalty of 2 here is chosen somewhat arbitrarily as the lowest value that fixes both issues. I think the outer loop case may want an even larger penalty, but this seems like a conservative starting point.


Patch is 51.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67137.diff

4 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (+14)
  • (added) llvm/test/Transforms/LoopUnroll/quadratic-unroll.ll (+340)
  • (modified) llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll (+1-1)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/quant_4x4.ll (+16-428)
diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
index 335b489d3cb25ba..41c82783a7a4d79 100644
--- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
@@ -173,6 +173,10 @@ static cl::opt<unsigned>
                            cl::desc("Default threshold (max size of unrolled "
                                     "loop), used in all but O3 optimizations"));
 
+static cl::opt<unsigned> UnrollInteriorControlFlowPenalty(
+    "unroll-interior-control-flow-penalty", cl::init(2), cl::Hidden,
+    cl::desc("Penalty for non-exiting branches in the loop"));
+
 /// A magic value for use with the Threshold parameter to indicate
 /// that the loop unroll should be performed regardless of how much
 /// code expansion would result.
@@ -676,6 +680,16 @@ InstructionCost llvm::ApproximateLoopSize(
 
   InstructionCost LoopSize = Metrics.NumInsts;
 
+  // Add a penalty for interior control flow (excluding exits and latches).
+  // Unrolling such loops is less profitable, as it does not result in
+  // straight-line code (or extended basic blocks if multiple exits). This
+  // also disincentivizes unrolling outer loops, which may result in large
+  // size increases if the inner loop is also unrolled later.
+  for (BasicBlock *BB : L->blocks())
+    if (!BB->getSingleSuccessor() && !L->isLoopExiting(BB) &&
+        !L->isLoopLatch(BB))
+      LoopSize += UnrollInteriorControlFlowPenalty;
+
   // Don't allow an estimate of size zero.  This would allows unrolling of loops
   // with huge iteration counts, which is a compile time problem even if it's
   // not a problem for code quality. Also, the code using this size may assume
diff --git a/llvm/test/Transforms/LoopUnroll/quadratic-unroll.ll b/llvm/test/Transforms/LoopUnroll/quadratic-unroll.ll
new file mode 100644
index 000000000000000..bfd88a12c9ca034
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/quadratic-unroll.ll
@@ -0,0 +1,340 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -S -passes='loop(loop-unroll-full),simplifycfg,loop(loop-unroll-full)' -unroll-threshold=300 < %s | FileCheck %s
+
+; Check that we don't produce quadratic output by unrolling both the inner
+; and outer loop.
+
+@a = global i32 0, align 4
+@d = global ptr null, align 8
+@e = external global ptr
+
+define i8 @f(ptr %p) {
+; CHECK-LABEL: define i8 @f(
+; CHECK-SAME: ptr [[P:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[FOR_COND1_PREHEADER:%.*]]
+; CHECK:       for.cond1.preheader:
+; CHECK-NEXT:    [[H_017:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC12:%.*]], [[FOR_INC11:%.*]] ]
+; CHECK-NEXT:    br label [[FOR_BODY4:%.*]]
+; CHECK:       for.body4:
+; CHECK-NEXT:    [[V:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT:%.*]] = icmp eq i32 [[V]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT]], label [[FOR_END:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_1:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_1:%.*]] = icmp eq i32 [[V_1]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_1]], label [[FOR_END_1:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.1:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_2:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_2:%.*]] = icmp eq i32 [[V_2]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_2]], label [[FOR_END_2:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.2:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_3:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_3:%.*]] = icmp eq i32 [[V_3]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_3]], label [[FOR_END_3:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.3:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_4:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_4:%.*]] = icmp eq i32 [[V_4]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_4]], label [[FOR_END_4:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.4:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_5:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_5:%.*]] = icmp eq i32 [[V_5]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_5]], label [[FOR_END_5:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.5:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_6:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_6:%.*]] = icmp eq i32 [[V_6]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_6]], label [[FOR_END_6:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.6:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_7:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_7:%.*]] = icmp eq i32 [[V_7]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_7]], label [[FOR_END_7:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.7:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_8:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_8:%.*]] = icmp eq i32 [[V_8]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_8]], label [[FOR_END_8:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.8:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_9:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_9:%.*]] = icmp eq i32 [[V_9]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_9]], label [[FOR_END_9:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.9:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_10:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_10:%.*]] = icmp eq i32 [[V_10]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_10]], label [[FOR_END_10:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.10:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_11:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_11:%.*]] = icmp eq i32 [[V_11]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_11]], label [[FOR_END_11:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.11:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_12:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_12:%.*]] = icmp eq i32 [[V_12]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_12]], label [[FOR_END_12:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.12:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_13:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_13:%.*]] = icmp eq i32 [[V_13]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_13]], label [[FOR_END_13:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.13:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_14:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_14:%.*]] = icmp eq i32 [[V_14]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_14]], label [[FOR_END_14:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.14:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_15:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_15:%.*]] = icmp eq i32 [[V_15]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_15]], label [[FOR_END_15:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.15:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_16:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_16:%.*]] = icmp eq i32 [[V_16]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_16]], label [[FOR_END_16:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.16:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_17:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_17:%.*]] = icmp eq i32 [[V_17]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_17]], label [[FOR_END_17:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.17:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_18:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_18:%.*]] = icmp eq i32 [[V_18]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_18]], label [[FOR_END_18:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.18:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_19:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_19:%.*]] = icmp eq i32 [[V_19]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_19]], label [[FOR_END_19:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.19:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_20:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_20:%.*]] = icmp eq i32 [[V_20]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_20]], label [[FOR_END_20:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.20:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_21:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_21:%.*]] = icmp eq i32 [[V_21]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_21]], label [[FOR_END_21:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.21:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_22:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_22:%.*]] = icmp eq i32 [[V_22]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_22]], label [[FOR_END_22:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.22:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_23:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_23:%.*]] = icmp eq i32 [[V_23]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_23]], label [[FOR_END_23:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.23:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_24:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_24:%.*]] = icmp eq i32 [[V_24]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_24]], label [[FOR_END_24:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.24:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_25:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_25:%.*]] = icmp eq i32 [[V_25]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_25]], label [[FOR_END_25:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.25:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_26:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_26:%.*]] = icmp eq i32 [[V_26]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_26]], label [[FOR_END_26:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.26:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_27:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_27:%.*]] = icmp eq i32 [[V_27]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_27]], label [[FOR_END_27:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.27:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_28:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_28:%.*]] = icmp eq i32 [[V_28]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_28]], label [[FOR_END_28:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.28:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_29:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_29:%.*]] = icmp eq i32 [[V_29]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_29]], label [[FOR_END_29:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.29:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_30:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_30:%.*]] = icmp eq i32 [[V_30]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_30]], label [[FOR_END_30:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.30:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_31:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_31:%.*]] = icmp eq i32 [[V_31]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_31]], label [[FOR_END_31:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.31:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_32:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_32:%.*]] = icmp eq i32 [[V_32]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_32]], label [[FOR_END_32:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.32:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_33:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_33:%.*]] = icmp eq i32 [[V_33]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_33]], label [[FOR_END_33:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.33:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_34:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_34:%.*]] = icmp eq i32 [[V_34]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_34]], label [[FOR_END_34:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.34:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_35:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_35:%.*]] = icmp eq i32 [[V_35]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_35]], label [[FOR_END_35:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.35:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_36:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_36:%.*]] = icmp eq i32 [[V_36]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_36]], label [[FOR_END_36:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.36:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_37:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_37:%.*]] = icmp eq i32 [[V_37]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_37]], label [[FOR_END_37:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.37:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_38:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_38:%.*]] = icmp eq i32 [[V_38]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_38]], label [[FOR_END_38:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.38:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_39:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_39:%.*]] = icmp eq i32 [[V_39]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_39]], label [[FOR_END_39:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.39:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_40:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_40:%.*]] = icmp eq i32 [[V_40]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_40]], label [[FOR_END_40:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.40:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_41:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_41:%.*]] = icmp eq i32 [[V_41]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_41]], label [[FOR_END_41:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.41:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_42:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_42:%.*]] = icmp eq i32 [[V_42]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_42]], label [[FOR_END_42:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.42:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_43:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_43:%.*]] = icmp eq i32 [[V_43]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_43]], label [[FOR_END_43:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.43:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_44:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_44:%.*]] = icmp eq i32 [[V_44]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_44]], label [[FOR_END_44:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.44:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_45:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_45:%.*]] = icmp eq i32 [[V_45]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_45]], label [[FOR_END_45:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.45:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    [[V_46:%.*]] = load i32, ptr [[P]], align 4
+; CHECK-NEXT:    [[TOBOOL_NOT_46:%.*]] = icmp eq i32 [[V_46]], 0
+; CHECK-NEXT:    br i1 [[TOBOOL_NOT_46]], label [[FOR_END_46:%.*]], label [[FOR_INC11]]
+; CHECK:       for.end.46:
+; CHECK-NEXT:    store i32 0, ptr @a, align 4
+; CHECK-NEXT:    store ptr @e, ptr @d, align 8
+; CHECK-NEXT:    br label [[FOR_INC11]]
+; CHECK:       for.inc11:
+; CHECK-NEXT:    [[INC12]] = add nuw nsw i32 [[H_017]], 1
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i32 [[H_017]], 21
+; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_COND1_PREHEADER]], label [[FOR_END13:%.*]]
+; CHECK:       for.end13:
+; CHECK-NEXT:    ret i8 undef
+;
+entry:
+  br label %for.cond1.preheader
+
+for.cond1.preheader:                              ; preds = %entry, %for.inc11
+  %h.017 = phi i32 [ 0, %entry ], [ %inc12, %for.inc11 ]
+  br label %for.body4
+
+for.body4:                                        ; preds = %for.cond1.preheader, %for.end
+  %g.016 = phi i16 [ -22, %for.cond1.preheader ], [ %inc9, %for.end ]
+  %v = load i32, ptr %p, align 4
+  %tobool.not = icmp eq i32 %v, 0
+  br i1 %tobool.not, label %for.cond5.preheader, label %for.inc11
+
+for.cond5.preheader:                              ; preds = %for.body4
+  br label %for.end
+
+for.end:                                          ; preds = %for.cond5.preheader
+  store i32 0, ptr @a, align 4
+  store ptr @e, ptr @d, align 8
+  %inc9 = add i16 %g.016, 1
+  %cmp2 = icmp slt i16 %inc9, 25
+  br i1 %cmp2, label %for.body4, label %for.inc11
+
+for.inc11:                                        ; preds = %for.end, %for.body4
+  %inc12 = add nuw nsw i32 %h.017, 1
+  %cmp = icmp ult i32 %h.017, 21
+  br i1 %cmp, label %for.cond1.preheader, label %for.end13
+
+for.end13:                                        ; preds = %for.inc11
+  ret i8 undef
+}
diff --git a/llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll b/llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll
index de548523134568f..7027cac9c9ea355 100644
--- a/llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll...
[truncated]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer unroll the scalar fallback loop here, which I think is a win?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yes!

This patch adds a small cost penalty for each (non-trivial,
non-exiting, non-latch) branch in the loop.

At a high level, I believe that fully unrolling loops with interior
control flow tends to be less profitable, because we don't get
straight-line (or extended-BB) code out of it, so it is less
amenable to further optimization. It also dilutes branch predictors.

The specific motivation for this is twofold:
 * This avoids an undesirable unroll in x264 that led to the revert
   of https://reviews.llvm.org/D156532. While the loop vectorizer
   can handle interior control flow, the SLP vectorizer can't (and
   likely won't be able to in any foreseeable future).
 * While looking into llvm#57865,
   I found a case where we produce a quadratic unroll result,
   because we first unroll an outer loop and then after simplification
   unroll (N copies of) the inner loop as well, resulting in huge
   IR output. The penalty applied here does not prevent this
   entirely, but does disincentivize it. (We don't want to forbid
   unrolling outer loops entirely, it is necessary for vectorization
   in some cases.)

The chosen penalty of 2 here is chosen somewhat arbitrarily as the
lowest value that fixes both issues. I think the outer loop case
may want an even larger penalty, but this seems like a conservative
starting point.
Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - At a high level, this makes sense to me. Please give it a couple days for others to comment.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any data on how widespread the impact of this is ?

Penalizing loops with internal control flow makes sense to me in general, unless unrolling allows the branches to be folded. Should this be integrated with UnrolledInstAnalyzer? I haven't looked at that in a while though so I am not sure if it actually can correctly estimate such things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yes!

@davemgreen
Copy link
Collaborator

I think this would be OK for the original case in x264 that was worse, and sounds like it would make sense in general. I'm not sure about other specific cases though, I know of cases where the outer loop unrolling can help the inner loop unroll too. Like in the example in https://reviews.llvm.org/D156532#4574569 but with more inner loops.

@nikic
Copy link
Contributor Author

nikic commented Sep 25, 2023

Do you have any data on how widespread the impact of this is ?

Here is the code size impact on CTMark: http://llvm-compile-time-tracker.com/compare.php?from=1c9b63f1038d13be7d83015565e2ecc395964c04&to=46f453b2872f5c31de0a9f6c4e35cbe52765a608&stat=instructions:u And here are the corresponding IR diffs: nikic/llvm-ir-diffs@badc90e

Penalizing loops with internal control flow makes sense to me in general, unless unrolling allows the branches to be folded. Should this be integrated with UnrolledInstAnalyzer? I haven't looked at that in a while though so I am not sure if it actually can correctly estimate such things.

Yeah, I think the "unless unrolling allows the branches to be folded" is the key bit. This is not something we capture right now. The problem here is that we have two modes of analyzing unrolling: analyzeLoopUnrollCost() which is used for full unrolling up to 10 iterations, is very precise, but also very expensive (#Insts * #Iterations) and then we have ApproximateLoopSize() for everything else, which is completely crude and does not reason about simplifications at all.

I'm considering making ApproximateLoopSize() smarter first, by determining which instructions we expect to constant fold after full unrolling without any per-iteration analysis (basically if we have simple recurrence with constant start and constant step, then any side-effect free instruction depending only on it will constant fold). This would make estimates for full unrolling more accurate and would allow us to not apply the branch penalty to branches that will fold away. I'm somewhat afraid this will back-fire though, because it will increase the amount of unrolling if thresholds remain the same (and, generally speaking, we are at a point where average unrolling needs to decrease rather than increase for better performance).

@nikic
Copy link
Contributor Author

nikic commented Sep 28, 2023

I gave making full unroll cost modelling more precise a try, here's how it could look like: 26df05e But of course, that does result in non-trivial increases to text size: http://llvm-compile-time-tracker.com/compare.php?from=18be23f82a4c66ef6c838d3324979825acbcf49f&to=26df05e27ce6c6742c6f523c926f4874ff5ca3ef&stat=size-text I'm not entirely sure where to go from here -- making the cost modelling more precise clearly seems like a good idea, but that probably also needs a recalibration of unroll thresholds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants