[Passes] Run SimpleLoopUnswitch after introducing invariant branches. #81271

fhahn · 2024-02-09T16:40:40Z

IndVars may be able to replace a loop dependent condition with a loop invariant one, but loop-unswitch runs before IndVars, so the invariant check remains in the loop.

For an example, consider a read-only loop with a bounds check: https://godbolt.org/z/8cdj4qhbG

This patch uses a approach similar to the way extra cleanup passes are run on demand after vectorization (added in acea6e9).

It introduces a new ShouldRunExtraSimpleLoopUnswitch analysis marker, which IndVars can use to indicate that extra unswitching is beneficial.

ExtraSimpleLoopUnswitchPassManager uses this analysis to determine whether to run its passes on a loop.

Compile-time impact (geomean) ranges from +0.0% to 0.02% https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=19e6e99eeb280d426907ea73a21b139ba7225627&stat=instructions%3Au

Compile-time impact (geomean) of unconditionally running SimpleLoopUnswitch ranges from +0.05% - +0.16%
https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=2930dfd5accdce2e6f8d5146ae4d626add2065a2&stat=instructions:u

Unconditionally running SimpleLoopUnswitch seems to indicate that there are multiple other scenarios where we fail to run unswitching when opportunities remain.

Fixes #85551.

IndVars may be able to replace a loop dependent condition with a loop invariant one, but loop-unswitch runs before IndVars, so the invariant check remains in the loop. For an example, consider a read-only loop with a bounds check: https://godbolt.org/z/8cdj4qhbG This patch uses a approach similar to the way extra cleanup passes are run on demand after vectorization (added in acea6e9). It introduces a new ShouldRunExtraSimpleLoopUnswitch analysis marker, which IndVars can use to indicate that extra unswitching is beneficial. ExtraSimpleLoopUnswitchPassManager uses this analysis to determine whether to run its passes on a loop. Compile-time impact (geomean) ranges from +0.0% to 0.02% https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=19e6e99eeb280d426907ea73a21b139ba7225627&stat=instructions%3Au Compile-time impact (geomean) of unconditionally running SimpleLoopUnswitch ranges from +0.05% - +0.16% https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=2930dfd5accdce2e6f8d5146ae4d626add2065a2&stat=instructions:u Unconditionally running SimpleLoopUnswitch seems to indicate that there are multiple other scenarios where we fail to run unswitching when opportunities remain.

llvmbot · 2024-02-09T16:41:12Z

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

IndVars may be able to replace a loop dependent condition with a loop invariant one, but loop-unswitch runs before IndVars, so the invariant check remains in the loop.

For an example, consider a read-only loop with a bounds check: https://godbolt.org/z/8cdj4qhbG

This patch uses a approach similar to the way extra cleanup passes are run on demand after vectorization (added in acea6e9).

It introduces a new ShouldRunExtraSimpleLoopUnswitch analysis marker, which IndVars can use to indicate that extra unswitching is beneficial.

ExtraSimpleLoopUnswitchPassManager uses this analysis to determine whether to run its passes on a loop.

Compile-time impact (geomean) ranges from +0.0% to 0.02% https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=19e6e99eeb280d426907ea73a21b139ba7225627&stat=instructions%3Au

Compile-time impact (geomean) of unconditionally running SimpleLoopUnswitch ranges from +0.05% - +0.16%
https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=2930dfd5accdce2e6f8d5146ae4d626add2065a2&stat=instructions:u

Unconditionally running SimpleLoopUnswitch seems to indicate that there are multiple other scenarios where we fail to run unswitching when opportunities remain.

Full diff: https://github.com/llvm/llvm-project/pull/81271.diff

8 Files Affected:

(modified) llvm/include/llvm/Transforms/Scalar/SimpleLoopUnswitch.h (+35)
(modified) llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h (+8-4)
(modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+7)
(modified) llvm/lib/Passes/PassRegistry.def (+3)
(modified) llvm/lib/Transforms/Scalar/IndVarSimplify.cpp (+15-2)
(modified) llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp (+1)
(modified) llvm/lib/Transforms/Utils/SimplifyIndVar.cpp (+11-6)
(modified) llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll (+43-11)

diff --git a/llvm/include/llvm/Transforms/Scalar/SimpleLoopUnswitch.h b/llvm/include/llvm/Transforms/Scalar/SimpleLoopUnswitch.h
index b97ee23fc0e65d..a40b0a2a9ab72f 100644
--- a/llvm/include/llvm/Transforms/Scalar/SimpleLoopUnswitch.h
+++ b/llvm/include/llvm/Transforms/Scalar/SimpleLoopUnswitch.h
@@ -12,6 +12,7 @@
 #include "llvm/ADT/STLFunctionalExtras.h"
 #include "llvm/Analysis/LoopAnalysisManager.h"
 #include "llvm/IR/PassManager.h"
+#include "llvm/Transforms/Scalar/LoopPassManager.h"
 
 namespace llvm {
 
@@ -20,6 +21,40 @@ class Loop;
 class StringRef;
 class raw_ostream;
 
+struct ShouldRunExtraSimpleLoopUnswitch
+    : public AnalysisInfoMixin<ShouldRunExtraSimpleLoopUnswitch> {
+  static AnalysisKey Key;
+  struct Result {
+    bool invalidate(Loop &L, const PreservedAnalyses &PA,
+                    LoopAnalysisManager::Invalidator &) {
+      // Check whether the analysis has been explicitly invalidated. Otherwise,
+      // it remains preserved.
+      auto PAC = PA.getChecker<ShouldRunExtraSimpleLoopUnswitch>();
+      return !PAC.preservedWhenStateless();
+    }
+  };
+
+  Result run(Loop &L, LoopAnalysisManager &AM,
+             LoopStandardAnalysisResults &AR) {
+    return Result();
+  }
+
+  static bool isRequired() { return true; }
+};
+
+struct ExtraSimpleLoopUnswitchPassManager : public LoopPassManager {
+  PreservedAnalyses run(Loop &L, LoopAnalysisManager &AM,
+                        LoopStandardAnalysisResults &AR, LPMUpdater &U) {
+    auto PA = PreservedAnalyses::all();
+    if (AM.getCachedResult<ShouldRunExtraSimpleLoopUnswitch>(L))
+      PA.intersect(LoopPassManager::run(L, AM, AR, U));
+    PA.abandon<ShouldRunExtraSimpleLoopUnswitch>();
+    return PA;
+  }
+
+  static bool isRequired() { return true; }
+};
+
 /// This pass transforms loops that contain branches or switches on loop-
 /// invariant conditions to have multiple loops. For example, it turns the left
 /// into the right code:
diff --git a/llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h b/llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h
index ff60811b616859..b39ca05f3f4b14 100644
--- a/llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h
+++ b/llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h
@@ -15,6 +15,8 @@
 #ifndef LLVM_TRANSFORMS_UTILS_SIMPLIFYINDVAR_H
 #define LLVM_TRANSFORMS_UTILS_SIMPLIFYINDVAR_H
 
+#include <utility>
+
 namespace llvm {
 
 class Type;
@@ -47,10 +49,12 @@ class IVVisitor {
 
 /// simplifyUsersOfIV - Simplify instructions that use this induction variable
 /// by using ScalarEvolution to analyze the IV's recurrence.
-bool simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE, DominatorTree *DT,
-                       LoopInfo *LI, const TargetTransformInfo *TTI,
-                       SmallVectorImpl<WeakTrackingVH> &Dead,
-                       SCEVExpander &Rewriter, IVVisitor *V = nullptr);
+std::pair<bool, bool> simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE,
+                                        DominatorTree *DT, LoopInfo *LI,
+                                        const TargetTransformInfo *TTI,
+                                        SmallVectorImpl<WeakTrackingVH> &Dead,
+                                        SCEVExpander &Rewriter,
+                                        IVVisitor *V = nullptr);
 
 /// SimplifyLoopIVs - Simplify users of induction variables within this
 /// loop. This does not actually change or add IVs.
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 6ede8638291206..8187d78d41e928 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -627,6 +627,13 @@ PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
   LPM2.addPass(LoopIdiomRecognizePass());
   LPM2.addPass(IndVarSimplifyPass());
 
+  {
+    ExtraSimpleLoopUnswitchPassManager ExtraPasses;
+    ExtraPasses.addPass(SimpleLoopUnswitchPass(/* NonTrivial */ Level ==
+                                               OptimizationLevel::O3));
+    LPM2.addPass(std::move(ExtraPasses));
+  }
+
   invokeLateLoopOptimizationsEPCallbacks(LPM2, Level);
 
   LPM2.addPass(LoopDeletionPass());
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index 6cb87fba426463..4ac2a6b48df220 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -583,6 +583,9 @@ LOOP_ANALYSIS("ddg", DDGAnalysis())
 LOOP_ANALYSIS("iv-users", IVUsersAnalysis())
 LOOP_ANALYSIS("no-op-loop", NoOpLoopAnalysis())
 LOOP_ANALYSIS("pass-instrumentation", PassInstrumentationAnalysis(PIC))
+LOOP_ANALYSIS("should-run-extra-simple-loop-unswitch",
+                  ShouldRunExtraSimpleLoopUnswitch())
+
 #undef LOOP_ANALYSIS
 
 #ifndef LOOP_PASS
diff --git a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
index 41c4d623617347..b2905109af327e 100644
--- a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
+++ b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
@@ -70,6 +70,7 @@
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/raw_ostream.h"
+#include "llvm/Transforms/Scalar/SimpleLoopUnswitch.h"
 #include "llvm/Transforms/Utils/BasicBlockUtils.h"
 #include "llvm/Transforms/Utils/Local.h"
 #include "llvm/Transforms/Utils/LoopUtils.h"
@@ -137,6 +138,8 @@ class IndVarSimplify {
   SmallVector<WeakTrackingVH, 16> DeadInsts;
   bool WidenIndVars;
 
+  bool RunUnswitching = false;
+
   bool handleFloatingPointIV(Loop *L, PHINode *PH);
   bool rewriteNonIntegerIVs(Loop *L);
 
@@ -170,6 +173,8 @@ class IndVarSimplify {
   }
 
   bool run(Loop *L);
+
+  bool runUnswitching() const { return RunUnswitching; }
 };
 
 } // end anonymous namespace
@@ -614,9 +619,11 @@ bool IndVarSimplify::simplifyAndExtend(Loop *L,
       // Information about sign/zero extensions of CurrIV.
       IndVarSimplifyVisitor Visitor(CurrIV, SE, TTI, DT);
 
-      Changed |= simplifyUsersOfIV(CurrIV, SE, DT, LI, TTI, DeadInsts, Rewriter,
-                                   &Visitor);
+      const auto &[C, U] = simplifyUsersOfIV(CurrIV, SE, DT, LI, TTI, DeadInsts,
+                                             Rewriter, &Visitor);
 
+      Changed |= C;
+      RunUnswitching |= U;
       if (Visitor.WI.WidestNativeType) {
         WideIVs.push_back(Visitor.WI);
       }
@@ -1873,6 +1880,7 @@ bool IndVarSimplify::predicateLoopExits(Loop *L, SCEVExpander &Rewriter) {
     if (OldCond->use_empty())
       DeadInsts.emplace_back(OldCond);
     Changed = true;
+    RunUnswitching = true;
   }
 
   return Changed;
@@ -2058,6 +2066,11 @@ PreservedAnalyses IndVarSimplifyPass::run(Loop &L, LoopAnalysisManager &AM,
 
   auto PA = getLoopPassPreservedAnalyses();
   PA.preserveSet<CFGAnalyses>();
+  if (IVS.runUnswitching()) {
+    AM.getResult<ShouldRunExtraSimpleLoopUnswitch>(L, AR);
+    PA.preserve<ShouldRunExtraSimpleLoopUnswitch>();
+  }
+
   if (AR.MSSA)
     PA.preserve<MemorySSAAnalysis>();
   return PA;
diff --git a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
index 7eb0ba1c2c1793..2bfee6e4c0e421 100644
--- a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
+++ b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
@@ -133,6 +133,7 @@ static cl::opt<unsigned> InjectInvariantConditionHotnesThreshold(
                          "not-taken 1/<this option> times or less."),
     cl::init(16));
 
+AnalysisKey ShouldRunExtraSimpleLoopUnswitch::Key;
 namespace {
 struct CompareDesc {
   BranchInst *Term;
diff --git a/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp b/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp
index 1b142f14d81139..07949094431919 100644
--- a/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp
+++ b/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp
@@ -59,6 +59,7 @@ namespace {
     SmallVectorImpl<WeakTrackingVH> &DeadInsts;
 
     bool Changed = false;
+    bool RunUnswitching = false;
 
   public:
     SimplifyIndvar(Loop *Loop, ScalarEvolution *SE, DominatorTree *DT,
@@ -71,6 +72,7 @@ namespace {
     }
 
     bool hasChanged() const { return Changed; }
+    bool runUnswitching() const { return RunUnswitching; }
 
     /// Iteratively perform simplification on a worklist of users of the
     /// specified induction variable. This is the top-level driver that applies
@@ -232,6 +234,7 @@ bool SimplifyIndvar::makeIVComparisonInvariant(ICmpInst *ICmp,
   ICmp->setPredicate(InvariantPredicate);
   ICmp->setOperand(0, NewLHS);
   ICmp->setOperand(1, NewRHS);
+  RunUnswitching = true;
   return true;
 }
 
@@ -981,14 +984,15 @@ void IVVisitor::anchor() { }
 
 /// Simplify instructions that use this induction variable
 /// by using ScalarEvolution to analyze the IV's recurrence.
-bool simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE, DominatorTree *DT,
-                       LoopInfo *LI, const TargetTransformInfo *TTI,
-                       SmallVectorImpl<WeakTrackingVH> &Dead,
-                       SCEVExpander &Rewriter, IVVisitor *V) {
+std::pair<bool, bool> simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE,
+                                        DominatorTree *DT, LoopInfo *LI,
+                                        const TargetTransformInfo *TTI,
+                                        SmallVectorImpl<WeakTrackingVH> &Dead,
+                                        SCEVExpander &Rewriter, IVVisitor *V) {
   SimplifyIndvar SIV(LI->getLoopFor(CurrIV->getParent()), SE, DT, LI, TTI,
                      Rewriter, Dead);
   SIV.simplifyUsers(CurrIV, V);
-  return SIV.hasChanged();
+  return {SIV.hasChanged(), SIV.runUnswitching()};
 }
 
 /// Simplify users of induction variables within this
@@ -1002,8 +1006,9 @@ bool simplifyLoopIVs(Loop *L, ScalarEvolution *SE, DominatorTree *DT,
 #endif
   bool Changed = false;
   for (BasicBlock::iterator I = L->getHeader()->begin(); isa<PHINode>(I); ++I) {
-    Changed |=
+    const auto &[C, _] =
         simplifyUsersOfIV(cast<PHINode>(I), SE, DT, LI, TTI, Dead, Rewriter);
+    Changed |= C;
   }
   return Changed;
 }
diff --git a/llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll b/llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll
index c6c9a52167d54e..ddb8e6133ebf63 100644
--- a/llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll
+++ b/llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll
@@ -14,24 +14,50 @@ define i32 @read_only_loop_with_runtime_check(ptr noundef %array, i32 noundef %c
 ; CHECK-NEXT:    [[TMP0:%.*]] = zext i32 [[N]] to i64
 ; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[N]], -1
 ; CHECK-NEXT:    [[DOTNOT_NOT:%.*]] = icmp ult i32 [[TMP1]], [[COUNT]]
+; CHECK-NEXT:    br i1 [[DOTNOT_NOT]], label [[FOR_BODY_PREHEADER10:%.*]], label [[IF_THEN:%.*]]
+; CHECK:       for.body.preheader10:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[FOR_BODY_PREHEADER13:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[N_VEC:%.*]] = and i64 [[TMP0]], 4294967288
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP4:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI11:%.*]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP5:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 [[INDEX]]
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i64 16
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD12:%.*]] = load <4 x i32>, ptr [[TMP3]], align 4
+; CHECK-NEXT:    [[TMP4]] = add <4 x i32> [[WIDE_LOAD]], [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP5]] = add <4 x i32> [[WIDE_LOAD12]], [[VEC_PHI11]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[BIN_RDX:%.*]] = add <4 x i32> [[TMP5]], [[TMP4]]
+; CHECK-NEXT:    [[TMP7:%.*]] = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[BIN_RDX]])
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[TMP0]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_PREHEADER13]]
+; CHECK:       for.body.preheader13:
+; CHECK-NEXT:    [[INDVARS_IV_PH:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER10]] ], [ [[N_VEC]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    [[SUM_07_PH:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER10]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.cond.cleanup:
-; CHECK-NEXT:    [[SUM_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[ADD:%.*]], [[IF_END:%.*]] ]
+; CHECK-NEXT:    [[SUM_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ], [ [[ADD:%.*]], [[FOR_BODY]] ]
 ; CHECK-NEXT:    ret i32 [[SUM_0_LCSSA]]
 ; CHECK:       for.body:
-; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.*]], [[IF_END]] ]
-; CHECK-NEXT:    [[SUM_07:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[ADD]], [[IF_END]] ]
-; CHECK-NEXT:    br i1 [[DOTNOT_NOT]], label [[IF_END]], label [[IF_THEN:%.*]]
-; CHECK:       if.then:
-; CHECK-NEXT:    tail call void @llvm.trap()
-; CHECK-NEXT:    unreachable
-; CHECK:       if.end:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ], [ [[INDVARS_IV_PH]], [[FOR_BODY_PREHEADER13]] ]
+; CHECK-NEXT:    [[SUM_07:%.*]] = phi i32 [ [[ADD]], [[FOR_BODY]] ], [ [[SUM_07_PH]], [[FOR_BODY_PREHEADER13]] ]
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 [[INDVARS_IV]]
-; CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT:    [[ADD]] = add nsw i32 [[TMP2]], [[SUM_07]]
+; CHECK-NEXT:    [[TMP8:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[ADD]] = add nsw i32 [[TMP8]], [[SUM_07]]
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[TMP0]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       if.then:
+; CHECK-NEXT:    tail call void @llvm.trap()
+; CHECK-NEXT:    unreachable
 ;
 entry:
   %array.addr = alloca ptr, align 8
@@ -96,3 +122,9 @@ declare void @llvm.lifetime.start.p0(i64 immarg, ptr nocapture)
 declare void @llvm.trap()
 
 declare void @llvm.lifetime.end.p0(i64 immarg, ptr nocapture)
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
+;.

After aea2a14, partial unrolling now happens much later in the pipeline. At this point, we miss a number of simplification passes to further optimize the partially unrolled loop. In some cases, this can cause notable performance regressions. To improve codegen for partial unrolled loop bodies, run IndVars on demand on partially unrolled loops. This patch uses a approach similar to the way extra cleanup passes are run on demand after vectorization (added in acea6e9) and if we decide to make wider use of this pattern, we should probably factor our the pass manager/analysis logic, before landing this. Another instance that can be improved by this pattern: llvm#81271 Compile-time impact (geomeans) ranges from +0.11% to +0.19%. https://llvm-compile-time-tracker.com/compare.php?from=68cef34784e33539b75f2c379d8e21a87719d67c&to=446d5e5b2fbfb5efc6cbc40584a31564327df584&stat=instructions:u This could be reduced by only running a small subset of IndVars to optimize inductions where the impact ranges from +0.05% to +0.10%. https://llvm-compile-time-tracker.com/compare.php?from=68cef34784e33539b75f2c379d8e21a87719d67c&to=bbee5fa003136618446dc30e6ac3897ab03facd7&stat=instructions%3Au

alinas

This brings back the discussion that we really should have better infrastructure in the NPM to specify the conditional running of passes. The extra-vectorizer-passes change was supposed to be a one time approach.

Do you have a measure of how this affects execution time for some workloads (e.g. spec)? Could you also test this on https://github.com/google/fleetbench?

alinas · 2024-02-09T17:12:44Z

llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h

@@ -47,10 +49,12 @@ class IVVisitor {

 /// simplifyUsersOfIV - Simplify instructions that use this induction variable
 /// by using ScalarEvolution to analyze the IV's recurrence.


Add comment on what the new return value represents.

Added, thanks!

alinas · 2024-02-09T17:16:00Z

llvm/lib/Transforms/Utils/SimplifyIndVar.cpp

-bool simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE, DominatorTree *DT,
-                       LoopInfo *LI, const TargetTransformInfo *TTI,
-                       SmallVectorImpl<WeakTrackingVH> &Dead,
-                       SCEVExpander &Rewriter, IVVisitor *V) {


Same here, add a comment on what the pair return stands for.

Added, thanks!

fhahn · 2024-02-09T20:21:12Z

This brings back the discussion that we really should have better infrastructure in the NPM to specify the conditional running of passes. The extra-vectorizer-passes change was supposed to be a one time approach.

Sure, I am not really tied to the current approach and would be happy to explore alternative options; the current implementation was just the easiest way to get this done I was aware of.

Do you have a measure of how this affects execution time for some workloads (e.g. spec)? Could you also test this on https://github.com/google/fleetbench?

Not yet, but I am planning to do this. I think the bounds check example highlights a good missed indvars -> SLU interaction

fhahn · 2024-02-20T14:07:12Z

@alinas @aeubanks any preferences/suggestions on how to better handle this in general?

fhahn · 2024-03-12T15:00:54Z

ping :)

PR Link: llvm/llvm-project#81271

preames · 2024-03-13T17:32:59Z

Drive by comment on alternatives...

The loop pass manger already supports requeueing a loop. This is done inside LPM1 today by e.g. both unswitch and rotate. This can cause significant compile time problems of its own - which is how I noticed it again recently.

One option here would be to either a) requeue the loop in IndVarSimplify if a loop invariant condition is exposed as part of LPM2 (this would still require adding unswitch to LPM2), or b) update indvars to preserve memory ssa and move it to LPM1 which can already iterate.

As an aside, starting with trivial only unswitching is probably much safer. I saw a really nasty compile time problem recently caused by the requeue interaction with non-trivial unswitch.

alinas · 2024-03-13T18:44:49Z

One way to approach conditional running of passes is by factoring out key transforms into static method, akin to utils. This is possible for the other patch with IndVars (#81275), and it could work with some refactoring of SLU too, i.e. making the exiting functions available. I'm not sure if that's better :-).

Adding to LPM2 as in the current commit should work fine (SLU doesn't really need MSSA; it can use it for a quick exit, it will update it if available but it's not necessary; and LPM2 doesn't have it available). The only caveat is spreading the pattern of going through a fake analysis to signal that a pass should be run conditionally.
Note also that adding SLU to LPM2 doesn't give you BFI either, which is available in LPM1.
So, is it an option to add IndVarSimplify to LPM1? (I don't think it makes any changes that would invalidate MSSA, so that should be fairly straightforward to add) Does it address the use case you are looking at? And following up with the same question as for the current approach: what's the compile time vs execution time impact then? i.e. is it worth adding it?

fhahn · 2024-04-05T12:01:49Z

So, is it an option to add IndVarSimplify to LPM1? (I don't think it makes any changes that would invalidate MSSA, so that should be fairly straightforward to add) Does it address the use case you are looking at? And following up with the same question as for the current approach: what's the compile time vs execution time impact then? i.e. is it worth adding it?

I finally had time to try this out in aff602c, apologies for the long delay.

The impact of moving indVarSimplify to LPM1 is much bigger, both in terms of compile-time and binary changes: https://llvm-compile-time-tracker.com/compare.php?from=1d06f41b72e429a5b3ba318ff639b8b997e21ff8&to=aff602c6f3d9b66d7539d72e9f78adbf49118890&stat=instructions:u. I think a number of optimizations went intoIndVarSimplify to make it work very well at the specific position. There may be some interesting insights from some of the bigger changes caused by moving it, but it seems moving it to LPM1 would be quite a heavy hammer.

Exposing the unswitching logic directly and use that on demand in IndVarSimplify would likely be the most efficient approach. It also compared using a dedicated pass vs directly invoking functionality on demand in #83860 and directly invocation was noticeably faster. Should we go down this route?

nikic · 2024-04-10T06:16:15Z

In the interest of not making perfect the enemy of good, I'd be happy to go forward with your current approach.

Could you please resolve the comments regarding extra comments and check whether this patch fixes #85551 (and if so add another test)?

Conflicts: llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll

fhahn

Could you please resolve the comments regarding extra comments and check whether this patch fixes #85551 (and if so add another test)?

Thanks, I added the comment and check that it fixes #85551. Added a test case in 94ed57d.

fhahn · 2024-04-10T12:41:32Z

llvm/lib/Transforms/Utils/SimplifyIndVar.cpp

-bool simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE, DominatorTree *DT,
-                       LoopInfo *LI, const TargetTransformInfo *TTI,
-                       SmallVectorImpl<WeakTrackingVH> &Dead,
-                       SCEVExpander &Rewriter, IVVisitor *V) {


Added, thanks!

fhahn · 2024-04-10T12:41:37Z

llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h

@@ -47,10 +49,12 @@ class IVVisitor {

 /// simplifyUsersOfIV - Simplify instructions that use this induction variable
 /// by using ScalarEvolution to analyze the IV's recurrence.


Added, thanks!

nikic

LGTM

…llvm#81271) IndVars may be able to replace a loop dependent condition with a loop invariant one, but loop-unswitch runs before IndVars, so the invariant check remains in the loop. For an example, consider a read-only loop with a bounds check: https://godbolt.org/z/8cdj4qhbG This patch uses a approach similar to the way extra cleanup passes are run on demand after vectorization (added in acea6e9). It introduces a new ShouldRunExtraSimpleLoopUnswitch analysis marker, which IndVars can use to indicate that extra unswitching is beneficial. ExtraSimpleLoopUnswitchPassManager uses this analysis to determine whether to run its passes on a loop. Compile-time impact (geomean) ranges from +0.0% to 0.02% https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=19e6e99eeb280d426907ea73a21b139ba7225627&stat=instructions%3Au Compile-time impact (geomean) of unconditionally running SimpleLoopUnswitch ranges from +0.05% - +0.16% https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=2930dfd5accdce2e6f8d5146ae4d626add2065a2&stat=instructions:u Unconditionally running SimpleLoopUnswitch seems to indicate that there are multiple other scenarios where we fail to run unswitching when opportunities remain. Fixes llvm#85551. PR: llvm#81271

antmox · 2024-04-16T13:11:34Z

Hi @fhahn,
Looks like this broke clang-aarch64-full-2stage bot : https://lab.llvm.org/buildbot/#/builders/179/builds/9867
Could you please look at this ?

fhahn requested review from nikic, preames, alinas and aeubanks February 9, 2024 16:40

llvmbot added the llvm:transforms label Feb 9, 2024

fhahn mentioned this pull request Feb 9, 2024

[Passes] Run IndVars after late partial unrolling. #81275

Open

alinas reviewed Feb 9, 2024

View reviewed changes

dtcxzyw added a commit to dtcxzyw/llvm-opt-benchmark that referenced this pull request Mar 12, 2024

pre-commit: test PR81271

eb8bbcd

PR Link: llvm/llvm-project#81271

dtcxzyw mentioned this pull request Mar 12, 2024

pre-commit: test PR81271 dtcxzyw/llvm-opt-benchmark#394

Closed

nikic mentioned this pull request Mar 16, 2024

Indexing bounds checks are not hoisted out of a loop #85551

Closed

fhahn added 3 commits April 10, 2024 10:29

Merge remote-tracking branch 'origin/main' into perf/extra-unswitch

1089034

Merge remote-tracking branch 'origin/main' into perf/extra-unswitch

c44e229

Conflicts: llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll

!fixup add comments

913e3a4

fhahn commented Apr 10, 2024

View reviewed changes

nikic approved these changes Apr 11, 2024

View reviewed changes

fhahn merged commit 0f82469 into llvm:main Apr 12, 2024
4 checks passed

fhahn deleted the perf/extra-unswitch branch April 12, 2024 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Passes] Run SimpleLoopUnswitch after introducing invariant branches. #81271

[Passes] Run SimpleLoopUnswitch after introducing invariant branches. #81271

fhahn commented Feb 9, 2024 •

edited

llvmbot commented Feb 9, 2024

alinas left a comment

alinas Feb 9, 2024

fhahn Apr 10, 2024

alinas Feb 9, 2024

fhahn Apr 10, 2024

fhahn commented Feb 9, 2024

fhahn commented Feb 20, 2024

fhahn commented Mar 12, 2024

preames commented Mar 13, 2024

alinas commented Mar 13, 2024

fhahn commented Apr 5, 2024

nikic commented Apr 10, 2024

fhahn left a comment

fhahn Apr 10, 2024

fhahn Apr 10, 2024

nikic left a comment

antmox commented Apr 16, 2024

		@@ -47,10 +49,12 @@ class IVVisitor {

		/// simplifyUsersOfIV - Simplify instructions that use this induction variable
		/// by using ScalarEvolution to analyze the IV's recurrence.

[Passes] Run SimpleLoopUnswitch after introducing invariant branches. #81271

[Passes] Run SimpleLoopUnswitch after introducing invariant branches. #81271

Conversation

fhahn commented Feb 9, 2024 • edited

llvmbot commented Feb 9, 2024

alinas left a comment

Choose a reason for hiding this comment

alinas Feb 9, 2024

Choose a reason for hiding this comment

fhahn Apr 10, 2024

Choose a reason for hiding this comment

alinas Feb 9, 2024

Choose a reason for hiding this comment

fhahn Apr 10, 2024

Choose a reason for hiding this comment

fhahn commented Feb 9, 2024

fhahn commented Feb 20, 2024

fhahn commented Mar 12, 2024

preames commented Mar 13, 2024

alinas commented Mar 13, 2024

fhahn commented Apr 5, 2024

nikic commented Apr 10, 2024

fhahn left a comment

Choose a reason for hiding this comment

fhahn Apr 10, 2024

Choose a reason for hiding this comment

fhahn Apr 10, 2024

Choose a reason for hiding this comment

nikic left a comment

Choose a reason for hiding this comment

antmox commented Apr 16, 2024

fhahn commented Feb 9, 2024 •

edited