Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Passes] Run SimpleLoopUnswitch after introducing invariant branches. #81271

Merged
merged 4 commits into from
Apr 12, 2024

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Feb 9, 2024

IndVars may be able to replace a loop dependent condition with a loop invariant one, but loop-unswitch runs before IndVars, so the invariant check remains in the loop.

For an example, consider a read-only loop with a bounds check: https://godbolt.org/z/8cdj4qhbG

This patch uses a approach similar to the way extra cleanup passes are run on demand after vectorization (added in acea6e9).

It introduces a new ShouldRunExtraSimpleLoopUnswitch analysis marker, which IndVars can use to indicate that extra unswitching is beneficial.

ExtraSimpleLoopUnswitchPassManager uses this analysis to determine whether to run its passes on a loop.

Compile-time impact (geomean) ranges from +0.0% to 0.02% https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=19e6e99eeb280d426907ea73a21b139ba7225627&stat=instructions%3Au

Compile-time impact (geomean) of unconditionally running SimpleLoopUnswitch ranges from +0.05% - +0.16%
https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=2930dfd5accdce2e6f8d5146ae4d626add2065a2&stat=instructions:u

Unconditionally running SimpleLoopUnswitch seems to indicate that there are multiple other scenarios where we fail to run unswitching when opportunities remain.

Fixes #85551.

IndVars may be able to replace a loop dependent condition with a loop
invariant one, but loop-unswitch runs before IndVars, so the invariant
check remains in the loop.

For an example, consider a read-only loop with a bounds check:
https://godbolt.org/z/8cdj4qhbG

This patch uses a approach similar to the way extra cleanup passes are
run on demand after vectorization (added in acea6e9).

It introduces a new ShouldRunExtraSimpleLoopUnswitch analysis marker,
which IndVars can use to indicate that extra unswitching is beneficial.

ExtraSimpleLoopUnswitchPassManager uses this analysis to determine
whether to run its passes on a loop.

Compile-time impact (geomean) ranges from +0.0% to 0.02%
https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=19e6e99eeb280d426907ea73a21b139ba7225627&stat=instructions%3Au

Compile-time impact (geomean) of unconditionally running SimpleLoopUnswitch ranges
from +0.05% - +0.16%
https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=2930dfd5accdce2e6f8d5146ae4d626add2065a2&stat=instructions:u

Unconditionally running SimpleLoopUnswitch seems to indicate that there
are multiple other scenarios where we fail to run unswitching when
opportunities remain.
@llvmbot
Copy link
Collaborator

llvmbot commented Feb 9, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

IndVars may be able to replace a loop dependent condition with a loop invariant one, but loop-unswitch runs before IndVars, so the invariant check remains in the loop.

For an example, consider a read-only loop with a bounds check: https://godbolt.org/z/8cdj4qhbG

This patch uses a approach similar to the way extra cleanup passes are run on demand after vectorization (added in acea6e9).

It introduces a new ShouldRunExtraSimpleLoopUnswitch analysis marker, which IndVars can use to indicate that extra unswitching is beneficial.

ExtraSimpleLoopUnswitchPassManager uses this analysis to determine whether to run its passes on a loop.

Compile-time impact (geomean) ranges from +0.0% to 0.02% https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=19e6e99eeb280d426907ea73a21b139ba7225627&stat=instructions%3Au

Compile-time impact (geomean) of unconditionally running SimpleLoopUnswitch ranges from +0.05% - +0.16%
https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=2930dfd5accdce2e6f8d5146ae4d626add2065a2&stat=instructions:u

Unconditionally running SimpleLoopUnswitch seems to indicate that there are multiple other scenarios where we fail to run unswitching when opportunities remain.


Full diff: https://github.com/llvm/llvm-project/pull/81271.diff

8 Files Affected:

  • (modified) llvm/include/llvm/Transforms/Scalar/SimpleLoopUnswitch.h (+35)
  • (modified) llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h (+8-4)
  • (modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+7)
  • (modified) llvm/lib/Passes/PassRegistry.def (+3)
  • (modified) llvm/lib/Transforms/Scalar/IndVarSimplify.cpp (+15-2)
  • (modified) llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp (+1)
  • (modified) llvm/lib/Transforms/Utils/SimplifyIndVar.cpp (+11-6)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll (+43-11)
diff --git a/llvm/include/llvm/Transforms/Scalar/SimpleLoopUnswitch.h b/llvm/include/llvm/Transforms/Scalar/SimpleLoopUnswitch.h
index b97ee23fc0e65d..a40b0a2a9ab72f 100644
--- a/llvm/include/llvm/Transforms/Scalar/SimpleLoopUnswitch.h
+++ b/llvm/include/llvm/Transforms/Scalar/SimpleLoopUnswitch.h
@@ -12,6 +12,7 @@
 #include "llvm/ADT/STLFunctionalExtras.h"
 #include "llvm/Analysis/LoopAnalysisManager.h"
 #include "llvm/IR/PassManager.h"
+#include "llvm/Transforms/Scalar/LoopPassManager.h"
 
 namespace llvm {
 
@@ -20,6 +21,40 @@ class Loop;
 class StringRef;
 class raw_ostream;
 
+struct ShouldRunExtraSimpleLoopUnswitch
+    : public AnalysisInfoMixin<ShouldRunExtraSimpleLoopUnswitch> {
+  static AnalysisKey Key;
+  struct Result {
+    bool invalidate(Loop &L, const PreservedAnalyses &PA,
+                    LoopAnalysisManager::Invalidator &) {
+      // Check whether the analysis has been explicitly invalidated. Otherwise,
+      // it remains preserved.
+      auto PAC = PA.getChecker<ShouldRunExtraSimpleLoopUnswitch>();
+      return !PAC.preservedWhenStateless();
+    }
+  };
+
+  Result run(Loop &L, LoopAnalysisManager &AM,
+             LoopStandardAnalysisResults &AR) {
+    return Result();
+  }
+
+  static bool isRequired() { return true; }
+};
+
+struct ExtraSimpleLoopUnswitchPassManager : public LoopPassManager {
+  PreservedAnalyses run(Loop &L, LoopAnalysisManager &AM,
+                        LoopStandardAnalysisResults &AR, LPMUpdater &U) {
+    auto PA = PreservedAnalyses::all();
+    if (AM.getCachedResult<ShouldRunExtraSimpleLoopUnswitch>(L))
+      PA.intersect(LoopPassManager::run(L, AM, AR, U));
+    PA.abandon<ShouldRunExtraSimpleLoopUnswitch>();
+    return PA;
+  }
+
+  static bool isRequired() { return true; }
+};
+
 /// This pass transforms loops that contain branches or switches on loop-
 /// invariant conditions to have multiple loops. For example, it turns the left
 /// into the right code:
diff --git a/llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h b/llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h
index ff60811b616859..b39ca05f3f4b14 100644
--- a/llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h
+++ b/llvm/include/llvm/Transforms/Utils/SimplifyIndVar.h
@@ -15,6 +15,8 @@
 #ifndef LLVM_TRANSFORMS_UTILS_SIMPLIFYINDVAR_H
 #define LLVM_TRANSFORMS_UTILS_SIMPLIFYINDVAR_H
 
+#include <utility>
+
 namespace llvm {
 
 class Type;
@@ -47,10 +49,12 @@ class IVVisitor {
 
 /// simplifyUsersOfIV - Simplify instructions that use this induction variable
 /// by using ScalarEvolution to analyze the IV's recurrence.
-bool simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE, DominatorTree *DT,
-                       LoopInfo *LI, const TargetTransformInfo *TTI,
-                       SmallVectorImpl<WeakTrackingVH> &Dead,
-                       SCEVExpander &Rewriter, IVVisitor *V = nullptr);
+std::pair<bool, bool> simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE,
+                                        DominatorTree *DT, LoopInfo *LI,
+                                        const TargetTransformInfo *TTI,
+                                        SmallVectorImpl<WeakTrackingVH> &Dead,
+                                        SCEVExpander &Rewriter,
+                                        IVVisitor *V = nullptr);
 
 /// SimplifyLoopIVs - Simplify users of induction variables within this
 /// loop. This does not actually change or add IVs.
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 6ede8638291206..8187d78d41e928 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -627,6 +627,13 @@ PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
   LPM2.addPass(LoopIdiomRecognizePass());
   LPM2.addPass(IndVarSimplifyPass());
 
+  {
+    ExtraSimpleLoopUnswitchPassManager ExtraPasses;
+    ExtraPasses.addPass(SimpleLoopUnswitchPass(/* NonTrivial */ Level ==
+                                               OptimizationLevel::O3));
+    LPM2.addPass(std::move(ExtraPasses));
+  }
+
   invokeLateLoopOptimizationsEPCallbacks(LPM2, Level);
 
   LPM2.addPass(LoopDeletionPass());
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index 6cb87fba426463..4ac2a6b48df220 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -583,6 +583,9 @@ LOOP_ANALYSIS("ddg", DDGAnalysis())
 LOOP_ANALYSIS("iv-users", IVUsersAnalysis())
 LOOP_ANALYSIS("no-op-loop", NoOpLoopAnalysis())
 LOOP_ANALYSIS("pass-instrumentation", PassInstrumentationAnalysis(PIC))
+LOOP_ANALYSIS("should-run-extra-simple-loop-unswitch",
+                  ShouldRunExtraSimpleLoopUnswitch())
+
 #undef LOOP_ANALYSIS
 
 #ifndef LOOP_PASS
diff --git a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
index 41c4d623617347..b2905109af327e 100644
--- a/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
+++ b/llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
@@ -70,6 +70,7 @@
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/raw_ostream.h"
+#include "llvm/Transforms/Scalar/SimpleLoopUnswitch.h"
 #include "llvm/Transforms/Utils/BasicBlockUtils.h"
 #include "llvm/Transforms/Utils/Local.h"
 #include "llvm/Transforms/Utils/LoopUtils.h"
@@ -137,6 +138,8 @@ class IndVarSimplify {
   SmallVector<WeakTrackingVH, 16> DeadInsts;
   bool WidenIndVars;
 
+  bool RunUnswitching = false;
+
   bool handleFloatingPointIV(Loop *L, PHINode *PH);
   bool rewriteNonIntegerIVs(Loop *L);
 
@@ -170,6 +173,8 @@ class IndVarSimplify {
   }
 
   bool run(Loop *L);
+
+  bool runUnswitching() const { return RunUnswitching; }
 };
 
 } // end anonymous namespace
@@ -614,9 +619,11 @@ bool IndVarSimplify::simplifyAndExtend(Loop *L,
       // Information about sign/zero extensions of CurrIV.
       IndVarSimplifyVisitor Visitor(CurrIV, SE, TTI, DT);
 
-      Changed |= simplifyUsersOfIV(CurrIV, SE, DT, LI, TTI, DeadInsts, Rewriter,
-                                   &Visitor);
+      const auto &[C, U] = simplifyUsersOfIV(CurrIV, SE, DT, LI, TTI, DeadInsts,
+                                             Rewriter, &Visitor);
 
+      Changed |= C;
+      RunUnswitching |= U;
       if (Visitor.WI.WidestNativeType) {
         WideIVs.push_back(Visitor.WI);
       }
@@ -1873,6 +1880,7 @@ bool IndVarSimplify::predicateLoopExits(Loop *L, SCEVExpander &Rewriter) {
     if (OldCond->use_empty())
       DeadInsts.emplace_back(OldCond);
     Changed = true;
+    RunUnswitching = true;
   }
 
   return Changed;
@@ -2058,6 +2066,11 @@ PreservedAnalyses IndVarSimplifyPass::run(Loop &L, LoopAnalysisManager &AM,
 
   auto PA = getLoopPassPreservedAnalyses();
   PA.preserveSet<CFGAnalyses>();
+  if (IVS.runUnswitching()) {
+    AM.getResult<ShouldRunExtraSimpleLoopUnswitch>(L, AR);
+    PA.preserve<ShouldRunExtraSimpleLoopUnswitch>();
+  }
+
   if (AR.MSSA)
     PA.preserve<MemorySSAAnalysis>();
   return PA;
diff --git a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
index 7eb0ba1c2c1793..2bfee6e4c0e421 100644
--- a/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
+++ b/llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
@@ -133,6 +133,7 @@ static cl::opt<unsigned> InjectInvariantConditionHotnesThreshold(
                          "not-taken 1/<this option> times or less."),
     cl::init(16));
 
+AnalysisKey ShouldRunExtraSimpleLoopUnswitch::Key;
 namespace {
 struct CompareDesc {
   BranchInst *Term;
diff --git a/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp b/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp
index 1b142f14d81139..07949094431919 100644
--- a/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp
+++ b/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp
@@ -59,6 +59,7 @@ namespace {
     SmallVectorImpl<WeakTrackingVH> &DeadInsts;
 
     bool Changed = false;
+    bool RunUnswitching = false;
 
   public:
     SimplifyIndvar(Loop *Loop, ScalarEvolution *SE, DominatorTree *DT,
@@ -71,6 +72,7 @@ namespace {
     }
 
     bool hasChanged() const { return Changed; }
+    bool runUnswitching() const { return RunUnswitching; }
 
     /// Iteratively perform simplification on a worklist of users of the
     /// specified induction variable. This is the top-level driver that applies
@@ -232,6 +234,7 @@ bool SimplifyIndvar::makeIVComparisonInvariant(ICmpInst *ICmp,
   ICmp->setPredicate(InvariantPredicate);
   ICmp->setOperand(0, NewLHS);
   ICmp->setOperand(1, NewRHS);
+  RunUnswitching = true;
   return true;
 }
 
@@ -981,14 +984,15 @@ void IVVisitor::anchor() { }
 
 /// Simplify instructions that use this induction variable
 /// by using ScalarEvolution to analyze the IV's recurrence.
-bool simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE, DominatorTree *DT,
-                       LoopInfo *LI, const TargetTransformInfo *TTI,
-                       SmallVectorImpl<WeakTrackingVH> &Dead,
-                       SCEVExpander &Rewriter, IVVisitor *V) {
+std::pair<bool, bool> simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE,
+                                        DominatorTree *DT, LoopInfo *LI,
+                                        const TargetTransformInfo *TTI,
+                                        SmallVectorImpl<WeakTrackingVH> &Dead,
+                                        SCEVExpander &Rewriter, IVVisitor *V) {
   SimplifyIndvar SIV(LI->getLoopFor(CurrIV->getParent()), SE, DT, LI, TTI,
                      Rewriter, Dead);
   SIV.simplifyUsers(CurrIV, V);
-  return SIV.hasChanged();
+  return {SIV.hasChanged(), SIV.runUnswitching()};
 }
 
 /// Simplify users of induction variables within this
@@ -1002,8 +1006,9 @@ bool simplifyLoopIVs(Loop *L, ScalarEvolution *SE, DominatorTree *DT,
 #endif
   bool Changed = false;
   for (BasicBlock::iterator I = L->getHeader()->begin(); isa<PHINode>(I); ++I) {
-    Changed |=
+    const auto &[C, _] =
         simplifyUsersOfIV(cast<PHINode>(I), SE, DT, LI, TTI, Dead, Rewriter);
+    Changed |= C;
   }
   return Changed;
 }
diff --git a/llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll b/llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll
index c6c9a52167d54e..ddb8e6133ebf63 100644
--- a/llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll
+++ b/llvm/test/Transforms/PhaseOrdering/AArch64/hoist-runtime-checks.ll
@@ -14,24 +14,50 @@ define i32 @read_only_loop_with_runtime_check(ptr noundef %array, i32 noundef %c
 ; CHECK-NEXT:    [[TMP0:%.*]] = zext i32 [[N]] to i64
 ; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[N]], -1
 ; CHECK-NEXT:    [[DOTNOT_NOT:%.*]] = icmp ult i32 [[TMP1]], [[COUNT]]
+; CHECK-NEXT:    br i1 [[DOTNOT_NOT]], label [[FOR_BODY_PREHEADER10:%.*]], label [[IF_THEN:%.*]]
+; CHECK:       for.body.preheader10:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[FOR_BODY_PREHEADER13:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[N_VEC:%.*]] = and i64 [[TMP0]], 4294967288
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP4:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI11:%.*]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP5:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 [[INDEX]]
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i64 16
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD12:%.*]] = load <4 x i32>, ptr [[TMP3]], align 4
+; CHECK-NEXT:    [[TMP4]] = add <4 x i32> [[WIDE_LOAD]], [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP5]] = add <4 x i32> [[WIDE_LOAD12]], [[VEC_PHI11]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[BIN_RDX:%.*]] = add <4 x i32> [[TMP5]], [[TMP4]]
+; CHECK-NEXT:    [[TMP7:%.*]] = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[BIN_RDX]])
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[TMP0]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_PREHEADER13]]
+; CHECK:       for.body.preheader13:
+; CHECK-NEXT:    [[INDVARS_IV_PH:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER10]] ], [ [[N_VEC]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    [[SUM_07_PH:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER10]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.cond.cleanup:
-; CHECK-NEXT:    [[SUM_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[ADD:%.*]], [[IF_END:%.*]] ]
+; CHECK-NEXT:    [[SUM_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ], [ [[ADD:%.*]], [[FOR_BODY]] ]
 ; CHECK-NEXT:    ret i32 [[SUM_0_LCSSA]]
 ; CHECK:       for.body:
-; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.*]], [[IF_END]] ]
-; CHECK-NEXT:    [[SUM_07:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[ADD]], [[IF_END]] ]
-; CHECK-NEXT:    br i1 [[DOTNOT_NOT]], label [[IF_END]], label [[IF_THEN:%.*]]
-; CHECK:       if.then:
-; CHECK-NEXT:    tail call void @llvm.trap()
-; CHECK-NEXT:    unreachable
-; CHECK:       if.end:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ], [ [[INDVARS_IV_PH]], [[FOR_BODY_PREHEADER13]] ]
+; CHECK-NEXT:    [[SUM_07:%.*]] = phi i32 [ [[ADD]], [[FOR_BODY]] ], [ [[SUM_07_PH]], [[FOR_BODY_PREHEADER13]] ]
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 [[INDVARS_IV]]
-; CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT:    [[ADD]] = add nsw i32 [[TMP2]], [[SUM_07]]
+; CHECK-NEXT:    [[TMP8:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[ADD]] = add nsw i32 [[TMP8]], [[SUM_07]]
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[TMP0]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       if.then:
+; CHECK-NEXT:    tail call void @llvm.trap()
+; CHECK-NEXT:    unreachable
 ;
 entry:
   %array.addr = alloca ptr, align 8
@@ -96,3 +122,9 @@ declare void @llvm.lifetime.start.p0(i64 immarg, ptr nocapture)
 declare void @llvm.trap()
 
 declare void @llvm.lifetime.end.p0(i64 immarg, ptr nocapture)
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
+;.

fhahn added a commit to fhahn/llvm-project that referenced this pull request Feb 9, 2024
After aea2a14, partial unrolling now happens much later in the
pipeline. At this point, we miss a number of simplification passes to
further optimize the partially unrolled loop.

In some cases, this can cause notable performance regressions.

To improve codegen for partial unrolled loop bodies, run IndVars on
demand on partially unrolled loops.

This patch uses a approach similar to the way extra cleanup passes are
run on demand after vectorization (added in acea6e9) and
if we decide to make wider use of this pattern, we should probably
factor our the pass manager/analysis logic, before landing this.

Another instance that can be improved by this pattern:
llvm#81271

Compile-time impact (geomeans) ranges from +0.11% to +0.19%.
https://llvm-compile-time-tracker.com/compare.php?from=68cef34784e33539b75f2c379d8e21a87719d67c&to=446d5e5b2fbfb5efc6cbc40584a31564327df584&stat=instructions:u

This could be reduced by only running a small subset of IndVars to
optimize inductions where the impact ranges from +0.05% to +0.10%.
https://llvm-compile-time-tracker.com/compare.php?from=68cef34784e33539b75f2c379d8e21a87719d67c&to=bbee5fa003136618446dc30e6ac3897ab03facd7&stat=instructions%3Au
Copy link
Contributor

@alinas alinas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings back the discussion that we really should have better infrastructure in the NPM to specify the conditional running of passes. The extra-vectorizer-passes change was supposed to be a one time approach.

Do you have a measure of how this affects execution time for some workloads (e.g. spec)? Could you also test this on https://github.com/google/fleetbench?

@@ -47,10 +49,12 @@ class IVVisitor {

/// simplifyUsersOfIV - Simplify instructions that use this induction variable
/// by using ScalarEvolution to analyze the IV's recurrence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment on what the new return value represents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

bool simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE, DominatorTree *DT,
LoopInfo *LI, const TargetTransformInfo *TTI,
SmallVectorImpl<WeakTrackingVH> &Dead,
SCEVExpander &Rewriter, IVVisitor *V) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, add a comment on what the pair return stands for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

@fhahn
Copy link
Contributor Author

fhahn commented Feb 9, 2024

This brings back the discussion that we really should have better infrastructure in the NPM to specify the conditional running of passes. The extra-vectorizer-passes change was supposed to be a one time approach.

Sure, I am not really tied to the current approach and would be happy to explore alternative options; the current implementation was just the easiest way to get this done I was aware of.

Do you have a measure of how this affects execution time for some workloads (e.g. spec)? Could you also test this on https://github.com/google/fleetbench?

Not yet, but I am planning to do this. I think the bounds check example highlights a good missed indvars -> SLU interaction

@fhahn
Copy link
Contributor Author

fhahn commented Feb 20, 2024

@alinas @aeubanks any preferences/suggestions on how to better handle this in general?

@fhahn
Copy link
Contributor Author

fhahn commented Mar 12, 2024

ping :)

dtcxzyw added a commit to dtcxzyw/llvm-opt-benchmark that referenced this pull request Mar 12, 2024
@preames
Copy link
Collaborator

preames commented Mar 13, 2024

Drive by comment on alternatives...

The loop pass manger already supports requeueing a loop. This is done inside LPM1 today by e.g. both unswitch and rotate. This can cause significant compile time problems of its own - which is how I noticed it again recently.

One option here would be to either a) requeue the loop in IndVarSimplify if a loop invariant condition is exposed as part of LPM2 (this would still require adding unswitch to LPM2), or b) update indvars to preserve memory ssa and move it to LPM1 which can already iterate.

As an aside, starting with trivial only unswitching is probably much safer. I saw a really nasty compile time problem recently caused by the requeue interaction with non-trivial unswitch.

@alinas
Copy link
Contributor

alinas commented Mar 13, 2024

One way to approach conditional running of passes is by factoring out key transforms into static method, akin to utils. This is possible for the other patch with IndVars (#81275), and it could work with some refactoring of SLU too, i.e. making the exiting functions available. I'm not sure if that's better :-).

Adding to LPM2 as in the current commit should work fine (SLU doesn't really need MSSA; it can use it for a quick exit, it will update it if available but it's not necessary; and LPM2 doesn't have it available). The only caveat is spreading the pattern of going through a fake analysis to signal that a pass should be run conditionally.
Note also that adding SLU to LPM2 doesn't give you BFI either, which is available in LPM1.
So, is it an option to add IndVarSimplify to LPM1? (I don't think it makes any changes that would invalidate MSSA, so that should be fairly straightforward to add) Does it address the use case you are looking at? And following up with the same question as for the current approach: what's the compile time vs execution time impact then? i.e. is it worth adding it?

@fhahn
Copy link
Contributor Author

fhahn commented Apr 5, 2024

So, is it an option to add IndVarSimplify to LPM1? (I don't think it makes any changes that would invalidate MSSA, so that should be fairly straightforward to add) Does it address the use case you are looking at? And following up with the same question as for the current approach: what's the compile time vs execution time impact then? i.e. is it worth adding it?

I finally had time to try this out in aff602c, apologies for the long delay.

The impact of moving indVarSimplify to LPM1 is much bigger, both in terms of compile-time and binary changes: https://llvm-compile-time-tracker.com/compare.php?from=1d06f41b72e429a5b3ba318ff639b8b997e21ff8&to=aff602c6f3d9b66d7539d72e9f78adbf49118890&stat=instructions:u. I think a number of optimizations went intoIndVarSimplify to make it work very well at the specific position. There may be some interesting insights from some of the bigger changes caused by moving it, but it seems moving it to LPM1 would be quite a heavy hammer.

Exposing the unswitching logic directly and use that on demand in IndVarSimplify would likely be the most efficient approach. It also compared using a dedicated pass vs directly invoking functionality on demand in #83860 and directly invocation was noticeably faster. Should we go down this route?

@nikic
Copy link
Contributor

nikic commented Apr 10, 2024

In the interest of not making perfect the enemy of good, I'd be happy to go forward with your current approach.

Could you please resolve the comments regarding extra comments and check whether this patch fixes #85551 (and if so add another test)?

Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please resolve the comments regarding extra comments and check whether this patch fixes #85551 (and if so add another test)?

Thanks, I added the comment and check that it fixes #85551. Added a test case in 94ed57d.

bool simplifyUsersOfIV(PHINode *CurrIV, ScalarEvolution *SE, DominatorTree *DT,
LoopInfo *LI, const TargetTransformInfo *TTI,
SmallVectorImpl<WeakTrackingVH> &Dead,
SCEVExpander &Rewriter, IVVisitor *V) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

@@ -47,10 +49,12 @@ class IVVisitor {

/// simplifyUsersOfIV - Simplify instructions that use this induction variable
/// by using ScalarEvolution to analyze the IV's recurrence.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fhahn fhahn merged commit 0f82469 into llvm:main Apr 12, 2024
4 checks passed
@fhahn fhahn deleted the perf/extra-unswitch branch April 12, 2024 21:07
bazuzi pushed a commit to bazuzi/llvm-project that referenced this pull request Apr 15, 2024
…llvm#81271)

IndVars may be able to replace a loop dependent condition with a loop
invariant one, but loop-unswitch runs before IndVars, so the invariant
check remains in the loop.

For an example, consider a read-only loop with a bounds check:
https://godbolt.org/z/8cdj4qhbG

This patch uses a approach similar to the way extra cleanup passes are
run on demand after vectorization (added in acea6e9).

It introduces a new ShouldRunExtraSimpleLoopUnswitch analysis marker,
which IndVars can use to indicate that extra unswitching is beneficial.

ExtraSimpleLoopUnswitchPassManager uses this analysis to determine
whether to run its passes on a loop.

Compile-time impact (geomean) ranges from +0.0% to 0.02%
https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=19e6e99eeb280d426907ea73a21b139ba7225627&stat=instructions%3Au

Compile-time impact (geomean) of unconditionally running
SimpleLoopUnswitch ranges from +0.05% - +0.16%

https://llvm-compile-time-tracker.com/compare.php?from=138c0beb109ffe47f75a0fe8c4dc2cdabe8a6532&to=2930dfd5accdce2e6f8d5146ae4d626add2065a2&stat=instructions:u

Unconditionally running SimpleLoopUnswitch seems to indicate that there
are multiple other scenarios where we fail to run unswitching when
opportunities remain.


Fixes llvm#85551.

PR: llvm#81271
@antmox
Copy link
Contributor

antmox commented Apr 16, 2024

Hi @fhahn,
Looks like this broke clang-aarch64-full-2stage bot : https://lab.llvm.org/buildbot/#/builders/179/builds/9867
Could you please look at this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Indexing bounds checks are not hoisted out of a loop
6 participants