[NFC][LoopFlatten] Lift loop versioning to function #166156

nasherm · 2025-11-03T11:53:01Z

LoopFlatten can sometimes generate loops like the following

vector.body:
  %index = phi i64 [ 0, %entry], [ %index.next, %vector.body ]
  %and = and i64 %index, 4294967295
  %index.next = add i64 %and, 1
  %exit.cond = icmp ugt i64 %index.next, %N
  br i1 %exit.cond, label %end, label %vector.body

The AND mask instruction is introduced due to LoopFlatten. To enable flattening a loop this pass attempts to widen induction variables to compute the new trip count. If widening is successful it introduces the AND mask instruction to check that the new widened IV doesn't overflow the original type width. This behaviour avoids runtime checks on the IV, but can slow down loop code considerably in some cases due to reducing the effectiveness of auto-vectorization.

A solution exists when we set loop-flatten-widen-iv=false which causes the loop flatten to fall through to just versioning the loop (if possible). This patch simply refactors the loop versioning code to a dedicated function.

llvmbot · 2025-11-03T11:53:48Z

@llvm/pr-subscribers-llvm-transforms

Author: Nashe Mncube (nasherm)

Changes

LoopFlatten can sometimes generate loops like the following

vector.body:
  %index = phi i64 [ 0, %entry], [ %index.next, %vector.body ]
  %and = and i64 %index, 4294967295
  %index.next = add i64 %and, 1
  %exit.cond = icmp ugt i64 %index.next, %N
  br i1 %exit.cond, label %end, label %vector.body

The AND mask instruction is introduced due to LoopFlatten. To enable flattening a loop this pass attempts to widen induction variables to compute the new trip count. If widening is successful it introduces the AND mask instruction to check that the new widened IV doesn't overflow the original type width. This behaviour avoids runtime checks on the IV, but can slow down loop code considerably in some cases due to reducing the effectiveness of auto-vectorization.

This patch introduces the -loop-flatten-version-over-widen flag to the LoopFlatten pass. This optional flag when enabled attempts to version the original loop, introducing a runtime check on whether the IV overflows, instead of widening. We find that this flag when enabled with other loop-nest-optimization and loop-vectorization flags can improve performance on internal autovectorization workloads by up to 23% for AArch64.

Change-Id: I94572e65411cfeca3f617c60148f1c02500ab056

Patch is 29.30 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/166156.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Scalar/LoopFlatten.cpp (+46-26)
(modified) llvm/test/Transforms/LoopFlatten/loop-flatten-version.ll (+259)

diff --git a/llvm/lib/Transforms/Scalar/LoopFlatten.cpp b/llvm/lib/Transforms/Scalar/LoopFlatten.cpp
index 04039b885f3c5..05d414811eabe 100644
--- a/llvm/lib/Transforms/Scalar/LoopFlatten.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopFlatten.cpp
@@ -102,6 +102,10 @@ static cl::opt<bool>
     VersionLoops("loop-flatten-version-loops", cl::Hidden, cl::init(true),
                  cl::desc("Version loops if flattened loop could overflow"));
 
+static cl::opt<bool> VersionLoopsOverWiden(
+    "loop-flatten-version-over-widen", cl::Hidden, cl::init(false),
+    cl::desc("Version loops and generate runtime checks over widening the IV"));
+
 namespace {
 // We require all uses of both induction variables to match this pattern:
 //
@@ -835,14 +839,52 @@ static bool DoFlattenLoopPair(FlattenInfo &FI, DominatorTree *DT, LoopInfo *LI,
   return true;
 }
 
+static bool VersionLoop(FlattenInfo &FI, DominatorTree *DT, LoopInfo *LI,
+                        ScalarEvolution *SE, const LoopAccessInfo &LAI) {
+
+  // Version the loop. The overflow check isn't a runtime pointer check, so we
+  // pass an empty list of runtime pointer checks, causing LoopVersioning to
+  // emit 'false' as the branch condition, and add our own check afterwards.
+  BasicBlock *CheckBlock = FI.OuterLoop->getLoopPreheader();
+  ArrayRef<RuntimePointerCheck> Checks(nullptr, nullptr);
+  LoopVersioning LVer(LAI, Checks, FI.OuterLoop, LI, DT, SE);
+  LVer.versionLoop();
+
+  // Check for overflow by calculating the new tripcount using
+  // umul_with_overflow and then checking if it overflowed.
+  BranchInst *Br = cast<BranchInst>(CheckBlock->getTerminator());
+  if (!Br->isConditional())
+    return false;
+  if (!match(Br->getCondition(), m_Zero()))
+    return false;
+  IRBuilder<> Builder(Br);
+  Value *Call = Builder.CreateIntrinsic(Intrinsic::umul_with_overflow,
+                                        FI.OuterTripCount->getType(),
+                                        {FI.OuterTripCount, FI.InnerTripCount},
+                                        /*FMFSource=*/nullptr, "flatten.mul");
+  FI.NewTripCount = Builder.CreateExtractValue(Call, 0, "flatten.tripcount");
+  Value *Overflow = Builder.CreateExtractValue(Call, 1, "flatten.overflow");
+  Br->setCondition(Overflow);
+  return true;
+}
+
 static bool CanWidenIV(FlattenInfo &FI, DominatorTree *DT, LoopInfo *LI,
                        ScalarEvolution *SE, AssumptionCache *AC,
-                       const TargetTransformInfo *TTI) {
+                       const TargetTransformInfo *TTI,
+                       const LoopAccessInfo &LAI) {
   if (!WidenIV) {
     LLVM_DEBUG(dbgs() << "Widening the IVs is disabled\n");
     return false;
   }
 
+  // TODO: don't bother widening IV's if know that they
+  // can't overflow. If they can overflow opt for versioning
+  // the loop and remove requirement to truncate when using
+  // IV in the loop
+  if (VersionLoopsOverWiden)
+    if (VersionLoop(FI, DT, LI, SE, LAI))
+      return true;
+
   LLVM_DEBUG(dbgs() << "Try widening the IVs\n");
   Module *M = FI.InnerLoop->getHeader()->getParent()->getParent();
   auto &DL = M->getDataLayout();
@@ -916,7 +958,8 @@ static bool FlattenLoopPair(FlattenInfo &FI, DominatorTree *DT, LoopInfo *LI,
     return false;
 
   // Check if we can widen the induction variables to avoid overflow checks.
-  bool CanFlatten = CanWidenIV(FI, DT, LI, SE, AC, TTI);
+  // TODO: widening doesn't remove overflow checks in practice
+  bool CanFlatten = CanWidenIV(FI, DT, LI, SE, AC, TTI, LAI);
 
   // It can happen that after widening of the IV, flattening may not be
   // possible/happening, e.g. when it is deemed unprofitable. So bail here if
@@ -961,30 +1004,7 @@ static bool FlattenLoopPair(FlattenInfo &FI, DominatorTree *DT, LoopInfo *LI,
       return false;
     }
     LLVM_DEBUG(dbgs() << "Multiply might overflow, versioning loop\n");
-
-    // Version the loop. The overflow check isn't a runtime pointer check, so we
-    // pass an empty list of runtime pointer checks, causing LoopVersioning to
-    // emit 'false' as the branch condition, and add our own check afterwards.
-    BasicBlock *CheckBlock = FI.OuterLoop->getLoopPreheader();
-    ArrayRef<RuntimePointerCheck> Checks(nullptr, nullptr);
-    LoopVersioning LVer(LAI, Checks, FI.OuterLoop, LI, DT, SE);
-    LVer.versionLoop();
-
-    // Check for overflow by calculating the new tripcount using
-    // umul_with_overflow and then checking if it overflowed.
-    BranchInst *Br = cast<BranchInst>(CheckBlock->getTerminator());
-    assert(Br->isConditional() &&
-           "Expected LoopVersioning to generate a conditional branch");
-    assert(match(Br->getCondition(), m_Zero()) &&
-           "Expected branch condition to be false");
-    IRBuilder<> Builder(Br);
-    Value *Call = Builder.CreateIntrinsic(
-        Intrinsic::umul_with_overflow, FI.OuterTripCount->getType(),
-        {FI.OuterTripCount, FI.InnerTripCount},
-        /*FMFSource=*/nullptr, "flatten.mul");
-    FI.NewTripCount = Builder.CreateExtractValue(Call, 0, "flatten.tripcount");
-    Value *Overflow = Builder.CreateExtractValue(Call, 1, "flatten.overflow");
-    Br->setCondition(Overflow);
+    assert(VersionLoop(FI, DT, LI, SE, LAI) && "Failed to version loop");
   } else {
     LLVM_DEBUG(dbgs() << "Multiply cannot overflow, modifying loop in-place\n");
   }
diff --git a/llvm/test/Transforms/LoopFlatten/loop-flatten-version.ll b/llvm/test/Transforms/LoopFlatten/loop-flatten-version.ll
index 85072bf3a43f4..1de31d2c7c70d 100644
--- a/llvm/test/Transforms/LoopFlatten/loop-flatten-version.ll
+++ b/llvm/test/Transforms/LoopFlatten/loop-flatten-version.ll
@@ -1,5 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
 ; RUN: opt %s -S -passes='loop(loop-flatten),verify' -verify-loop-info -verify-dom-info -verify-scev -o - | FileCheck %s
+; RUN: opt %s -S -passes='loop(loop-flatten),verify' -loop-flatten-version-over-widen -verify-loop-info -verify-dom-info -verify-scev -o - | FileCheck %s --check-prefix=CHECK-VERSION-OVER-WIDEN
 
 target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
 
@@ -61,6 +62,62 @@ define void @noinbounds_gep(i32 %N, ptr %A) {
 ; CHECK:       for.end:
 ; CHECK-NEXT:    ret void
 ;
+; CHECK-VERSION-OVER-WIDEN-LABEL: define void @noinbounds_gep(
+; CHECK-VERSION-OVER-WIDEN-SAME: i32 [[N:%.*]], ptr [[A:%.*]]) {
+; CHECK-VERSION-OVER-WIDEN-NEXT:  entry:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP3:%.*]] = icmp ult i32 0, [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[CMP3]], label [[FOR_INNER_PREHEADER_LVER_CHECK:%.*]], label [[FOR_END:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.preheader.lver.check:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_MUL:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 [[N]], i32 [[N]])
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_TRIPCOUNT:%.*]] = extractvalue { i32, i1 } [[FLATTEN_MUL]], 0
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[FLATTEN_MUL]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[FLATTEN_OVERFLOW]], label [[FOR_INNER_PREHEADER_PH_LVER_ORIG:%.*]], label [[FOR_INNER_PREHEADER_PH:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.preheader.ph.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_INNER_PREHEADER_LVER_ORIG:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.preheader.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[I_LVER_ORIG:%.*]] = phi i32 [ 0, [[FOR_INNER_PREHEADER_PH_LVER_ORIG]] ], [ [[INC2_LVER_ORIG:%.*]], [[FOR_OUTER_LVER_ORIG:%.*]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_INNER_LVER_ORIG:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[J_LVER_ORIG:%.*]] = phi i32 [ 0, [[FOR_INNER_PREHEADER_LVER_ORIG]] ], [ [[INC1_LVER_ORIG:%.*]], [[FOR_INNER_LVER_ORIG]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[MUL_LVER_ORIG:%.*]] = mul i32 [[I_LVER_ORIG]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[GEP_LVER_ORIG:%.*]] = getelementptr i32, ptr [[A]], i32 [[MUL_LVER_ORIG]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[ARRAYIDX_LVER_ORIG:%.*]] = getelementptr i32, ptr [[GEP_LVER_ORIG]], i32 [[J_LVER_ORIG]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    store i32 0, ptr [[ARRAYIDX_LVER_ORIG]], align 4
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[INC1_LVER_ORIG]] = add nuw i32 [[J_LVER_ORIG]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP2_LVER_ORIG:%.*]] = icmp ult i32 [[INC1_LVER_ORIG]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[CMP2_LVER_ORIG]], label [[FOR_INNER_LVER_ORIG]], label [[FOR_OUTER_LVER_ORIG]]
+; CHECK-VERSION-OVER-WIDEN:       for.outer.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[INC2_LVER_ORIG]] = add i32 [[I_LVER_ORIG]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP1_LVER_ORIG:%.*]] = icmp ult i32 [[INC2_LVER_ORIG]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[CMP1_LVER_ORIG]], label [[FOR_INNER_PREHEADER_LVER_ORIG]], label [[FOR_END_LOOPEXIT_LOOPEXIT:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.preheader.ph:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_INNER_PREHEADER:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.preheader:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[I:%.*]] = phi i32 [ 0, [[FOR_INNER_PREHEADER_PH]] ], [ [[INC2:%.*]], [[FOR_OUTER:%.*]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_ARRAYIDX:%.*]] = getelementptr i32, ptr [[A]], i32 [[I]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_INNER:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[J:%.*]] = phi i32 [ 0, [[FOR_INNER_PREHEADER]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[MUL:%.*]] = mul i32 [[I]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[A]], i32 [[MUL]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[ARRAYIDX:%.*]] = getelementptr i32, ptr [[GEP]], i32 [[J]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    store i32 0, ptr [[FLATTEN_ARRAYIDX]], align 4
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[INC1:%.*]] = add nuw i32 [[J]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP2:%.*]] = icmp ult i32 [[INC1]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_OUTER]]
+; CHECK-VERSION-OVER-WIDEN:       for.outer:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[INC2]] = add i32 [[I]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP1:%.*]] = icmp ult i32 [[INC2]], [[FLATTEN_TRIPCOUNT]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[CMP1]], label [[FOR_INNER_PREHEADER]], label [[FOR_END_LOOPEXIT_LOOPEXIT1:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.end.loopexit.loopexit:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_END_LOOPEXIT:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.end.loopexit.loopexit1:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_END_LOOPEXIT]]
+; CHECK-VERSION-OVER-WIDEN:       for.end.loopexit:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_END]]
+; CHECK-VERSION-OVER-WIDEN:       for.end:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    ret void
+;
 entry:
   %cmp3 = icmp ult i32 0, %N
   br i1 %cmp3, label %for.outer.preheader, label %for.end
@@ -124,6 +181,62 @@ define void @noinbounds_gep_too_large_mul(i64 %N, ptr %A) {
 ; CHECK:       for.end:
 ; CHECK-NEXT:    ret void
 ;
+; CHECK-VERSION-OVER-WIDEN-LABEL: define void @noinbounds_gep_too_large_mul(
+; CHECK-VERSION-OVER-WIDEN-SAME: i64 [[N:%.*]], ptr [[A:%.*]]) {
+; CHECK-VERSION-OVER-WIDEN-NEXT:  entry:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP3:%.*]] = icmp ult i64 0, [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[CMP3]], label [[FOR_INNER_PREHEADER_LVER_CHECK:%.*]], label [[FOR_END:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.preheader.lver.check:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_MUL:%.*]] = call { i64, i1 } @llvm.umul.with.overflow.i64(i64 [[N]], i64 [[N]])
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_TRIPCOUNT:%.*]] = extractvalue { i64, i1 } [[FLATTEN_MUL]], 0
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_OVERFLOW:%.*]] = extractvalue { i64, i1 } [[FLATTEN_MUL]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[FLATTEN_OVERFLOW]], label [[FOR_INNER_PREHEADER_PH_LVER_ORIG:%.*]], label [[FOR_INNER_PREHEADER_PH:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.preheader.ph.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_INNER_PREHEADER_LVER_ORIG:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.preheader.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[I_LVER_ORIG:%.*]] = phi i64 [ 0, [[FOR_INNER_PREHEADER_PH_LVER_ORIG]] ], [ [[INC2_LVER_ORIG:%.*]], [[FOR_OUTER_LVER_ORIG:%.*]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_INNER_LVER_ORIG:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[J_LVER_ORIG:%.*]] = phi i64 [ 0, [[FOR_INNER_PREHEADER_LVER_ORIG]] ], [ [[INC1_LVER_ORIG:%.*]], [[FOR_INNER_LVER_ORIG]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[MUL_LVER_ORIG:%.*]] = mul i64 [[I_LVER_ORIG]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[GEP_LVER_ORIG:%.*]] = getelementptr i32, ptr [[A]], i64 [[MUL_LVER_ORIG]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[ARRAYIDX_LVER_ORIG:%.*]] = getelementptr i32, ptr [[GEP_LVER_ORIG]], i64 [[J_LVER_ORIG]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    store i32 0, ptr [[ARRAYIDX_LVER_ORIG]], align 4
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[INC1_LVER_ORIG]] = add nuw i64 [[J_LVER_ORIG]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP2_LVER_ORIG:%.*]] = icmp ult i64 [[INC1_LVER_ORIG]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[CMP2_LVER_ORIG]], label [[FOR_INNER_LVER_ORIG]], label [[FOR_OUTER_LVER_ORIG]]
+; CHECK-VERSION-OVER-WIDEN:       for.outer.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[INC2_LVER_ORIG]] = add i64 [[I_LVER_ORIG]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP1_LVER_ORIG:%.*]] = icmp ult i64 [[INC2_LVER_ORIG]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[CMP1_LVER_ORIG]], label [[FOR_INNER_PREHEADER_LVER_ORIG]], label [[FOR_END_LOOPEXIT_LOOPEXIT:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.preheader.ph:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_INNER_PREHEADER:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner.preheader:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[I:%.*]] = phi i64 [ 0, [[FOR_INNER_PREHEADER_PH]] ], [ [[INC2:%.*]], [[FOR_OUTER:%.*]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_ARRAYIDX:%.*]] = getelementptr i32, ptr [[A]], i64 [[I]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_INNER:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.inner:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[J:%.*]] = phi i64 [ 0, [[FOR_INNER_PREHEADER]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[MUL:%.*]] = mul i64 [[I]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[A]], i64 [[MUL]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[ARRAYIDX:%.*]] = getelementptr i32, ptr [[GEP]], i64 [[J]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    store i32 0, ptr [[FLATTEN_ARRAYIDX]], align 4
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[INC1:%.*]] = add nuw i64 [[J]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP2:%.*]] = icmp ult i64 [[INC1]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_OUTER]]
+; CHECK-VERSION-OVER-WIDEN:       for.outer:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[INC2]] = add i64 [[I]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP1:%.*]] = icmp ult i64 [[INC2]], [[FLATTEN_TRIPCOUNT]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[CMP1]], label [[FOR_INNER_PREHEADER]], label [[FOR_END_LOOPEXIT_LOOPEXIT1:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.end.loopexit.loopexit:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_END_LOOPEXIT:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.end.loopexit.loopexit1:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_END_LOOPEXIT]]
+; CHECK-VERSION-OVER-WIDEN:       for.end.loopexit:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_END]]
+; CHECK-VERSION-OVER-WIDEN:       for.end:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    ret void
+;
 entry:
   %cmp3 = icmp ult i64 0, %N
   br i1 %cmp3, label %for.outer.preheader, label %for.end
@@ -238,6 +351,79 @@ define void @d3_2(ptr %A, i32 %N, i32 %M) {
 ; CHECK:       for.cond.cleanup:
 ; CHECK-NEXT:    ret void
 ;
+; CHECK-VERSION-OVER-WIDEN-LABEL: define void @d3_2(
+; CHECK-VERSION-OVER-WIDEN-SAME: ptr [[A:%.*]], i32 [[N:%.*]], i32 [[M:%.*]]) {
+; CHECK-VERSION-OVER-WIDEN-NEXT:  entry:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP30:%.*]] = icmp sgt i32 [[N]], 0
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[CMP30]], label [[FOR_COND1_PREHEADER_LR_PH:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.cond1.preheader.lr.ph:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[CMP625:%.*]] = icmp sgt i32 [[M]], 0
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_COND1_PREHEADER_US:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.cond1.preheader.us:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[K_031_US:%.*]] = phi i32 [ 0, [[FOR_COND1_PREHEADER_LR_PH]] ], [ [[INC13_US:%.*]], [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US:%.*]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[CMP625]], label [[FOR_COND5_PREHEADER_US_US_LVER_CHECK:%.*]], label [[FOR_COND5_PREHEADER_US43_PREHEADER:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.cond5.preheader.us43.preheader:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_LOOPEXIT50:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.cond5.preheader.us.us.lver.check:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_MUL:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 [[N]], i32 [[M]])
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_TRIPCOUNT:%.*]] = extractvalue { i32, i1 } [[FLATTEN_MUL]], 0
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[FLATTEN_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[FLATTEN_MUL]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[FLATTEN_OVERFLOW]], label [[FOR_COND5_PREHEADER_US_US_PH_LVER_ORIG:%.*]], label [[FOR_COND5_PREHEADER_US_US_PH:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.cond5.preheader.us.us.ph.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_COND5_PREHEADER_US_US_LVER_ORIG:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.cond5.preheader.us.us.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[I_028_US_US_LVER_ORIG:%.*]] = phi i32 [ [[INC10_US_US_LVER_ORIG:%.*]], [[FOR_COND5_FOR_COND_CLEANUP7_CRIT_EDGE_US_US_LVER_ORIG:%.*]] ], [ 0, [[FOR_COND5_PREHEADER_US_US_PH_LVER_ORIG]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[MUL_US_US_LVER_ORIG:%.*]] = mul nsw i32 [[I_028_US_US_LVER_ORIG]], [[M]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_BODY8_US_US_LVER_ORIG:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.body8.us.us.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[J_026_US_US_LVER_ORIG:%.*]] = phi i32 [ 0, [[FOR_COND5_PREHEADER_US_US_LVER_ORIG]] ], [ [[INC_US_US_LVER_ORIG:%.*]], [[FOR_BODY8_US_US_LVER_ORIG]] ]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[ADD_US_US_LVER_ORIG:%.*]] = add nsw i32 [[J_026_US_US_LVER_ORIG]], [[MUL_US_US_LVER_ORIG]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[IDXPROM_US_US_LVER_ORIG:%.*]] = sext i32 [[ADD_US_US_LVER_ORIG]] to i64
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[ARRAYIDX_US_US_LVER_ORIG:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IDXPROM_US_US_LVER_ORIG]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    tail call void @f(ptr [[ARRAYIDX_US_US_LVER_ORIG]])
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[INC_US_US_LVER_ORIG]] = add nuw nsw i32 [[J_026_US_US_LVER_ORIG]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[EXITCOND_LVER_ORIG:%.*]] = icmp ne i32 [[INC_US_US_LVER_ORIG]], [[M]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[EXITCOND_LVER_ORIG]], label [[FOR_BODY8_US_US_LVER_ORIG]], label [[FOR_COND5_FOR_COND_CLEANUP7_CRIT_EDGE_US_US_LVER_ORIG]]
+; CHECK-VERSION-OVER-WIDEN:       for.cond5.for.cond.cleanup7_crit_edge.us.us.lver.orig:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[INC10_US_US_LVER_ORIG]] = add nuw nsw i32 [[I_028_US_US_LVER_ORIG]], 1
+; CHECK-VERSION-OVER-WIDEN-NEXT:    [[EXITCOND51_LVER_ORIG:%.*]] = icmp ne i32 [[INC10_US_US_LVER_ORIG]], [[N]]
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br i1 [[EXITCOND51_LVER_ORIG]], label [[FOR_COND5_PREHEADER_US_US_LVER_ORIG]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_LOOPEXIT_LOOPEXIT:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.cond5.preheader.us.us.ph:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_COND5_PREHEADER_US_US:%.*]]
+; CHECK-VERSION-OVER-WIDEN:       for.cond1.for.cond.cleanup3_crit_edge.us.loopexit.loopexit:
+; CHECK-VERSION-OVER-WIDEN-NEXT:    br label [[FOR_COND1_F...
[truncated]

Copilot

Pull Request Overview

This PR adds a new command-line option -loop-flatten-version-over-widen to the LoopFlatten pass that enables loop versioning with runtime overflow checks instead of widening induction variables. The change aims to improve auto-vectorization performance by avoiding AND mask instructions that can hinder vectorization effectiveness.

Introduces the new -loop-flatten-version-over-widen flag for optional loop versioning behavior
Refactors loop versioning logic into a separate VersionLoop function
Modifies CanWidenIV to conditionally use versioning when the new flag is enabled

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
llvm/test/Transforms/LoopFlatten/loop-flatten-version.ll	Adds test cases with CHECK-VERSION-OVER-WIDEN expectations for the new versioning functionality
llvm/lib/Transforms/Scalar/LoopFlatten.cpp	Implements the new flag, refactors versioning logic, and integrates versioning option into CanWidenIV

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

llvm/lib/Transforms/Scalar/LoopFlatten.cpp

LoopFlatten can sometimes generate loops like the following ``` vector.body: %index = phi i64 [ 0, %entry], [ %index.next, %vector.body ] %and = and i64 %index, 4294967295 %index.next = add i64 %and, 1 %exit.cond = icmp ugt i64 %index.next, %N br i1 %exit.cond, label %end, label %vector.body ``` The AND mask instruction is introduced due to LoopFlatten. To enable flattening a loop this pass attempts to widen induction variables to compute the new trip count. If widening is successful it introduces the AND mask instruction to check that the new widened IV doesn't overflow the original type width. This behaviour avoids runtime checks on the IV, but can slow down loop code considerably in some cases due to reducing the effectiveness of auto-vectorization. This patch introduces the -loop-flatten-version-over-widen flag to the LoopFlatten pass. This optional flag when enabled attempts to version the original loop, introducing a runtime check on whether the IV overflows, instead of widening. We find that this flag when enabled with other loop-nest-optimization and loop-vectorization flags can improve performance on internal autovectorization workloads by up to 23% for AArch64. Change-Id: I94572e65411cfeca3f617c60148f1c02500ab056

Change-Id: I7976117394354ec20fbd4398a245c06f73d64a41

Change-Id: I996ba838554494fe750fb0824844098d21ea8a52

llvm/lib/Transforms/Scalar/LoopFlatten.cpp

john-brawn-arm · 2025-11-03T17:01:52Z

It looks like this interacts strangely with loops that are guaranteed to overflow. If we have

#define N 1000000
void fn(unsigned int *p) {
  for (unsigned int i = 0; i < N; i++)
    for (unsigned int j = 0; j < N; j++)
      p[i*N + j] += 1;
}

Then what happens with loop-flatten-version-over-widen=true is:

The loop gets versioned by loop flattening
Instcombine sees that the overflow check is always true and removes it
Because the branch condition is true, the flattened version gets removed
The end result is as if loop flattening never happened

It looks like probably the call to checkOverflow needs to be before we decide whether to version due to VersionLoopsOverWiden.

llvm/lib/Transforms/Scalar/LoopFlatten.cpp

amehsan · 2025-11-03T20:37:21Z

It looks like probably the call to checkOverflow needs to be before we decide whether to version due to VersionLoopsOverWiden.

Don't we get this behavior if we just set the loop-flatten-widen-iv to false, and let loop-flatten-version-loops be true ?

(EDIT: some changes may be needed, but it seems doable as versioning is already implemented.)

sjoerdmeijer · 2025-11-04T10:17:09Z

The idea for versioning makes sense. When we looked at this and thought about how to ensure correctness, there we two options: widening the IV, and versioning. We chose the widening approach, as it was the easiest and enough for our use-cases. If I recall this correctly, it has one big disadvantage: we widen first, and the final decision to flatten or not is made later. This can result in widening of the IV, while the transformation may not succeed, which isn't great. This approach may avoid that. There's one thing that surprises me though: I don't think I have seen, or I can't remember this AND instruction. I will look into things and this patch, but thought about asking this question here before I do that.

fhahn · 2025-11-04T10:30:50Z

llvm/lib/Transforms/Scalar/LoopFlatten.cpp

+  Value *Call = Builder.CreateIntrinsic(Intrinsic::umul_with_overflow,
+                                        FI.OuterTripCount->getType(),
+                                        {FI.OuterTripCount, FI.InnerTripCount},
+                                        /*FMFSource=*/nullptr, "flatten.mul");
+  FI.NewTripCount = Builder.CreateExtractValue(Call, 0, "flatten.tripcount");
+  Value *Overflow = Builder.CreateExtractValue(Call, 1, "flatten.overflow");
+  Br->setCondition(Overflow);


Independent of the change here, but it seems like the loop versioning interface could do with some refactoring to clean things up.

LAI in LoopFlatten is only used to pass to LoopVersioning, but isn't used; LoopVersioning should provide an interface that doesn't require passing it.

Not sure if that would be possible, but hperhaps the wrapping predicates could re-use the logic in PredicatedScalarEvolution as well?

Make patch an NFC refactor Change-Id: I0155ae8e31ebf1e2f30cca89e201449a926dc192

nasherm · 2025-11-10T14:18:37Z

@amehsan pointed out that we can achieve the same goal of this flag by setting loop-flatten-widen-iv=false and loop-flatten-version-loops=true. I've verified that this works. This simplifies this patch by just making it an NFC refactor. I've updated the description and title to reflect this

This should by default resolve @john-brawn-arm comment about weird behavior when working with always overflowing loops

Change-Id: Ibea6a2e2a18c080e20dced1a9d5ffd7f57b122e4

nasherm requested a review from Copilot November 3, 2025 11:53

llvmbot added the llvm:transforms label Nov 3, 2025

Copilot AI reviewed Nov 3, 2025

View reviewed changes

llvm/lib/Transforms/Scalar/LoopFlatten.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Scalar/LoopFlatten.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Scalar/LoopFlatten.cpp Outdated Show resolved Hide resolved

nasherm force-pushed the nashe/loop-flatten-refactor branch from 2757ed8 to 386c9c0 Compare November 3, 2025 11:59

Review comments from copilot

0113265

Change-Id: I7976117394354ec20fbd4398a245c06f73d64a41

nasherm force-pushed the nashe/loop-flatten-refactor branch from 386c9c0 to 0113265 Compare November 3, 2025 12:11

nasherm requested review from jayfoad, john-brawn-arm and sjoerdmeijer November 3, 2025 12:50

nasherm mentioned this pull request Nov 3, 2025

[VPlan][LV] Add removeRedundantAndMasks to VPlanTransforms #163534

Closed

Test typo

6b61d6e

Change-Id: I996ba838554494fe750fb0824844098d21ea8a52

john-brawn-arm requested changes Nov 3, 2025

View reviewed changes

llvm/lib/Transforms/Scalar/LoopFlatten.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Scalar/LoopFlatten.cpp Outdated Show resolved Hide resolved

amehsan reviewed Nov 3, 2025

View reviewed changes

llvm/lib/Transforms/Scalar/LoopFlatten.cpp Outdated Show resolved Hide resolved

fhahn reviewed Nov 4, 2025

View reviewed changes

Review comments

66c887e

Make patch an NFC refactor Change-Id: I0155ae8e31ebf1e2f30cca89e201449a926dc192

nasherm changed the title ~~[LoopFlatten] Add option to version loops instead of widening IVs~~ [NFC][LoopFlatten] Lift loop versioning to function Nov 10, 2025

Update test

e08aa7e

Change-Id: Ibea6a2e2a18c080e20dced1a9d5ffd7f57b122e4

nasherm closed this Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NFC][LoopFlatten] Lift loop versioning to function #166156

[NFC][LoopFlatten] Lift loop versioning to function #166156

Uh oh!

nasherm commented Nov 3, 2025 •

edited

Loading

Uh oh!

llvmbot commented Nov 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

john-brawn-arm commented Nov 3, 2025

Uh oh!

Uh oh!

amehsan commented Nov 3, 2025 •

edited

Loading

Uh oh!

sjoerdmeijer commented Nov 4, 2025

Uh oh!

fhahn Nov 4, 2025

Uh oh!

nasherm commented Nov 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[NFC][LoopFlatten] Lift loop versioning to function #166156

[NFC][LoopFlatten] Lift loop versioning to function #166156

Uh oh!

Conversation

nasherm commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

john-brawn-arm commented Nov 3, 2025

Uh oh!

Uh oh!

amehsan commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjoerdmeijer commented Nov 4, 2025

Uh oh!

fhahn Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

nasherm commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nasherm commented Nov 3, 2025 •

edited

Loading

amehsan commented Nov 3, 2025 •

edited

Loading

nasherm commented Nov 10, 2025 •

edited

Loading