[VPlan] Delay adding canonical IV increment and exit branches. #82270

fhahn · 2024-02-19T18:49:11Z

This patch delays adding canonical IV increments, computing the next active lane mask and the branch recipes to exit the vector loop.

During initial construction of a VPlan, only add the canonical IV phi and active-lane-mask phis if needed. Similarly, do not add the branches to exit the loop initially. Computing the next IV value, the next active-lane-mask or the exit branches are details that are only needed to simplify codegen (execute of the individual recipes).

This makes the inital VPlans more abstract (and simpler), in that we initially leave out some details. Initial VPlans still have the canonical induction recipe, which provides the value of the canonical induction at every iteration, but we leave out the detail of how it is computed, which is not needed until code generation.

Similar reasoning applies to active lane mask increments and branch recipes. The vector loop region initially abstractly models the loop processing all vector iterations, without specifying how exactly exiting the region is handled. Again these details are only needed for code generation.

To introduce the missing pieces and complete the abstract VPlan, a new prepareToExecute transform is added and run just before exiting the VPlan.

This is an attempt to further employ gradual lowering, as outlined in https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf and already applied for replicate region handling.

This patch delays adding canonical IV increments, computing the next active lane mask and the branch recipes to exit the vector loop. During initial construction of a VPlan, only add the canonical IV phi and active-lane-mask phis if needed. Similarly, do not add the branches to exit the loop initially. Computing the next IV value, the next active-lane-mask or the exit branches are details that are only needed to simplify codegen (execute of the individual recipes). This makes the inital VPlans more abstract (and simpler), in that we initially leave out some details. Initial VPlans still have the canonical induction recipe, which provides the value of the canonical induction at every iteration, but we leave out the detail of how it is computed, which is not needed until code generation. Similar reasoning applies to active lane mask increments and branch recipes. The vector loop region initially abstractly models the loop processing all vector iterations, without specifying how exactly exiting the region is handled. Again these details are only needed for code generation. To introduce the missing pieces and complete the abstract VPlan, a new prepareToExecute transform is added and run just before exiting the VPlan. This is an attempt to further employ gradual lowering, as outlined in https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf and already applied for replicate region handling.

llvmbot · 2024-02-19T18:49:42Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Florian Hahn (fhahn)

Changes

This patch delays adding canonical IV increments, computing the next active lane mask and the branch recipes to exit the vector loop.

During initial construction of a VPlan, only add the canonical IV phi and active-lane-mask phis if needed. Similarly, do not add the branches to exit the loop initially. Computing the next IV value, the next active-lane-mask or the exit branches are details that are only needed to simplify codegen (execute of the individual recipes).

This makes the inital VPlans more abstract (and simpler), in that we initially leave out some details. Initial VPlans still have the canonical induction recipe, which provides the value of the canonical induction at every iteration, but we leave out the detail of how it is computed, which is not needed until code generation.

Similar reasoning applies to active lane mask increments and branch recipes. The vector loop region initially abstractly models the loop processing all vector iterations, without specifying how exactly exiting the region is handled. Again these details are only needed for code generation.

To introduce the missing pieces and complete the abstract VPlan, a new prepareToExecute transform is added and run just before exiting the VPlan.

This is an attempt to further employ gradual lowering, as outlined in https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf and already applied for replicate region handling.

Patch is 88.10 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/82270.diff

27 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+9-12)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+2-1)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+69-40)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+3)
(modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+21-19)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-masked-accesses.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-overflow-checks.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-reductions.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-unroll.ll (+26-26)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll (+10-10)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-gep.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/synthesize-mask-for-call.ll (-24)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/tail-folding-styles.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/uniform-args-call-variants.ll (+12-12)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/widen-call-with-intrinsic-or-libfunc.ll (-9)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (-8)
(modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/icmp-uniforms.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/vplan-dot-printing.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/vplan-iv-transforms.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+15-15)
(modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge-vf1.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll (+14-14)
(modified) llvm/test/Transforms/LoopVectorize/vplan-unused-interleave-group.ll (+1-1)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index eca901fcdae4ce..8b5590c3400f48 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7452,6 +7452,14 @@ LoopVectorizationPlanner::executePlan(
   LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF << ", UF=" << BestUF
                     << '\n');
 
+  TailFoldingStyle Style =
+      CM.getTailFoldingStyle(!isIndvarOverflowCheckKnownFalse(&CM, BestVF));
+  // When not folding the tail, we know that the induction increment will not
+  // overflow.
+  bool HasNUW = Style == TailFoldingStyle::None;
+  bool WithoutRuntimeCheck =
+      Style == TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck;
+  VPlanTransforms::prepareToExecute(BestVPlan, HasNUW, WithoutRuntimeCheck);
   if (!IsEpilogueVectorization)
     VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
 
@@ -8509,17 +8517,6 @@ static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
   VPRegionBlock *TopRegion = Plan.getVectorLoopRegion();
   VPBasicBlock *Header = TopRegion->getEntryBasicBlock();
   Header->insert(CanonicalIVPHI, Header->begin());
-
-  VPBuilder Builder(TopRegion->getExitingBasicBlock());
-  // Add a VPInstruction to increment the scalar canonical IV by VF * UF.
-  auto *CanonicalIVIncrement = Builder.createOverflowingOp(
-      Instruction::Add, {CanonicalIVPHI, &Plan.getVFxUF()}, {HasNUW, false}, DL,
-      "index.next");
-  CanonicalIVPHI->addOperand(CanonicalIVIncrement);
-
-  // Add the BranchOnCount VPInstruction to the latch.
-  Builder.createNaryOp(VPInstruction::BranchOnCount,
-                       {CanonicalIVIncrement, &Plan.getVectorTripCount()}, DL);
 }
 
 // Add exit values to \p Plan. VPLiveOuts are added for each LCSSA phi in the
@@ -9005,7 +9002,7 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
       PreviousLink = RedRecipe;
     }
   }
-  Builder.setInsertPoint(&*LatchVPBB->begin());
+  Builder.setInsertPoint(LatchVPBB, LatchVPBB->begin());
   for (VPRecipeBase &R :
        Plan->getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
     VPReductionPHIRecipe *PhiR = dyn_cast<VPReductionPHIRecipe>(&R);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 42e3b4a003511b..cfc4161adc0c82 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -2363,7 +2363,8 @@ class VPCanonicalIVPHIRecipe : public VPHeaderPHIRecipe {
 
   VPRecipeBase *clone() override {
     auto *R = new VPCanonicalIVPHIRecipe(getOperand(0), getDebugLoc());
-    R->addOperand(getBackedgeValue());
+    if (getNumOperands() == 2)
+      R->addOperand(getBackedgeValue());
     return R;
   }
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 3d443421024202..965ea0386b9475 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -1109,16 +1109,10 @@ void VPlanTransforms::optimize(VPlan &Plan, ScalarEvolution &SE) {
 static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
     VPlan &Plan, bool DataAndControlFlowWithoutRuntimeCheck) {
   VPRegionBlock *TopRegion = Plan.getVectorLoopRegion();
-  VPBasicBlock *EB = TopRegion->getExitingBasicBlock();
   auto *CanonicalIVPHI = Plan.getCanonicalIV();
   VPValue *StartV = CanonicalIVPHI->getStartValue();
 
-  auto *CanonicalIVIncrement =
-      cast<VPInstruction>(CanonicalIVPHI->getBackedgeValue());
-  // TODO: Check if dropping the flags is needed if
-  // !DataAndControlFlowWithoutRuntimeCheck.
-  CanonicalIVIncrement->dropPoisonGeneratingFlags();
-  DebugLoc DL = CanonicalIVIncrement->getDebugLoc();
+  DebugLoc DL = CanonicalIVPHI->getDebugLoc();
   // We can't use StartV directly in the ActiveLaneMask VPInstruction, since
   // we have to take unrolling into account. Each part needs to start at
   //   Part * VF
@@ -1128,21 +1122,6 @@ static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
   // Create the ActiveLaneMask instruction using the correct start values.
   VPValue *TC = Plan.getTripCount();
 
-  VPValue *TripCount, *IncrementValue;
-  if (!DataAndControlFlowWithoutRuntimeCheck) {
-    // When the loop is guarded by a runtime overflow check for the loop
-    // induction variable increment by VF, we can increment the value before
-    // the get.active.lane mask and use the unmodified tripcount.
-    IncrementValue = CanonicalIVIncrement;
-    TripCount = TC;
-  } else {
-    // When avoiding a runtime check, the active.lane.mask inside the loop
-    // uses a modified trip count and the induction variable increment is
-    // done after the active.lane.mask intrinsic is called.
-    IncrementValue = CanonicalIVPHI;
-    TripCount = Builder.createNaryOp(VPInstruction::CalculateTripCountMinusVF,
-                                     {TC}, DL);
-  }
   auto *EntryIncrement = Builder.createOverflowingOp(
       VPInstruction::CanonicalIVIncrementForPart, {StartV}, {false, false}, DL,
       "index.part.next");
@@ -1156,24 +1135,6 @@ static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
   // preheader ActiveLaneMask instruction.
   auto LaneMaskPhi = new VPActiveLaneMaskPHIRecipe(EntryALM, DebugLoc());
   LaneMaskPhi->insertAfter(CanonicalIVPHI);
-
-  // Create the active lane mask for the next iteration of the loop before the
-  // original terminator.
-  VPRecipeBase *OriginalTerminator = EB->getTerminator();
-  Builder.setInsertPoint(OriginalTerminator);
-  auto *InLoopIncrement =
-      Builder.createOverflowingOp(VPInstruction::CanonicalIVIncrementForPart,
-                                  {IncrementValue}, {false, false}, DL);
-  auto *ALM = Builder.createNaryOp(VPInstruction::ActiveLaneMask,
-                                   {InLoopIncrement, TripCount}, DL,
-                                   "active.lane.mask.next");
-  LaneMaskPhi->addOperand(ALM);
-
-  // Replace the original terminator with BranchOnCond. We have to invert the
-  // mask here because a true condition means jumping to the exit block.
-  auto *NotMask = Builder.createNot(ALM, DL);
-  Builder.createNaryOp(VPInstruction::BranchOnCond, {NotMask}, DL);
-  OriginalTerminator->eraseFromParent();
   return LaneMaskPhi;
 }
 
@@ -1304,3 +1265,71 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
     }
   }
 }
+
+void VPlanTransforms::prepareToExecute(
+    VPlan &Plan, bool HasNUW, bool DataAndControlFlowWithoutRuntimeCheck) {
+  auto *CanIV = Plan.getCanonicalIV();
+
+  VPBasicBlock *EB = Plan.getVectorLoopRegion()->getExitingBasicBlock();
+  VPBuilder Builder(EB);
+  DebugLoc DL = CanIV->getDebugLoc();
+  // Add a VPInstruction to increment the scalar canonical IV by VF * UF.
+  auto *CanonicalIVIncrement =
+      Builder.createOverflowingOp(Instruction::Add, {CanIV, &Plan.getVFxUF()},
+                                  {HasNUW, false}, DL, "index.next");
+
+  CanIV->addOperand(CanonicalIVIncrement);
+
+  auto FoundLaneMaskPhi = find_if(
+      Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis(),
+      [](VPRecipeBase &P) { return isa<VPActiveLaneMaskPHIRecipe>(P); });
+
+  if (FoundLaneMaskPhi ==
+      Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis().end()) {
+    // Add the BranchOnCount VPInstruction to the latch.
+    Builder.createNaryOp(VPInstruction::BranchOnCount,
+                         {CanonicalIVIncrement, &Plan.getVectorTripCount()},
+                         DL);
+    return;
+  }
+  auto *LaneMaskPhi = cast<VPActiveLaneMaskPHIRecipe>(&*FoundLaneMaskPhi);
+  auto *VecPreheader =
+      cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSinglePredecessor());
+  Builder.setInsertPoint(VecPreheader);
+
+  VPValue *TC = Plan.getTripCount();
+
+  // TODO: Check if dropping the flags is needed if
+  // !DataAndControlFlowWithoutRuntimeCheck.
+  CanonicalIVIncrement->dropPoisonGeneratingFlags();
+  VPValue *TripCount, *IncrementValue;
+  if (!DataAndControlFlowWithoutRuntimeCheck) {
+    // When the loop is guarded by a runtime overflow check for the loop
+    // induction variable increment by VF, we can increment the value before
+    // the get.active.lane mask and use the unmodified tripcount.
+    IncrementValue = CanonicalIVIncrement;
+    TripCount = TC;
+  } else {
+    // When avoiding a runtime check, the active.lane.mask inside the loop
+    // uses a modified trip count and the induction variable increment is
+    // done after the active.lane.mask intrinsic is called.
+    IncrementValue = CanIV;
+    TripCount = Builder.createNaryOp(VPInstruction::CalculateTripCountMinusVF,
+                                     {TC}, DL);
+  }
+  // Create the active lane mask for the next iteration of the loop before the
+  // original terminator.
+  Builder.setInsertPoint(EB);
+  auto *InLoopIncrement =
+      Builder.createOverflowingOp(VPInstruction::CanonicalIVIncrementForPart,
+                                  {IncrementValue}, {false, false}, DL);
+  auto *ALM = Builder.createNaryOp(VPInstruction::ActiveLaneMask,
+                                   {InLoopIncrement, TripCount}, DL,
+                                   "active.lane.mask.next");
+  LaneMaskPhi->addOperand(ALM);
+
+  // Replace the original terminator with BranchOnCond. We have to invert the
+  // mask here because a true condition means jumping to the exit block.
+  auto *NotMask = Builder.createNot(ALM, DL);
+  Builder.createNaryOp(VPInstruction::BranchOnCond, {NotMask}, DL);
+}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index ff83c3f083b093..1ddbe769d4a132 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -98,6 +98,9 @@ struct VPlanTransforms {
   ///       VPlan directly.
   static void dropPoisonGeneratingRecipes(
       VPlan &Plan, function_ref<bool(BasicBlock *)> BlockNeedsPredication);
+
+  static void prepareToExecute(VPlan &Plan, bool HasNUW,
+                               bool DataAndControlFlowWithoutRuntimeCheck);
 };
 
 } // namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index d6b81543dbc9cc..22128e9dbb35a3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -51,12 +51,13 @@ static void verifyBlocksInRegion(const VPRegionBlock *Region) {
 
     auto *VPBB = dyn_cast<VPBasicBlock>(VPB);
     // Check block's condition bit.
-    if (VPB->getNumSuccessors() > 1 || (VPBB && VPBB->isExiting()))
-      assert(VPBB && VPBB->getTerminator() &&
-             "Block has multiple successors but doesn't "
-             "have a proper branch recipe!");
-    else
-      assert((!VPBB || !VPBB->getTerminator()) && "Unexpected branch recipe!");
+    /*    if (VPB->getNumSuccessors() > 1 || (VPBB && VPBB->isExiting()))*/
+    /*assert(VPBB && VPBB->getTerminator() &&*/
+    /*"Block has multiple successors but doesn't "*/
+    /*"have a proper branch recipe!");*/
+    /*else*/
+    /*assert((!VPBB || !VPBB->getTerminator()) && "Unexpected branch
+     * recipe!");*/
 
     // Check block's successors.
     const auto &Successors = VPB->getSuccessors();
@@ -260,19 +261,20 @@ bool VPlanVerifier::verifyPlanIsValid(const VPlan &Plan) {
     return false;
   }
 
-  if (Exiting->empty()) {
-    errs() << "VPlan vector loop exiting block must end with BranchOnCount or "
-              "BranchOnCond VPInstruction but is empty\n";
-    return false;
-  }
-
-  auto *LastInst = dyn_cast<VPInstruction>(std::prev(Exiting->end()));
-  if (!LastInst || (LastInst->getOpcode() != VPInstruction::BranchOnCount &&
-                    LastInst->getOpcode() != VPInstruction::BranchOnCond)) {
-    errs() << "VPlan vector loop exit must end with BranchOnCount or "
-              "BranchOnCond VPInstruction\n";
-    return false;
-  }
+  /*  if (Exiting->empty()) {*/
+  /*errs() << "VPlan vector loop exiting block must end with BranchOnCount or
+   * "*/
+  /*"BranchOnCond VPInstruction but is empty\n";*/
+  /*return false;*/
+  /*}*/
+
+  /*auto *LastInst = dyn_cast<VPInstruction>(std::prev(Exiting->end()));*/
+  /*if (!LastInst || (LastInst->getOpcode() != VPInstruction::BranchOnCount &&*/
+  /*LastInst->getOpcode() != VPInstruction::BranchOnCond)) {*/
+  /*errs() << "VPlan vector loop exit must end with BranchOnCount or "*/
+  /*"BranchOnCond VPInstruction\n";*/
+  /*return false;*/
+  /*}*/
 
   for (const VPRegionBlock *Region :
        VPBlockUtils::blocksOnly<const VPRegionBlock>(
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll b/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
index 44ace377ac7927..11ddb096e357e0 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
@@ -18,12 +18,12 @@ define void @clamped_tc_8(ptr nocapture %dst, i32 %n, i64 %val){
 ; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[N_VEC]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 8)
 ; CHECK-NEXT:    [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP8:%.*]] = mul i64 [[TMP7]], 8
 ; CHECK-NEXT:    [[TMP9:%.*]] = sub i64 8, [[TMP8]]
 ; CHECK-NEXT:    [[TMP10:%.*]] = icmp ugt i64 8, [[TMP8]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = select i1 [[TMP10]], i64 [[TMP9]], i64 0
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 8)
 ; CHECK-NEXT:    [[TMP12:%.*]] = call <vscale x 8 x i64> @llvm.experimental.stepvector.nxv8i64()
 ; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 8 x i64> [[TMP12]], zeroinitializer
 ; CHECK-NEXT:    [[TMP14:%.*]] = mul <vscale x 8 x i64> [[TMP13]], shufflevector (<vscale x 8 x i64> insertelement (<vscale x 8 x i64> poison, i64 1, i64 0), <vscale x 8 x i64> poison, <vscale x 8 x i32> zeroinitializer)
@@ -115,12 +115,12 @@ define void @clamped_tc_max_8(ptr nocapture %dst, i32 %n, i64 %val){
 ; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[N_VEC]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
 ; CHECK-NEXT:    [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP8:%.*]] = mul i64 [[TMP7]], 8
 ; CHECK-NEXT:    [[TMP9:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[TMP8]]
 ; CHECK-NEXT:    [[TMP10:%.*]] = icmp ugt i64 [[WIDE_TRIP_COUNT]], [[TMP8]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = select i1 [[TMP10]], i64 [[TMP9]], i64 0
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
 ; CHECK-NEXT:    [[TMP12:%.*]] = call <vscale x 8 x i64> @llvm.experimental.stepvector.nxv8i64()
 ; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 8 x i64> [[TMP12]], zeroinitializer
 ; CHECK-NEXT:    [[TMP14:%.*]] = mul <vscale x 8 x i64> [[TMP13]], shufflevector (<vscale x 8 x i64> insertelement (<vscale x 8 x i64> poison, i64 1, i64 0), <vscale x 8 x i64> poison, <vscale x 8 x i32> zeroinitializer)
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll b/llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
index fc67fb5aded6a8..11bae5068a180f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
@@ -138,12 +138,12 @@ define float @fadd_strict(ptr noalias nocapture readonly %a, i64 %n) #0 {
 ; CHECK-ORDERED-TF-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP16:%.*]] = mul i64 [[TMP15]], 8
+; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
 ; CHECK-ORDERED-TF-NEXT:    [[TMP7:%.*]] = sub i64 [[N]], [[TMP6]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP8:%.*]] = icmp ugt i64 [[N]], [[TMP6]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP9:%.*]] = select i1 [[TMP8]], i64 [[TMP7]], i64 0
-; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK-ORDERED-TF:       vector.body:
 ; CHECK-ORDERED-TF-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -733,11 +733,11 @@ define void @fadd_strict_interleave(ptr noalias nocapture readonly %a, ptr noali
 ; CHECK-ORDERED-TF-NEXT:    [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP23:%.*]] = mul i64 [[TMP22]], 4
 ; CHECK-ORDERED-TF-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP2]])
 ; CHECK-ORDERED-TF-NEXT:    [[TMP9:%.*]] = mul i64 [[TMP8]], 4
 ; CHECK-ORDERED-TF-NEXT:    [[TMP10:%.*]] = sub i64 [[TMP2]], [[TMP9]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP11:%.*]] = icmp ugt i64 [[TMP2]], [[TMP9]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP12:%.*]] = select i1 [[TMP11]], i64 [[TMP10]], i64 0
-; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP2]])
 ; CHECK-ORDERED-TF-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK-ORDERED-TF:       vector.body:
 ; CHECK-ORDERED-TF-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -993,12 +993,12 @@ define float @fadd_of_sum(ptr noalias nocapture readonly %a, ptr noalias nocaptu
 ; CHECK-ORDERED-TF-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP19:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP20:%.*]] = mul i64 [[TMP19]], 4
+; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP7:%.*]] = mul i64 [[TMP6]], 4
 ; CHECK-ORDERED-TF-NEXT:    [[TMP8:%.*]] = sub i64 [[N]], [[TMP7]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP9:%.*]] = icmp ugt i64 [[N]], [[TMP7]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP10:%.*]] = select i1 [[TMP9]], i64 [[TMP8]], i64 0
-; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK-ORDERED-TF:       vector.body:
 ; CHECK-ORDERED-TF-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -1234,12 +1234,12 @@ define float @fadd_conditional(ptr noalias nocapture readonly %a, ptr noalias no
 ; CHECK-ORDERED-TF-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP23:%.*]] = mul i64 [[TMP22]], 4
+; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 4
 ; CHECK-ORDERED-TF-NEXT:    [[TMP7:%.*]] = sub i64 [[N]], [[TMP6]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP8:%.*]] = icmp ugt i64 [[N]], [[TMP6]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP9:%.*]] = select i1 [[TMP8]], i64 [[TMP7]], i64 0
-; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    br label [[VECTOR_BODY:%.*]]
 ...
[truncated]

nikolaypanchenko

I'm concerned that such missed update till later phase make VPlan ill-formed for xforms and vplan-based cost model ? That is, that canonical IV can be seen as a dead recipe by xforms, i.e. requires special handling everywhere and possibly can break vendor's downstream.
For the vplan-based cost model such implicit representation will require extra code to properly model the cost (even though it's simple 1 reg + 1 add).
What is a plan to handle these areas later ?

fhahn · 2024-02-25T13:36:00Z

Thank you very much for taking a look!

I'm concerned that such missed update till later phase make VPlan ill-formed for xforms and vplan-based cost model ? That is, that canonical IV can be seen as a dead recipe by xforms, i.e. requires special handling everywhere and possibly can break vendor's downstream.

I don't think the form without the increment or branch-on-cond should be seen as ill-formed, just more abstract. The main idea is to gradually lower the from a more abstract VPlan representation to a more concrete one that can be code-gen'd easily (like also done for replicate regions, which only get introduced later). One advantage of VPlan should be that it allows us to model more complex constructs directly , especially early on.

At the early stages, various IV increments and how exactly the loop is exited do not really add any additional info (e.g. the canonical IV may or may not control exiting the loop). IIRC when adding recipes to model the canonical IV and its increment, the reason why the increment was modeled was to simplify code-gen, similar to how #82021 will simplify code-gen. At that time, VPlan as only added for code-gen, so it made sense at that time.

Since then the infrastructure has evolved quite a bit, and relatively recently we started to use gradual lowering to introduce complexity as needed at later stages.

IIUC the main benefit from #82021 is that it simplifies codegen, so I think it would make sense to consistently introduce those in preparation for codegen; modeling the increments for all indications early on would unnecessarily make the plans more verbose, while only being needed for codegen. Just before codegen, if the increments are modeled explicitly, then we could also lower to a generic scalar or vector phi nodes, instead of having multiple recipes to generate phis.

For the vplan-based cost model such implicit representation will require extra code to properly model the cost (even though it's simple 1 reg + 1 add). What is a plan to handle these areas later ?

It depends, if we see CanonincalIVPHIRecipe and VPWidenIntOrFpRecipe as complex recipes that take care of their increment as well then computing the cost for both is natural. If the more verbose form is only introduced just before executing, there should be no need to account for costing the increment separately.

ayalz

Interesting addition, indeed following outlined roadmap. Adding a couple of thoughts.

ayalz · 2024-02-28T10:19:29Z

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp

       !VPBB->getParent()->isReplicator())) {
-    if (!VPBB || !VPBB->getTerminator()) {
+    if (!IsAbstract && (!VPBB || !VPBB->getTerminator())) {


Perhaps rather than a global parameter, Is[RegionBackBranch]Abstract is more a property of the parental region - whether it ends with an explicit terminating branch or not - the latter applying to both replicating regions and non-replicating regions in "abstract loop-control" state.

ayalz · 2024-02-28T10:22:10Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

@@ -1307,3 +1268,71 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
    }
  }
 }
+
+void VPlanTransforms::prepareToExecute(


Would be good to be more precise than "prepareToExecute()".

ayalz · 2024-02-28T10:22:52Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-// loop.
-static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
-                                  DebugLoc DL) {
+// Add the required canonical IV


Suggested change

// Add the required canonical IV

// Add the required canonical IV.

ayalz · 2024-02-28T10:25:37Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
-                                  DebugLoc DL) {
+// Add the required canonical IV
+static void addCanonicalIV(VPlan &Plan, Type *IdxTy, bool HasNUW, DebugLoc DL) {


Perhaps instead of creating the final original canonical-IV recipe, leaving it temporarily with only a single operand, it would be better to employ another (sub)class, having a single operand, which is later replaced by the more explicit CanonicalIVPHI?

nikolaypanchenko · 2024-02-28T18:45:27Z

I don't think the form without the increment or branch-on-cond should be seen as ill-formed, just more abstract. The main idea is to gradually lower the from a more abstract VPlan representation to a more concrete one that can be code-gen'd easily (like also done for replicate regions, which only get introduced later). One advantage of VPlan should be that it allows us to model more complex constructs directly , especially early on.

Can you please share a VPlan design doc which has been discussed with a community ? Perhaps I'm missing something and as right now my view of a VPlan is it has self-contained i.e. it should be possible to get all required information to make xform without accessing underlying IR (except for live-ins and live-outs) and at any point be able to generate code out of it without any extra manipulations on a VPlan.

At the early stages, various IV increments and how exactly the loop is exited do not really add any additional info (e.g. the canonical IV may or may not control exiting the loop). IIRC when adding recipes to model the canonical IV and its increment, the reason why the increment was modeled was to simplify code-gen, similar to how #82021 will simplify code-gen. At that time, VPlan as only added for code-gen, so it made sense at that time.

Since then the infrastructure has evolved quite a bit, and relatively recently we started to use gradual lowering to introduce complexity as needed at later stages.

Ignoring cost model part, for VPlan that represents just one loop it's certainly simplifies xforms. But if we consider future work with peel/remainder (prolog/epilog/tail/cleanup) loops represented that is going to complicate xforms (unrolling, interleaving), codegen and cost model, am I wrong ? That technically can be mitigated by https://reviews.llvm.org/D148581, but if we consider SEME loops, from my perspective, that becomes complicated too.

IIUC the main benefit from #82021 is that it simplifies codegen, so I think it would make sense to consistently introduce those in preparation for codegen; modeling the increments for all indications early on would unnecessarily make the plans more verbose, while only being needed for codegen. Just before codegen, if the increments are modeled explicitly, then we could also lower to a generic scalar or vector phi nodes, instead of having multiple recipes to generate phis.

Eventually yes, to simplify codegen. But most importantly, for EVL-based vectorization that change helps to replace WidenVFxUF that is used with WidenEVL. That will be true for the remaining VPWidenPointerInductionRecipe that also better to be decomposed.
Also want to point out #82021 moves trunc optimization to xform so that VPlan became more explicit, which IMO makes more sense. However, if VPlan is going to allow some implicitness that's not so clear if it's good.

It depends, if we see CanonincalIVPHIRecipe and VPWidenIntOrFpRecipe as complex recipes that take care of their increment as well then computing the cost for both is natural. If the more verbose form is only introduced just before executing, there should be no need to account for costing the increment separately.

Right, and that's my point: now all places that need to be aware of the update will have extra logic to deal with them. Also, for the VPWidenIntOrFpRecipe heuristics becomes a bit more complicated / unclear to estimate broadcast of VF. For the VPWidenPointerInductionRecipe it would also have to accommodate onlyScalarsGenerated case.
For these vectorized IVs, is there a plan to represent broadcast(VF) in the preheader ?

This patch adds a new VPInstruction::HeaderMask opcode to model the abstract header-mask used for tail-folding. It will be lowered depending on target preference (either using active-lane-mask, explicit-vector-length or a wide compare of the canonical IV and the backedge taken count) Similarly to llvm#82270, it would be good to clarify/agree on the terminology w.r.t. to recipes/opcodes that cannot be code-gen'd directly (i.e. require further gradual lowering). NOTE: some tests are failing or needed updating, due to widened IVs being replaced by scalar-steps, as their only use was the earlier wide compare. This could be fixed by either adding a suitable wide canonical IV as operand to the header-mask recipe and exactly preserve the original behavior. Alternatively we could keep the current behavior of the patch and update the tests. Or introduce a wide induction PHI instead of VPWidenCanonicalIVReicpe; currently we *only* use a wide IV for VPWidenCanonicalIVRecipe, if there was a suitable IV in the original loop, *even* if the mask compare is the *only* wide use. Either never or always using a wide PHI would be more consistent (or eventually make a more informed cost-based decision).

fhahn requested review from ayalz and aniragil February 19, 2024 18:49

llvmbot added backend:RISC-V vectorization llvm:transforms labels Feb 19, 2024

nikolaypanchenko reviewed Feb 20, 2024

View reviewed changes

fhahn added 2 commits February 26, 2024 07:22

Merge remote-tracking branch 'origin/main' into vplan-prepare-to-execute

8b8a4c7

!fixup Finalize verifier handling after updating to current main

7a47384

ayalz reviewed Feb 28, 2024

View reviewed changes

Merge branch 'main' into vplan-prepare-to-execute

b2f8789

npanchen mentioned this pull request Mar 20, 2024

[VPlan] Add additional guiding principles to docs. #85688

Open

ayalz mentioned this pull request Mar 24, 2024

[LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL. #76172

Merged

fhahn mentioned this pull request Apr 22, 2024

[VPlan] Add new VPInstruction ocpode for header mask. #89603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VPlan] Delay adding canonical IV increment and exit branches. #82270

[VPlan] Delay adding canonical IV increment and exit branches. #82270

fhahn commented Feb 19, 2024

llvmbot commented Feb 19, 2024 •

edited

nikolaypanchenko left a comment

fhahn commented Feb 25, 2024

ayalz left a comment

ayalz Feb 28, 2024

ayalz Feb 28, 2024

ayalz Feb 28, 2024

ayalz Feb 28, 2024

nikolaypanchenko commented Feb 28, 2024

	// Add the required canonical IV
	// Add the required canonical IV.

[VPlan] Delay adding canonical IV increment and exit branches. #82270

Are you sure you want to change the base?

[VPlan] Delay adding canonical IV increment and exit branches. #82270

Conversation

fhahn commented Feb 19, 2024

llvmbot commented Feb 19, 2024 • edited

nikolaypanchenko left a comment

Choose a reason for hiding this comment

fhahn commented Feb 25, 2024

ayalz left a comment

Choose a reason for hiding this comment

ayalz Feb 28, 2024

Choose a reason for hiding this comment

ayalz Feb 28, 2024

Choose a reason for hiding this comment

ayalz Feb 28, 2024

Choose a reason for hiding this comment

ayalz Feb 28, 2024

Choose a reason for hiding this comment

nikolaypanchenko commented Feb 28, 2024

llvmbot commented Feb 19, 2024 •

edited