[VPlan] Add new VPInstruction ocpode for header mask. #89603

fhahn · 2024-04-22T12:49:33Z

This patch adds a new VPInstruction::HeaderMask opcode to model the abstract header-mask used for tail-folding. It will be lowered depending on target preference (either using active-lane-mask, explicit-vector-length or a wide compare of the canonical IV and the backedge taken count)

Similarly to #82270, it would be good to clarify/agree on the terminology w.r.t. to recipes/opcodes that cannot be code-gen'd directly (i.e. require further gradual lowering).

Follow-up to #87816

NOTE: some tests are failing or needed updating, due to widened IVs being replaced by scalar-steps, as their only use was the earlier wide compare. This could be fixed by either adding a suitable wide canonical IV as operand to the header-mask recipe and exactly preserve the original behavior. Alternatively we could keep the current behavior of the patch and update the tests. Or introduce a wide induction PHI instead of VPWidenCanonicalIVReicpe; currently we only use a wide IV for VPWidenCanonicalIVRecipe, if there was a suitable IV in the original loop, even if the mask compare is the only wide use. Either never or always using a wide PHI would be more consistent (or eventually make a more informed cost-based decision).

llvmbot · 2024-04-22T12:50:03Z

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch adds a new VPInstruction::HeaderMask opcode to model the abstract header-mask used for tail-folding. It will be lowered depending on target preference (either using active-lane-mask, explicit-vector-length or a wide compare of the canonical IV and the backedge taken count)

Similarly to #82270, it would be good to clarify/agree on the terminology w.r.t. to recipes/opcodes that cannot be code-gen'd directly (i.e. require further gradual lowering).

NOTE: some tests are failing or needed updating, due to widened IVs being replaced by scalar-steps, as their only use was the earlier wide compare. This could be fixed by either adding a suitable wide canonical IV as operand to the header-mask recipe and exactly preserve the original behavior. Alternatively we could keep the current behavior of the patch and update the tests. Or introduce a wide induction PHI instead of VPWidenCanonicalIVReicpe; currently we only use a wide IV for VPWidenCanonicalIVRecipe, if there was a suitable IV in the original loop, even if the mask compare is the only wide use. Either never or always using a wide PHI would be more consistent (or eventually make a more informed cost-based decision).

Patch is 86.23 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89603.diff

9 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+5-11)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+3)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+1)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+73-122)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+3)
(modified) llvm/test/Transforms/LoopVectorize/dont-fold-tail-for-divisible-TC.ll (+20-19)
(modified) llvm/test/Transforms/LoopVectorize/reduction-inloop-pred.ll (+129-114)
(modified) llvm/test/Transforms/LoopVectorize/reduction-predselect.ll (+37-28)
(modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll (+6-6)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 33c4decd58a6c2..70beb4b1bfd415 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8035,21 +8035,14 @@ void VPRecipeBuilder::createHeaderMask() {
     return;
   }
 
-  // Introduce the early-exit compare IV <= BTC to form header block mask.
-  // This is used instead of IV < TC because TC may wrap, unlike BTC. Start by
-  // constructing the desired canonical IV in the header block as its first
-  // non-phi instructions.
-
+  // Introduce an abstract header-mask VPInstruction. This will be lowered later
+  // depending on target preference.
   VPBasicBlock *HeaderVPBB = Plan.getVectorLoopRegion()->getEntryBasicBlock();
   auto NewInsertionPoint = HeaderVPBB->getFirstNonPhi();
-  auto *IV = new VPWidenCanonicalIVRecipe(Plan.getCanonicalIV());
-  HeaderVPBB->insert(IV, NewInsertionPoint);
-
   VPBuilder::InsertPointGuard Guard(Builder);
   Builder.setInsertPoint(HeaderVPBB, NewInsertionPoint);
-  VPValue *BlockMask = nullptr;
-  VPValue *BTC = Plan.getOrCreateBackedgeTakenCount();
-  BlockMask = Builder.createICmp(CmpInst::ICMP_ULE, IV, BTC);
+  VPValue *BlockMask =
+      Builder.createNaryOp(VPInstruction::HeaderMask, {Plan.getCanonicalIV()});
   BlockMaskCache[Header] = BlockMask;
 }
 
@@ -8555,6 +8548,7 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
       // TODO: try to put it close to addActiveLaneMask().
       if (CM.foldTailWithEVL())
         VPlanTransforms::addExplicitVectorLength(*Plan);
+      VPlanTransforms::lowerRecipes(*Plan);
       assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
       VPlans.push_back(std::move(Plan));
     }
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index c74329a0bcc4ac..bd472d770803e0 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1180,6 +1180,9 @@ class VPInstruction : public VPRecipeWithIRFlags {
     // operand). Only generates scalar values (either for the first lane only or
     // for all lanes, depending on its uses).
     PtrAdd,
+    // An abstract representation of the vector loops header mask, to be lowered
+    // later depending on target preference.
+    HeaderMask,
   };
 
 private:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 9ec422ec002c82..87bcea50b21572 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -132,6 +132,7 @@ bool VPRecipeBase::mayHaveSideEffects() const {
     case VPInstruction::CalculateTripCountMinusVF:
     case VPInstruction::CanonicalIVIncrementForPart:
     case VPInstruction::PtrAdd:
+    case VPInstruction::HeaderMask:
       return false;
     default:
       return true;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 007dc3f89b3fb9..6d643b81b2c88e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -434,44 +434,6 @@ static void removeRedundantInductionCasts(VPlan &Plan) {
   }
 }
 
-/// Try to replace VPWidenCanonicalIVRecipes with a widened canonical IV
-/// recipe, if it exists.
-static void removeRedundantCanonicalIVs(VPlan &Plan) {
-  VPCanonicalIVPHIRecipe *CanonicalIV = Plan.getCanonicalIV();
-  VPWidenCanonicalIVRecipe *WidenNewIV = nullptr;
-  for (VPUser *U : CanonicalIV->users()) {
-    WidenNewIV = dyn_cast<VPWidenCanonicalIVRecipe>(U);
-    if (WidenNewIV)
-      break;
-  }
-
-  if (!WidenNewIV)
-    return;
-
-  VPBasicBlock *HeaderVPBB = Plan.getVectorLoopRegion()->getEntryBasicBlock();
-  for (VPRecipeBase &Phi : HeaderVPBB->phis()) {
-    auto *WidenOriginalIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&Phi);
-
-    if (!WidenOriginalIV || !WidenOriginalIV->isCanonical() ||
-        WidenOriginalIV->getScalarType() != WidenNewIV->getScalarType())
-      continue;
-
-    // Replace WidenNewIV with WidenOriginalIV if WidenOriginalIV provides
-    // everything WidenNewIV's users need. That is, WidenOriginalIV will
-    // generate a vector phi or all users of WidenNewIV demand the first lane
-    // only.
-    if (any_of(WidenOriginalIV->users(),
-               [WidenOriginalIV](VPUser *U) {
-                 return !U->usesScalars(WidenOriginalIV);
-               }) ||
-        vputils::onlyFirstLaneUsed(WidenNewIV)) {
-      WidenNewIV->replaceAllUsesWith(WidenOriginalIV);
-      WidenNewIV->eraseFromParent();
-      return;
-    }
-  }
-}
-
 /// Returns true if \p R is dead and can be removed.
 static bool isDeadRecipe(VPRecipeBase &R) {
   using namespace llvm::PatternMatch;
@@ -1086,7 +1048,6 @@ void VPlanTransforms::truncateToMinimalBitwidths(
 }
 
 void VPlanTransforms::optimize(VPlan &Plan, ScalarEvolution &SE) {
-  removeRedundantCanonicalIVs(Plan);
   removeRedundantInductionCasts(Plan);
 
   simplifyRecipes(Plan, SE.getContext());
@@ -1203,52 +1164,33 @@ static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
   return LaneMaskPhi;
 }
 
-/// Collect all VPValues representing a header mask through the (ICMP_ULE,
-/// WideCanonicalIV, backedge-taken-count) pattern.
-/// TODO: Introduce explicit recipe for header-mask instead of searching
-/// for the header-mask pattern manually.
-static SmallVector<VPValue *> collectAllHeaderMasks(VPlan &Plan) {
-  SmallVector<VPValue *> WideCanonicalIVs;
-  auto *FoundWidenCanonicalIVUser =
-      find_if(Plan.getCanonicalIV()->users(),
-              [](VPUser *U) { return isa<VPWidenCanonicalIVRecipe>(U); });
-  assert(count_if(Plan.getCanonicalIV()->users(),
-                  [](VPUser *U) { return isa<VPWidenCanonicalIVRecipe>(U); }) <=
-             1 &&
-         "Must have at most one VPWideCanonicalIVRecipe");
-  if (FoundWidenCanonicalIVUser != Plan.getCanonicalIV()->users().end()) {
-    auto *WideCanonicalIV =
-        cast<VPWidenCanonicalIVRecipe>(*FoundWidenCanonicalIVUser);
-    WideCanonicalIVs.push_back(WideCanonicalIV);
-  }
-
-  // Also include VPWidenIntOrFpInductionRecipes that represent a widened
-  // version of the canonical induction.
+/// Return the header mask recipe of the VPlan, if there is one.
+static VPInstruction *getHeaderMask(VPlan &Plan) {
   VPBasicBlock *HeaderVPBB = Plan.getVectorLoopRegion()->getEntryBasicBlock();
-  for (VPRecipeBase &Phi : HeaderVPBB->phis()) {
-    auto *WidenOriginalIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&Phi);
-    if (WidenOriginalIV && WidenOriginalIV->isCanonical())
-      WideCanonicalIVs.push_back(WidenOriginalIV);
-  }
+  auto R = find_if(*HeaderVPBB,  [](VPRecipeBase &R) 
+                   {
+    using namespace llvm::VPlanPatternMatch;
+    return match(&R, m_VPInstruction<VPInstruction::HeaderMask>(m_VPValue()));
+                   });
+                   return R == HeaderVPBB->end() ? nullptr : cast<VPInstruction>(&*R);
+}
 
-  // Walk users of wide canonical IVs and collect to all compares of the form
-  // (ICMP_ULE, WideCanonicalIV, backedge-taken-count).
-  SmallVector<VPValue *> HeaderMasks;
-  VPValue *BTC = Plan.getOrCreateBackedgeTakenCount();
-  for (auto *Wide : WideCanonicalIVs) {
-    for (VPUser *U : SmallVector<VPUser *>(Wide->users())) {
-      auto *HeaderMask = dyn_cast<VPInstruction>(U);
-      if (!HeaderMask || HeaderMask->getOpcode() != Instruction::ICmp ||
-          HeaderMask->getPredicate() != CmpInst::ICMP_ULE ||
-          HeaderMask->getOperand(1) != BTC)
-        continue;
+static VPValue *getOrCreateWideCanonicalIV(VPlan &Plan,
+                                           VPRecipeBase *InsertPt) {
 
-      assert(HeaderMask->getOperand(0) == Wide &&
-             "WidenCanonicalIV must be the first operand of the compare");
-      HeaderMasks.push_back(HeaderMask);
-    }
+  VPBasicBlock *HeaderVPBB = Plan.getVectorLoopRegion()->getEntryBasicBlock();
+  for (VPRecipeBase &R : HeaderVPBB->phis()) {
+    auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&R);
+    if (!WideIV || !WideIV->isCanonical() ||
+        Plan.getCanonicalIV()->getScalarType() != WideIV->getScalarType())
+      continue;
+    return WideIV;
+    break;
   }
-  return HeaderMasks;
+
+  auto *IV = new VPWidenCanonicalIVRecipe(Plan.getCanonicalIV());
+  IV->insertBefore(InsertPt);
+  return IV;
 }
 
 void VPlanTransforms::addActiveLaneMask(
@@ -1258,30 +1200,23 @@ void VPlanTransforms::addActiveLaneMask(
           UseActiveLaneMaskForControlFlow) &&
          "DataAndControlFlowWithoutRuntimeCheck implies "
          "UseActiveLaneMaskForControlFlow");
-
-  auto FoundWidenCanonicalIVUser =
-      find_if(Plan.getCanonicalIV()->users(),
-              [](VPUser *U) { return isa<VPWidenCanonicalIVRecipe>(U); });
-  assert(FoundWidenCanonicalIVUser &&
-         "Must have widened canonical IV when tail folding!");
-  auto *WideCanonicalIV =
-      cast<VPWidenCanonicalIVRecipe>(*FoundWidenCanonicalIVUser);
+  VPValue *HeaderMask = getHeaderMask(Plan);
+  assert(HeaderMask && "Active-lane-mask not needed?");
   VPSingleDefRecipe *LaneMask;
   if (UseActiveLaneMaskForControlFlow) {
     LaneMask = addVPLaneMaskPhiAndUpdateExitBranch(
         Plan, DataAndControlFlowWithoutRuntimeCheck);
   } else {
-    VPBuilder B = VPBuilder::getToInsertAfter(WideCanonicalIV);
-    LaneMask = B.createNaryOp(VPInstruction::ActiveLaneMask,
-                              {WideCanonicalIV, Plan.getTripCount()}, nullptr,
-                              "active.lane.mask");
+    VPBasicBlock *HeaderVPBB = Plan.getVectorLoopRegion()->getEntryBasicBlock();
+    VPBuilder B;
+    B.setInsertPoint(HeaderVPBB, HeaderVPBB->getFirstNonPhi());
+    LaneMask = B.createNaryOp(
+        VPInstruction::ActiveLaneMask,
+        {getOrCreateWideCanonicalIV(Plan, &*HeaderVPBB->getFirstNonPhi()),
+         Plan.getTripCount()},
+        nullptr, "active.lane.mask");
   }
-
-  // Walk users of WideCanonicalIV and replace all compares of the form
-  // (ICMP_ULE, WideCanonicalIV, backedge-taken-count) with an
-  // active-lane-mask.
-  for (VPValue *HeaderMask : collectAllHeaderMasks(Plan))
-    HeaderMask->replaceAllUsesWith(LaneMask);
+  HeaderMask->replaceAllUsesWith(LaneMask);
 }
 
 /// Add a VPEVLBasedIVPHIRecipe and related recipes to \p Plan and
@@ -1307,6 +1242,10 @@ void VPlanTransforms::addActiveLaneMask(
 /// ...
 ///
 void VPlanTransforms::addExplicitVectorLength(VPlan &Plan) {
+  VPValue *HeaderMask = getHeaderMask(Plan);
+  if (!HeaderMask)
+    return;
+
   VPBasicBlock *Header = Plan.getVectorLoopRegion()->getEntryBasicBlock();
   auto *CanonicalIVPHI = Plan.getCanonicalIV();
   VPValue *StartV = CanonicalIVPHI->getStartValue();
@@ -1336,31 +1275,30 @@ void VPlanTransforms::addExplicitVectorLength(VPlan &Plan) {
   NextEVLIV->insertBefore(CanonicalIVIncrement);
   EVLPhi->addOperand(NextEVLIV);
 
-  for (VPValue *HeaderMask : collectAllHeaderMasks(Plan)) {
-    for (VPUser *U : collectUsersRecursively(HeaderMask)) {
-      auto *MemR = dyn_cast<VPWidenMemoryRecipe>(U);
-      if (!MemR)
-        continue;
-      assert(!MemR->isReverse() &&
-             "Reversed memory operations not supported yet.");
-      VPValue *OrigMask = MemR->getMask();
-      assert(OrigMask && "Unmasked widen memory recipe when folding tail");
-      VPValue *NewMask = HeaderMask == OrigMask ? nullptr : OrigMask;
-      if (auto *L = dyn_cast<VPWidenLoadRecipe>(MemR)) {
-        auto *N = new VPWidenLoadEVLRecipe(L, VPEVL, NewMask);
-        N->insertBefore(L);
-        L->replaceAllUsesWith(N);
-        L->eraseFromParent();
-      } else if (auto *S = dyn_cast<VPWidenStoreRecipe>(MemR)) {
-        auto *N = new VPWidenStoreEVLRecipe(S, VPEVL, NewMask);
-        N->insertBefore(S);
-        S->eraseFromParent();
-      } else {
-        llvm_unreachable("unsupported recipe");
-      }
+  for (VPUser *U : collectUsersRecursively(HeaderMask)) {
+    auto *MemR = dyn_cast<VPWidenMemoryRecipe>(U);
+    if (!MemR)
+      continue;
+    assert(!MemR->isReverse() &&
+           "Reversed memory operations not supported yet.");
+    VPValue *OrigMask = MemR->getMask();
+    assert(OrigMask && "Unmasked widen memory recipe when folding tail");
+    VPValue *NewMask = HeaderMask == OrigMask ? nullptr : OrigMask;
+    if (auto *L = dyn_cast<VPWidenLoadRecipe>(MemR)) {
+      auto *N = new VPWidenLoadEVLRecipe(L, VPEVL, NewMask);
+      N->insertBefore(L);
+      L->replaceAllUsesWith(N);
+      L->eraseFromParent();
+    } else if (auto *S = dyn_cast<VPWidenStoreRecipe>(MemR)) {
+      auto *N = new VPWidenStoreEVLRecipe(S, VPEVL, NewMask);
+      N->insertBefore(S);
+      S->eraseFromParent();
+    } else {
+      llvm_unreachable("unsupported recipe");
     }
-    recursivelyDeleteDeadRecipes(HeaderMask);
   }
+  recursivelyDeleteDeadRecipes(HeaderMask);
+
   // Replace all uses of VPCanonicalIVPHIRecipe by
   // VPEVLBasedIVPHIRecipe except for the canonical IV increment.
   CanonicalIVPHI->replaceAllUsesWith(EVLPhi);
@@ -1465,3 +1403,16 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
     }
   }
 }
+
+void VPlanTransforms::lowerRecipes(VPlan &Plan) {
+  VPInstruction *HeaderMask = getHeaderMask(Plan);
+  if (!HeaderMask)
+    return;
+
+  VPValue *IV = getOrCreateWideCanonicalIV(Plan, HeaderMask);
+  VPBuilder Builder(HeaderMask);
+  VPValue *BTC = Plan.getOrCreateBackedgeTakenCount();
+  VPValue *M = Builder.createICmp(CmpInst::ICMP_ULE, IV, BTC);
+  HeaderMask->replaceAllUsesWith(M);
+  HeaderMask->eraseFromParent();
+}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 0cbc70713d9c10..86a29e0a11aafc 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -105,6 +105,9 @@ struct VPlanTransforms {
   /// VPCanonicalIVPHIRecipe is only used to control the loop after
   /// this transformation.
   static void addExplicitVectorLength(VPlan &Plan);
+
+  /// Lower abstract VPInstruction recipes to a concrete sequence of recipes for which code can be generated.
+  static void lowerRecipes(VPlan &Plan);
 };
 
 } // namespace llvm
diff --git a/llvm/test/Transforms/LoopVectorize/dont-fold-tail-for-divisible-TC.ll b/llvm/test/Transforms/LoopVectorize/dont-fold-tail-for-divisible-TC.ll
index 76ca2507b914cf..cc015ce465c472 100644
--- a/llvm/test/Transforms/LoopVectorize/dont-fold-tail-for-divisible-TC.ll
+++ b/llvm/test/Transforms/LoopVectorize/dont-fold-tail-for-divisible-TC.ll
@@ -37,7 +37,7 @@ define dso_local void @alignTC(ptr noalias nocapture %A, i32 %n) optsize {
 ; CHECK-NEXT:    store i32 13, ptr [[ARRAYIDX]], align 1
 ; CHECK-NEXT:    [[RIVPLUS1]] = add nuw nsw i32 [[RIV]], 1
 ; CHECK-NEXT:    [[COND:%.*]] = icmp eq i32 [[RIVPLUS1]], [[ALIGNEDTC]]
-; CHECK-NEXT:    br i1 [[COND]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP2:![0-9]+]]
+; CHECK-NEXT:    br i1 [[COND]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
 ;
@@ -158,13 +158,15 @@ define dso_local void @cannotProveAlignedTC(ptr noalias nocapture %A, i32 %p, i3
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], 4
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i32 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TRIP_COUNT_MINUS_1:%.*]] = sub i32 [[N]], 1
-; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[TRIP_COUNT_MINUS_1]], i64 0
-; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <4 x i32> poison, i32 [[TRIP_COUNT_MINUS_1]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT1]], <4 x i32> poison, <4 x i32> zeroinitializer
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE6:%.*]] ]
-; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE6]] ]
-; CHECK-NEXT:    [[TMP0:%.*]] = icmp ule <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE8:%.*]] ]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[INDEX]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[VEC_IV:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[TMP0:%.*]] = icmp ule <4 x i32> [[VEC_IV]], [[BROADCAST_SPLAT2]]
 ; CHECK-NEXT:    [[TMP1:%.*]] = extractelement <4 x i1> [[TMP0]], i32 0
 ; CHECK-NEXT:    br i1 [[TMP1]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
 ; CHECK:       pred.store.if:
@@ -174,31 +176,30 @@ define dso_local void @cannotProveAlignedTC(ptr noalias nocapture %A, i32 %p, i3
 ; CHECK-NEXT:    br label [[PRED_STORE_CONTINUE]]
 ; CHECK:       pred.store.continue:
 ; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <4 x i1> [[TMP0]], i32 1
-; CHECK-NEXT:    br i1 [[TMP4]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2:%.*]]
-; CHECK:       pred.store.if1:
+; CHECK-NEXT:    br i1 [[TMP4]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4:%.*]]
+; CHECK:       pred.store.if3:
 ; CHECK-NEXT:    [[TMP5:%.*]] = add i32 [[INDEX]], 1
 ; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[A]], i32 [[TMP5]]
 ; CHECK-NEXT:    store i32 13, ptr [[TMP6]], align 1
-; CHECK-NEXT:    br label [[PRED_STORE_CONTINUE2]]
-; CHECK:       pred.store.continue2:
+; CHECK-NEXT:    br label [[PRED_STORE_CONTINUE4]]
+; CHECK:       pred.store.continue4:
 ; CHECK-NEXT:    [[TMP7:%.*]] = extractelement <4 x i1> [[TMP0]], i32 2
-; CHECK-NEXT:    br i1 [[TMP7]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4:%.*]]
-; CHECK:       pred.store.if3:
+; CHECK-NEXT:    br i1 [[TMP7]], label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE6:%.*]]
+; CHECK:       pred.store.if5:
 ; CHECK-NEXT:    [[TMP8:%.*]] = add i32 [[INDEX]], 2
 ; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i32, ptr [[A]], i32 [[TMP8]]
 ; CHECK-NEXT:    store i32 13, ptr [[TMP9]], align 1
-; CHECK-NEXT:    br label [[PRED_STORE_CONTINUE4]]
-; CHECK:       pred.store.continue4:
+; CHECK-NEXT:    br label [[PRED_STORE_CONTINUE6]]
+; CHECK:       pred.store.continue6:
 ; CHECK-NEXT:    [[TMP10:%.*]] = extractelement <4 x i1> [[TMP0]], i32 3
-; CHECK-NEXT:    br i1 [[TMP10]], label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE6]]
-; CHECK:       pred.store.if5:
+; CHECK-NEXT:    br i1 [[TMP10]], label [[PRED_STORE_IF7:%.*]], label [[PRED_STORE_CONTINUE8]]
+; CHECK:       pred.store.if7:
 ; CHECK-NEXT:    [[TMP11:%.*]] = add i32 [[INDEX]], 3
 ; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i32, ptr [[A]], i32 [[TMP11]]
 ; CHECK-NEXT:    store i32 13, ptr [[TMP12]], align 1
-; CHECK-NEXT:    br label [[PRED_STORE_CONTINUE6]]
-; CHECK:       pred.store.continue6:
+; CHECK-NEXT:    br label [[PRED_STORE_CONTINUE8]]
+; CHECK:       pred.store.continue8:
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add i32 [[INDEX]], 4
-; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], <i32 4, i32 4, i32 4, i32 4>
 ; CHECK-NEXT:    [[TMP13:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:       middle.block:
diff --git a/llvm/test/Transforms/LoopVectorize/reduction-inloop-pred.ll b/llvm/test/Transforms/LoopVectorize/reduction-inloop-pred.ll
index e79f0983a63785..c18c2f27e0f437 100644
--- a/llvm/test/Transforms/LoopVectorize/reduction-inloop-pred.ll
+++ b/llvm/test/Transforms/LoopVectorize/reduction-inloop-pred.ll
@@ -11,9 +11,11 @@ define i32 @reduction_sum_single(ptr noalias nocapture %A) {
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_LOAD_CONTINUE6:%.*]] ]
-; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i...
[truncated]

This patch adds a new VPInstruction::HeaderMask opcode to model the abstract header-mask used for tail-folding. It will be lowered depending on target preference (either using active-lane-mask, explicit-vector-length or a wide compare of the canonical IV and the backedge taken count) Similarly to llvm#82270, it would be good to clarify/agree on the terminology w.r.t. to recipes/opcodes that cannot be code-gen'd directly (i.e. require further gradual lowering). NOTE: some tests are failing or needed updating, due to widened IVs being replaced by scalar-steps, as their only use was the earlier wide compare. This could be fixed by either adding a suitable wide canonical IV as operand to the header-mask recipe and exactly preserve the original behavior. Alternatively we could keep the current behavior of the patch and update the tests. Or introduce a wide induction PHI instead of VPWidenCanonicalIVReicpe; currently we *only* use a wide IV for VPWidenCanonicalIVRecipe, if there was a suitable IV in the original loop, *even* if the mask compare is the *only* wide use. Either never or always using a wide PHI would be more consistent (or eventually make a more informed cost-based decision).

github-actions · 2024-04-22T12:52:29Z

✅ With the latest revision this PR passed the C/C++ code formatter.

fhahn · 2024-04-28T11:34:02Z

ping :)

ayalz

Nice cleanup, thanks for following up on this!
Adding several comments.

ayalz · 2024-04-30T19:27:34Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.h

@@ -105,6 +105,10 @@ struct VPlanTransforms {
  /// VPCanonicalIVPHIRecipe is only used to control the loop after
  /// this transformation.
  static void addExplicitVectorLength(VPlan &Plan);
+
+  /// Lower abstract VPInstruction recipes to a concrete sequence of recipes for


Suggested change

/// Lower abstract VPInstruction recipes to a concrete sequence of recipes for

/// Lower abstract VPInstruction recipes to sequences of concrete recipes for

ayalz · 2024-05-01T11:06:06Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -8555,6 +8548,7 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
      // TODO: try to put it close to addActiveLaneMask().
      if (CM.foldTailWithEVL())
        VPlanTransforms::addExplicitVectorLength(*Plan);
+      VPlanTransforms::lowerRecipes(*Plan);


(TODO) The three passes that lower the header mask (addActiveLaneMask, addExplicitVectorLength, lowerRecipes) should arguably be applied together, depending on tail folding style, inside VPlanTransforms::optimize().

Updated, the TODO above, thanks!

ayalz · 2024-05-01T11:08:27Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -1180,6 +1180,9 @@ class VPInstruction : public VPRecipeWithIRFlags {
    // operand). Only generates scalar values (either for the first lane only or
    // for all lanes, depending on its uses).
    PtrAdd,
+    // An abstract representation of the vector loops header mask, to be lowered
+    // later depending on target preference.


Suggested change

// later depending on target preference.

// later depending on target preference. Relevant only when the header may have a partial mask, i.e., when tail folding. A mask known to always be full is represented by null, w/o a HeaderMask recipe. A header mask may not be empty.

Added, thanks!

ayalz · 2024-05-01T13:52:47Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -1180,6 +1180,9 @@ class VPInstruction : public VPRecipeWithIRFlags {
    // operand). Only generates scalar values (either for the first lane only or
    // for all lanes, depending on its uses).
    PtrAdd,
+    // An abstract representation of the vector loops header mask, to be lowered
+    // later depending on target preference.
+    HeaderMask,


nit: would be good to list in lex order, so better placed before PtrAdd.

Moved, thanks!

ayalz · 2024-05-01T13:53:22Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -132,6 +132,7 @@ bool VPRecipeBase::mayHaveSideEffects() const {
    case VPInstruction::CalculateTripCountMinusVF:
    case VPInstruction::CanonicalIVIncrementForPart:
    case VPInstruction::PtrAdd:
+    case VPInstruction::HeaderMask:


nit: would be good to list in lex order, so better placed before PtrAdd.

Done, thanks!

ayalz · 2024-05-01T14:00:03Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

-  // Also include VPWidenIntOrFpInductionRecipes that represent a widened
-  // version of the canonical induction.
+/// Return the header mask recipe of the VPlan, if there is one.
+static VPInstruction *getHeaderMask(VPlan &Plan) {


Some thoughts: have Loop Region (1) provide a getHeaderMask() method which (always) returns a VPValue* (2) hold an abstract HeaderMask VPInstruction* field. The former retrieves the latter initially, later retrieves concrete header mask generating recipe. The latter should be unique per loop region. It may be inserted into the header VPBB (after canonical IV recipe) if needed to make sure standard def/use dominance relations are in tact(?) but in general it also represents a bump which is placed after (header) phi's, and must not be executed, so better have the region take care of (printing) it?
Similar argument applies to the canonical IV, if/when introduced initially w/o its bump, to be lowered eventually to a regular induction header phi and add recipes.

have Loop Region (1) provide a getHeaderMask() method which (always) returns a VPValue*

Left as is for now, as this would require to also look for ActiveLaneMask that replaces the HeaderMask VPInstruciton I think.

(2) hold an abstract HeaderMask VPInstruction* field. The former retrieves the latter initially, later retrieves concrete header mask generating recipe.

Adding the this is a field adds extra burden on code that replaces/removes the header mask, as they would have to update the field. It might be better to manage the header mask (and/or canonical IV) via VPUser in VPRegionBlock; that way, RAUW of the header mask will also correctly update the reference in the region block. WDYT as follow-up?

That leaves the EVL case, where we don't replace the mask directly, but update users of the mask to use EVL, possibly removing the mask alltogether. This would require some special handling.

(Following discussion w/ @aniragil): rather than introduce a designated abstract VPInstruction, suffice for a loop region::getHeaderMask() to provide a VPValue*, initially an instance of VPValue it holds - as an abstract "live-in", until replaced by a (concrete) recipe - ActiveLaneMask or a vector compare? This replacement can be done via setHeaderMask(). Whoever needs access to the canonical IV can get it directly rather than check the header mask's operand. Whoever materializes concrete recipes for generating the header mask can figure out where to place them directly, rather than use the abstract header mask as a concrete insertion point.
For EVL, setHeaderMask() effectively sets the header mask to null, after visiting all its users.

ayalz · 2024-05-01T14:46:06Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  for (VPRecipeBase &R : HeaderVPBB->phis()) {
+    auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&R);
+    if (!WideIV || !WideIV->isCanonical() ||
+        Plan.getCanonicalIV()->getScalarType() != WideIV->getScalarType())
+      continue;
+    return WideIV;
+    break;


Suggested change

for (VPRecipeBase &R : HeaderVPBB->phis()) {

auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&R);

if (!WideIV || !WideIV->isCanonical() ||

Plan.getCanonicalIV()->getScalarType() != WideIV->getScalarType())

continue;

return WideIV;

break;

Type *CanonicalType = Plan.getCanonicalIV()->getScalarType();

for (VPRecipeBase &R : HeaderVPBB->phis()) {

auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&R);

if (WideIV && WideIV->isCanonical() &&

WideIV->getScalarType() == CanonicalType)

return WideIV;

Updated (modulo type check), thanks!

ayalz · 2024-05-01T15:19:15Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  for (VPRecipeBase &R : HeaderVPBB->phis()) {
+    auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&R);
+    if (!WideIV || !WideIV->isCanonical() ||
+        Plan.getCanonicalIV()->getScalarType() != WideIV->getScalarType())


nit: have isCanonical() accept a (possibly optional/null) Type* parameter?

Split off to a48ebb8 by directly retrieving the canonical induction, which must be the first recipe in the same block (region header)

ayalz · 2024-05-01T15:26:57Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    VPBuilder B;
+    B.setInsertPoint(HeaderVPBB, HeaderVPBB->getFirstNonPhi());


Suggested change

VPBuilder B;

B.setInsertPoint(HeaderVPBB, HeaderVPBB->getFirstNonPhi());

VPBuilder B(HeaderVPBB->getFirstNonPhi());

Done, thanks!

ayalz · 2024-05-01T16:51:04Z

llvm/test/Transforms/LoopVectorize/dont-fold-tail-for-divisible-TC.ll

 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE6:%.*]] ]
-; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE6]] ]


NOTE: some tests are failing or needed updating ...

This note in the commit message should be augmented with the selected approach - which is to update the tests, right? Would it be reasonable to first retain the current behavior, isolating this refactoring cleanup as an NFC, to be followed by the desired simplification / consistency functional change, if not too cumbersome?

Directly check the type of the wide induction matches the canonical induction. Refactor suggested in and in preparation for #89603

fhahn

Updated the PR, and update the HeaderMask to has a suitable wide canonical IV as operand, if there is one, to avoid DCE and preserve original test behavior.

With the recent EVL changes, we also need to adjust VPWidenCanonicalIVRecipe; it now isn't just widening the canonical IV, it may also widen the EVL based counter. So we need to loosen the restriction on the start value (which I did in this patch, to be split off separately once we converge on a good name for the new recipe; it could also be a VPInstruction opcode instead of a recipe).

Widening the EVL based counter should be fine; we may execute an operation using the masked on more elements than necessary for the last 2 iterations. But as source and sink operations (loads/stores) are limited by EVL, it should not lead to correctness issues. At least I couldn't construct an example where it would be an issue.

fhahn · 2024-05-03T10:53:07Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -8555,6 +8548,7 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
      // TODO: try to put it close to addActiveLaneMask().
      if (CM.foldTailWithEVL())
        VPlanTransforms::addExplicitVectorLength(*Plan);
+      VPlanTransforms::lowerRecipes(*Plan);


Updated, the TODO above, thanks!

fhahn · 2024-05-03T10:54:19Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -1180,6 +1180,9 @@ class VPInstruction : public VPRecipeWithIRFlags {
    // operand). Only generates scalar values (either for the first lane only or
    // for all lanes, depending on its uses).
    PtrAdd,
+    // An abstract representation of the vector loops header mask, to be lowered
+    // later depending on target preference.
+    HeaderMask,


Moved, thanks!

fhahn · 2024-05-03T10:54:25Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -1180,6 +1180,9 @@ class VPInstruction : public VPRecipeWithIRFlags {
    // operand). Only generates scalar values (either for the first lane only or
    // for all lanes, depending on its uses).
    PtrAdd,
+    // An abstract representation of the vector loops header mask, to be lowered
+    // later depending on target preference.


Added, thanks!

fhahn · 2024-05-03T10:56:37Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -132,6 +132,7 @@ bool VPRecipeBase::mayHaveSideEffects() const {
    case VPInstruction::CalculateTripCountMinusVF:
    case VPInstruction::CanonicalIVIncrementForPart:
    case VPInstruction::PtrAdd:
+    case VPInstruction::HeaderMask:


Done, thanks!

fhahn · 2024-05-03T11:05:49Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

-  // Also include VPWidenIntOrFpInductionRecipes that represent a widened
-  // version of the canonical induction.
+/// Return the header mask recipe of the VPlan, if there is one.
+static VPInstruction *getHeaderMask(VPlan &Plan) {


have Loop Region (1) provide a getHeaderMask() method which (always) returns a VPValue*

Left as is for now, as this would require to also look for ActiveLaneMask that replaces the HeaderMask VPInstruciton I think.

(2) hold an abstract HeaderMask VPInstruction* field. The former retrieves the latter initially, later retrieves concrete header mask generating recipe.

Adding the this is a field adds extra burden on code that replaces/removes the header mask, as they would have to update the field. It might be better to manage the header mask (and/or canonical IV) via VPUser in VPRegionBlock; that way, RAUW of the header mask will also correctly update the reference in the region block. WDYT as follow-up?

That leaves the EVL case, where we don't replace the mask directly, but update users of the mask to use EVL, possibly removing the mask alltogether. This would require some special handling.

fhahn · 2024-05-03T12:14:08Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  for (VPRecipeBase &R : HeaderVPBB->phis()) {
+    auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&R);
+    if (!WideIV || !WideIV->isCanonical() ||
+        Plan.getCanonicalIV()->getScalarType() != WideIV->getScalarType())


Split off to a48ebb8 by directly retrieving the canonical induction, which must be the first recipe in the same block (region header)

fhahn · 2024-05-03T12:21:33Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  for (VPRecipeBase &R : HeaderVPBB->phis()) {
+    auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&R);
+    if (!WideIV || !WideIV->isCanonical() ||
+        Plan.getCanonicalIV()->getScalarType() != WideIV->getScalarType())
+      continue;
+    return WideIV;
+    break;


Updated (modulo type check), thanks!

fhahn · 2024-05-03T12:28:34Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    VPBuilder B;
+    B.setInsertPoint(HeaderVPBB, HeaderVPBB->getFirstNonPhi());


Done, thanks!

ayalz · 2024-05-04T10:56:11Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+
+    // If there is a header mask, check if WideIV is canonical IV with other
+    // wide users. If that is the case, use it as HeaderMask's operand, so it
+    // can be used when lowering the recipe.


Is this part needed to retain current behavior, and deserves a TODO to simplify it?

Will this be simpler if the abstract HeaderMask is a "live-in" VPValue of loop region, free of any operands, i.e., using neither the scalar canonical IV nor a wide canonical/WideIV.

ayalz · 2024-05-04T11:05:55Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  if (isa<VPWidenIntOrFpInductionRecipe, VPScalarIVStepsRecipe>(Op))
+    return Op;
+  // Check if there is a wide canonical IV that can be re-used.
+  if (auto *CanIV = dyn_cast<VPCanonicalIVPHIRecipe>(Op)) {


CanIV is now useless.

Is the check needed - does Op need to be a VPCanonicalIVPHIRecipe for the following search and replace to apply?

ayalz · 2024-05-04T11:07:07Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

    }
  }
-  return HeaderMasks;
+  auto *IV = new VPWidenCanonicalIVRecipe(Op);


Is Op expected to be scalar here, as in VPCanonicalIVPHIRecipe?

ayalz · 2024-05-04T11:10:05Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -2688,14 +2694,13 @@ class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
 /// A Recipe for widening the canonical induction variable of the vector loop.
 class VPWidenCanonicalIVRecipe : public VPSingleDefRecipe {
 public:
-  VPWidenCanonicalIVRecipe(VPCanonicalIVPHIRecipe *CanonicalIV)
-      : VPSingleDefRecipe(VPDef::VPWidenCanonicalIVSC, {CanonicalIV}) {}
+  VPWidenCanonicalIVRecipe(VPValue *Start)


It's still a [Scalar]CanonicalIV, right? Start may be confused with the invariant value feeding an IV from the preheader.

ayalz · 2024-05-04T11:13:03Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -1897,22 +1901,21 @@ void VPExpandSCEVRecipe::print(raw_ostream &O, const Twine &Indent,
 #endif

 void VPWidenCanonicalIVRecipe::execute(VPTransformState &State) {
-  Value *CanonicalIV = State.get(getOperand(0), 0, /*IsScalar*/ true);
-  Type *STy = CanonicalIV->getType();
+  Value *Start = State.get(getOperand(0), 0, /*IsScalar*/ true);


[Scalar]CanonicalIV? Admittedly used to build VStart below.

ayalz · 2024-05-04T11:20:57Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

  for (unsigned Part = 0, UF = State.UF; Part < UF; ++Part) {
    Value *VStep = createStepForVF(Builder, STy, VF, Part);
    if (VF.isVector()) {
      VStep = Builder.CreateVectorSplat(VF, VStep);
      VStep =
          Builder.CreateAdd(VStep, Builder.CreateStepVector(VStep->getType()));
    }
-    Value *CanonicalVectorIV = Builder.CreateAdd(VStart, VStep, "vec.iv");
-    State.set(this, CanonicalVectorIV, Part);
+    Value *Res = Builder.CreateAdd(VStart, VStep, "vec.iv");


Retain name - this still generates a canonical wide IV?

ayalz · 2024-05-06T10:09:51Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

@@ -552,6 +515,16 @@ createScalarIVSteps(VPlan &Plan, InductionDescriptor::InductionKind Kind,
  return Steps;
 }

+/// Return the header mask recipe of the VPlan, if there is one.
+static VPInstruction *getHeaderMask(VPlan &Plan) {


Having loop region provide getHeaderMask() should save this search for a singleton VPValue.

ayalz · 2024-05-06T10:17:24Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

@@ -1306,6 +1253,10 @@ void VPlanTransforms::addActiveLaneMask(
 /// ...
 ///
 void VPlanTransforms::addExplicitVectorLength(VPlan &Plan) {
+  VPValue *HeaderMask = getHeaderMask(Plan);
+  if (!HeaderMask)


If addExplicitVectorLength() is called header mask is expected to exist - assert instead of early exit?

ayalz · 2024-05-06T10:45:10Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

-  // Also include VPWidenIntOrFpInductionRecipes that represent a widened
-  // version of the canonical induction.
+/// Return the header mask recipe of the VPlan, if there is one.
+static VPInstruction *getHeaderMask(VPlan &Plan) {


(Following discussion w/ @aniragil): rather than introduce a designated abstract VPInstruction, suffice for a loop region::getHeaderMask() to provide a VPValue*, initially an instance of VPValue it holds - as an abstract "live-in", until replaced by a (concrete) recipe - ActiveLaneMask or a vector compare? This replacement can be done via setHeaderMask(). Whoever needs access to the canonical IV can get it directly rather than check the header mask's operand. Whoever materializes concrete recipes for generating the header mask can figure out where to place them directly, rather than use the abstract header mask as a concrete insertion point.
For EVL, setHeaderMask() effectively sets the header mask to null, after visiting all its users.

fhahn requested review from rengolin, alexey-bataev, ayalz and aniragil April 22, 2024 12:49

llvmbot added vectorization llvm:transforms labels Apr 22, 2024

fhahn mentioned this pull request Apr 22, 2024

[VPlan] Introduce recipes for VP loads and stores. #87816

Merged

fhahn force-pushed the vplan-header-mask-opcode branch from aa8890c to 2630805 Compare April 22, 2024 12:51

ayalz reviewed May 1, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into vplan-header-mask-opcode

8ea7c45

fhahn added a commit that referenced this pull request May 3, 2024

[VPlan] Check type directly in ::isCanonical (NFC).

a48ebb8

Directly check the type of the wide induction matches the canonical induction. Refactor suggested in and in preparation for #89603

fhahn added 3 commits May 3, 2024 13:16

Merge remote-tracking branch 'origin/main' into vplan-header-mask-opcode

354777e

Merge remote-tracking branch 'origin/main' into vplan-header-mask-opcode

5fed0d1

!fixup address comments

5324561

fhahn commented May 3, 2024

View reviewed changes

ayalz reviewed May 6, 2024

View reviewed changes

fhahn mentioned this pull request Jun 12, 2024

[VPlan] Delay adding canonical IV increment and exit branches. #82270

Open

	/// Lower abstract VPInstruction recipes to a concrete sequence of recipes for
	/// Lower abstract VPInstruction recipes to sequences of concrete recipes for

	// later depending on target preference.
	// later depending on target preference. Relevant only when the header may have a partial mask, i.e., when tail folding. A mask known to always be full is represented by null, w/o a HeaderMask recipe. A header mask may not be empty.

		VPBuilder B;
		B.setInsertPoint(HeaderVPBB, HeaderVPBB->getFirstNonPhi());

	VPBuilder B;
	B.setInsertPoint(HeaderVPBB, HeaderVPBB->getFirstNonPhi());
	VPBuilder B(HeaderVPBB->getFirstNonPhi());

[VPlan] Add new VPInstruction ocpode for header mask. #89603

Are you sure you want to change the base?

[VPlan] Add new VPInstruction ocpode for header mask. #89603

Conversation

fhahn commented Apr 22, 2024 • edited Loading

llvmbot commented Apr 22, 2024

github-actions bot commented Apr 22, 2024 • edited Loading

fhahn commented Apr 28, 2024

ayalz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn commented Apr 22, 2024 •

edited

Loading

github-actions bot commented Apr 22, 2024 •

edited

Loading