-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[VPlan] Move addExplicitVectorLength to tryToBuildVPlanWithVPRecipes #166164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[VPlan] Move addExplicitVectorLength to tryToBuildVPlanWithVPRecipes #166164
Conversation
|
@llvm/pr-subscribers-llvm-transforms Author: Luke Lau (lukel97) ChangesStacked on #166158 Currently we convert a VPlan to an EVL tail folded one after the VPlan is built and optimized, which doesn't match how we handle regular tail folding. This addresses a long standing TODO by performing it much earlier in the pipeline before any optimizations are run, and simulatneously splits out optimizeMaskToEVL into a separate pass to be run during VPlanTransforms::optimize. This way the two parts of EVL tail folding are separated into those needed for correctness and those that are an optimization.
Because we now optimize the VPlan after the EVL stuff is added, some simplifications e.g. replacing a scalar-steps when UF=1 kick in for the initial VPlan. Fixes #153144 Patch is 99.68 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/166164.diff 20 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index e5c3f17860103..dbd811d641883 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8216,10 +8216,6 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
VPlanTransforms::runPass(VPlanTransforms::truncateToMinimalBitwidths,
*Plan, CM.getMinimalBitwidths());
VPlanTransforms::runPass(VPlanTransforms::optimize, *Plan);
- // TODO: try to put it close to addActiveLaneMask().
- if (CM.foldTailWithEVL())
- VPlanTransforms::runPass(VPlanTransforms::addExplicitVectorLength,
- *Plan, CM.getMaxSafeElements());
assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
VPlans.push_back(std::move(Plan));
}
@@ -8483,6 +8479,9 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
}
VPlanTransforms::optimizeInductionExitUsers(*Plan, IVEndValues, *PSE.getSE());
+ if (CM.foldTailWithEVL())
+ VPlanTransforms::addExplicitVectorLength(*Plan, CM.getMaxSafeElements());
+
assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
return Plan;
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index e1da070a1fb7f..2ff1d30d00e41 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -4118,6 +4118,11 @@ class LLVM_ABI_FOR_TEST VPRegionBlock : public VPBlockBase {
return const_cast<VPRegionBlock *>(this)->getCanonicalIV();
}
+ VPEVLBasedIVPHIRecipe *getEVLBasedIV() {
+ return dyn_cast<VPEVLBasedIVPHIRecipe>(
+ std::next(getCanonicalIV()->getIterator()));
+ }
+
/// Return the type of the canonical IV for loop regions.
Type *getCanonicalIVType() { return getCanonicalIV()->getScalarType(); }
const Type *getCanonicalIVType() const {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 9d9bb14530539..43ab56f226c7a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -610,9 +610,11 @@ createScalarIVSteps(VPlan &Plan, InductionDescriptor::InductionKind Kind,
VPBuilder &Builder) {
VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
VPBasicBlock *HeaderVPBB = LoopRegion->getEntryBasicBlock();
- VPCanonicalIVPHIRecipe *CanonicalIV = LoopRegion->getCanonicalIV();
- VPSingleDefRecipe *BaseIV = Builder.createDerivedIV(
- Kind, FPBinOp, StartV, CanonicalIV, Step, "offset.idx");
+ VPValue *IV = LoopRegion->getCanonicalIV();
+ if (auto *EVLIV = LoopRegion->getEVLBasedIV())
+ IV = EVLIV;
+ VPSingleDefRecipe *BaseIV =
+ Builder.createDerivedIV(Kind, FPBinOp, StartV, IV, Step, "offset.idx");
// Truncate base induction if needed.
VPTypeAnalysis TypeInfo(Plan);
@@ -2327,6 +2329,7 @@ void VPlanTransforms::optimize(VPlan &Plan) {
runPass(removeRedundantExpandSCEVRecipes, Plan);
runPass(simplifyRecipes, Plan);
runPass(removeBranchOnConst, Plan);
+ runPass(optimizeMasksToEVL, Plan);
runPass(removeDeadRecipes, Plan);
runPass(createAndOptimizeReplicateRegions, Plan);
@@ -2617,8 +2620,40 @@ static VPRecipeBase *optimizeMaskToEVL(VPValue *HeaderMask,
return nullptr;
}
-/// Replace recipes with their EVL variants.
-static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
+void VPlanTransforms::optimizeMasksToEVL(VPlan &Plan) {
+ // Find the EVL-based header mask if it exists: icmp ult step-vector, EVL
+ VPInstruction *HeaderMask = nullptr;
+ for (VPRecipeBase &R : *Plan.getVectorLoopRegion()->getEntryBasicBlock()) {
+ if (match(&R, m_ICmp(m_VPInstruction<VPInstruction::StepVector>(),
+ m_EVL(m_VPValue())))) {
+ HeaderMask = cast<VPInstruction>(&R);
+ break;
+ }
+ }
+ if (!HeaderMask)
+ return;
+
+ VPValue *EVL = HeaderMask->getOperand(1);
+
+ VPTypeAnalysis TypeInfo(Plan);
+
+ for (VPUser *U : collectUsersRecursively(HeaderMask)) {
+ VPRecipeBase *R = cast<VPRecipeBase>(U);
+ if (auto *NewR = optimizeMaskToEVL(HeaderMask, *R, TypeInfo, *EVL)) {
+ NewR->insertBefore(R);
+ for (auto [Old, New] :
+ zip_equal(R->definedValues(), NewR->definedValues()))
+ Old->replaceAllUsesWith(New);
+ // Erase dead stores, the rest will be removed by removeDeadRecipes.
+ if (R->getNumDefinedValues() == 0)
+ R->eraseFromParent();
+ }
+ }
+}
+
+/// After replacing the IV with a EVL-based IV, fixup recipes that use VF to use
+/// the EVL instead to avoid incorrect updates on the penultimate iteration.
+static void fixupVFUsersForEVL(VPlan &Plan, VPValue &EVL) {
VPTypeAnalysis TypeInfo(Plan);
VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
VPBasicBlock *Header = LoopRegion->getEntryBasicBlock();
@@ -2646,10 +2681,6 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
return isa<VPWidenPointerInductionRecipe>(U);
});
- // Defer erasing recipes till the end so that we don't invalidate the
- // VPTypeAnalysis cache.
- SmallVector<VPRecipeBase *> ToErase;
-
// Create a scalar phi to track the previous EVL if fixed-order recurrence is
// contained.
bool ContainsFORs =
@@ -2683,7 +2714,6 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
TypeInfo.inferScalarType(R.getVPSingleValue()), R.getDebugLoc());
VPSplice->insertBefore(&R);
R.getVPSingleValue()->replaceAllUsesWith(VPSplice);
- ToErase.push_back(&R);
}
}
}
@@ -2704,43 +2734,6 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
CmpInst::ICMP_ULT,
Builder.createNaryOp(VPInstruction::StepVector, {}, EVLType), &EVL);
HeaderMask->replaceAllUsesWith(EVLMask);
- ToErase.push_back(HeaderMask->getDefiningRecipe());
-
- // Try to optimize header mask recipes away to their EVL variants.
- // TODO: Split optimizeMaskToEVL out and move into
- // VPlanTransforms::optimize. transformRecipestoEVLRecipes should be run in
- // tryToBuildVPlanWithVPRecipes beforehand.
- for (VPUser *U : collectUsersRecursively(EVLMask)) {
- auto *CurRecipe = cast<VPRecipeBase>(U);
- VPRecipeBase *EVLRecipe =
- optimizeMaskToEVL(EVLMask, *CurRecipe, TypeInfo, EVL);
- if (!EVLRecipe)
- continue;
-
- unsigned NumDefVal = EVLRecipe->getNumDefinedValues();
- assert(NumDefVal == CurRecipe->getNumDefinedValues() &&
- "New recipe must define the same number of values as the "
- "original.");
- EVLRecipe->insertBefore(CurRecipe);
- if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe, VPInterleaveEVLRecipe>(
- EVLRecipe)) {
- for (unsigned I = 0; I < NumDefVal; ++I) {
- VPValue *CurVPV = CurRecipe->getVPValue(I);
- CurVPV->replaceAllUsesWith(EVLRecipe->getVPValue(I));
- }
- }
- ToErase.push_back(CurRecipe);
- }
- // Remove dead EVL mask.
- if (EVLMask->getNumUsers() == 0)
- ToErase.push_back(EVLMask->getDefiningRecipe());
-
- for (VPRecipeBase *R : reverse(ToErase)) {
- SmallVector<VPValue *> PossiblyDead(R->operands());
- R->eraseFromParent();
- for (VPValue *Op : PossiblyDead)
- recursivelyDeleteDeadRecipes(Op);
- }
}
/// Add a VPEVLBasedIVPHIRecipe and related recipes to \p Plan and
@@ -2838,7 +2831,7 @@ void VPlanTransforms::addExplicitVectorLength(
DebugLoc::getCompilerGenerated(), "avl.next");
AVLPhi->addOperand(NextAVL);
- transformRecipestoEVLRecipes(Plan, *VPEVL);
+ fixupVFUsersForEVL(Plan, *VPEVL);
// Replace all uses of VPCanonicalIVPHIRecipe by
// VPEVLBasedIVPHIRecipe except for the canonical IV increment.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index b28559b620e13..f474f61c5d8d3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -377,6 +377,17 @@ struct VPlanTransforms {
/// users in the original exit block using the VPIRInstruction wrapping to the
/// LCSSA phi.
static void addExitUsersForFirstOrderRecurrences(VPlan &Plan, VFRange &Range);
+
+ /// If the loop is EVL tail folded, try and optimize any recipes that use a
+ /// EVL based header mask to a VP intrinsic, e.g:
+ ///
+ /// %mask = icmp step-vector, EVL
+ /// %load = load %ptr, %mask
+ ///
+ /// ->
+ ///
+ /// %load = vp.load %ptr, EVL
+ static void optimizeMasksToEVL(VPlan &Plan);
};
} // namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index 91734a10cb2c8..fc80b022ffad9 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -310,6 +310,12 @@ bool VPlanVerifier::verifyVPBasicBlock(const VPBasicBlock *VPBB) {
break;
}
}
+ if (const auto *EVLPhi = dyn_cast<VPEVLBasedIVPHIRecipe>(&R)) {
+ if (!isa<VPCanonicalIVPHIRecipe>(std::prev(EVLPhi->getIterator()))) {
+ errs() << "EVL-based IV is not immediately after canonical IV\n";
+ return false;
+ }
+ }
}
auto *IRBB = dyn_cast<VPIRBasicBlock>(VPBB);
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll
index f25b86d3b20c2..183bebe818f7d 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll
@@ -361,12 +361,12 @@ define void @gather_interleave_group_with_dead_insert_pos(i64 %N, ptr noalias %s
; CHECK-NEXT: [[EVL_BASED_IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_IND:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[AVL:%.*]] = phi i64 [ [[TMP2]], %[[VECTOR_PH]] ], [ [[AVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[EVL_BASED_IV]], 2
; CHECK-NEXT: [[TMP10:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 4, i1 true)
; CHECK-NEXT: [[TMP16:%.*]] = zext i32 [[TMP10]] to i64
; CHECK-NEXT: [[TMP12:%.*]] = mul i64 2, [[TMP16]]
; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP12]], i64 0
; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[EVL_BASED_IV]], 2
; CHECK-NEXT: [[TMP22:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[OFFSET_IDX]]
; CHECK-NEXT: [[INTERLEAVE_EVL:%.*]] = mul nuw nsw i32 [[TMP10]], 2
; CHECK-NEXT: [[WIDE_MASKED_VEC:%.*]] = call <vscale x 8 x i8> @llvm.vp.load.nxv8i8.p0(ptr align 1 [[TMP22]], <vscale x 8 x i1> splat (i1 true), i32 [[INTERLEAVE_EVL]])
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll b/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll
index 01b4502308c95..214f8068c8043 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll
@@ -270,6 +270,7 @@ define void @predicated_udiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[V:%.*]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <vscale x 2 x i64> [[BROADCAST_SPLAT]], zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
@@ -278,7 +279,6 @@ define void @predicated_udiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 2, i1 true)
; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i32> poison, i32 [[TMP12]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP15:%.*]] = icmp ult <vscale x 2 x i32> [[TMP7]], [[BROADCAST_SPLAT2]]
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 [[TMP8]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP12]])
@@ -351,6 +351,7 @@ define void @predicated_sdiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[V:%.*]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <vscale x 2 x i64> [[BROADCAST_SPLAT]], zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
@@ -359,7 +360,6 @@ define void @predicated_sdiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 2, i1 true)
; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i32> poison, i32 [[TMP12]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP15:%.*]] = icmp ult <vscale x 2 x i32> [[TMP7]], [[BROADCAST_SPLAT2]]
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 [[TMP8]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP12]])
@@ -570,6 +570,7 @@ define void @predicated_sdiv_by_minus_one(ptr noalias nocapture %a, i64 %n) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
+; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 16 x i32> @llvm.stepvector.nxv16i32()
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -577,7 +578,6 @@ define void @predicated_sdiv_by_minus_one(ptr noalias nocapture %a, i64 %n) {
; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 16, i1 true)
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[TMP12]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 16 x i32> @llvm.stepvector.nxv16i32()
; CHECK-NEXT: [[TMP15:%.*]] = icmp ult <vscale x 16 x i32> [[TMP6]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[A:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr align 1 [[TMP7]], <vscale x 16 x i1> splat (i1 true), i32 [[TMP12]])
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll
index 097f05d222cf6..52f9ef2805bff 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll
@@ -5,7 +5,7 @@ define void @add(ptr noalias nocapture readonly %src1, ptr noalias nocapture rea
; CHECK-LABEL: add
; CHECK: LV(REG): VF = vscale x 4
; CHECK-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 4 registers
; CHECK-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll
index 8bbfdf39a0624..100c2d123c0ba 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll
@@ -6,14 +6,14 @@ define void @add(ptr noalias nocapture readonly %src1, ptr noalias nocapture rea
; ZVFH-LABEL: add
; ZVFH: LV(REG): VF = vscale x 4
; ZVFH-NEXT: LV(REG): Found max usage: 2 item
-; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 2 registers
; ZVFH-NEXT: LV(REG): Found invariant usage: 1 item
; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
; ZVFHMIN-LABEL: add
; ZVFHMIN: LV(REG): VF = vscale x 4
; ZVFHMIN-NEXT: LV(REG): Found max usage: 2 item
-; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 4 registers
; ZVFHMIN-NEXT: LV(REG): Found invariant usage: 1 item
; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll
index 6bb0d64314d3e..fbe28b3bf2bc4 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll
@@ -4,7 +4,7 @@
define i32 @dotp(ptr %a, ptr %b) {
; CHECK-REGS-VP: LV(REG): VF = vscale x 16
; CHECK-REGS-VP-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 24 registers
; CHECK-REGS-VP-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll
index 99139da67bb78..591df1abe06d2 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll
@@ -31,28 +31,28 @@ define void @add(ptr noalias nocapture readonly %src1, ptr noalias nocapture rea
; CHECK-LMUL1-LABEL: add
; CHECK-LMUL1: LV(REG): VF = vscale x 2
; CHECK-LMUL1-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 2 registers
; CHECK-LMUL1-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
; CHECK-LMUL2-LABEL: add
; CHECK-LMUL2: LV(REG): VF = vscale x 4
; CHECK-LMUL2-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 4 registers
; CHECK-LMUL2-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
; CHECK-LMUL4-LABEL: add
; CHECK-LMUL4: LV(REG): VF = vscale x 8
; CHECK-LMUL4-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-LMUL4-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-LMUL4-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK...
[truncated]
|
|
@llvm/pr-subscribers-vectorizers Author: Luke Lau (lukel97) ChangesStacked on #166158 Currently we convert a VPlan to an EVL tail folded one after the VPlan is built and optimized, which doesn't match how we handle regular tail folding. This addresses a long standing TODO by performing it much earlier in the pipeline before any optimizations are run, and simulatneously splits out optimizeMaskToEVL into a separate pass to be run during VPlanTransforms::optimize. This way the two parts of EVL tail folding are separated into those needed for correctness and those that are an optimization.
Because we now optimize the VPlan after the EVL stuff is added, some simplifications e.g. replacing a scalar-steps when UF=1 kick in for the initial VPlan. Fixes #153144 Patch is 99.68 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/166164.diff 20 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index e5c3f17860103..dbd811d641883 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8216,10 +8216,6 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
VPlanTransforms::runPass(VPlanTransforms::truncateToMinimalBitwidths,
*Plan, CM.getMinimalBitwidths());
VPlanTransforms::runPass(VPlanTransforms::optimize, *Plan);
- // TODO: try to put it close to addActiveLaneMask().
- if (CM.foldTailWithEVL())
- VPlanTransforms::runPass(VPlanTransforms::addExplicitVectorLength,
- *Plan, CM.getMaxSafeElements());
assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
VPlans.push_back(std::move(Plan));
}
@@ -8483,6 +8479,9 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
}
VPlanTransforms::optimizeInductionExitUsers(*Plan, IVEndValues, *PSE.getSE());
+ if (CM.foldTailWithEVL())
+ VPlanTransforms::addExplicitVectorLength(*Plan, CM.getMaxSafeElements());
+
assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
return Plan;
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index e1da070a1fb7f..2ff1d30d00e41 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -4118,6 +4118,11 @@ class LLVM_ABI_FOR_TEST VPRegionBlock : public VPBlockBase {
return const_cast<VPRegionBlock *>(this)->getCanonicalIV();
}
+ VPEVLBasedIVPHIRecipe *getEVLBasedIV() {
+ return dyn_cast<VPEVLBasedIVPHIRecipe>(
+ std::next(getCanonicalIV()->getIterator()));
+ }
+
/// Return the type of the canonical IV for loop regions.
Type *getCanonicalIVType() { return getCanonicalIV()->getScalarType(); }
const Type *getCanonicalIVType() const {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 9d9bb14530539..43ab56f226c7a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -610,9 +610,11 @@ createScalarIVSteps(VPlan &Plan, InductionDescriptor::InductionKind Kind,
VPBuilder &Builder) {
VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
VPBasicBlock *HeaderVPBB = LoopRegion->getEntryBasicBlock();
- VPCanonicalIVPHIRecipe *CanonicalIV = LoopRegion->getCanonicalIV();
- VPSingleDefRecipe *BaseIV = Builder.createDerivedIV(
- Kind, FPBinOp, StartV, CanonicalIV, Step, "offset.idx");
+ VPValue *IV = LoopRegion->getCanonicalIV();
+ if (auto *EVLIV = LoopRegion->getEVLBasedIV())
+ IV = EVLIV;
+ VPSingleDefRecipe *BaseIV =
+ Builder.createDerivedIV(Kind, FPBinOp, StartV, IV, Step, "offset.idx");
// Truncate base induction if needed.
VPTypeAnalysis TypeInfo(Plan);
@@ -2327,6 +2329,7 @@ void VPlanTransforms::optimize(VPlan &Plan) {
runPass(removeRedundantExpandSCEVRecipes, Plan);
runPass(simplifyRecipes, Plan);
runPass(removeBranchOnConst, Plan);
+ runPass(optimizeMasksToEVL, Plan);
runPass(removeDeadRecipes, Plan);
runPass(createAndOptimizeReplicateRegions, Plan);
@@ -2617,8 +2620,40 @@ static VPRecipeBase *optimizeMaskToEVL(VPValue *HeaderMask,
return nullptr;
}
-/// Replace recipes with their EVL variants.
-static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
+void VPlanTransforms::optimizeMasksToEVL(VPlan &Plan) {
+ // Find the EVL-based header mask if it exists: icmp ult step-vector, EVL
+ VPInstruction *HeaderMask = nullptr;
+ for (VPRecipeBase &R : *Plan.getVectorLoopRegion()->getEntryBasicBlock()) {
+ if (match(&R, m_ICmp(m_VPInstruction<VPInstruction::StepVector>(),
+ m_EVL(m_VPValue())))) {
+ HeaderMask = cast<VPInstruction>(&R);
+ break;
+ }
+ }
+ if (!HeaderMask)
+ return;
+
+ VPValue *EVL = HeaderMask->getOperand(1);
+
+ VPTypeAnalysis TypeInfo(Plan);
+
+ for (VPUser *U : collectUsersRecursively(HeaderMask)) {
+ VPRecipeBase *R = cast<VPRecipeBase>(U);
+ if (auto *NewR = optimizeMaskToEVL(HeaderMask, *R, TypeInfo, *EVL)) {
+ NewR->insertBefore(R);
+ for (auto [Old, New] :
+ zip_equal(R->definedValues(), NewR->definedValues()))
+ Old->replaceAllUsesWith(New);
+ // Erase dead stores, the rest will be removed by removeDeadRecipes.
+ if (R->getNumDefinedValues() == 0)
+ R->eraseFromParent();
+ }
+ }
+}
+
+/// After replacing the IV with a EVL-based IV, fixup recipes that use VF to use
+/// the EVL instead to avoid incorrect updates on the penultimate iteration.
+static void fixupVFUsersForEVL(VPlan &Plan, VPValue &EVL) {
VPTypeAnalysis TypeInfo(Plan);
VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
VPBasicBlock *Header = LoopRegion->getEntryBasicBlock();
@@ -2646,10 +2681,6 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
return isa<VPWidenPointerInductionRecipe>(U);
});
- // Defer erasing recipes till the end so that we don't invalidate the
- // VPTypeAnalysis cache.
- SmallVector<VPRecipeBase *> ToErase;
-
// Create a scalar phi to track the previous EVL if fixed-order recurrence is
// contained.
bool ContainsFORs =
@@ -2683,7 +2714,6 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
TypeInfo.inferScalarType(R.getVPSingleValue()), R.getDebugLoc());
VPSplice->insertBefore(&R);
R.getVPSingleValue()->replaceAllUsesWith(VPSplice);
- ToErase.push_back(&R);
}
}
}
@@ -2704,43 +2734,6 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
CmpInst::ICMP_ULT,
Builder.createNaryOp(VPInstruction::StepVector, {}, EVLType), &EVL);
HeaderMask->replaceAllUsesWith(EVLMask);
- ToErase.push_back(HeaderMask->getDefiningRecipe());
-
- // Try to optimize header mask recipes away to their EVL variants.
- // TODO: Split optimizeMaskToEVL out and move into
- // VPlanTransforms::optimize. transformRecipestoEVLRecipes should be run in
- // tryToBuildVPlanWithVPRecipes beforehand.
- for (VPUser *U : collectUsersRecursively(EVLMask)) {
- auto *CurRecipe = cast<VPRecipeBase>(U);
- VPRecipeBase *EVLRecipe =
- optimizeMaskToEVL(EVLMask, *CurRecipe, TypeInfo, EVL);
- if (!EVLRecipe)
- continue;
-
- unsigned NumDefVal = EVLRecipe->getNumDefinedValues();
- assert(NumDefVal == CurRecipe->getNumDefinedValues() &&
- "New recipe must define the same number of values as the "
- "original.");
- EVLRecipe->insertBefore(CurRecipe);
- if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe, VPInterleaveEVLRecipe>(
- EVLRecipe)) {
- for (unsigned I = 0; I < NumDefVal; ++I) {
- VPValue *CurVPV = CurRecipe->getVPValue(I);
- CurVPV->replaceAllUsesWith(EVLRecipe->getVPValue(I));
- }
- }
- ToErase.push_back(CurRecipe);
- }
- // Remove dead EVL mask.
- if (EVLMask->getNumUsers() == 0)
- ToErase.push_back(EVLMask->getDefiningRecipe());
-
- for (VPRecipeBase *R : reverse(ToErase)) {
- SmallVector<VPValue *> PossiblyDead(R->operands());
- R->eraseFromParent();
- for (VPValue *Op : PossiblyDead)
- recursivelyDeleteDeadRecipes(Op);
- }
}
/// Add a VPEVLBasedIVPHIRecipe and related recipes to \p Plan and
@@ -2838,7 +2831,7 @@ void VPlanTransforms::addExplicitVectorLength(
DebugLoc::getCompilerGenerated(), "avl.next");
AVLPhi->addOperand(NextAVL);
- transformRecipestoEVLRecipes(Plan, *VPEVL);
+ fixupVFUsersForEVL(Plan, *VPEVL);
// Replace all uses of VPCanonicalIVPHIRecipe by
// VPEVLBasedIVPHIRecipe except for the canonical IV increment.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index b28559b620e13..f474f61c5d8d3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -377,6 +377,17 @@ struct VPlanTransforms {
/// users in the original exit block using the VPIRInstruction wrapping to the
/// LCSSA phi.
static void addExitUsersForFirstOrderRecurrences(VPlan &Plan, VFRange &Range);
+
+ /// If the loop is EVL tail folded, try and optimize any recipes that use a
+ /// EVL based header mask to a VP intrinsic, e.g:
+ ///
+ /// %mask = icmp step-vector, EVL
+ /// %load = load %ptr, %mask
+ ///
+ /// ->
+ ///
+ /// %load = vp.load %ptr, EVL
+ static void optimizeMasksToEVL(VPlan &Plan);
};
} // namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index 91734a10cb2c8..fc80b022ffad9 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -310,6 +310,12 @@ bool VPlanVerifier::verifyVPBasicBlock(const VPBasicBlock *VPBB) {
break;
}
}
+ if (const auto *EVLPhi = dyn_cast<VPEVLBasedIVPHIRecipe>(&R)) {
+ if (!isa<VPCanonicalIVPHIRecipe>(std::prev(EVLPhi->getIterator()))) {
+ errs() << "EVL-based IV is not immediately after canonical IV\n";
+ return false;
+ }
+ }
}
auto *IRBB = dyn_cast<VPIRBasicBlock>(VPBB);
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll
index f25b86d3b20c2..183bebe818f7d 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll
@@ -361,12 +361,12 @@ define void @gather_interleave_group_with_dead_insert_pos(i64 %N, ptr noalias %s
; CHECK-NEXT: [[EVL_BASED_IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_IND:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[AVL:%.*]] = phi i64 [ [[TMP2]], %[[VECTOR_PH]] ], [ [[AVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[EVL_BASED_IV]], 2
; CHECK-NEXT: [[TMP10:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 4, i1 true)
; CHECK-NEXT: [[TMP16:%.*]] = zext i32 [[TMP10]] to i64
; CHECK-NEXT: [[TMP12:%.*]] = mul i64 2, [[TMP16]]
; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP12]], i64 0
; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[EVL_BASED_IV]], 2
; CHECK-NEXT: [[TMP22:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[OFFSET_IDX]]
; CHECK-NEXT: [[INTERLEAVE_EVL:%.*]] = mul nuw nsw i32 [[TMP10]], 2
; CHECK-NEXT: [[WIDE_MASKED_VEC:%.*]] = call <vscale x 8 x i8> @llvm.vp.load.nxv8i8.p0(ptr align 1 [[TMP22]], <vscale x 8 x i1> splat (i1 true), i32 [[INTERLEAVE_EVL]])
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll b/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll
index 01b4502308c95..214f8068c8043 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll
@@ -270,6 +270,7 @@ define void @predicated_udiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[V:%.*]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <vscale x 2 x i64> [[BROADCAST_SPLAT]], zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
@@ -278,7 +279,6 @@ define void @predicated_udiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 2, i1 true)
; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i32> poison, i32 [[TMP12]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP15:%.*]] = icmp ult <vscale x 2 x i32> [[TMP7]], [[BROADCAST_SPLAT2]]
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 [[TMP8]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP12]])
@@ -351,6 +351,7 @@ define void @predicated_sdiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[V:%.*]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <vscale x 2 x i64> [[BROADCAST_SPLAT]], zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
@@ -359,7 +360,6 @@ define void @predicated_sdiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 2, i1 true)
; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i32> poison, i32 [[TMP12]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP15:%.*]] = icmp ult <vscale x 2 x i32> [[TMP7]], [[BROADCAST_SPLAT2]]
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 [[TMP8]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP12]])
@@ -570,6 +570,7 @@ define void @predicated_sdiv_by_minus_one(ptr noalias nocapture %a, i64 %n) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
+; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 16 x i32> @llvm.stepvector.nxv16i32()
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -577,7 +578,6 @@ define void @predicated_sdiv_by_minus_one(ptr noalias nocapture %a, i64 %n) {
; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 16, i1 true)
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[TMP12]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 16 x i32> @llvm.stepvector.nxv16i32()
; CHECK-NEXT: [[TMP15:%.*]] = icmp ult <vscale x 16 x i32> [[TMP6]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[A:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr align 1 [[TMP7]], <vscale x 16 x i1> splat (i1 true), i32 [[TMP12]])
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll
index 097f05d222cf6..52f9ef2805bff 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll
@@ -5,7 +5,7 @@ define void @add(ptr noalias nocapture readonly %src1, ptr noalias nocapture rea
; CHECK-LABEL: add
; CHECK: LV(REG): VF = vscale x 4
; CHECK-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 4 registers
; CHECK-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll
index 8bbfdf39a0624..100c2d123c0ba 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll
@@ -6,14 +6,14 @@ define void @add(ptr noalias nocapture readonly %src1, ptr noalias nocapture rea
; ZVFH-LABEL: add
; ZVFH: LV(REG): VF = vscale x 4
; ZVFH-NEXT: LV(REG): Found max usage: 2 item
-; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 2 registers
; ZVFH-NEXT: LV(REG): Found invariant usage: 1 item
; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
; ZVFHMIN-LABEL: add
; ZVFHMIN: LV(REG): VF = vscale x 4
; ZVFHMIN-NEXT: LV(REG): Found max usage: 2 item
-; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 4 registers
; ZVFHMIN-NEXT: LV(REG): Found invariant usage: 1 item
; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll
index 6bb0d64314d3e..fbe28b3bf2bc4 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll
@@ -4,7 +4,7 @@
define i32 @dotp(ptr %a, ptr %b) {
; CHECK-REGS-VP: LV(REG): VF = vscale x 16
; CHECK-REGS-VP-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 24 registers
; CHECK-REGS-VP-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll
index 99139da67bb78..591df1abe06d2 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll
@@ -31,28 +31,28 @@ define void @add(ptr noalias nocapture readonly %src1, ptr noalias nocapture rea
; CHECK-LMUL1-LABEL: add
; CHECK-LMUL1: LV(REG): VF = vscale x 2
; CHECK-LMUL1-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 2 registers
; CHECK-LMUL1-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
; CHECK-LMUL2-LABEL: add
; CHECK-LMUL2: LV(REG): VF = vscale x 4
; CHECK-LMUL2-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 4 registers
; CHECK-LMUL2-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
; CHECK-LMUL4-LABEL: add
; CHECK-LMUL4: LV(REG): VF = vscale x 8
; CHECK-LMUL4-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-LMUL4-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-LMUL4-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK...
[truncated]
|
|
@llvm/pr-subscribers-backend-risc-v Author: Luke Lau (lukel97) ChangesStacked on #166158 Currently we convert a VPlan to an EVL tail folded one after the VPlan is built and optimized, which doesn't match how we handle regular tail folding. This addresses a long standing TODO by performing it much earlier in the pipeline before any optimizations are run, and simulatneously splits out optimizeMaskToEVL into a separate pass to be run during VPlanTransforms::optimize. This way the two parts of EVL tail folding are separated into those needed for correctness and those that are an optimization.
Because we now optimize the VPlan after the EVL stuff is added, some simplifications e.g. replacing a scalar-steps when UF=1 kick in for the initial VPlan. Fixes #153144 Patch is 99.68 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/166164.diff 20 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index e5c3f17860103..dbd811d641883 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8216,10 +8216,6 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
VPlanTransforms::runPass(VPlanTransforms::truncateToMinimalBitwidths,
*Plan, CM.getMinimalBitwidths());
VPlanTransforms::runPass(VPlanTransforms::optimize, *Plan);
- // TODO: try to put it close to addActiveLaneMask().
- if (CM.foldTailWithEVL())
- VPlanTransforms::runPass(VPlanTransforms::addExplicitVectorLength,
- *Plan, CM.getMaxSafeElements());
assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
VPlans.push_back(std::move(Plan));
}
@@ -8483,6 +8479,9 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
}
VPlanTransforms::optimizeInductionExitUsers(*Plan, IVEndValues, *PSE.getSE());
+ if (CM.foldTailWithEVL())
+ VPlanTransforms::addExplicitVectorLength(*Plan, CM.getMaxSafeElements());
+
assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
return Plan;
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index e1da070a1fb7f..2ff1d30d00e41 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -4118,6 +4118,11 @@ class LLVM_ABI_FOR_TEST VPRegionBlock : public VPBlockBase {
return const_cast<VPRegionBlock *>(this)->getCanonicalIV();
}
+ VPEVLBasedIVPHIRecipe *getEVLBasedIV() {
+ return dyn_cast<VPEVLBasedIVPHIRecipe>(
+ std::next(getCanonicalIV()->getIterator()));
+ }
+
/// Return the type of the canonical IV for loop regions.
Type *getCanonicalIVType() { return getCanonicalIV()->getScalarType(); }
const Type *getCanonicalIVType() const {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 9d9bb14530539..43ab56f226c7a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -610,9 +610,11 @@ createScalarIVSteps(VPlan &Plan, InductionDescriptor::InductionKind Kind,
VPBuilder &Builder) {
VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
VPBasicBlock *HeaderVPBB = LoopRegion->getEntryBasicBlock();
- VPCanonicalIVPHIRecipe *CanonicalIV = LoopRegion->getCanonicalIV();
- VPSingleDefRecipe *BaseIV = Builder.createDerivedIV(
- Kind, FPBinOp, StartV, CanonicalIV, Step, "offset.idx");
+ VPValue *IV = LoopRegion->getCanonicalIV();
+ if (auto *EVLIV = LoopRegion->getEVLBasedIV())
+ IV = EVLIV;
+ VPSingleDefRecipe *BaseIV =
+ Builder.createDerivedIV(Kind, FPBinOp, StartV, IV, Step, "offset.idx");
// Truncate base induction if needed.
VPTypeAnalysis TypeInfo(Plan);
@@ -2327,6 +2329,7 @@ void VPlanTransforms::optimize(VPlan &Plan) {
runPass(removeRedundantExpandSCEVRecipes, Plan);
runPass(simplifyRecipes, Plan);
runPass(removeBranchOnConst, Plan);
+ runPass(optimizeMasksToEVL, Plan);
runPass(removeDeadRecipes, Plan);
runPass(createAndOptimizeReplicateRegions, Plan);
@@ -2617,8 +2620,40 @@ static VPRecipeBase *optimizeMaskToEVL(VPValue *HeaderMask,
return nullptr;
}
-/// Replace recipes with their EVL variants.
-static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
+void VPlanTransforms::optimizeMasksToEVL(VPlan &Plan) {
+ // Find the EVL-based header mask if it exists: icmp ult step-vector, EVL
+ VPInstruction *HeaderMask = nullptr;
+ for (VPRecipeBase &R : *Plan.getVectorLoopRegion()->getEntryBasicBlock()) {
+ if (match(&R, m_ICmp(m_VPInstruction<VPInstruction::StepVector>(),
+ m_EVL(m_VPValue())))) {
+ HeaderMask = cast<VPInstruction>(&R);
+ break;
+ }
+ }
+ if (!HeaderMask)
+ return;
+
+ VPValue *EVL = HeaderMask->getOperand(1);
+
+ VPTypeAnalysis TypeInfo(Plan);
+
+ for (VPUser *U : collectUsersRecursively(HeaderMask)) {
+ VPRecipeBase *R = cast<VPRecipeBase>(U);
+ if (auto *NewR = optimizeMaskToEVL(HeaderMask, *R, TypeInfo, *EVL)) {
+ NewR->insertBefore(R);
+ for (auto [Old, New] :
+ zip_equal(R->definedValues(), NewR->definedValues()))
+ Old->replaceAllUsesWith(New);
+ // Erase dead stores, the rest will be removed by removeDeadRecipes.
+ if (R->getNumDefinedValues() == 0)
+ R->eraseFromParent();
+ }
+ }
+}
+
+/// After replacing the IV with a EVL-based IV, fixup recipes that use VF to use
+/// the EVL instead to avoid incorrect updates on the penultimate iteration.
+static void fixupVFUsersForEVL(VPlan &Plan, VPValue &EVL) {
VPTypeAnalysis TypeInfo(Plan);
VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
VPBasicBlock *Header = LoopRegion->getEntryBasicBlock();
@@ -2646,10 +2681,6 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
return isa<VPWidenPointerInductionRecipe>(U);
});
- // Defer erasing recipes till the end so that we don't invalidate the
- // VPTypeAnalysis cache.
- SmallVector<VPRecipeBase *> ToErase;
-
// Create a scalar phi to track the previous EVL if fixed-order recurrence is
// contained.
bool ContainsFORs =
@@ -2683,7 +2714,6 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
TypeInfo.inferScalarType(R.getVPSingleValue()), R.getDebugLoc());
VPSplice->insertBefore(&R);
R.getVPSingleValue()->replaceAllUsesWith(VPSplice);
- ToErase.push_back(&R);
}
}
}
@@ -2704,43 +2734,6 @@ static void transformRecipestoEVLRecipes(VPlan &Plan, VPValue &EVL) {
CmpInst::ICMP_ULT,
Builder.createNaryOp(VPInstruction::StepVector, {}, EVLType), &EVL);
HeaderMask->replaceAllUsesWith(EVLMask);
- ToErase.push_back(HeaderMask->getDefiningRecipe());
-
- // Try to optimize header mask recipes away to their EVL variants.
- // TODO: Split optimizeMaskToEVL out and move into
- // VPlanTransforms::optimize. transformRecipestoEVLRecipes should be run in
- // tryToBuildVPlanWithVPRecipes beforehand.
- for (VPUser *U : collectUsersRecursively(EVLMask)) {
- auto *CurRecipe = cast<VPRecipeBase>(U);
- VPRecipeBase *EVLRecipe =
- optimizeMaskToEVL(EVLMask, *CurRecipe, TypeInfo, EVL);
- if (!EVLRecipe)
- continue;
-
- unsigned NumDefVal = EVLRecipe->getNumDefinedValues();
- assert(NumDefVal == CurRecipe->getNumDefinedValues() &&
- "New recipe must define the same number of values as the "
- "original.");
- EVLRecipe->insertBefore(CurRecipe);
- if (isa<VPSingleDefRecipe, VPWidenLoadEVLRecipe, VPInterleaveEVLRecipe>(
- EVLRecipe)) {
- for (unsigned I = 0; I < NumDefVal; ++I) {
- VPValue *CurVPV = CurRecipe->getVPValue(I);
- CurVPV->replaceAllUsesWith(EVLRecipe->getVPValue(I));
- }
- }
- ToErase.push_back(CurRecipe);
- }
- // Remove dead EVL mask.
- if (EVLMask->getNumUsers() == 0)
- ToErase.push_back(EVLMask->getDefiningRecipe());
-
- for (VPRecipeBase *R : reverse(ToErase)) {
- SmallVector<VPValue *> PossiblyDead(R->operands());
- R->eraseFromParent();
- for (VPValue *Op : PossiblyDead)
- recursivelyDeleteDeadRecipes(Op);
- }
}
/// Add a VPEVLBasedIVPHIRecipe and related recipes to \p Plan and
@@ -2838,7 +2831,7 @@ void VPlanTransforms::addExplicitVectorLength(
DebugLoc::getCompilerGenerated(), "avl.next");
AVLPhi->addOperand(NextAVL);
- transformRecipestoEVLRecipes(Plan, *VPEVL);
+ fixupVFUsersForEVL(Plan, *VPEVL);
// Replace all uses of VPCanonicalIVPHIRecipe by
// VPEVLBasedIVPHIRecipe except for the canonical IV increment.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index b28559b620e13..f474f61c5d8d3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -377,6 +377,17 @@ struct VPlanTransforms {
/// users in the original exit block using the VPIRInstruction wrapping to the
/// LCSSA phi.
static void addExitUsersForFirstOrderRecurrences(VPlan &Plan, VFRange &Range);
+
+ /// If the loop is EVL tail folded, try and optimize any recipes that use a
+ /// EVL based header mask to a VP intrinsic, e.g:
+ ///
+ /// %mask = icmp step-vector, EVL
+ /// %load = load %ptr, %mask
+ ///
+ /// ->
+ ///
+ /// %load = vp.load %ptr, EVL
+ static void optimizeMasksToEVL(VPlan &Plan);
};
} // namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index 91734a10cb2c8..fc80b022ffad9 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -310,6 +310,12 @@ bool VPlanVerifier::verifyVPBasicBlock(const VPBasicBlock *VPBB) {
break;
}
}
+ if (const auto *EVLPhi = dyn_cast<VPEVLBasedIVPHIRecipe>(&R)) {
+ if (!isa<VPCanonicalIVPHIRecipe>(std::prev(EVLPhi->getIterator()))) {
+ errs() << "EVL-based IV is not immediately after canonical IV\n";
+ return false;
+ }
+ }
}
auto *IRBB = dyn_cast<VPIRBasicBlock>(VPBB);
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll
index f25b86d3b20c2..183bebe818f7d 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll
@@ -361,12 +361,12 @@ define void @gather_interleave_group_with_dead_insert_pos(i64 %N, ptr noalias %s
; CHECK-NEXT: [[EVL_BASED_IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_IND:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[AVL:%.*]] = phi i64 [ [[TMP2]], %[[VECTOR_PH]] ], [ [[AVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[EVL_BASED_IV]], 2
; CHECK-NEXT: [[TMP10:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 4, i1 true)
; CHECK-NEXT: [[TMP16:%.*]] = zext i32 [[TMP10]] to i64
; CHECK-NEXT: [[TMP12:%.*]] = mul i64 2, [[TMP16]]
; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP12]], i64 0
; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[EVL_BASED_IV]], 2
; CHECK-NEXT: [[TMP22:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[OFFSET_IDX]]
; CHECK-NEXT: [[INTERLEAVE_EVL:%.*]] = mul nuw nsw i32 [[TMP10]], 2
; CHECK-NEXT: [[WIDE_MASKED_VEC:%.*]] = call <vscale x 8 x i8> @llvm.vp.load.nxv8i8.p0(ptr align 1 [[TMP22]], <vscale x 8 x i1> splat (i1 true), i32 [[INTERLEAVE_EVL]])
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll b/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll
index 01b4502308c95..214f8068c8043 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll
@@ -270,6 +270,7 @@ define void @predicated_udiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[V:%.*]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <vscale x 2 x i64> [[BROADCAST_SPLAT]], zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
@@ -278,7 +279,6 @@ define void @predicated_udiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 2, i1 true)
; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i32> poison, i32 [[TMP12]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP15:%.*]] = icmp ult <vscale x 2 x i32> [[TMP7]], [[BROADCAST_SPLAT2]]
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 [[TMP8]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP12]])
@@ -351,6 +351,7 @@ define void @predicated_sdiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[V:%.*]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP6:%.*]] = icmp ne <vscale x 2 x i64> [[BROADCAST_SPLAT]], zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
@@ -359,7 +360,6 @@ define void @predicated_sdiv(ptr noalias nocapture %a, i64 %v, i64 %n) {
; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 2, i1 true)
; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i32> poison, i32 [[TMP12]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 2 x i32> @llvm.stepvector.nxv2i32()
; CHECK-NEXT: [[TMP15:%.*]] = icmp ult <vscale x 2 x i32> [[TMP7]], [[BROADCAST_SPLAT2]]
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 [[TMP8]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP12]])
@@ -570,6 +570,7 @@ define void @predicated_sdiv_by_minus_one(ptr noalias nocapture %a, i64 %n) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
+; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 16 x i32> @llvm.stepvector.nxv16i32()
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -577,7 +578,6 @@ define void @predicated_sdiv_by_minus_one(ptr noalias nocapture %a, i64 %n) {
; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 16, i1 true)
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[TMP12]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 16 x i32> @llvm.stepvector.nxv16i32()
; CHECK-NEXT: [[TMP15:%.*]] = icmp ult <vscale x 16 x i32> [[TMP6]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[A:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr align 1 [[TMP7]], <vscale x 16 x i1> splat (i1 true), i32 [[TMP12]])
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll
index 097f05d222cf6..52f9ef2805bff 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-bf16.ll
@@ -5,7 +5,7 @@ define void @add(ptr noalias nocapture readonly %src1, ptr noalias nocapture rea
; CHECK-LABEL: add
; CHECK: LV(REG): VF = vscale x 4
; CHECK-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 4 registers
; CHECK-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll
index 8bbfdf39a0624..100c2d123c0ba 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-f16.ll
@@ -6,14 +6,14 @@ define void @add(ptr noalias nocapture readonly %src1, ptr noalias nocapture rea
; ZVFH-LABEL: add
; ZVFH: LV(REG): VF = vscale x 4
; ZVFH-NEXT: LV(REG): Found max usage: 2 item
-; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 2 registers
; ZVFH-NEXT: LV(REG): Found invariant usage: 1 item
; ZVFH-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
; ZVFHMIN-LABEL: add
; ZVFHMIN: LV(REG): VF = vscale x 4
; ZVFHMIN-NEXT: LV(REG): Found max usage: 2 item
-; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 4 registers
; ZVFHMIN-NEXT: LV(REG): Found invariant usage: 1 item
; ZVFHMIN-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll
index 6bb0d64314d3e..fbe28b3bf2bc4 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage-maxbandwidth.ll
@@ -4,7 +4,7 @@
define i32 @dotp(ptr %a, ptr %b) {
; CHECK-REGS-VP: LV(REG): VF = vscale x 16
; CHECK-REGS-VP-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 24 registers
; CHECK-REGS-VP-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-REGS-VP-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll
index 99139da67bb78..591df1abe06d2 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/reg-usage.ll
@@ -31,28 +31,28 @@ define void @add(ptr noalias nocapture readonly %src1, ptr noalias nocapture rea
; CHECK-LMUL1-LABEL: add
; CHECK-LMUL1: LV(REG): VF = vscale x 2
; CHECK-LMUL1-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 2 registers
; CHECK-LMUL1-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-LMUL1-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
; CHECK-LMUL2-LABEL: add
; CHECK-LMUL2: LV(REG): VF = vscale x 4
; CHECK-LMUL2-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::VRRC, 4 registers
; CHECK-LMUL2-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-LMUL2-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
; CHECK-LMUL4-LABEL: add
; CHECK-LMUL4: LV(REG): VF = vscale x 8
; CHECK-LMUL4-NEXT: LV(REG): Found max usage: 2 item
-; CHECK-LMUL4-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 6 registers
+; CHECK-LMUL4-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 5 registers
; CHECK...
[truncated]
|
| ; IF-EVL-OUTLOOP-NEXT: EMIT-SCALAR vp<[[EVL:%.+]]> = EXPLICIT-VECTOR-LENGTH vp<[[AVL]]> | ||
| ; IF-EVL-OUTLOOP-NEXT: vp<[[ST:%[0-9]+]]> = SCALAR-STEPS vp<[[EVL_PHI]]>, ir<1>, vp<[[EVL]]> | ||
| ; IF-EVL-OUTLOOP-NEXT: CLONE ir<[[GEP1:%.+]]> = getelementptr inbounds ir<%a>, vp<[[ST]]> | ||
| ; IF-EVL-OUTLOOP-NEXT: vp<[[PTR1:%[0-9]+]]> = vector-pointer ir<[[GEP1]]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why did we not remove the redundant vector-pointers before? It is not immediately clear how this is a consequence of moving the transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did remove the redundant vector pointers before, just during the second call to simplifyRecipes just before execution, which is after the initial VPlan is debug printed.
This change also means we do it before the cost model now IIUC so it's not NFC
| VPValue *IV = LoopRegion->getCanonicalIV(); | ||
| if (auto *EVLIV = LoopRegion->getEVLBasedIV()) | ||
| IV = EVLIV; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so now potentially more transformations need to handle EVL based IVs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I went through and audited all the uses of getCanonicalIV and createScalarIVSteps/legalizeAndOptimizeInductions is the only place that is now exposed to EVL based IVs.
FWIW there's a bunch of other transformations that already deal with the canonical IV after the EVL IV is added, but I think they're all correct:
- narrowInterleaveGroups changes the canonical IV but it bails if it sees any non-canonical-IV PHI in the vector region
- preparePlanForEpilogueVectorLoop calls getCanonicalIV but we won't ever have an epilogue with EVL tail folding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I’d like to propose applying the EVL transform (DataWithEVL) to loops with an epilogue so the remainder drops from (ScalarTC % VF) to one. I don’t yet have data to quantify the benefit or a concrete proposal.
Reducing the iterations in scalar epilogue just sounds a good idea.
Ideally I hope all other passes to be EVL/tail folding style agnostic, but that implies we have to run addExplicitVectorLength and optimizations post loop-region dissolution.
It is Just a rough idea, I haven’t evaluated the trade-offs yet.
| VPEVLBasedIVPHIRecipe *getEVLBasedIV() { | ||
| return dyn_cast<VPEVLBasedIVPHIRecipe>( | ||
| std::next(getCanonicalIV()->getIterator())); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this should be exposed in VPRegionBlock for a single user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in f4e52c5
| // Find the EVL-based header mask if it exists: icmp ult step-vector, EVL | ||
| VPInstruction *HeaderMask = nullptr; | ||
| for (VPRecipeBase &R : *Plan.getVectorLoopRegion()->getEntryBasicBlock()) { | ||
| if (match(&R, m_ICmp(m_VPInstruction<VPInstruction::StepVector>(), | ||
| m_EVL(m_VPValue())))) { | ||
| HeaderMask = cast<VPInstruction>(&R); | ||
| break; | ||
| } | ||
| } | ||
| if (!HeaderMask) | ||
| return; | ||
|
|
||
| VPValue *EVL = HeaderMask->getOperand(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have plan to create helper function like static VPValue *findEVLMask(VPlan &Plan)?
If not, could we do
| // Find the EVL-based header mask if it exists: icmp ult step-vector, EVL | |
| VPInstruction *HeaderMask = nullptr; | |
| for (VPRecipeBase &R : *Plan.getVectorLoopRegion()->getEntryBasicBlock()) { | |
| if (match(&R, m_ICmp(m_VPInstruction<VPInstruction::StepVector>(), | |
| m_EVL(m_VPValue())))) { | |
| HeaderMask = cast<VPInstruction>(&R); | |
| break; | |
| } | |
| } | |
| if (!HeaderMask) | |
| return; | |
| VPValue *EVL = HeaderMask->getOperand(1); | |
| // Find the EVL-based header mask if it exists: icmp ult step-vector, EVL | |
| VPInstruction *HeaderMask = nullptr; | |
| VPValue *EVL; | |
| for (VPRecipeBase &R : *Plan.getVectorLoopRegion()->getEntryBasicBlock()) { | |
| if (match(&R, m_ICmp(m_VPInstruction<VPInstruction::StepVector>(), | |
| m_EVL(m_VPValue(EVL))))) { | |
| HeaderMask = cast<VPInstruction>(&R); | |
| break; | |
| } | |
| } | |
| if (!HeaderMask) | |
| return; |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't planning on creating a helper function. I think the change you suggested matches the AVL though, not the EVL. I think we need something like bind_and_match_ty from PatternMatch.h where we can both match and capture a value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do we need to update vputils::isHeaderMask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, isHeaderMask is only used by findHeaderMask, which in turn is only used by transformRecipestoEVLRecipes and addActiveLaneMask
Stacked on llvm#166158 Currently we convert a VPlan to an EVL tail folded one after the VPlan is built and optimized, which doesn't match how we handle regular tail folding. This addresses a long standing TODO by performing it much earlier in the pipeline before any optimizations are run, and simulatneously splits out optimizeMaskToEVL into a separate pass to be run during VPlanTransforms::optimize. This way the two parts of EVL tail folding are separated into those needed for correctness and those that are an optimization. - We don't need to remove the old recipes ourselves anymore and can leave it to removeDeadRecipes - createScalarIVSteps needs to be updated to use the EVL based IV if it exists, so a helper method was added to VPlan to extract it - VPlanVerifier was updated to check that the EVL based IV always immediately follows the canonical IV Because we now optimize the VPlan after the EVL stuff is added, some simplifications e.g. replacing a scalar-steps when UF=1 kick in for the initial VPlan. Fixes llvm#153144
f4e52c5 to
6ac7780
Compare
| VPHeaderPHIRecipe *IV = LoopRegion->getCanonicalIV(); | ||
| if (auto *EVLIV = | ||
| dyn_cast<VPEVLBasedIVPHIRecipe>(std::next(IV->getIterator()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can avoid this by deferring canonical IV replacement until canonicalizeEVLLoops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think thats a good idea because it means we will have an incorrect VPlan throughout the optimisation pipeline.
Part of the motivation for this PR is to have everything use the EVL based IV as soon as it's added so we don't accidentally have recipes using the canonical IV and producing incorrect results on the penultimate iteration.
We could probably add a method to VPRegionBlock that abstracts over the EVL or canonical IV like getEffectiveIV, but that probably requires more discussion so I'd like to leave that to another PR if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it be incorrect?
If the movement means every optimization must be careful whether to use the canonical IV or EVL based IV, and adding new users to the canonical IV could cause incorrect transformations, then I am not sure if that is the best direction forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it be incorrect?
Once we change the header mask to an EVL based mask (a.k.a variable stepping), if a widened recipe still uses the canonical IV it will operate on the wrong lanes in the penultimate iteration.
So the conversion of the header mask to the EVL based mask needs to be done in tandem with replacing all uses of the canonical IV with the EVL based IV.
If the movement means every optimization must be careful whether to use the canonical IV or EVL based IV, and adding new users to the canonical IV could cause incorrect transformations
This is already something we need to be careful about today, even without this patch. E.g. narrowInterleaveGroups uses the canonical IV and runs after addExplicitVectorLength. It just so happens to bail when it sees any non-canonical IV phis at the moment, but in the future we presumably will need to handle EVL based IVs etc.
It crossed my mind that maybe we should just call addExplicitVectorLength as late as possible, but I can see two potential issues:
- If we move past the point where we compute the cost then the cost would be inaccurate, because we would no longer see that the header mask is optimised away/we use VP intrinsics
- We miss out on any simplifications that are exposed via the EVL transform
In my opinion I think it's simplest if we have the EVL based loop early on, instead of having a mix of some transforms being EVL-aware and some unaware.
We should probably also audit users of getCanonicalIV and make sure they're using some API that returns either the canonical IV or EVL based IV.
Hope that explanation makes sense, open to other thoughts and suggestions.
So that llvm#149706 doesn't need to worry about EVL recipes
|
I've moved optimizeMasksToEVL later so it should be the last optimization now before costing in 5344cc9. Since I noticed that in #149706 narrowInterleaveGroups will be moved earlier so making sure optimizeMasksToEVL runs after it means it shouldn't need to worry about EVL widened recipes (but it still needs to account for EVL based IVs as before) |
Stacked on #166158
Currently we convert a VPlan to an EVL tail folded one after the VPlan is built and optimized, which doesn't match how we handle regular tail folding.
This addresses a long standing TODO by performing it much earlier in the pipeline before any optimizations are run, and simulatneously splits out optimizeMaskToEVL into a separate pass to be run during VPlanTransforms::optimize.
This way the two parts of EVL tail folding are separated into those needed for correctness and those that are an optimization.
Because we now optimize the VPlan after the EVL stuff is added, some simplifications e.g. replacing a scalar-steps when UF=1 kick in for the initial VPlan.
Fixes #153144