-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPlan] Consistently use (Part, 0) for first lane scalar values #80271
Conversation
A VPInstruction only has its first lane used if all users use its first lane only. Use vputils::onlyFirstLaneUsed to continue checking the recipe's users to handle more cases. Besides allowing additional introduction of scalar steps when interleaving in some cases, this also enables using an Add VPInstruction to model the increment.
At the moment, some VPInstructions create only a single scalar value, but use VPTransformatState's 'vector' storage for this value. Those values are effectively uniform-per-VF (or in some cases uniform-across-VF-and-UF). Using the vector/per-part storage doesn't interact well with other recipes, that more accurately using (Part, Lane) to look up scalar values and prevents VPInstructions creating scalars from interacting with other recipes working with scalars. This PR tries to unify handling of scalars by using (Part, 0) for scalar values where only the first lane is demanded. This allows using VPInstructions with other recipes like VPScalarCastRecipe and is also needed when using VPInstructions in more cases otuside the vector loop region to generate scalars. The patch is still a bit rough around the edges, but hopefully serves as start for a discussion how to model more scalar recipes. A potential alternative would be to split off the opcodes that generate scalars only to a separate recipe.
@llvm/pr-subscribers-llvm-transforms Author: Florian Hahn (fhahn) ChangesAt the moment, some VPInstructions create only a single scalar value, but use VPTransformatState's 'vector' storage for this value. Those values are effectively uniform-per-VF (or in some cases uniform-across-VF-and-UF). Using the vector/per-part storage doesn't interact well with other recipes, that more accurately using (Part, Lane) to look up scalar values and prevents VPInstructions creating scalars from interacting with other recipes working with scalars. This PR tries to unify handling of scalars by using (Part, 0) for scalar values where only the first lane is demanded. This allows using VPInstructions with other recipes like VPScalarCastRecipe and is also needed when using VPInstructions in more cases otuside the vector loop region to generate scalars. The patch is still a bit rough around the edges, but hopefully serves as start for a discussion how to model more scalar recipes. A potential alternative would be to split off the opcodes that generate scalars only to a separate recipe. Patch is 22.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/80271.diff 5 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 1ca2cfef447f6..7ed07fe5f413a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -213,8 +213,13 @@ VPBasicBlock::iterator VPBasicBlock::getFirstNonPhi() {
}
Value *VPTransformState::get(VPValue *Def, const VPIteration &Instance) {
- if (Def->isLiveIn())
- return Def->getLiveInIRValue();
+ if (Def->isLiveIn()) {
+ if (Value *V = Def->getLiveInIRValue())
+ return V;
+ if (hasScalarValue(Def, VPIteration(0, 0))) {
+ return Data.PerPartScalars[Def][0][0];
+ }
+ }
if (hasScalarValue(Def, Instance)) {
return Data
@@ -794,7 +799,7 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
// FIXME: Model VF * UF computation completely in VPlan.
State.set(&VFxUF,
createStepForVF(Builder, TripCountV->getType(), State.VF, State.UF),
- 0);
+ VPIteration(0, 0));
// When vectorizing the epilogue loop, the canonical induction start value
// needs to be changed from zero to the value after the main vector loop.
@@ -883,8 +888,11 @@ void VPlan::execute(VPTransformState *State) {
for (unsigned Part = 0; Part < LastPartForNewPhi; ++Part) {
Value *Phi = State->get(PhiR, Part);
- Value *Val = State->get(PhiR->getBackedgeValue(),
- SinglePartNeeded ? State->UF - 1 : Part);
+ Value *Val =
+ isa<VPCanonicalIVPHIRecipe>(PhiR)
+ ? State->get(PhiR->getBackedgeValue(), VPIteration(Part, 0))
+ : State->get(PhiR->getBackedgeValue(),
+ SinglePartNeeded ? State->UF - 1 : Part);
cast<PHINode>(Phi)->addIncoming(Val, VectorLatchBB);
}
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 30dc521947b3b..97035146a2f4d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1388,6 +1388,13 @@ class VPScalarCastRecipe : public VPSingleDefRecipe {
/// Returns the result type of the cast.
Type *getResultType() const { return ResultTy; }
+
+ bool onlyFirstLaneUsed(const VPValue *Op) const override {
+ // At the moment, only scalar codegen is implemented.
+ assert(is_contained(operands(), Op) &&
+ "Op must be an operand of the recipe");
+ return true;
+ }
};
/// A recipe for widening Call instructions.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 21b8d1eb77bf9..77f2cf899b085 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -279,11 +279,17 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
Builder.SetCurrentDebugLocation(getDebugLoc());
if (Instruction::isBinaryOp(getOpcode())) {
+ bool OnlyFirstLaneUsed = vputils::onlyFirstLaneUsed(this);
if (Part != 0 && vputils::onlyFirstPartUsed(this))
- return State.get(this, 0);
-
- Value *A = State.get(getOperand(0), Part);
- Value *B = State.get(getOperand(1), Part);
+ return OnlyFirstLaneUsed ? State.get(this, VPIteration(0, 0))
+ : State.get(this, 0);
+
+ Value *A = OnlyFirstLaneUsed
+ ? State.get(getOperand(0), VPIteration(Part, 0))
+ : State.get(getOperand(0), Part);
+ Value *B = OnlyFirstLaneUsed
+ ? State.get(getOperand(1), VPIteration(Part, 0))
+ : State.get(getOperand(1), Part);
auto *Res =
Builder.CreateBinOp((Instruction::BinaryOps)getOpcode(), A, B, Name);
if (auto *I = dyn_cast<Instruction>(Res))
@@ -385,8 +391,8 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
if (Part != 0)
return nullptr;
// First create the compare.
- Value *IV = State.get(getOperand(0), Part);
- Value *TC = State.get(getOperand(1), Part);
+ Value *IV = State.get(getOperand(0), VPIteration(0, 0));
+ Value *TC = State.get(getOperand(1), VPIteration(0, 0));
Value *Cond = Builder.CreateICmpEQ(IV, TC);
// Now create the branch.
@@ -407,7 +413,7 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
}
case VPInstruction::ComputeReductionResult: {
if (Part != 0)
- return State.get(this, 0);
+ return State.get(this, VPIteration(0, 0));
// FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
// and will be removed by breaking up the recipe further.
@@ -512,7 +518,17 @@ void VPInstruction::execute(VPTransformState &State) {
if (!hasResult())
continue;
assert(GeneratedValue && "generateInstruction must produce a value");
- State.set(this, GeneratedValue, Part);
+ if (GeneratedValue->getType()->isVectorTy())
+ State.set(this, GeneratedValue, Part);
+ else {
+ if (getOpcode() == VPInstruction::ComputeReductionResult) {
+ State.set(this, GeneratedValue, VPIteration(Part, 0));
+ } else {
+ assert((State.VF.isScalar() || vputils::onlyFirstLaneUsed(this)) &&
+ "scalar value but not only first lane used");
+ State.set(this, GeneratedValue, VPIteration(Part, 0));
+ }
+ }
}
}
bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
@@ -525,11 +541,13 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
return false;
case Instruction::ICmp:
return vputils::onlyFirstLaneUsed(this);
+ case VPInstruction::ComputeReductionResult:
+ return true;
case VPInstruction::ActiveLaneMask:
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::BranchOnCount:
- return getOperand(0) == Op;
+ return true;
};
llvm_unreachable("switch should return");
}
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll b/llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll
index 1e79c3e1e8dc2..2ef55742ffc0b 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll
@@ -99,7 +99,7 @@ define void @test_widen(ptr noalias %a, ptr readnone %b) #4 {
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_ENTRY1:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP8]], i64 1025)
; TFA_INTERLEAVE-NEXT: br label [[VECTOR_BODY:%.*]]
; TFA_INTERLEAVE: vector.body:
-; TFA_INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT4:%.*]], [[VECTOR_BODY]] ]
+; TFA_INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK2:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT5:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[TMP9:%.*]] = getelementptr i64, ptr [[B:%.*]], i64 [[INDEX]]
@@ -116,8 +116,7 @@ define void @test_widen(ptr noalias %a, ptr readnone %b) #4 {
; TFA_INTERLEAVE-NEXT: [[TMP18:%.*]] = getelementptr inbounds i64, ptr [[TMP15]], i64 [[TMP17]]
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP13]], ptr [[TMP15]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP14]], ptr [[TMP18]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK2]])
-; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT4]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
; TFA_INTERLEAVE-NEXT: [[TMP19:%.*]] = call i64 @llvm.vscale.i64()
; TFA_INTERLEAVE-NEXT: [[TMP20:%.*]] = mul i64 [[TMP19]], 2
; TFA_INTERLEAVE-NEXT: [[TMP21:%.*]] = add i64 [[INDEX_NEXT]], [[TMP20]]
@@ -254,7 +253,7 @@ define void @test_if_then(ptr noalias %a, ptr readnone %b) #4 {
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_ENTRY1:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP8]], i64 1025)
; TFA_INTERLEAVE-NEXT: br label [[VECTOR_BODY:%.*]]
; TFA_INTERLEAVE: vector.body:
-; TFA_INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT5:%.*]], [[VECTOR_BODY]] ]
+; TFA_INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK2:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT6:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[TMP9:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 [[INDEX]]
@@ -283,8 +282,7 @@ define void @test_if_then(ptr noalias %a, ptr readnone %b) #4 {
; TFA_INTERLEAVE-NEXT: [[TMP28:%.*]] = getelementptr inbounds i64, ptr [[TMP25]], i64 [[TMP27]]
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[PREDPHI]], ptr [[TMP25]], i32 8, <vscale x 2 x i1> [[TMP23]])
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[PREDPHI4]], ptr [[TMP28]], i32 8, <vscale x 2 x i1> [[TMP24]])
-; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT5]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
; TFA_INTERLEAVE-NEXT: [[TMP29:%.*]] = call i64 @llvm.vscale.i64()
; TFA_INTERLEAVE-NEXT: [[TMP30:%.*]] = mul i64 [[TMP29]], 2
; TFA_INTERLEAVE-NEXT: [[TMP31:%.*]] = add i64 [[INDEX_NEXT]], [[TMP30]]
@@ -437,7 +435,7 @@ define void @test_widen_if_then_else(ptr noalias %a, ptr readnone %b) #4 {
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_ENTRY1:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP8]], i64 1025)
; TFA_INTERLEAVE-NEXT: br label [[VECTOR_BODY:%.*]]
; TFA_INTERLEAVE: vector.body:
-; TFA_INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT5:%.*]], [[VECTOR_BODY]] ]
+; TFA_INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK2:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT6:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[TMP9:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 [[INDEX]]
@@ -468,8 +466,7 @@ define void @test_widen_if_then_else(ptr noalias %a, ptr readnone %b) #4 {
; TFA_INTERLEAVE-NEXT: [[TMP30:%.*]] = getelementptr inbounds i64, ptr [[TMP27]], i64 [[TMP29]]
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[PREDPHI]], ptr [[TMP27]], i32 8, <vscale x 2 x i1> [[TMP25]])
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[PREDPHI4]], ptr [[TMP30]], i32 8, <vscale x 2 x i1> [[TMP26]])
-; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT5]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
; TFA_INTERLEAVE-NEXT: [[TMP31:%.*]] = call i64 @llvm.vscale.i64()
; TFA_INTERLEAVE-NEXT: [[TMP32:%.*]] = mul i64 [[TMP31]], 2
; TFA_INTERLEAVE-NEXT: [[TMP33:%.*]] = add i64 [[INDEX_NEXT]], [[TMP32]]
@@ -771,7 +768,7 @@ define void @test_widen_optmask(ptr noalias %a, ptr readnone %b) #4 {
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_ENTRY1:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP8]], i64 1025)
; TFA_INTERLEAVE-NEXT: br label [[VECTOR_BODY:%.*]]
; TFA_INTERLEAVE: vector.body:
-; TFA_INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT4:%.*]], [[VECTOR_BODY]] ]
+; TFA_INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK2:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT5:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[TMP9:%.*]] = getelementptr i64, ptr [[B:%.*]], i64 [[INDEX]]
@@ -788,8 +785,7 @@ define void @test_widen_optmask(ptr noalias %a, ptr readnone %b) #4 {
; TFA_INTERLEAVE-NEXT: [[TMP18:%.*]] = getelementptr inbounds i64, ptr [[TMP15]], i64 [[TMP17]]
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP13]], ptr [[TMP15]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP14]], ptr [[TMP18]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK2]])
-; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT4]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
; TFA_INTERLEAVE-NEXT: [[TMP19:%.*]] = call i64 @llvm.vscale.i64()
; TFA_INTERLEAVE-NEXT: [[TMP20:%.*]] = mul i64 [[TMP19]], 2
; TFA_INTERLEAVE-NEXT: [[TMP21:%.*]] = add i64 [[INDEX_NEXT]], [[TMP20]]
@@ -970,7 +966,7 @@ define double @test_widen_fmuladd_and_call(ptr noalias %a, ptr readnone %b, doub
; TFA_INTERLEAVE-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x double> [[BROADCAST_SPLATINSERT]], <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer
; TFA_INTERLEAVE-NEXT: br label [[VECTOR_BODY:%.*]]
; TFA_INTERLEAVE: vector.body:
-; TFA_INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT4:%.*]], [[VECTOR_BODY]] ]
+; TFA_INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK2:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT5:%.*]], [[VECTOR_BODY]] ]
; TFA_INTERLEAVE-NEXT: [[VEC_PHI:%.*]] = phi double [ 0.000000e+00, [[ENTRY]] ], [ [[TMP26:%.*]], [[VECTOR_BODY]] ]
@@ -996,8 +992,7 @@ define double @test_widen_fmuladd_and_call(ptr noalias %a, ptr readnone %b, doub
; TFA_INTERLEAVE-NEXT: [[TMP24:%.*]] = call double @llvm.vector.reduce.fadd.nxv2f64(double [[VEC_PHI]], <vscale x 2 x double> [[TMP23]])
; TFA_INTERLEAVE-NEXT: [[TMP25:%.*]] = select <vscale x 2 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 2 x double> [[TMP14]], <vscale x 2 x double> shufflevector (<vscale x 2 x double> insertelement (<vscale x 2 x double> poison, double -0.000000e+00, i64 0), <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer)
; TFA_INTERLEAVE-NEXT: [[TMP26]] = call double @llvm.vector.reduce.fadd.nxv2f64(double [[TMP24]], <vscale x 2 x double> [[TMP25]])
-; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT4]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
; TFA_INTERLEAVE-NEXT: [[TMP27:%.*]] = call i64 @llvm.vscale.i64()
; TFA_INTERLEAVE-NEXT: [[TMP28:%.*]] = mul i64 [[TMP27]], 2
; TFA_INTERLEAVE-NEXT: [[TMP29:%.*]] = add i64 [[INDEX_NEXT]], [[TMP28]]
diff --git a/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll b/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
index b451d4b4e5462..f0a5fc1592532 100644
--- a/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
+++ b/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
@@ -3271,47 +3271,45 @@ define i32 @sink_into_replication_region_multiple(ptr %x, i32 %y) {
; UNROLL-NO-VF-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i32 [[TMP1]], 1
; UNROLL-NO-VF-NEXT: br label [[VECTOR_BODY:%.*]]
; UNROLL-NO-VF: vector.body:
-; UNROLL-NO-VF-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE7:%.*]] ]
-; UNROLL-NO-VF-NEXT: [[VECTOR_RECUR:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.*]], [[PRED_STORE_CONTINUE7]] ]
-; UNROLL-NO-VF-NEXT: [[VEC_PHI:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP10:%.*]], [[PRED_STORE_CONTINUE7]] ]
-; UNROLL-NO-VF-NEXT: [[VEC_PHI2:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP11:%.*]], [[PRED_STORE_CONTINUE7]] ]
+; UNROLL-NO-VF-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE6:%.*]] ]
+; UNROLL-NO-VF-NEXT: [[VECTOR_RECUR:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP11:%.*]], [[PRED_STORE_CONTINUE6]] ]
+; UNROLL-NO-VF-NEXT: [[VEC_PHI:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP12:%.*]], [[PRED_STORE_CONTINUE6]] ]
+; UNROLL-NO-VF-NEXT: [[VEC_PHI2:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP13:%.*]], [[PRED_STORE_CONTINUE6]] ]
; UNROLL-NO-VF-NEXT: [[OFFSET_IDX:%.*]] = sub i32 [[Y]], [[INDEX]]
; UNROLL-NO-VF-NEXT: [[TMP2:%.*]] = add i32 [[OFFSET_IDX]], 0
; UNROLL-NO-VF-NEXT: [[TMP3:%.*]] = add i32 [[OFFSET_IDX]], -1
-; UNROLL-NO-VF-NEXT: [[VEC_IV:%.*]] = add i32 [[INDEX]], 0
-; UNROLL-NO-VF-NEXT: [[VEC_IV3:%.*]] = add i32 [[INDEX]], 1
-; UNROLL-NO-VF-NEXT: [[TMP4:%.*]] = icmp ule i32 [[VEC_IV]], [[TRIP_COUNT_MINUS_1]]
-; UNROLL-NO-VF-NEXT: [[TMP5:%.*]] = icmp ule i32 [[VEC_IV3]], [[TRIP_COUNT_MINUS_1]]
-; UNROLL-NO-VF-NEXT: br i1 [[TMP4]], label [[PRED_UDIV_IF:%.*]], label [[PRED_UDIV_CONTINUE:%.*]]
+; UNROLL-NO-VF-NEXT: [[TMP4:%.*]] = add i32 [[INDEX]], 0
+; UNROLL-NO-VF-NEXT: [[TMP5:%.*]] = add i32 [[INDEX]], 1
+; UNROLL-NO-VF-NEXT: [[TMP6:%.*]] = icmp ule i32 [[TMP4]], [[TRIP_COUNT_MINUS_1]]
+; UNROLL-NO-VF-NEXT: [[TMP7:%.*]] = icmp ule i32 [[TMP5]], [[TRIP_COUNT_MINUS_1]]
+; UNROLL-NO-VF-NEXT: br i1 [[TMP6]], label [[PRED_UDIV_IF:%.*]], label [[PRED_UDIV_CONTINUE:%.*]]
; UNROLL-NO-VF: pred.udiv.if:
-; UNROLL-NO-VF-NEXT: [[TMP6:%.*]] = udiv i32 219220132, [[TMP2]]
+; UNROLL-NO-VF-NEXT: [[TMP8:%.*]] = udiv i32 219220132, [[TMP2]]
; UNROLL-NO-VF-NEXT: br label [[PRED_UDIV_CONTINUE]]
; UNROLL-NO-VF: pred.udiv.continue:
-; UNROLL-NO-VF-NEXT: [[TMP7:%.*]] = phi i32 [ poison, [[VECTOR_BODY]] ], [ [[TMP6]], [[PRED_UDIV_IF]] ]
-; UNROLL-NO-VF-NEXT: br i1 [[TMP5]], label [[PRED_UDIV_IF4:%.*]], label [[PRED_UDIV_CONTINUE5:%.*]]
-; UNROLL-NO-VF: pred.udiv.if4:
-; UNROLL-NO-VF-NEXT: [[TMP8:%.*]] = udiv i32 219220132, [[TMP3]]
-; UNROLL-NO-VF-NEXT: br label [[PRED_UDIV_CONTINUE5]]
-; UNROLL-NO-VF: pred.udiv.continue5:
-; UNROLL-NO-VF-NEXT: [[TMP9]] = phi i32 [ poison, [[PRED_UDIV_CONTINUE]] ], [ [[TMP8]], [[PRED_UDIV_IF4]] ]
-; UNROLL-NO-VF-NEXT: [[TMP10]] = add i32 [[VEC_PHI]], [[VECTOR_RECUR]]
-; UNROLL-NO-VF-NEXT: [[TMP11]] = add i32 [[VEC_PHI2]], [[TMP7]]
-; UNROLL-NO-VF-NEXT: br i1 [[TMP4]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
+; UNROLL-NO-VF-NEXT: [[TMP9:%.*]] = phi i32 [ poison, [[VECTOR_BODY]] ], [ [[TMP8]], [[PRED_UDIV_IF]] ]
+; UNROLL-NO-VF-NEXT: br i1 [[TMP7]], label [[PRED_UDIV_IF3:%.*]], label [[PRED_UDIV_CONTINUE4:%.*]]
+; UNROLL-NO-VF: pred.udiv.if3:
+; UNROLL-NO-VF-NEXT: [[TMP10:%.*]] = udiv i32 219220132, [[TMP3]]
+; UNROLL-NO-VF-NEXT: br label [[PRED_UDIV_CONTINUE4]]
+; UNROLL-NO-VF: pred.udiv.continue4:
+; UNROLL-NO-VF-NEXT: [[TMP11]] = phi i32 [ poison, [[PRED_UDIV_CONTINUE]] ], [ [[TMP10]], [[PRED_UDIV_IF3]] ]
+; UNROLL-NO-VF-NEXT: [[TMP12]] = add i32 [[VEC_PHI]], [[VECTOR_RECUR]]
+; UNROLL-NO-VF-NEXT: [[TMP13]] = add i32 [[VEC_PHI2]], [[TMP9]]
+; UNROLL-NO-VF-NEXT: br i1 [[TMP6]], label [[PRED_STORE_IF:%.*]]...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, some VPInstructions create only a single scalar value, but use VPTransformatState's 'vector' storage for this value. Those values are effectively uniform-per-VF (or in some cases uniform-across-VF-and-UF). Using the vector/per-part storage doesn't interact well with other recipes, that more accurately using (Part, Lane) to look up scalar values and prevents VPInstructions creating scalars from interacting with other recipes working with scalars.
This PR tries to unify handling of scalars by using (Part, 0) for scalar values where only the first lane is demanded. This allows using VPInstructions with other recipes like VPScalarCastRecipe and is also needed when using VPInstructions in more cases otuside the vector loop region to generate scalars.
Perhaps it is the "other recipes" who should be fixed, to more consistently store their single scalar in per-part storage rather than lane zero? The current design of State is to hold UF per-part Values whenever possible, be they vectors or (uniform) scalars, and otherwise holds VF*UF per-lane Values, for every VPValue. The hasVectorValue() and hasScalarValue() indicators should better be renamed hasValuePerPart() and hasValuePerLane(), respectively.
Unrolling loop-regions in VPlan by UF would simplify State to hold a single per-part Value or VF per-lane Values. Further unrolling replicating-regions and replicating-recipes by VF would simplify State to hold a single Value per VPValue.
The current get() functions are overly complex, as they bridge gaps by packing VF scalar elements into a vector on-demand for a scalar producer to feed a vector consumer, by extracting VF scalar elements from a vector producer to feed a scalar-replicating consumer, and more. Introducing explicit pack and extract recipes in VPlan could simplify its code-gen and State, prior to unrollings.
@@ -1389,6 +1389,13 @@ class VPScalarCastRecipe : public VPSingleDefRecipe { | |||
|
|||
/// Returns the result type of the cast. | |||
Type *getResultType() const { return ResultTy; } | |||
|
|||
bool onlyFirstLaneUsed(const VPValue *Op) const override { | |||
// At the moment, only scalar codegen is implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// At the moment, only scalar codegen is implemented. | |
// At the moment, only uniform codegen is implemented. |
?
scalar codegen in general may use other lanes.
Can this improvement of onlyFirstLaneUsed() be pushed independently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
Can split off if preferred, the code just won't be exercised by any tests until this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fine, better avoid introducing unexercised code, here and below.
@@ -526,12 +542,13 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const { | |||
case Instruction::ICmp: | |||
// TODO: Cover additional opcodes. | |||
return vputils::onlyFirstLaneUsed(this); | |||
case VPInstruction::ComputeReductionResult: | |||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fallthrough to join other true
opcodes below?
Can these two improvements of onlyFirstLaneUsed() (additional opcodes and additional operands) be pushed independently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed ComputeReductionResult change here as it is no needed. Can split off if preferred, the code just won't be exercised by any tests until this patch.
@@ -116,8 +116,7 @@ define void @test_widen(ptr noalias %a, ptr readnone %b) #4 { | |||
; TFA_INTERLEAVE-NEXT: [[TMP18:%.*]] = getelementptr inbounds i64, ptr [[TMP15]], i64 [[TMP17]] | |||
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP13]], ptr [[TMP15]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]]) | |||
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP14]], ptr [[TMP18]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK2]]) | |||
; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]] | |||
; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT4]] = add i64 [[INDEX]], [[TMP6]] | |||
; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only implication visible in existing tests below is cse'ing of the Canonical IV bump when active lane masks are used. Would other test(s) be relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverse ping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed that one, sorry!
This was an accidental change and has been undone now. To catch this, onlyFirstPartUsed
needs to also support CanonicalIVIncrementForPart
, which can be adjusted separately.
if (Def->isLiveIn()) | ||
return Def->getLiveInIRValue(); | ||
if (Def->isLiveIn()) { | ||
if (Value *V = Def->getLiveInIRValue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A LiveIn Def must have a LiveInIRValue, is this if
an assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily at the moment, LiveIns don't have a defining recipe, but they also may not have a LiveInIRValue as their value only becomes known later, for example VFxUF
, BackedgeTakenCount
or VectorTripCount
, which are set in the state in prepareToExecute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, thanks for reminding. This may deserve a better representation to avoid confusion.
An alternative way of handling non-IR-valued live-ins, is to complement them with one asap, i.e., have prepareToExecute() call setUnderlyingValue() on every such "symbolic invariant" variant of "live-in", instead of filling State. That way they will behave as any other live-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks! This is only possible with the current patch, so I included the changes here. The original code here has been restored.
if (getOpcode() == VPInstruction::ComputeReductionResult) { | ||
State.set(this, GeneratedValue, VPIteration(Part, 0)); | ||
} else { | ||
assert((State.VF.isScalar() || vputils::onlyFirstLaneUsed(this)) && | ||
"scalar value but not only first lane used"); | ||
State.set(this, GeneratedValue, VPIteration(Part, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can simplify into a single State.set() with the condition folded into the assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is the "other recipes" who should be fixed, to more consistently store their single scalar in per-part storage rather than lane zero? The current design of State is to hold UF per-part Values whenever possible, be they vectors or (uniform) scalars, and otherwise holds VF*UF per-lane Values, for every VPValue. The hasVectorValue() and hasScalarValue() indicators should better be renamed hasValuePerPart() and hasValuePerLane(), respectively.
That would be another option, but a potential problem could be recipes that only have their first lane used, but one of their operands would only define a vector value per-part. With this patch, this will work as accessing the per-lane value will create an extract from the per-part vector.
Unrolling loop-regions in VPlan by UF would simplify State to hold a single per-part Value or VF per-lane Values. Further unrolling replicating-regions and replicating-recipes by VF would simplify State to hold a single Value per VPValue.
Agreed, hopefully I should be able to share a patch for that by next week. It will require introducing additional VPInstructions operating on scalars, so this patch and #80273 should help to get us there.
The current get() functions are overly complex, as they bridge gaps by packing VF scalar elements into a vector on-demand for a scalar producer to feed a vector consumer, by extracting VF scalar elements from a vector producer to feed a scalar-replicating consumer, and more. Introducing explicit pack and extract recipes in VPlan could simplify its code-gen and State, prior to unrollings.
That would be an alternative path, but I am quite close to getting ready to share unrolling as VPlan transformer, so I'd prefer to introduce explicit insert/extracts after explicit unrolling.
if (Def->isLiveIn()) | ||
return Def->getLiveInIRValue(); | ||
if (Def->isLiveIn()) { | ||
if (Value *V = Def->getLiveInIRValue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily at the moment, LiveIns don't have a defining recipe, but they also may not have a LiveInIRValue as their value only becomes known later, for example VFxUF
, BackedgeTakenCount
or VectorTripCount
, which are set in the state in prepareToExecute.
if (getOpcode() == VPInstruction::ComputeReductionResult) { | ||
State.set(this, GeneratedValue, VPIteration(Part, 0)); | ||
} else { | ||
assert((State.VF.isScalar() || vputils::onlyFirstLaneUsed(this)) && | ||
"scalar value but not only first lane used"); | ||
State.set(this, GeneratedValue, VPIteration(Part, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
@@ -526,12 +542,13 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const { | |||
case Instruction::ICmp: | |||
// TODO: Cover additional opcodes. | |||
return vputils::onlyFirstLaneUsed(this); | |||
case VPInstruction::ComputeReductionResult: | |||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed ComputeReductionResult change here as it is no needed. Can split off if preferred, the code just won't be exercised by any tests until this patch.
@@ -1389,6 +1389,13 @@ class VPScalarCastRecipe : public VPSingleDefRecipe { | |||
|
|||
/// Returns the result type of the cast. | |||
Type *getResultType() const { return ResultTy; } | |||
|
|||
bool onlyFirstLaneUsed(const VPValue *Op) const override { | |||
// At the moment, only scalar codegen is implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
Can split off if preferred, the code just won't be exercised by any tests until this patch.
Would it be better to stick with the current design, and fix such cases of missing extracts?
Sure, by all means; the comment was more future looking. |
…form-scalar-lanes
…form-scalar-lanes
…-lanes' into users/fhahn/vplan-uniform-scalar-lanes
Ahh, only other recipe that may generate uniform(-"after-vectorization") scalar values besides VPInstruction is Replicate recipe - which already stores its scalar in lane zero rather than per-part... OK, let's go with storing values-per-part in lane-zero, consistently. Suggest to wrap in a |
@@ -530,8 +543,7 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const { | |||
case VPInstruction::CalculateTripCountMinusVF: | |||
case VPInstruction::CanonicalIVIncrementForPart: | |||
case VPInstruction::BranchOnCount: | |||
// TODO: Cover additional operands. | |||
return getOperand(0) == Op; | |||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, can probably be pushed separately, if testable. (written before the separate push ;-)
CalculateTripCountMinusVF and CanonicalIVIncrementForPart have a single operand, so NFC for them.
ActiveLaneMask and BranchOnCount have two operands, the (first lane of the) IV and the scalar uniform trip count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left as is for now, as this only is needed for this patch at the moment.
@@ -116,8 +116,7 @@ define void @test_widen(ptr noalias %a, ptr readnone %b) #4 { | |||
; TFA_INTERLEAVE-NEXT: [[TMP18:%.*]] = getelementptr inbounds i64, ptr [[TMP15]], i64 [[TMP17]] | |||
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP13]], ptr [[TMP15]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]]) | |||
; TFA_INTERLEAVE-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP14]], ptr [[TMP18]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK2]]) | |||
; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]] | |||
; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT4]] = add i64 [[INDEX]], [[TMP6]] | |||
; TFA_INTERLEAVE-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverse ping.
if (GeneratedValue->getType()->isVectorTy()) | ||
State.set(this, GeneratedValue, Part); | ||
else { | ||
assert((getOpcode() == VPInstruction::ComputeReductionResult || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain/Why is ComputeReductionResult excluded, it generates a scalar Value but may have lanes other than first used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note that it generates a scalar but has it's last lane accessed via the generic VPLiveOut.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, right.
This in particular would be more clearly expressed by doing
State.set(this, GeneratedValue, Part, true /* scalar */);
rather than explicitly storing the single last lane directly in the first lane zero via
State.set(this, GeneratedValue, VPIteration(Part, 0));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and added variant of set
taking an IsScalar
argument.
@@ -530,8 +543,7 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const { | |||
case VPInstruction::CalculateTripCountMinusVF: | |||
case VPInstruction::CanonicalIVIncrementForPart: | |||
case VPInstruction::BranchOnCount: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add case ComputeReductionResult?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ComputeReductionResult combines the partial reduction vector values, so it should use more than the first lane per operand. It produces a single scalar value though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, right. Got confused with onlyFirstLaneDefined.
if (GeneratedValue->getType()->isVectorTy()) | ||
State.set(this, GeneratedValue, Part); | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (GeneratedValue->getType()->isVectorTy()) | |
State.set(this, GeneratedValue, Part); | |
else { | |
if (GeneratedValue->getType()->isVectorTy()) { | |
State.set(this, GeneratedValue, Part); | |
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added braces, thanks
@@ -883,8 +888,11 @@ void VPlan::execute(VPTransformState *State) { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool IsScalarNeeded = isa<VPCanonicalIVPHIRecipe>(PhiR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
Value *Val = | ||
isa<VPCanonicalIVPHIRecipe>(PhiR) | ||
? State->get(PhiR->getBackedgeValue(), VPIteration(Part, 0)) | ||
: State->get(PhiR->getBackedgeValue(), | ||
SinglePartNeeded ? State->UF - 1 : Part); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value *Val = | |
isa<VPCanonicalIVPHIRecipe>(PhiR) | |
? State->get(PhiR->getBackedgeValue(), VPIteration(Part, 0)) | |
: State->get(PhiR->getBackedgeValue(), | |
SinglePartNeeded ? State->UF - 1 : Part); | |
unsigned PartNeeded = SinglePartNeeded ? State->UF - 1 : Part; | |
Value *Val = State->get(PhiR->getBackedgeValue(), PartNeeded, IsScalarNeeded); |
using API suggested below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks
if (Def->isLiveIn()) | ||
return Def->getLiveInIRValue(); | ||
if (Def->isLiveIn()) { | ||
if (Value *V = Def->getLiveInIRValue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, thanks for reminding. This may deserve a better representation to avoid confusion.
An alternative way of handling non-IR-valued live-ins, is to complement them with one asap, i.e., have prepareToExecute() call setUnderlyingValue() on every such "symbolic invariant" variant of "live-in", instead of filling State. That way they will behave as any other live-in.
return V; | ||
if (hasScalarValue(Def, VPIteration(0, 0))) { | ||
return Data.PerPartScalars[Def][0][0]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else?
should this if be an assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code has been removed, thanks
@@ -788,13 +793,13 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV, | |||
} | |||
|
|||
for (unsigned Part = 0, UF = State.UF; Part < UF; ++Part) | |||
State.set(&VectorTripCount, VectorTripCountV, Part); | |||
State.set(&VectorTripCount, VectorTripCountV, VPIteration(Part, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent of this patch: would be good to be consistent if/when a scalar that is invariant or uniform-across-parts is set in lane zero of all parts (e.g., VectorTripCount) or in lane zero of part zero only (e.g., VFxUF).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the underlying value is set as suggested above, so neither are stored in State.
…form-scalar-lanes
…form-scalar-lanes
if (GeneratedValue->getType()->isVectorTy()) | ||
State.set(this, GeneratedValue, Part); | ||
else { | ||
assert((getOpcode() == VPInstruction::ComputeReductionResult || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, right.
This in particular would be more clearly expressed by doing
State.set(this, GeneratedValue, Part, true /* scalar */);
rather than explicitly storing the single last lane directly in the first lane zero via
State.set(this, GeneratedValue, VPIteration(Part, 0));
@@ -9127,7 +9127,7 @@ void VPWidenPointerInductionRecipe::execute(VPTransformState &State) { | |||
"Unexpected type."); | |||
|
|||
auto *IVR = getParent()->getPlan()->getCanonicalIV(); | |||
PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, 0)); | |||
PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, VPIteration(0, 0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, VPIteration(0, 0))); | |
PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, 0, true /* scalar */)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -9243,7 +9243,7 @@ void VPInterleaveRecipe::execute(VPTransformState &State) { | |||
|
|||
void VPReductionRecipe::execute(VPTransformState &State) { | |||
assert(!State.Instance && "Reduction being replicated."); | |||
Value *PrevInChain = State.get(getChainOp(), 0); | |||
Value *PrevInChain = State.get(getChainOp(), VPIteration(0, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value *PrevInChain = State.get(getChainOp(), VPIteration(0, 0)); | |
Value *PrevInChain = State.get(getChainOp(), 0, true /* scalar */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -9278,7 +9277,7 @@ void VPReductionRecipe::execute(VPTransformState &State) { | |||
NewVecOp); | |||
PrevInChain = NewRed; | |||
} else { | |||
PrevInChain = State.get(getChainOp(), Part); | |||
PrevInChain = State.get(getChainOp(), VPIteration(Part, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PrevInChain = State.get(getChainOp(), VPIteration(Part, 0)); | |
PrevInChain = State.get(getChainOp(), Part, true /* scalar */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -9289,7 +9288,7 @@ void VPReductionRecipe::execute(VPTransformState &State) { | |||
else | |||
NextInChain = State.Builder.CreateBinOp( | |||
(Instruction::BinaryOps)RdxDesc.getOpcode(Kind), NewRed, PrevInChain); | |||
State.set(this, NextInChain, Part); | |||
State.set(this, NextInChain, VPIteration(Part, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State.set(this, NextInChain, VPIteration(Part, 0)); | |
State.set(this, NextInChain, Part, true /* scalar */); |
same API of get
should apply to set
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -1344,7 +1353,7 @@ void VPVectorPointerRecipe ::execute(VPTransformState &State) { | |||
PartPtr = Builder.CreateGEP(IndexedTy, Ptr, Increment, "", InBounds); | |||
} | |||
|
|||
State.set(this, PartPtr, Part); | |||
State.set(this, PartPtr, VPIteration(Part, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State.set(this, PartPtr, VPIteration(Part, 0)); | |
State.set(this, PartPtr, Part, true /* scalar */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -1640,7 +1649,7 @@ void VPCanonicalIVPHIRecipe::execute(VPTransformState &State) { | |||
EntryPart->addIncoming(Start, VectorPH); | |||
EntryPart->setDebugLoc(getDebugLoc()); | |||
for (unsigned Part = 0, UF = State.UF; Part < UF; ++Part) | |||
State.set(this, EntryPart, Part); | |||
State.set(this, EntryPart, VPIteration(Part, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State.set(this, EntryPart, VPIteration(Part, 0)); | |
State.set(this, EntryPart, Part, true /* scalar */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -1711,7 +1720,7 @@ void VPExpandSCEVRecipe::print(raw_ostream &O, const Twine &Indent, | |||
#endif | |||
|
|||
void VPWidenCanonicalIVRecipe::execute(VPTransformState &State) { | |||
Value *CanonicalIV = State.get(getOperand(0), 0); | |||
Value *CanonicalIV = State.get(getOperand(0), VPIteration(0, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value *CanonicalIV = State.get(getOperand(0), VPIteration(0, 0)); | |
Value *CanonicalIV = State.get(getOperand(0), 0, true /* scalar */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -1779,6 +1788,7 @@ void VPFirstOrderRecurrencePHIRecipe::print(raw_ostream &O, const Twine &Indent, | |||
#endif | |||
|
|||
void VPReductionPHIRecipe::execute(VPTransformState &State) { | |||
// TODO: Store scalar value for in-loop reductions as {Part, 0}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks!
if (IsInLoop) | ||
State.set(this, EntryPart, VPIteration(Part, 0)); | ||
else | ||
State.set(this, EntryPart, Part); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (IsInLoop) | |
State.set(this, EntryPart, VPIteration(Part, 0)); | |
else | |
State.set(this, EntryPart, Part); | |
State.set(this, EntryPart, Part, IsInLoop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for accommodating, adding a couple of last minor nits.
/// IsScalar is false. If \p IsScalar is true, set the scalar in (Part, 0). | ||
void set(VPValue *Def, Value *V, unsigned Part, bool IsScalar = false) { | ||
if (IsScalar) { | ||
set(Def, V, VPIteration(Part, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can early return, leaving the rest intact, consistent with get() above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
if (GeneratedValue->getType()->isVectorTy()) { | ||
State.set(this, GeneratedValue, Part); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can fold into a single
if (GeneratedValue->getType()->isVectorTy()) { | |
State.set(this, GeneratedValue, Part); | |
bool isVector = GeneratedValue->getType()->isVectorTy(); | |
State.set(this, GeneratedValue, Part, !isVector); | |
assert((isVector || getOpcode() == VPInstruction::ComputeReductionResult || | |
State.VF.isScalar() || vputils::onlyFirstLaneUsed(this)) && | |
"scalar value but not only first lane used"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted, thanks!
Add a new PtrAdd opcode to VPInstruction that corresponds to IRBuilder::CreatePtrAdd, which creates a GEP with source element type i8. This is then used to model scalarizing VPWidenPointerInductionRecipe by introducing scalar-steps to model the index increment followed by a PtrAdd. Note that PtrAdd needs to be able to generate code for only the first lane or for all lanes. This may warrant introducing a separate recipe for scalarizing that can be created without relying on the underlying IR. Depends on #80271 PR: #83068
At the moment, some VPInstructions create only a single scalar value, but use VPTransformatState's 'vector' storage for this value. Those values are effectively uniform-per-VF (or in some cases uniform-across-VF-and-UF). Using the vector/per-part storage doesn't interact well with other recipes, that more accurately using (Part, Lane) to look up scalar values and prevents VPInstructions creating scalars from interacting with other recipes working with scalars.
This PR tries to unify handling of scalars by using (Part, 0) for scalar values where only the first lane is demanded. This allows using VPInstructions with other recipes like VPScalarCastRecipe and is also needed when using VPInstructions in more cases otuside the vector loop region to generate scalars.
The patch is still a bit rough around the edges, but hopefully serves as start for a discussion how to model more scalar recipes. A potential alternative would be to split off the opcodes that generate scalars only to a separate recipe.
Depends on #80269