[VPlan] Consistently use (Part, 0) for first lane scalar values #80271

fhahn · 2024-02-01T10:49:49Z

At the moment, some VPInstructions create only a single scalar value, but use VPTransformatState's 'vector' storage for this value. Those values are effectively uniform-per-VF (or in some cases uniform-across-VF-and-UF). Using the vector/per-part storage doesn't interact well with other recipes, that more accurately using (Part, Lane) to look up scalar values and prevents VPInstructions creating scalars from interacting with other recipes working with scalars.

This PR tries to unify handling of scalars by using (Part, 0) for scalar values where only the first lane is demanded. This allows using VPInstructions with other recipes like VPScalarCastRecipe and is also needed when using VPInstructions in more cases otuside the vector loop region to generate scalars.

The patch is still a bit rough around the edges, but hopefully serves as start for a discussion how to model more scalar recipes. A potential alternative would be to split off the opcodes that generate scalars only to a separate recipe.

Depends on #80269

A VPInstruction only has its first lane used if all users use its first lane only. Use vputils::onlyFirstLaneUsed to continue checking the recipe's users to handle more cases. Besides allowing additional introduction of scalar steps when interleaving in some cases, this also enables using an Add VPInstruction to model the increment.

At the moment, some VPInstructions create only a single scalar value, but use VPTransformatState's 'vector' storage for this value. Those values are effectively uniform-per-VF (or in some cases uniform-across-VF-and-UF). Using the vector/per-part storage doesn't interact well with other recipes, that more accurately using (Part, Lane) to look up scalar values and prevents VPInstructions creating scalars from interacting with other recipes working with scalars. This PR tries to unify handling of scalars by using (Part, 0) for scalar values where only the first lane is demanded. This allows using VPInstructions with other recipes like VPScalarCastRecipe and is also needed when using VPInstructions in more cases otuside the vector loop region to generate scalars. The patch is still a bit rough around the edges, but hopefully serves as start for a discussion how to model more scalar recipes. A potential alternative would be to split off the opcodes that generate scalars only to a separate recipe.

llvmbot · 2024-02-01T10:50:18Z

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

At the moment, some VPInstructions create only a single scalar value, but use VPTransformatState's 'vector' storage for this value. Those values are effectively uniform-per-VF (or in some cases uniform-across-VF-and-UF). Using the vector/per-part storage doesn't interact well with other recipes, that more accurately using (Part, Lane) to look up scalar values and prevents VPInstructions creating scalars from interacting with other recipes working with scalars.

This PR tries to unify handling of scalars by using (Part, 0) for scalar values where only the first lane is demanded. This allows using VPInstructions with other recipes like VPScalarCastRecipe and is also needed when using VPInstructions in more cases otuside the vector loop region to generate scalars.

The patch is still a bit rough around the edges, but hopefully serves as start for a discussion how to model more scalar recipes. A potential alternative would be to split off the opcodes that generate scalars only to a separate recipe.

Patch is 22.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/80271.diff

5 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+13-5)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+7)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+27-9)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll (+10-15)
(modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+30-32)

diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 1ca2cfef447f6..7ed07fe5f413a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -213,8 +213,13 @@ VPBasicBlock::iterator VPBasicBlock::getFirstNonPhi() {
 }
 
 Value *VPTransformState::get(VPValue *Def, const VPIteration &Instance) {
-  if (Def->isLiveIn())
-    return Def->getLiveInIRValue();
+  if (Def->isLiveIn()) {
+    if (Value *V = Def->getLiveInIRValue())
+      return V;
+    if (hasScalarValue(Def, VPIteration(0, 0))) {
+      return Data.PerPartScalars[Def][0][0];
+    }
+  }
 
   if (hasScalarValue(Def, Instance)) {
     return Data
@@ -794,7 +799,7 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
   // FIXME: Model VF * UF computation completely in VPlan.
   State.set(&VFxUF,
             createStepForVF(Builder, TripCountV->getType(), State.VF, State.UF),
-            0);
+            VPIteration(0, 0));
 
   // When vectorizing the epilogue loop, the canonical induction start value
   // needs to be changed from zero to the value after the main vector loop.
@@ -883,8 +888,11 @@ void VPlan::execute(VPTransformState *State) {
 
     for (unsigned Part = 0; Part < LastPartForNewPhi; ++Part) {
       Value *Phi = State->get(PhiR, Part);
-      Value *Val = State->get(PhiR->getBackedgeValue(),
-                              SinglePartNeeded ? State->UF - 1 : Part);
+      Value *Val =
+          isa<VPCanonicalIVPHIRecipe>(PhiR)
+              ? State->get(PhiR->getBackedgeValue(), VPIteration(Part, 0))
+              : State->get(PhiR->getBackedgeValue(),
+                           SinglePartNeeded ? State->UF - 1 : Part);
       cast<PHINode>(Phi)->addIncoming(Val, VectorLatchBB);
     }
   }
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 30dc521947b3b..97035146a2f4d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1388,6 +1388,13 @@ class VPScalarCastRecipe : public VPSingleDefRecipe {
 
   /// Returns the result type of the cast.
   Type *getResultType() const { return ResultTy; }
+
+  bool onlyFirstLaneUsed(const VPValue *Op) const override {
+    // At the moment, only scalar codegen is implemented.
+    assert(is_contained(operands(), Op) &&
+           "Op must be an operand of the recipe");
+    return true;
+  }
 };
 
 /// A recipe for widening Call instructions.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 21b8d1eb77bf9..77f2cf899b085 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -279,11 +279,17 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
   Builder.SetCurrentDebugLocation(getDebugLoc());
 
   if (Instruction::isBinaryOp(getOpcode())) {
+    bool OnlyFirstLaneUsed = vputils::onlyFirstLaneUsed(this);
     if (Part != 0 && vputils::onlyFirstPartUsed(this))
-      return State.get(this, 0);
-
-    Value *A = State.get(getOperand(0), Part);
-    Value *B = State.get(getOperand(1), Part);
+      return OnlyFirstLaneUsed ? State.get(this, VPIteration(0, 0))
+                               : State.get(this, 0);
+
+    Value *A = OnlyFirstLaneUsed
+                   ? State.get(getOperand(0), VPIteration(Part, 0))
+                   : State.get(getOperand(0), Part);
+    Value *B = OnlyFirstLaneUsed
+                   ? State.get(getOperand(1), VPIteration(Part, 0))
+                   : State.get(getOperand(1), Part);
     auto *Res =
         Builder.CreateBinOp((Instruction::BinaryOps)getOpcode(), A, B, Name);
     if (auto *I = dyn_cast<Instruction>(Res))
@@ -385,8 +391,8 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
     if (Part != 0)
       return nullptr;
     // First create the compare.
-    Value *IV = State.get(getOperand(0), Part);
-    Value *TC = State.get(getOperand(1), Part);
+    Value *IV = State.get(getOperand(0), VPIteration(0, 0));
+    Value *TC = State.get(getOperand(1), VPIteration(0, 0));
     Value *Cond = Builder.CreateICmpEQ(IV, TC);
 
     // Now create the branch.
@@ -407,7 +413,7 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
   }
   case VPInstruction::ComputeReductionResult: {
     if (Part != 0)
-      return State.get(this, 0);
+      return State.get(this, VPIteration(0, 0));
 
     // FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
     // and will be removed by breaking up the recipe further.
@@ -512,7 +518,17 @@ void VPInstruction::execute(VPTransformState &State) {
     if (!hasResult())
       continue;
     assert(GeneratedValue && "generateInstruction must produce a value");
-    State.set(this, GeneratedValue, Part);
+    if (GeneratedValue->getType()->isVectorTy())
+      State.set(this, GeneratedValue, Part);
+    else {
+      if (getOpcode() == VPInstruction::ComputeReductionResult) {
+        State.set(this, GeneratedValue, VPIteration(Part, 0));
+      } else {
+        assert((State.VF.isScalar() || vputils::onlyFirstLaneUsed(this)) &&
+               "scalar value but not only first lane used");
+        State.set(this, GeneratedValue, VPIteration(Part, 0));
+      }
+    }
   }
 }
 bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
@@ -525,11 +541,13 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
     return false;
   case Instruction::ICmp:
     return vputils::onlyFirstLaneUsed(this);
+  case VPInstruction::ComputeReductionResult:
+    return true;
   case VPInstruction::ActiveLaneMask:
   case VPInstruction::CalculateTripCountMinusVF:
   case VPInstruction::CanonicalIVIncrementForPart:
   case VPInstruction::BranchOnCount:
-    return getOperand(0) == Op;
+    return true;
   };
   llvm_unreachable("switch should return");
 }
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll b/llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll
index 1e79c3e1e8dc2..2ef55742ffc0b 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll
@@ -99,7 +99,7 @@ define void @test_widen(ptr noalias %a, ptr readnone %b) #4 {
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK_ENTRY1:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP8]], i64 1025)
 ; TFA_INTERLEAVE-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; TFA_INTERLEAVE:       vector.body:
-; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT4:%.*]], [[VECTOR_BODY]] ]
+; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK2:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT5:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[TMP9:%.*]] = getelementptr i64, ptr [[B:%.*]], i64 [[INDEX]]
@@ -116,8 +116,7 @@ define void @test_widen(ptr noalias %a, ptr readnone %b) #4 {
 ; TFA_INTERLEAVE-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i64, ptr [[TMP15]], i64 [[TMP17]]
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP13]], ptr [[TMP15]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP14]], ptr [[TMP18]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK2]])
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT4]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
 ; TFA_INTERLEAVE-NEXT:    [[TMP19:%.*]] = call i64 @llvm.vscale.i64()
 ; TFA_INTERLEAVE-NEXT:    [[TMP20:%.*]] = mul i64 [[TMP19]], 2
 ; TFA_INTERLEAVE-NEXT:    [[TMP21:%.*]] = add i64 [[INDEX_NEXT]], [[TMP20]]
@@ -254,7 +253,7 @@ define void @test_if_then(ptr noalias %a, ptr readnone %b) #4 {
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK_ENTRY1:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP8]], i64 1025)
 ; TFA_INTERLEAVE-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; TFA_INTERLEAVE:       vector.body:
-; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT5:%.*]], [[VECTOR_BODY]] ]
+; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK2:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT6:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 [[INDEX]]
@@ -283,8 +282,7 @@ define void @test_if_then(ptr noalias %a, ptr readnone %b) #4 {
 ; TFA_INTERLEAVE-NEXT:    [[TMP28:%.*]] = getelementptr inbounds i64, ptr [[TMP25]], i64 [[TMP27]]
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[PREDPHI]], ptr [[TMP25]], i32 8, <vscale x 2 x i1> [[TMP23]])
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[PREDPHI4]], ptr [[TMP28]], i32 8, <vscale x 2 x i1> [[TMP24]])
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT5]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
 ; TFA_INTERLEAVE-NEXT:    [[TMP29:%.*]] = call i64 @llvm.vscale.i64()
 ; TFA_INTERLEAVE-NEXT:    [[TMP30:%.*]] = mul i64 [[TMP29]], 2
 ; TFA_INTERLEAVE-NEXT:    [[TMP31:%.*]] = add i64 [[INDEX_NEXT]], [[TMP30]]
@@ -437,7 +435,7 @@ define void @test_widen_if_then_else(ptr noalias %a, ptr readnone %b) #4 {
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK_ENTRY1:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP8]], i64 1025)
 ; TFA_INTERLEAVE-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; TFA_INTERLEAVE:       vector.body:
-; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT5:%.*]], [[VECTOR_BODY]] ]
+; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK2:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT6:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 [[INDEX]]
@@ -468,8 +466,7 @@ define void @test_widen_if_then_else(ptr noalias %a, ptr readnone %b) #4 {
 ; TFA_INTERLEAVE-NEXT:    [[TMP30:%.*]] = getelementptr inbounds i64, ptr [[TMP27]], i64 [[TMP29]]
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[PREDPHI]], ptr [[TMP27]], i32 8, <vscale x 2 x i1> [[TMP25]])
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[PREDPHI4]], ptr [[TMP30]], i32 8, <vscale x 2 x i1> [[TMP26]])
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT5]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
 ; TFA_INTERLEAVE-NEXT:    [[TMP31:%.*]] = call i64 @llvm.vscale.i64()
 ; TFA_INTERLEAVE-NEXT:    [[TMP32:%.*]] = mul i64 [[TMP31]], 2
 ; TFA_INTERLEAVE-NEXT:    [[TMP33:%.*]] = add i64 [[INDEX_NEXT]], [[TMP32]]
@@ -771,7 +768,7 @@ define void @test_widen_optmask(ptr noalias %a, ptr readnone %b) #4 {
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK_ENTRY1:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP8]], i64 1025)
 ; TFA_INTERLEAVE-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; TFA_INTERLEAVE:       vector.body:
-; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT4:%.*]], [[VECTOR_BODY]] ]
+; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK2:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT5:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[TMP9:%.*]] = getelementptr i64, ptr [[B:%.*]], i64 [[INDEX]]
@@ -788,8 +785,7 @@ define void @test_widen_optmask(ptr noalias %a, ptr readnone %b) #4 {
 ; TFA_INTERLEAVE-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i64, ptr [[TMP15]], i64 [[TMP17]]
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP13]], ptr [[TMP15]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP14]], ptr [[TMP18]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK2]])
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT4]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
 ; TFA_INTERLEAVE-NEXT:    [[TMP19:%.*]] = call i64 @llvm.vscale.i64()
 ; TFA_INTERLEAVE-NEXT:    [[TMP20:%.*]] = mul i64 [[TMP19]], 2
 ; TFA_INTERLEAVE-NEXT:    [[TMP21:%.*]] = add i64 [[INDEX_NEXT]], [[TMP20]]
@@ -970,7 +966,7 @@ define double @test_widen_fmuladd_and_call(ptr noalias %a, ptr readnone %b, doub
 ; TFA_INTERLEAVE-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x double> [[BROADCAST_SPLATINSERT]], <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer
 ; TFA_INTERLEAVE-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; TFA_INTERLEAVE:       vector.body:
-; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT4:%.*]], [[VECTOR_BODY]] ]
+; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK2:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT5:%.*]], [[VECTOR_BODY]] ]
 ; TFA_INTERLEAVE-NEXT:    [[VEC_PHI:%.*]] = phi double [ 0.000000e+00, [[ENTRY]] ], [ [[TMP26:%.*]], [[VECTOR_BODY]] ]
@@ -996,8 +992,7 @@ define double @test_widen_fmuladd_and_call(ptr noalias %a, ptr readnone %b, doub
 ; TFA_INTERLEAVE-NEXT:    [[TMP24:%.*]] = call double @llvm.vector.reduce.fadd.nxv2f64(double [[VEC_PHI]], <vscale x 2 x double> [[TMP23]])
 ; TFA_INTERLEAVE-NEXT:    [[TMP25:%.*]] = select <vscale x 2 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 2 x double> [[TMP14]], <vscale x 2 x double> shufflevector (<vscale x 2 x double> insertelement (<vscale x 2 x double> poison, double -0.000000e+00, i64 0), <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer)
 ; TFA_INTERLEAVE-NEXT:    [[TMP26]] = call double @llvm.vector.reduce.fadd.nxv2f64(double [[TMP24]], <vscale x 2 x double> [[TMP25]])
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT4]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
 ; TFA_INTERLEAVE-NEXT:    [[TMP27:%.*]] = call i64 @llvm.vscale.i64()
 ; TFA_INTERLEAVE-NEXT:    [[TMP28:%.*]] = mul i64 [[TMP27]], 2
 ; TFA_INTERLEAVE-NEXT:    [[TMP29:%.*]] = add i64 [[INDEX_NEXT]], [[TMP28]]
diff --git a/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll b/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
index b451d4b4e5462..f0a5fc1592532 100644
--- a/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
+++ b/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
@@ -3271,47 +3271,45 @@ define i32 @sink_into_replication_region_multiple(ptr %x, i32 %y) {
 ; UNROLL-NO-VF-NEXT:    [[TRIP_COUNT_MINUS_1:%.*]] = sub i32 [[TMP1]], 1
 ; UNROLL-NO-VF-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; UNROLL-NO-VF:       vector.body:
-; UNROLL-NO-VF-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE7:%.*]] ]
-; UNROLL-NO-VF-NEXT:    [[VECTOR_RECUR:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.*]], [[PRED_STORE_CONTINUE7]] ]
-; UNROLL-NO-VF-NEXT:    [[VEC_PHI:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP10:%.*]], [[PRED_STORE_CONTINUE7]] ]
-; UNROLL-NO-VF-NEXT:    [[VEC_PHI2:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP11:%.*]], [[PRED_STORE_CONTINUE7]] ]
+; UNROLL-NO-VF-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE6:%.*]] ]
+; UNROLL-NO-VF-NEXT:    [[VECTOR_RECUR:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP11:%.*]], [[PRED_STORE_CONTINUE6]] ]
+; UNROLL-NO-VF-NEXT:    [[VEC_PHI:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP12:%.*]], [[PRED_STORE_CONTINUE6]] ]
+; UNROLL-NO-VF-NEXT:    [[VEC_PHI2:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP13:%.*]], [[PRED_STORE_CONTINUE6]] ]
 ; UNROLL-NO-VF-NEXT:    [[OFFSET_IDX:%.*]] = sub i32 [[Y]], [[INDEX]]
 ; UNROLL-NO-VF-NEXT:    [[TMP2:%.*]] = add i32 [[OFFSET_IDX]], 0
 ; UNROLL-NO-VF-NEXT:    [[TMP3:%.*]] = add i32 [[OFFSET_IDX]], -1
-; UNROLL-NO-VF-NEXT:    [[VEC_IV:%.*]] = add i32 [[INDEX]], 0
-; UNROLL-NO-VF-NEXT:    [[VEC_IV3:%.*]] = add i32 [[INDEX]], 1
-; UNROLL-NO-VF-NEXT:    [[TMP4:%.*]] = icmp ule i32 [[VEC_IV]], [[TRIP_COUNT_MINUS_1]]
-; UNROLL-NO-VF-NEXT:    [[TMP5:%.*]] = icmp ule i32 [[VEC_IV3]], [[TRIP_COUNT_MINUS_1]]
-; UNROLL-NO-VF-NEXT:    br i1 [[TMP4]], label [[PRED_UDIV_IF:%.*]], label [[PRED_UDIV_CONTINUE:%.*]]
+; UNROLL-NO-VF-NEXT:    [[TMP4:%.*]] = add i32 [[INDEX]], 0
+; UNROLL-NO-VF-NEXT:    [[TMP5:%.*]] = add i32 [[INDEX]], 1
+; UNROLL-NO-VF-NEXT:    [[TMP6:%.*]] = icmp ule i32 [[TMP4]], [[TRIP_COUNT_MINUS_1]]
+; UNROLL-NO-VF-NEXT:    [[TMP7:%.*]] = icmp ule i32 [[TMP5]], [[TRIP_COUNT_MINUS_1]]
+; UNROLL-NO-VF-NEXT:    br i1 [[TMP6]], label [[PRED_UDIV_IF:%.*]], label [[PRED_UDIV_CONTINUE:%.*]]
 ; UNROLL-NO-VF:       pred.udiv.if:
-; UNROLL-NO-VF-NEXT:    [[TMP6:%.*]] = udiv i32 219220132, [[TMP2]]
+; UNROLL-NO-VF-NEXT:    [[TMP8:%.*]] = udiv i32 219220132, [[TMP2]]
 ; UNROLL-NO-VF-NEXT:    br label [[PRED_UDIV_CONTINUE]]
 ; UNROLL-NO-VF:       pred.udiv.continue:
-; UNROLL-NO-VF-NEXT:    [[TMP7:%.*]] = phi i32 [ poison, [[VECTOR_BODY]] ], [ [[TMP6]], [[PRED_UDIV_IF]] ]
-; UNROLL-NO-VF-NEXT:    br i1 [[TMP5]], label [[PRED_UDIV_IF4:%.*]], label [[PRED_UDIV_CONTINUE5:%.*]]
-; UNROLL-NO-VF:       pred.udiv.if4:
-; UNROLL-NO-VF-NEXT:    [[TMP8:%.*]] = udiv i32 219220132, [[TMP3]]
-; UNROLL-NO-VF-NEXT:    br label [[PRED_UDIV_CONTINUE5]]
-; UNROLL-NO-VF:       pred.udiv.continue5:
-; UNROLL-NO-VF-NEXT:    [[TMP9]] = phi i32 [ poison, [[PRED_UDIV_CONTINUE]] ], [ [[TMP8]], [[PRED_UDIV_IF4]] ]
-; UNROLL-NO-VF-NEXT:    [[TMP10]] = add i32 [[VEC_PHI]], [[VECTOR_RECUR]]
-; UNROLL-NO-VF-NEXT:    [[TMP11]] = add i32 [[VEC_PHI2]], [[TMP7]]
-; UNROLL-NO-VF-NEXT:    br i1 [[TMP4]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
+; UNROLL-NO-VF-NEXT:    [[TMP9:%.*]] = phi i32 [ poison, [[VECTOR_BODY]] ], [ [[TMP8]], [[PRED_UDIV_IF]] ]
+; UNROLL-NO-VF-NEXT:    br i1 [[TMP7]], label [[PRED_UDIV_IF3:%.*]], label [[PRED_UDIV_CONTINUE4:%.*]]
+; UNROLL-NO-VF:       pred.udiv.if3:
+; UNROLL-NO-VF-NEXT:    [[TMP10:%.*]] = udiv i32 219220132, [[TMP3]]
+; UNROLL-NO-VF-NEXT:    br label [[PRED_UDIV_CONTINUE4]]
+; UNROLL-NO-VF:       pred.udiv.continue4:
+; UNROLL-NO-VF-NEXT:    [[TMP11]] = phi i32 [ poison, [[PRED_UDIV_CONTINUE]] ], [ [[TMP10]], [[PRED_UDIV_IF3]] ]
+; UNROLL-NO-VF-NEXT:    [[TMP12]] = add i32 [[VEC_PHI]], [[VECTOR_RECUR]]
+; UNROLL-NO-VF-NEXT:    [[TMP13]] = add i32 [[VEC_PHI2]], [[TMP9]]
+; UNROLL-NO-VF-NEXT:    br i1 [[TMP6]], label [[PRED_STORE_IF:%.*]]...
[truncated]

ayalz

At the moment, some VPInstructions create only a single scalar value, but use VPTransformatState's 'vector' storage for this value. Those values are effectively uniform-per-VF (or in some cases uniform-across-VF-and-UF). Using the vector/per-part storage doesn't interact well with other recipes, that more accurately using (Part, Lane) to look up scalar values and prevents VPInstructions creating scalars from interacting with other recipes working with scalars.

This PR tries to unify handling of scalars by using (Part, 0) for scalar values where only the first lane is demanded. This allows using VPInstructions with other recipes like VPScalarCastRecipe and is also needed when using VPInstructions in more cases otuside the vector loop region to generate scalars.

Perhaps it is the "other recipes" who should be fixed, to more consistently store their single scalar in per-part storage rather than lane zero? The current design of State is to hold UF per-part Values whenever possible, be they vectors or (uniform) scalars, and otherwise holds VF*UF per-lane Values, for every VPValue. The hasVectorValue() and hasScalarValue() indicators should better be renamed hasValuePerPart() and hasValuePerLane(), respectively.

Unrolling loop-regions in VPlan by UF would simplify State to hold a single per-part Value or VF per-lane Values. Further unrolling replicating-regions and replicating-recipes by VF would simplify State to hold a single Value per VPValue.

The current get() functions are overly complex, as they bridge gaps by packing VF scalar elements into a vector on-demand for a scalar producer to feed a vector consumer, by extracting VF scalar elements from a vector producer to feed a scalar-replicating consumer, and more. Introducing explicit pack and extract recipes in VPlan could simplify its code-gen and State, prior to unrollings.

ayalz · 2024-02-03T20:20:54Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -1389,6 +1389,13 @@ class VPScalarCastRecipe : public VPSingleDefRecipe {

  /// Returns the result type of the cast.
  Type *getResultType() const { return ResultTy; }
+
+  bool onlyFirstLaneUsed(const VPValue *Op) const override {
+    // At the moment, only scalar codegen is implemented.


Suggested change

// At the moment, only scalar codegen is implemented.

// At the moment, only uniform codegen is implemented.

?
scalar codegen in general may use other lanes.

Can this improvement of onlyFirstLaneUsed() be pushed independently?

Updated!

Can split off if preferred, the code just won't be exercised by any tests until this patch.

Ah, fine, better avoid introducing unexercised code, here and below.

ayalz · 2024-02-03T20:24:06Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -526,12 +542,13 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
  case Instruction::ICmp:
    // TODO: Cover additional opcodes.
    return vputils::onlyFirstLaneUsed(this);
+  case VPInstruction::ComputeReductionResult:
+    return true;


nit: fallthrough to join other true opcodes below?

Can these two improvements of onlyFirstLaneUsed() (additional opcodes and additional operands) be pushed independently?

Removed ComputeReductionResult change here as it is no needed. Can split off if preferred, the code just won't be exercised by any tests until this patch.

ayalz · 2024-02-03T20:30:44Z

llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll

@@ -116,8 +116,7 @@ define void @test_widen(ptr noalias %a, ptr readnone %b) #4 {
 ; TFA_INTERLEAVE-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i64, ptr [[TMP15]], i64 [[TMP17]]
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP13]], ptr [[TMP15]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP14]], ptr [[TMP18]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK2]])
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT4]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]


Only implication visible in existing tests below is cse'ing of the Canonical IV bump when active lane masks are used. Would other test(s) be relevant?

reverse ping.

Missed that one, sorry!

This was an accidental change and has been undone now. To catch this, onlyFirstPartUsed needs to also support CanonicalIVIncrementForPart, which can be adjusted separately.

ayalz · 2024-02-03T21:25:56Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

-  if (Def->isLiveIn())
-    return Def->getLiveInIRValue();
+  if (Def->isLiveIn()) {
+    if (Value *V = Def->getLiveInIRValue())


A LiveIn Def must have a LiveInIRValue, is this if an assert?

Not necessarily at the moment, LiveIns don't have a defining recipe, but they also may not have a LiveInIRValue as their value only becomes known later, for example VFxUF, BackedgeTakenCount or VectorTripCount, which are set in the state in prepareToExecute.

Ahh, thanks for reminding. This may deserve a better representation to avoid confusion.

An alternative way of handling non-IR-valued live-ins, is to complement them with one asap, i.e., have prepareToExecute() call setUnderlyingValue() on every such "symbolic invariant" variant of "live-in", instead of filling State. That way they will behave as any other live-in.

Updated, thanks! This is only possible with the current patch, so I included the changes here. The original code here has been restored.

ayalz · 2024-02-04T20:45:49Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+      if (getOpcode() == VPInstruction::ComputeReductionResult) {
+        State.set(this, GeneratedValue, VPIteration(Part, 0));
+      } else {
+        assert((State.VF.isScalar() || vputils::onlyFirstLaneUsed(this)) &&
+               "scalar value but not only first lane used");
+        State.set(this, GeneratedValue, VPIteration(Part, 0));


nit: can simplify into a single State.set() with the condition folded into the assert.

Done, thanks!

github-actions · 2024-02-07T13:04:04Z

✅ With the latest revision this PR passed the C/C++ code formatter.

fhahn

Perhaps it is the "other recipes" who should be fixed, to more consistently store their single scalar in per-part storage rather than lane zero? The current design of State is to hold UF per-part Values whenever possible, be they vectors or (uniform) scalars, and otherwise holds VF*UF per-lane Values, for every VPValue. The hasVectorValue() and hasScalarValue() indicators should better be renamed hasValuePerPart() and hasValuePerLane(), respectively.

That would be another option, but a potential problem could be recipes that only have their first lane used, but one of their operands would only define a vector value per-part. With this patch, this will work as accessing the per-lane value will create an extract from the per-part vector.

Unrolling loop-regions in VPlan by UF would simplify State to hold a single per-part Value or VF per-lane Values. Further unrolling replicating-regions and replicating-recipes by VF would simplify State to hold a single Value per VPValue.

Agreed, hopefully I should be able to share a patch for that by next week. It will require introducing additional VPInstructions operating on scalars, so this patch and #80273 should help to get us there.

The current get() functions are overly complex, as they bridge gaps by packing VF scalar elements into a vector on-demand for a scalar producer to feed a vector consumer, by extracting VF scalar elements from a vector producer to feed a scalar-replicating consumer, and more. Introducing explicit pack and extract recipes in VPlan could simplify its code-gen and State, prior to unrollings.

That would be an alternative path, but I am quite close to getting ready to share unrolling as VPlan transformer, so I'd prefer to introduce explicit insert/extracts after explicit unrolling.

fhahn · 2024-02-07T11:18:56Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

-  if (Def->isLiveIn())
-    return Def->getLiveInIRValue();
+  if (Def->isLiveIn()) {
+    if (Value *V = Def->getLiveInIRValue())


Not necessarily at the moment, LiveIns don't have a defining recipe, but they also may not have a LiveInIRValue as their value only becomes known later, for example VFxUF, BackedgeTakenCount or VectorTripCount, which are set in the state in prepareToExecute.

fhahn · 2024-02-07T12:56:00Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+      if (getOpcode() == VPInstruction::ComputeReductionResult) {
+        State.set(this, GeneratedValue, VPIteration(Part, 0));
+      } else {
+        assert((State.VF.isScalar() || vputils::onlyFirstLaneUsed(this)) &&
+               "scalar value but not only first lane used");
+        State.set(this, GeneratedValue, VPIteration(Part, 0));


Done, thanks!

fhahn · 2024-02-07T12:58:41Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -526,12 +542,13 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
  case Instruction::ICmp:
    // TODO: Cover additional opcodes.
    return vputils::onlyFirstLaneUsed(this);
+  case VPInstruction::ComputeReductionResult:
+    return true;


Removed ComputeReductionResult change here as it is no needed. Can split off if preferred, the code just won't be exercised by any tests until this patch.

fhahn · 2024-02-07T13:00:00Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -1389,6 +1389,13 @@ class VPScalarCastRecipe : public VPSingleDefRecipe {

  /// Returns the result type of the cast.
  Type *getResultType() const { return ResultTy; }
+
+  bool onlyFirstLaneUsed(const VPValue *Op) const override {
+    // At the moment, only scalar codegen is implemented.


Updated!

Can split off if preferred, the code just won't be exercised by any tests until this patch.

llvm#80271

ayalz · 2024-02-15T13:35:43Z

Perhaps it is the "other recipes" who should be fixed, to more consistently store their single scalar in per-part storage rather than lane zero? The current design of State is to hold UF per-part Values whenever possible, be they vectors or (uniform) scalars, and otherwise holds VF*UF per-lane Values, for every VPValue. The hasVectorValue() and hasScalarValue() indicators should better be renamed hasValuePerPart() and hasValuePerLane(), respectively.

That would be another option, but a potential problem could be recipes that only have their first lane used, but one of their operands would only define a vector value per-part. With this patch, this will work as accessing the per-lane value will create an extract from the per-part vector.

Would it be better to stick with the current design, and fix such cases of missing extracts?

Unrolling loop-regions in VPlan by UF would simplify State to hold a single per-part Value or VF per-lane Values. Further unrolling replicating-regions and replicating-recipes by VF would simplify State to hold a single Value per VPValue.

Agreed, hopefully I should be able to share a patch for that by next week. It will require introducing additional VPInstructions operating on scalars, so this patch and #80273 should help to get us there.

The current get() functions are overly complex, as they bridge gaps by packing VF scalar elements into a vector on-demand for a scalar producer to feed a vector consumer, by extracting VF scalar elements from a vector producer to feed a scalar-replicating consumer, and more. Introducing explicit pack and extract recipes in VPlan could simplify its code-gen and State, prior to unrollings.

That would be an alternative path, but I am quite close to getting ready to share unrolling as VPlan transformer, so I'd prefer to introduce explicit insert/extracts after explicit unrolling.

Sure, by all means; the comment was more future looking.

…form-scalar-lanes

…-lanes' into users/fhahn/vplan-uniform-scalar-lanes

ayalz · 2024-02-18T14:10:40Z

Perhaps it is the "other recipes" who should be fixed, to more consistently store their single scalar in per-part storage rather than lane zero? The current design of State is to hold UF per-part Values whenever possible, be they vectors or (uniform) scalars, and otherwise holds VF*UF per-lane Values, for every VPValue. The hasVectorValue() and hasScalarValue() indicators should better be renamed hasValuePerPart() and hasValuePerLane(), respectively.

That would be another option, but a potential problem could be recipes that only have their first lane used, but one of their operands would only define a vector value per-part. With this patch, this will work as accessing the per-lane value will create an extract from the per-part vector.

Would it be better to stick with the current design, and fix such cases of missing extracts?

Ahh, only other recipe that may generate uniform(-"after-vectorization") scalar values besides VPInstruction is Replicate recipe - which already stores its scalar in lane zero rather than per-part...

OK, let's go with storing values-per-part in lane-zero, consistently. Suggest to wrap in a State.get(VPValue *value, unsigned part, bool IsScalarPerPart) API to make it clearer and easier to change later.

ayalz · 2024-02-18T10:03:03Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -530,8 +543,7 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
  case VPInstruction::CalculateTripCountMinusVF:
  case VPInstruction::CanonicalIVIncrementForPart:
  case VPInstruction::BranchOnCount:
-    // TODO: Cover additional operands.
-    return getOperand(0) == Op;
+    return true;


This is fine, can probably be pushed separately, if testable. (written before the separate push ;-)
CalculateTripCountMinusVF and CanonicalIVIncrementForPart have a single operand, so NFC for them.
ActiveLaneMask and BranchOnCount have two operands, the (first lane of the) IV and the scalar uniform trip count.

Left as is for now, as this only is needed for this patch at the moment.

ayalz · 2024-02-18T10:03:39Z

llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll

@@ -116,8 +116,7 @@ define void @test_widen(ptr noalias %a, ptr readnone %b) #4 {
 ; TFA_INTERLEAVE-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i64, ptr [[TMP15]], i64 [[TMP17]]
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP13]], ptr [[TMP15]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
 ; TFA_INTERLEAVE-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP14]], ptr [[TMP18]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK2]])
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT:%.*]] = add i64 [[INDEX]], [[TMP6]]
-; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT4]] = add i64 [[INDEX]], [[TMP6]]
+; TFA_INTERLEAVE-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]


reverse ping.

ayalz · 2024-02-18T10:42:33Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    if (GeneratedValue->getType()->isVectorTy())
+      State.set(this, GeneratedValue, Part);
+    else {
+      assert((getOpcode() == VPInstruction::ComputeReductionResult ||


Explain/Why is ComputeReductionResult excluded, it generates a scalar Value but may have lanes other than first used?

Added a note that it generates a scalar but has it's last lane accessed via the generic VPLiveOut.

Ahh, right.
This in particular would be more clearly expressed by doing
State.set(this, GeneratedValue, Part, true /* scalar */);
rather than explicitly storing the single last lane directly in the first lane zero via
State.set(this, GeneratedValue, VPIteration(Part, 0));

Updated and added variant of set taking an IsScalar argument.

ayalz · 2024-02-18T10:43:50Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -530,8 +543,7 @@ bool VPInstruction::onlyFirstLaneUsed(const VPValue *Op) const {
  case VPInstruction::CalculateTripCountMinusVF:
  case VPInstruction::CanonicalIVIncrementForPart:
  case VPInstruction::BranchOnCount:


Add case ComputeReductionResult?

ComputeReductionResult combines the partial reduction vector values, so it should use more than the first lane per operand. It produces a single scalar value though.

Ahh, right. Got confused with onlyFirstLaneDefined.

ayalz · 2024-02-18T10:45:03Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    if (GeneratedValue->getType()->isVectorTy())
+      State.set(this, GeneratedValue, Part);
+    else {


Suggested change

if (GeneratedValue->getType()->isVectorTy())

State.set(this, GeneratedValue, Part);

else {

if (GeneratedValue->getType()->isVectorTy()) {

State.set(this, GeneratedValue, Part);

} else {

added braces, thanks

ayalz · 2024-02-18T12:39:35Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -883,8 +888,11 @@ void VPlan::execute(VPTransformState *State) {



Suggested change

bool IsScalarNeeded = isa<VPCanonicalIVPHIRecipe>(PhiR);

Added, thanks!

ayalz · 2024-02-18T12:50:08Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

+      Value *Val =
+          isa<VPCanonicalIVPHIRecipe>(PhiR)
+              ? State->get(PhiR->getBackedgeValue(), VPIteration(Part, 0))
+              : State->get(PhiR->getBackedgeValue(),
+                           SinglePartNeeded ? State->UF - 1 : Part);


Suggested change

Value *Val =

isa<VPCanonicalIVPHIRecipe>(PhiR)

? State->get(PhiR->getBackedgeValue(), VPIteration(Part, 0))

: State->get(PhiR->getBackedgeValue(),

SinglePartNeeded ? State->UF - 1 : Part);

unsigned PartNeeded = SinglePartNeeded ? State->UF - 1 : Part;

Value *Val = State->get(PhiR->getBackedgeValue(), PartNeeded, IsScalarNeeded);

using API suggested below.

Added, thanks

ayalz · 2024-02-18T13:31:01Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

-  if (Def->isLiveIn())
-    return Def->getLiveInIRValue();
+  if (Def->isLiveIn()) {
+    if (Value *V = Def->getLiveInIRValue())


Ahh, thanks for reminding. This may deserve a better representation to avoid confusion.

An alternative way of handling non-IR-valued live-ins, is to complement them with one asap, i.e., have prepareToExecute() call setUnderlyingValue() on every such "symbolic invariant" variant of "live-in", instead of filling State. That way they will behave as any other live-in.

ayalz · 2024-02-18T13:56:47Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

+      return V;
+    if (hasScalarValue(Def, VPIteration(0, 0))) {
+      return Data.PerPartScalars[Def][0][0];
+    }


else?
should this if be an assert?

Code has been removed, thanks

ayalz · 2024-02-18T14:02:54Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -788,13 +793,13 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
  }

  for (unsigned Part = 0, UF = State.UF; Part < UF; ++Part)
-    State.set(&VectorTripCount, VectorTripCountV, Part);
+    State.set(&VectorTripCount, VectorTripCountV, VPIteration(Part, 0));


Independent of this patch: would be good to be consistent if/when a scalar that is invariant or uniform-across-parts is set in lane zero of all parts (e.g., VectorTripCount) or in lane zero of part zero only (e.g., VFxUF).

Now the underlying value is set as suggested above, so neither are stored in State.

…form-scalar-lanes

ayalz · 2024-02-21T15:13:22Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    if (GeneratedValue->getType()->isVectorTy())
+      State.set(this, GeneratedValue, Part);
+    else {
+      assert((getOpcode() == VPInstruction::ComputeReductionResult ||


Ahh, right.
This in particular would be more clearly expressed by doing
State.set(this, GeneratedValue, Part, true /* scalar */);
rather than explicitly storing the single last lane directly in the first lane zero via
State.set(this, GeneratedValue, VPIteration(Part, 0));

ayalz · 2024-02-21T23:15:42Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -9127,7 +9127,7 @@ void VPWidenPointerInductionRecipe::execute(VPTransformState &State) {
         "Unexpected type.");

  auto *IVR = getParent()->getPlan()->getCanonicalIV();
-  PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, 0));
+  PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, VPIteration(0, 0)));


Suggested change

PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, VPIteration(0, 0)));

PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, 0, true /* scalar */));

Updated, thanks!

ayalz · 2024-02-21T23:16:31Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -9243,7 +9243,7 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {

 void VPReductionRecipe::execute(VPTransformState &State) {
  assert(!State.Instance && "Reduction being replicated.");
-  Value *PrevInChain = State.get(getChainOp(), 0);
+  Value *PrevInChain = State.get(getChainOp(), VPIteration(0, 0));


Suggested change

Value *PrevInChain = State.get(getChainOp(), VPIteration(0, 0));

Value *PrevInChain = State.get(getChainOp(), 0, true /* scalar */);

Updated, thanks!

ayalz · 2024-02-21T23:17:20Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -9278,7 +9277,7 @@ void VPReductionRecipe::execute(VPTransformState &State) {
            NewVecOp);
      PrevInChain = NewRed;
    } else {
-      PrevInChain = State.get(getChainOp(), Part);
+      PrevInChain = State.get(getChainOp(), VPIteration(Part, 0));


Suggested change

PrevInChain = State.get(getChainOp(), VPIteration(Part, 0));

PrevInChain = State.get(getChainOp(), Part, true /* scalar */);

Updated, thanks!

ayalz · 2024-02-21T23:19:22Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -9289,7 +9288,7 @@ void VPReductionRecipe::execute(VPTransformState &State) {
    else
      NextInChain = State.Builder.CreateBinOp(
          (Instruction::BinaryOps)RdxDesc.getOpcode(Kind), NewRed, PrevInChain);
-    State.set(this, NextInChain, Part);
+    State.set(this, NextInChain, VPIteration(Part, 0));


Suggested change

State.set(this, NextInChain, VPIteration(Part, 0));

State.set(this, NextInChain, Part, true /* scalar */);

same API of get should apply to set as well.

Updated, thanks!

ayalz · 2024-02-21T23:44:12Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -1344,7 +1353,7 @@ void VPVectorPointerRecipe ::execute(VPTransformState &State) {
      PartPtr = Builder.CreateGEP(IndexedTy, Ptr, Increment, "", InBounds);
    }

-    State.set(this, PartPtr, Part);
+    State.set(this, PartPtr, VPIteration(Part, 0));


Suggested change

State.set(this, PartPtr, VPIteration(Part, 0));

State.set(this, PartPtr, Part, true /* scalar */);

Updated, thanks!

ayalz · 2024-02-21T23:44:39Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -1640,7 +1649,7 @@ void VPCanonicalIVPHIRecipe::execute(VPTransformState &State) {
  EntryPart->addIncoming(Start, VectorPH);
  EntryPart->setDebugLoc(getDebugLoc());
  for (unsigned Part = 0, UF = State.UF; Part < UF; ++Part)
-    State.set(this, EntryPart, Part);
+    State.set(this, EntryPart, VPIteration(Part, 0));


Suggested change

State.set(this, EntryPart, VPIteration(Part, 0));

State.set(this, EntryPart, Part, true /* scalar */);

Updated, thanks!

ayalz · 2024-02-21T23:45:14Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -1711,7 +1720,7 @@ void VPExpandSCEVRecipe::print(raw_ostream &O, const Twine &Indent,
 #endif

 void VPWidenCanonicalIVRecipe::execute(VPTransformState &State) {
-  Value *CanonicalIV = State.get(getOperand(0), 0);
+  Value *CanonicalIV = State.get(getOperand(0), VPIteration(0, 0));


Suggested change

Value *CanonicalIV = State.get(getOperand(0), VPIteration(0, 0));

Value *CanonicalIV = State.get(getOperand(0), 0, true /* scalar */);

Updated, thanks!

ayalz · 2024-02-21T23:45:47Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -1779,6 +1788,7 @@ void VPFirstOrderRecurrencePHIRecipe::print(raw_ostream &O, const Twine &Indent,
 #endif

 void VPReductionPHIRecipe::execute(VPTransformState &State) {
+  // TODO: Store scalar value for in-loop reductions as {Part, 0}.


Removed, thanks!

ayalz · 2024-02-21T23:46:22Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    if (IsInLoop)
+      State.set(this, EntryPart, VPIteration(Part, 0));
+    else
+      State.set(this, EntryPart, Part);


Suggested change

if (IsInLoop)

State.set(this, EntryPart, VPIteration(Part, 0));

else

State.set(this, EntryPart, Part);

State.set(this, EntryPart, Part, IsInLoop);

Updated, thanks!

ayalz

Looks good to me, thanks for accommodating, adding a couple of last minor nits.

ayalz · 2024-02-26T17:17:48Z

llvm/lib/Transforms/Vectorize/VPlan.h

+  /// IsScalar is false. If \p IsScalar is true, set the scalar in (Part, 0).
+  void set(VPValue *Def, Value *V, unsigned Part, bool IsScalar = false) {
+    if (IsScalar) {
+      set(Def, V, VPIteration(Part, 0));


nit: can early return, leaving the rest intact, consistent with get() above.

Done, thanks!

ayalz · 2024-02-26T17:22:16Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    if (GeneratedValue->getType()->isVectorTy()) {
+      State.set(this, GeneratedValue, Part);


nit: can fold into a single

Suggested change

if (GeneratedValue->getType()->isVectorTy()) {

State.set(this, GeneratedValue, Part);

bool isVector = GeneratedValue->getType()->isVectorTy();

State.set(this, GeneratedValue, Part, !isVector);

assert((isVector || getOpcode() == VPInstruction::ComputeReductionResult ||

State.VF.isScalar() || vputils::onlyFirstLaneUsed(this)) &&

"scalar value but not only first lane used");

Adjusted, thanks!

Add a new PtrAdd opcode to VPInstruction that corresponds to IRBuilder::CreatePtrAdd, which creates a GEP with source element type i8. This is then used to model scalarizing VPWidenPointerInductionRecipe by introducing scalar-steps to model the index increment followed by a PtrAdd. Note that PtrAdd needs to be able to generate code for only the first lane or for all lanes. This may warrant introducing a separate recipe for scalarizing that can be created without relying on the underlying IR. Depends on #80271 PR: #83068

fhahn added 2 commits February 1, 2024 10:39

fhahn requested review from ayalz and aniragil February 1, 2024 10:49

llvmbot added vectorizers llvm:transforms labels Feb 1, 2024

fhahn mentioned this pull request Feb 1, 2024

[VPlan] Explicitly handle scalar pointer inductions. #80273

Closed

Base automatically changed from users/fhahn/vplan-vpinst-onlyfirstlaneused to main February 3, 2024 16:19

fhahn added 2 commits February 3, 2024 16:21

Merge branch 'main' into users/fhahn/vplan-uniform-scalar-lanes

f56e217

!fixup fix merge

172dbf6

ayalz reviewed Feb 4, 2024

View reviewed changes

fhahn added 3 commits February 6, 2024 14:10

Merge branch 'main' into users/fhahn/vplan-uniform-scalar-lanes

d2c51ec

Merge branch 'main' into users/fhahn/vplan-uniform-scalar-lanes

82d74df

!fixup address latest comments, thanks!

c6797e6

fhahn commented Feb 7, 2024

View reviewed changes

fhahn added 2 commits February 7, 2024 13:09

!fixup fix formatting

53f2937

Merge branch 'main' into users/fhahn/vplan-uniform-scalar-lanes

a166da5

fhahn added a commit to fhahn/llvm-project that referenced this pull request Feb 14, 2024

[VPlan] Consistently use (Part, 0) for first lane scalar values.

ac11fba

llvm#80271

fhahn added 4 commits February 17, 2024 21:24

Merge remote-tracking branch 'origin/main' into users/fhahn/vplan-uni…

8b6fd60

…form-scalar-lanes

!fixup update remaining places to use (Part, 0).

865da64

Merge remote-tracking branch 'origin/main' into users/fhahn/vplan-uni…

f71f752

…form-scalar-lanes

Merge remote-tracking branch 'origin/users/fhahn/vplan-uniform-scalar…

ddf5f75

…-lanes' into users/fhahn/vplan-uniform-scalar-lanes

ayalz reviewed Feb 18, 2024

View reviewed changes

fhahn added 2 commits February 18, 2024 18:50

Merge remote-tracking branch 'origin/main' into users/fhahn/vplan-uni…

d6538e0

…form-scalar-lanes

!fixpup address parts of outstanding comments, thanks!

6429fdb

fhahn added 6 commits February 19, 2024 10:49

Merge remote-tracking branch 'origin/main' into users/fhahn/vplan-uni…

89f7a80

…form-scalar-lanes

!fixup address remaining comments

567faea

!fixup add TODOs for reduction recipes.

d9760f1

Merge branch 'main' into users/fhahn/vplan-uniform-scalar-lanes

d72a629

!fixup also use new scheme for in-loop reductions.

fa8f747

Merge remote-tracking branch 'origin/main' into users/fhahn/vplan-uni…

57b4229

…form-scalar-lanes

ayalz reviewed Feb 21, 2024

View reviewed changes

!fixup address remaining comments, thanks

8b48685

ayalz approved these changes Feb 26, 2024

View reviewed changes

fhahn added 3 commits February 26, 2024 17:58

Merge branch 'main' into users/fhahn/vplan-uniform-scalar-lanes

e038070

!fixup address remaining comments, thanks!

031df8e

Fix formatting.

4c2f243

fhahn merged commit 911055e into main Feb 26, 2024
3 of 4 checks passed

fhahn deleted the users/fhahn/vplan-uniform-scalar-lanes branch February 26, 2024 19:06

fhahn mentioned this pull request Feb 26, 2024

[VPlan] Explicitly handle scalar pointer inductions. #83068

Merged

cheshire mentioned this pull request May 12, 2024

New LLVM vectorizer crash in assertion #91883

Closed

	// At the moment, only scalar codegen is implemented.
	// At the moment, only uniform codegen is implemented.

		@@ -883,8 +888,11 @@ void VPlan::execute(VPTransformState *State) {

	PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, VPIteration(0, 0)));
	PHINode CanonicalIV = cast<PHINode>(State.get(IVR, 0, true / scalar */));

	Value *PrevInChain = State.get(getChainOp(), VPIteration(0, 0));
	Value PrevInChain = State.get(getChainOp(), 0, true / scalar */);

	PrevInChain = State.get(getChainOp(), VPIteration(Part, 0));
	PrevInChain = State.get(getChainOp(), Part, true /* scalar */);

	State.set(this, NextInChain, VPIteration(Part, 0));
	State.set(this, NextInChain, Part, true /* scalar */);

	State.set(this, PartPtr, VPIteration(Part, 0));
	State.set(this, PartPtr, Part, true /* scalar */);

	State.set(this, EntryPart, VPIteration(Part, 0));
	State.set(this, EntryPart, Part, true /* scalar */);

	Value *CanonicalIV = State.get(getOperand(0), VPIteration(0, 0));
	Value CanonicalIV = State.get(getOperand(0), 0, true / scalar */);

		if (GeneratedValue->getType()->isVectorTy()) {
		State.set(this, GeneratedValue, Part);

[VPlan] Consistently use (Part, 0) for first lane scalar values #80271

[VPlan] Consistently use (Part, 0) for first lane scalar values #80271

Conversation

fhahn commented Feb 1, 2024 • edited Loading

llvmbot commented Feb 1, 2024

ayalz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 7, 2024 • edited Loading

fhahn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayalz commented Feb 15, 2024

ayalz commented Feb 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayalz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn commented Feb 1, 2024 •

edited

Loading

github-actions bot commented Feb 7, 2024 •

edited

Loading