Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPlan] Model FOR extract of exit value in VPlan. #93395

Merged
merged 5 commits into from
Jun 3, 2024

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented May 26, 2024

This patch introduces a new ExtractFromEnd VPInstruction opcode to extract the value of a FOR for users outside the loop (i.e. in the scalar loop's exits). This moves the first part of fixing first order recurrences to VPlan, and removes some additional code to patch up live-outs, which is now handled automatically.

The majority of test changes is due to changes in the order of which the extracts are generated now. As we are now using VPTransformState to generate the extracts, we may be able to re-use existing extracts in the loop body in some cases. For scalable vectors, in some cases we now have to compute the runtime VF twice, as each extract is now independent, but those should be trivial to clean up for later passes (and in line with other places in the code that also liberally re-compute runtime VFs).

@llvmbot
Copy link
Collaborator

llvmbot commented May 26, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch introduces a new ExtractRecurrenceResult VPInstruction opcode to extract the value of a FOR for users outside the loop (i.e. in the scalar loop's exits). This moves the first part of fixing first order recurrences to VPlan, and removes some additional code to patch up live-outs, which is now handled automatically.

The majority of test changes is due to changes in the order of which the extracts are generated now. As we are now using VPTransformState to generate the extracts, we may be able to re-use existing extracts in the loop body in some cases. For scalable vectors, in some cases we now have to compute the runtime VF twice, as each extract is now independent, but those should be trivial to clean up for later passes (and in line with other places in the code that also liberally re-compute runtime VFs).


Patch is 36.50 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/93395.diff

10 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (-38)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+5-3)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+27)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+8)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll (+28-28)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+11-13)
  • (modified) llvm/test/Transforms/LoopVectorize/scalable-first-order-recurrence.ll (+8-4)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+2-1)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 48981a6bd39e3..71ee690a8fabd 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3536,44 +3536,6 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(
         Builder.CreateExtractElement(Incoming, LastIdx, "vector.recur.extract");
   }
 
-  auto RecurSplice = cast<VPInstruction>(*PhiR->user_begin());
-  assert(PhiR->getNumUsers() == 1 &&
-         RecurSplice->getOpcode() ==
-             VPInstruction::FirstOrderRecurrenceSplice &&
-         "recurrence phi must have a single user: FirstOrderRecurrenceSplice");
-  SmallVector<VPLiveOut *> LiveOuts;
-  for (VPUser *U : RecurSplice->users())
-    if (auto *LiveOut = dyn_cast<VPLiveOut>(U))
-      LiveOuts.push_back(LiveOut);
-
-  if (!LiveOuts.empty()) {
-    // Extract the second last element in the middle block if the
-    // Phi is used outside the loop. We need to extract the phi itself
-    // and not the last element (the phi update in the current iteration). This
-    // will be the value when jumping to the exit block from the
-    // LoopMiddleBlock, when the scalar loop is not run at all.
-    Value *ExtractForPhiUsedOutsideLoop = nullptr;
-    if (VF.isVector()) {
-      auto *Idx = Builder.CreateSub(RuntimeVF, ConstantInt::get(IdxTy, 2));
-      ExtractForPhiUsedOutsideLoop = Builder.CreateExtractElement(
-          Incoming, Idx, "vector.recur.extract.for.phi");
-    } else {
-      assert(UF > 1 && "VF and UF cannot both be 1");
-      // When loop is unrolled without vectorizing, initialize
-      // ExtractForPhiUsedOutsideLoop with the value just prior to unrolled
-      // value of `Incoming`. This is analogous to the vectorized case above:
-      // extracting the second last element when VF > 1.
-      ExtractForPhiUsedOutsideLoop = State.get(PreviousDef, UF - 2);
-    }
-
-    for (VPLiveOut *LiveOut : LiveOuts) {
-      assert(!Cost->requiresScalarEpilogue(VF.isVector()));
-      PHINode *LCSSAPhi = LiveOut->getPhi();
-      LCSSAPhi->addIncoming(ExtractForPhiUsedOutsideLoop, LoopMiddleBlock);
-      State.Plan->removeLiveOut(LCSSAPhi);
-    }
-  }
-
   // Fix the initial value of the original recurrence in the scalar loop.
   Builder.SetInsertPoint(LoopScalarPreHeader, LoopScalarPreHeader->begin());
   PHINode *Phi = cast<PHINode>(PhiR->getUnderlyingValue());
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 3aee17921086d..91cbd459675d9 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -167,8 +167,8 @@ class VPLane {
 
   static VPLane getFirstLane() { return VPLane(0, VPLane::Kind::First); }
 
-  static VPLane getLastLaneForVF(const ElementCount &VF) {
-    unsigned LaneOffset = VF.getKnownMinValue() - 1;
+  static VPLane getLastLaneForVF(const ElementCount &VF, unsigned Offset = 1) {
+    unsigned LaneOffset = VF.getKnownMinValue() - Offset;
     Kind LaneKind;
     if (VF.isScalable())
       // In this case 'LaneOffset' refers to the offset from the start of the
@@ -1179,6 +1179,7 @@ class VPInstruction : public VPRecipeWithIRFlags {
     BranchOnCount,
     BranchOnCond,
     ComputeReductionResult,
+    ExtractRecurrenceResult,
     LogicalAnd, // Non-poison propagating logical And.
     // Add an offset in bytes (second operand) to a base pointer (first
     // operand). Only generates scalar values (either for the first lane only or
@@ -3612,7 +3613,8 @@ inline bool isUniformAfterVectorization(VPValue *VPV) {
   if (auto *GEP = dyn_cast<VPWidenGEPRecipe>(Def))
     return all_of(GEP->operands(), isUniformAfterVectorization);
   if (auto *VPI = dyn_cast<VPInstruction>(Def))
-    return VPI->getOpcode() == VPInstruction::ComputeReductionResult;
+    return VPI->getOpcode() == VPInstruction::ComputeReductionResult ||
+           VPI->getOpcode() == VPInstruction::ExtractRecurrenceResult;
   return false;
 }
 } // end namespace vputils
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 5eb99ffd1e10e..82e3bd530a8d6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -137,6 +137,7 @@ bool VPRecipeBase::mayHaveSideEffects() const {
     case VPInstruction::Not:
     case VPInstruction::CalculateTripCountMinusVF:
     case VPInstruction::CanonicalIVIncrementForPart:
+    case VPInstruction::ExtractRecurrenceResult:
     case VPInstruction::LogicalAnd:
     case VPInstruction::PtrAdd:
       return false;
@@ -300,6 +301,7 @@ bool VPInstruction::canGenerateScalarForFirstLane() const {
   case VPInstruction::CalculateTripCountMinusVF:
   case VPInstruction::CanonicalIVIncrementForPart:
   case VPInstruction::ComputeReductionResult:
+  case VPInstruction::ExtractRecurrenceResult:
   case VPInstruction::PtrAdd:
   case VPInstruction::ExplicitVectorLength:
     return true;
@@ -558,6 +560,27 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
 
     return ReducedPartRdx;
   }
+  case VPInstruction::ExtractRecurrenceResult: {
+    if (Part != 0)
+      return State.get(this, 0, /*IsScalar*/ true);
+
+    // Extract the second last element in the middle block for users outside the
+    // loop.
+    Value *Res;
+    if (State.VF.isVector()) {
+      Res = State.get(
+          getOperand(0),
+          VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF, 2)));
+    } else {
+      assert(State.UF > 1 && "VF and UF cannot both be 1");
+      // When loop is unrolled without vectorizing, retrieve the value just
+      // prior to the final unrolled value. This is analogous to the vectorized
+      // case above: extracting the second last element when VF > 1.
+      Res = State.get(getOperand(0), State.UF - 2);
+    }
+    Res->setName(Name);
+    return Res;
+  }
   case VPInstruction::LogicalAnd: {
     Value *A = State.get(getOperand(0), Part);
     Value *B = State.get(getOperand(1), Part);
@@ -598,6 +621,7 @@ void VPInstruction::execute(VPTransformState &State) {
   bool GeneratesPerFirstLaneOnly =
       canGenerateScalarForFirstLane() &&
       (vputils::onlyFirstLaneUsed(this) ||
+       getOpcode() == VPInstruction::ExtractRecurrenceResult ||
        getOpcode() == VPInstruction::ComputeReductionResult);
   bool GeneratesPerAllLanes = doesGeneratePerAllLanes();
   for (unsigned Part = 0; Part < State.UF; ++Part) {
@@ -692,6 +716,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
   case VPInstruction::BranchOnCount:
     O << "branch-on-count";
     break;
+  case VPInstruction::ExtractRecurrenceResult:
+    O << "extract-recurrence-result";
+    break;
   case VPInstruction::ComputeReductionResult:
     O << "compute-reduction-result";
     break;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 422579ea8b84f..f4f43ed15615e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -812,6 +812,8 @@ bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan,
     if (auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&R))
       RecurrencePhis.push_back(FOR);
 
+  VPBuilder MiddleBuilder(
+      cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSingleSuccessor()));
   for (VPFirstOrderRecurrencePHIRecipe *FOR : RecurrencePhis) {
     SmallPtrSet<VPFirstOrderRecurrencePHIRecipe *, 4> SeenPhis;
     VPRecipeBase *Previous = FOR->getBackedgeValue()->getDefiningRecipe();
@@ -843,6 +845,12 @@ bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan,
     // Set the first operand of RecurSplice to FOR again, after replacing
     // all users.
     RecurSplice->setOperand(0, FOR);
+
+    auto *Result = cast<VPInstruction>(MiddleBuilder.createNaryOp(
+        VPInstruction::ExtractRecurrenceResult, {FOR->getBackedgeValue()}, {},
+        "vector.recur.extract.for.phi"));
+    RecurSplice->replaceUsesWithIf(
+        Result, [](VPUser &U, unsigned) { return isa<VPLiveOut>(&U); });
   }
   return true;
 }
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
index 4d9c850abdf3d..162fb9c802dd2 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
@@ -126,9 +126,9 @@ define i64 @pointer_induction_only(ptr %start, ptr %end) {
 ; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK:       middle.block:
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0
 ; CHECK-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; CHECK:       scalar.ph:
 ; CHECK-NEXT:    [[SCALAR_RECUR_INIT:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]
@@ -191,8 +191,8 @@ define i64 @int_and_pointer_iv(ptr %start, i32 %N) {
 ; CHECK-NEXT:    [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
 ; CHECK-NEXT:    br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:       middle.block:
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i64> [[TMP5]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i64> [[TMP5]], i32 2
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i64> [[TMP5]], i32 3
 ; CHECK-NEXT:    br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; CHECK:       scalar.ph:
 ; CHECK-NEXT:    [[SCALAR_RECUR_INIT:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]
@@ -332,9 +332,9 @@ define i64 @test_ptr_ivs_and_widened_ivs(ptr %src, i32 %N) {
 ; DEFAULT-NEXT:    [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; DEFAULT-NEXT:    br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
 ; DEFAULT:       middle.block:
+; DEFAULT-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i64> [[TMP15]], i32 2
 ; DEFAULT-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
 ; DEFAULT-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i64> [[TMP15]], i32 3
-; DEFAULT-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i64> [[TMP15]], i32 2
 ; DEFAULT-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; DEFAULT:       scalar.ph:
 ; DEFAULT-NEXT:    [[SCALAR_RECUR_INIT:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll b/llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
index f2f58f89eb95d..718148a67fcc8 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
@@ -870,9 +870,9 @@ define i8 @add_phifail2(ptr noalias nocapture readonly %p, ptr noalias nocapture
 ; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP23:![0-9]+]]
 ; CHECK:       middle.block:
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <16 x i32> [[TMP6]], i32 14
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <16 x i32> [[TMP6]], i32 15
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <16 x i32> [[TMP6]], i32 14
 ; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
 ; CHECK:       scalar.ph:
 ; CHECK-NEXT:    [[SCALAR_RECUR_INIT:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]
diff --git a/llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll b/llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll
index c663d2b15b587..a1173c6b46a2c 100644
--- a/llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll
+++ b/llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll
@@ -18,10 +18,10 @@ define i16 @test_chained_first_order_recurrences_1(ptr %ptr) {
 ; CHECK-NEXT:    [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
 ; CHECK-NEXT:    br i1 [[TMP8]], label %middle.block, label %vector.body
 ; CHECK:       middle.block:
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT2:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT2:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
 ;
 entry:
   br label %loop
@@ -61,10 +61,10 @@ define i16 @test_chained_first_order_recurrences_2(ptr %ptr) {
 ; CHECK-NEXT:    [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
 ; CHECK-NEXT:    br i1 [[TMP8]], label %middle.block, label %vector.body, !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK:      middle.block:
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT2:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI3:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT2:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
 ;
 entry:
   br label %loop
@@ -107,12 +107,12 @@ define i16 @test_chained_first_order_recurrences_3(ptr %ptr) {
 ; CHECK-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
 ; CHECK-NEXT:    br i1 [[TMP10]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:      middle.block:
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI4:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI8:%.*]] = extractelement <4 x i16> [[TMP5]], i32 2
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
 ;
 entry:
   br label %loop
@@ -219,12 +219,12 @@ define i16 @test_chained_first_order_recurrences_3_reordered_1(ptr %ptr) {
 ; CHECK-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
 ; CHECK-NEXT:    br i1 [[TMP10]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:      middle.block:
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI8:%.*]] = extractelement <4 x i16> [[TMP5]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI4:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
 ;
 entry:
   br label %loop
@@ -270,12 +270,12 @@ define i16 @test_chained_first_order_recurrences_3_reordered_2(ptr %ptr) {
 ; CHECK-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
 ; CHECK-NEXT:    br i1 [[TMP10]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:      middle.block:
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI4:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI8:%.*]] = extractelement <4 x i16> [[TMP5]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
 ;
 entry:
   br label %loop
@@ -321,12 +321,12 @@ define i16 @test_chained_first_order_recurrences_3_for2_no_other_uses(ptr %ptr)
 ; CHECK-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
 ; CHECK-NEXT:    br i1 [[TMP10]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:      middle.block:
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI4:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI8:%.*]] = extractelement <4 x i16> [[TMP5]], i32 2
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
 ;
 entry:
   br label %loop
@@ -371,12 +371,12 @@ define i16 @test_chained_first_order_recurrences_3_for1_for2_no_other_uses(ptr %
 ; CHECK-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
 ; CHECK-NEXT:    br i1 [[TMP10]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:      middle.block:
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI4:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI8:%.*]] = extractelement <4 x i16> [[TMP5]], i32 2
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
 ;
 entry:
   br label %loop
@@ -420,10 +420,10 @@ define double @test_chained_first_order_recurrence_sink_users_1(ptr %ptr) {
 ; CHECK-NEXT:    [[TMP9...
[truncated]

fhahn added a commit to fhahn/llvm-project that referenced this pull request May 26, 2024
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode
to extract the value of a FOR to be used as resume value for the ph in
the scalar loop.

Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is
needed at the moment to conveniently let fixFixedOrderRecurrence still
handle creating and patching up the phis in the scalar pre-header and
header. As ExtractRecurrenceResume won't have any users yet, it is
marked as having side-effects until the phi in the scalar pre-header is
also created and managed by VPlan.

Depends on llvm#93395
Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice step populating middle block with recipes!

@@ -167,8 +167,8 @@ class VPLane {

static VPLane getFirstLane() { return VPLane(0, VPLane::Kind::First); }

static VPLane getLastLaneForVF(const ElementCount &VF) {
unsigned LaneOffset = VF.getKnownMinValue() - 1;
static VPLane getLastLaneForVF(const ElementCount &VF, unsigned Offset = 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last is no longer accurate. Plus worth asserting that Offset is not larger than VF.
How about adding getPenultimateLaneForVF(VF) instead, which together with getLastLaneForVF(VF) call a private getLaneFromEnd(VF, OffsetFromEnd)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalized the patch to use ExtractFromEnd, added public getLaneFromEnd for that use case

@@ -558,6 +560,27 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {

return ReducedPartRdx;
}
case VPInstruction::ExtractRecurrenceResult: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a VPInstruction seems independent of Recurrence Result, and could hopefully be used elsewhere if generalized(?).
How about calling it ExtractPenultimateElement? Or ExtractElementFromEnd with another operand conveying which element to extract, from the end...
Admittedly hard to see reuse opportunities atm, but better focus on what it does than what it's used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalized to ExtractFromEnd and will update #93396 to use it to extract the resume value as well

if (Part != 0)
return State.get(this, 0, /*IsScalar*/ true);

// Extract the second last element in the middle block for users outside the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Extract the second last element in the middle block for users outside the
// Extract the penultimate element from the operand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is now gone after generalizing to ExtractFromEnd

assert(State.UF > 1 && "VF and UF cannot both be 1");
// When loop is unrolled without vectorizing, retrieve the value just
// prior to the final unrolled value. This is analogous to the vectorized
// case above: extracting the second last element when VF > 1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// case above: extracting the second last element when VF > 1.
// case above: extracting the penultimate element when VF > 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now been generalized to ExtractFromEnd, so the comment is gone

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment still appears? Anyhow, "else" case corresponds to retrieving part UF - Offset "vertically".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

// case above: extracting the second last element when VF > 1.
Res = State.get(getOperand(0), State.UF - 2);
}
Res->setName(Name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Name was found useless and destined to removal some patches ago: #81411 (review)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to retain the original name for the IR values, which still seems useful to keep the IR readable and reduce the test changes (in particular it makes it easier to trace back to where the IR instruction is coming from)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are available via Underlying Values? This Name was originally designed to facilitate new names, and as such appear redundant. (Issue independent of this patch.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for cases where there is no underlying value, which is the case for this opcode and (most?) custom VPInstruction opcodes.

if (State.VF.isVector()) {
Res = State.get(
getOperand(0),
VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF, 2)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nicer if State.VF - 2 could return the desired VPlane, analogous to State.UF - 1 returning the desired part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done via implementing operator- for VPLane as follow-up if desired.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator- would naturally return an ElementCount. Perhaps operator[] e.g., VF[0], VF[-1], VF[-2] to obtain first, last and penultimate VPlanes, respectively (and possibly VF[1] for 2nd if useful), although admittedly first is computes regardless of VF.

@@ -598,6 +621,7 @@ void VPInstruction::execute(VPTransformState &State) {
bool GeneratesPerFirstLaneOnly =
canGenerateScalarForFirstLane() &&
(vputils::onlyFirstLaneUsed(this) ||
getOpcode() == VPInstruction::ExtractRecurrenceResult ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth introducing an isReduction trait to classify vector-to-scalar recipes including ComputeReductionResult and Extract*'s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do, but reduction may not necessarily be a good fit for ExtractFromEnd. For now, they both generate a single scalar, either by reducing or extracting. How about hasSingleScalarResult?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps isVectorToScalar: trying to capture the inverse behavior of broadcast, rather than only the type of the result, which may depend on the types of the operands - the sum of uniform / single scalar per part values also produces a uniform / single scalar per part value.

@@ -1223,11 +1222,11 @@ define i32 @PR30183(i32 %pre_load, ptr %a, ptr %b, i64 %n) {
; SINK-AFTER-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP10]]
; SINK-AFTER-NEXT: [[TMP15:%.*]] = load i32, ptr [[TMP11]], align 4
; SINK-AFTER-NEXT: [[TMP16:%.*]] = load i32, ptr [[TMP12]], align 4
; SINK-AFTER-NEXT: [[TMP17:%.*]] = load i32, ptr [[TMP13]], align 4
; SINK-AFTER-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = load i32, ptr [[TMP13]], align 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This VECTOR_RECUR_EXTRACT_FOR_PHI is a load? Another occurrence above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a consequence of using State.get() to generate extracts, which will re-use existing extracts. (also mentioned in the commit message)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be hopefully clarified when ExtractAllLanes is introduced to represent in VPlan vector-to-all-scalar extractions explicitly rather than having State.get() deal with them on demand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@@ -812,6 +812,8 @@ bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan,
if (auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&R))
RecurrencePhis.push_back(FOR);

VPBuilder MiddleBuilder(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth renaming Builder here to, say, LoopBuilder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

@fhahn fhahn force-pushed the vplan-compute-for-result branch from 8b4bffc to 07a955d Compare June 1, 2024 11:26
Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and generalized to ExtracFromEnd opcode, thanks!

// case above: extracting the second last element when VF > 1.
Res = State.get(getOperand(0), State.UF - 2);
}
Res->setName(Name);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to retain the original name for the IR values, which still seems useful to keep the IR readable and reduce the test changes (in particular it makes it easier to trace back to where the IR instruction is coming from)

assert(State.UF > 1 && "VF and UF cannot both be 1");
// When loop is unrolled without vectorizing, retrieve the value just
// prior to the final unrolled value. This is analogous to the vectorized
// case above: extracting the second last element when VF > 1.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now been generalized to ExtractFromEnd, so the comment is gone

if (State.VF.isVector()) {
Res = State.get(
getOperand(0),
VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF, 2)));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done via implementing operator- for VPLane as follow-up if desired.

if (Part != 0)
return State.get(this, 0, /*IsScalar*/ true);

// Extract the second last element in the middle block for users outside the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is now gone after generalizing to ExtractFromEnd

@@ -558,6 +560,27 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {

return ReducedPartRdx;
}
case VPInstruction::ExtractRecurrenceResult: {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalized to ExtractFromEnd and will update #93396 to use it to extract the resume value as well

@@ -167,8 +167,8 @@ class VPLane {

static VPLane getFirstLane() { return VPLane(0, VPLane::Kind::First); }

static VPLane getLastLaneForVF(const ElementCount &VF) {
unsigned LaneOffset = VF.getKnownMinValue() - 1;
static VPLane getLastLaneForVF(const ElementCount &VF, unsigned Offset = 1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalized the patch to use ExtractFromEnd, added public getLaneFromEnd for that use case

@@ -598,6 +621,7 @@ void VPInstruction::execute(VPTransformState &State) {
bool GeneratesPerFirstLaneOnly =
canGenerateScalarForFirstLane() &&
(vputils::onlyFirstLaneUsed(this) ||
getOpcode() == VPInstruction::ExtractRecurrenceResult ||
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do, but reduction may not necessarily be a good fit for ExtractFromEnd. For now, they both generate a single scalar, either by reducing or extracting. How about hasSingleScalarResult?

@@ -812,6 +812,8 @@ bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan,
if (auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&R))
RecurrencePhis.push_back(FOR);

VPBuilder MiddleBuilder(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

@@ -1223,11 +1222,11 @@ define i32 @PR30183(i32 %pre_load, ptr %a, ptr %b, i64 %n) {
; SINK-AFTER-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP10]]
; SINK-AFTER-NEXT: [[TMP15:%.*]] = load i32, ptr [[TMP11]], align 4
; SINK-AFTER-NEXT: [[TMP16:%.*]] = load i32, ptr [[TMP12]], align 4
; SINK-AFTER-NEXT: [[TMP17:%.*]] = load i32, ptr [[TMP13]], align 4
; SINK-AFTER-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = load i32, ptr [[TMP13]], align 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a consequence of using State.get() to generate extracts, which will re-use existing extracts. (also mentioned in the commit message)

fhahn added a commit to fhahn/llvm-project that referenced this pull request Jun 1, 2024
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode
to extract the value of a FOR to be used as resume value for the ph in
the scalar loop.

Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is
needed at the moment to conveniently let fixFixedOrderRecurrence still
handle creating and patching up the phis in the scalar pre-header and
header. As ExtractRecurrenceResume won't have any users yet, it is
marked as having side-effects until the phi in the scalar pre-header is
also created and managed by VPlan.

Depends on llvm#93395
@@ -167,8 +167,8 @@ class VPLane {

static VPLane getFirstLane() { return VPLane(0, VPLane::Kind::First); }

static VPLane getLastLaneForVF(const ElementCount &VF) {
unsigned LaneOffset = VF.getKnownMinValue() - 1;
static VPLane getLaneFromEnd(const ElementCount &VF, unsigned Offset) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works "horizontally" for Offsets between 1 and VF, inclusive, if VF >= 2,
and "vertically" for Offsets between 1 and UF, inclusive, if VF = 1, implying UF >= 2.
I.e., this works in general only for Offset = 1 and Offset = 2 across possible {VF, UF} combinations.
May be good to keep private and expose only public last and penultimate cases. And/or assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth supporting any lane w.r.t. to he constraints that it is <= VF in the vector case or <= UF for the scalar case. Added an assert for now. This also avoids explicitly checking the lane and delegating to multiple helper functions.

if (Part != 0)
return State.get(this, 0, /*IsScalar*/ true);

auto *CI = cast<ConstantInt>(getOperand(1)->getLiveInIRValue());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document somewhere that 2nd operand of ExtractFromEnd must be a constant, either 1 or 2.

This patch exercises Offset = 2 and follow-up #93396 exercises Offset = 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to where the opcode is defined, thanks. I kept the code general for now, with the constraints that the lane to extract must be <= VF when vectorizing or <= UF otherwise.

auto *CI = cast<ConstantInt>(getOperand(1)->getLiveInIRValue());
unsigned Offset = CI->getZExtValue();

// Extract lane VF - Offset in the from the operand.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Extract lane VF - Offset in the from the operand.
// Extract lane VF - Offset from the operand.

Better place under the "if", as it applies to the horizontal "then" case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved and updated, thanks!

// case above: extracting the second last element when VF > 1.
Res = State.get(getOperand(0), State.UF - 2);
}
Res->setName(Name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are available via Underlying Values? This Name was originally designed to facilitate new names, and as such appear redundant. (Issue independent of this patch.)

@@ -1223,11 +1222,11 @@ define i32 @PR30183(i32 %pre_load, ptr %a, ptr %b, i64 %n) {
; SINK-AFTER-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP10]]
; SINK-AFTER-NEXT: [[TMP15:%.*]] = load i32, ptr [[TMP11]], align 4
; SINK-AFTER-NEXT: [[TMP16:%.*]] = load i32, ptr [[TMP12]], align 4
; SINK-AFTER-NEXT: [[TMP17:%.*]] = load i32, ptr [[TMP13]], align 4
; SINK-AFTER-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = load i32, ptr [[TMP13]], align 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be hopefully clarified when ExtractAllLanes is introduced to represent in VPlan vector-to-all-scalar extractions explicitly rather than having State.get() deal with them on demand.

if (State.VF.isVector()) {
Res = State.get(
getOperand(0),
VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF, 2)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator- would naturally return an ElementCount. Perhaps operator[] e.g., VF[0], VF[-1], VF[-2] to obtain first, last and penultimate VPlanes, respectively (and possibly VF[1] for 2nd if useful), although admittedly first is computes regardless of VF.

@@ -598,6 +621,7 @@ void VPInstruction::execute(VPTransformState &State) {
bool GeneratesPerFirstLaneOnly =
canGenerateScalarForFirstLane() &&
(vputils::onlyFirstLaneUsed(this) ||
getOpcode() == VPInstruction::ExtractRecurrenceResult ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps isVectorToScalar: trying to capture the inverse behavior of broadcast, rather than only the type of the result, which may depend on the types of the operands - the sum of uniform / single scalar per part values also produces a uniform / single scalar per part value.

assert(State.UF > 1 && "VF and UF cannot both be 1");
// When loop is unrolled without vectorizing, retrieve the value just
// prior to the final unrolled value. This is analogous to the vectorized
// case above: extracting the second last element when VF > 1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment still appears? Anyhow, "else" case corresponds to retrieving part UF - Offset "vertically".

Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed latest comments, thanks!

Added new isVectorToScalar as suggested.

Operator- would naturally return an ElementCount
An issue with that may be that we would need to create a VPLane from the element count first, to apply it's operator-, which may be almost as verbose as currently, which is a bit more explicit.

if (Part != 0)
return State.get(this, 0, /*IsScalar*/ true);

auto *CI = cast<ConstantInt>(getOperand(1)->getLiveInIRValue());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to where the opcode is defined, thanks. I kept the code general for now, with the constraints that the lane to extract must be <= VF when vectorizing or <= UF otherwise.

@@ -167,8 +167,8 @@ class VPLane {

static VPLane getFirstLane() { return VPLane(0, VPLane::Kind::First); }

static VPLane getLastLaneForVF(const ElementCount &VF) {
unsigned LaneOffset = VF.getKnownMinValue() - 1;
static VPLane getLaneFromEnd(const ElementCount &VF, unsigned Offset) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth supporting any lane w.r.t. to he constraints that it is <= VF in the vector case or <= UF for the scalar case. Added an assert for now. This also avoids explicitly checking the lane and delegating to multiple helper functions.

auto *CI = cast<ConstantInt>(getOperand(1)->getLiveInIRValue());
unsigned Offset = CI->getZExtValue();

// Extract lane VF - Offset in the from the operand.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved and updated, thanks!

assert(State.UF > 1 && "VF and UF cannot both be 1");
// When loop is unrolled without vectorizing, retrieve the value just
// prior to the final unrolled value. This is analogous to the vectorized
// case above: extracting the second last element when VF > 1.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

// case above: extracting the second last element when VF > 1.
Res = State.get(getOperand(0), State.UF - 2);
}
Res->setName(Name);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for cases where there is no underlying value, which is the case for this opcode and (most?) custom VPInstruction opcodes.

@@ -1223,11 +1222,11 @@ define i32 @PR30183(i32 %pre_load, ptr %a, ptr %b, i64 %n) {
; SINK-AFTER-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP10]]
; SINK-AFTER-NEXT: [[TMP15:%.*]] = load i32, ptr [[TMP11]], align 4
; SINK-AFTER-NEXT: [[TMP16:%.*]] = load i32, ptr [[TMP12]], align 4
; SINK-AFTER-NEXT: [[TMP17:%.*]] = load i32, ptr [[TMP13]], align 4
; SINK-AFTER-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = load i32, ptr [[TMP13]], align 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@fhahn fhahn force-pushed the vplan-compute-for-result branch from 07a955d to faf35ac Compare June 3, 2024 09:19
Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for accommodating! Adding some final suggestions.

@@ -1182,6 +1188,11 @@ class VPInstruction : public VPRecipeWithIRFlags {
BranchOnCount,
BranchOnCond,
ComputeReductionResult,
// Takes the VPValue to extract from as first operand and the lane to
// extract from as second operand. The second operand must be a constant and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// extract from as second operand. The second operand must be a constant and
// extract as second operand, counting from the end starting with 1 for last. The second operand must be a positive constant and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, thanks!

@@ -1182,6 +1188,11 @@ class VPInstruction : public VPRecipeWithIRFlags {
BranchOnCount,
BranchOnCond,
ComputeReductionResult,
// Takes the VPValue to extract from as first operand and the lane to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Takes the VPValue to extract from as first operand and the lane to
// Takes the VPValue to extract from as first operand and the lane or part to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, thanks!

@@ -1182,6 +1188,11 @@ class VPInstruction : public VPRecipeWithIRFlags {
BranchOnCount,
BranchOnCond,
ComputeReductionResult,
// Takes the VPValue to extract from as first operand and the lane to
// extract from as second operand. The second operand must be a constant and
// <= VF when extracting from a vector or <= UF when extracting from a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// <= VF when extracting from a vector or <= UF when extracting from a
// <= VF when extracting from a vector or <= UF when extracting from an unrolled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, thanks!

@@ -1219,6 +1230,10 @@ class VPInstruction : public VPRecipeWithIRFlags {
/// value for lane \p Lane.
Value *generatePerLane(VPTransformState &State, const VPIteration &Lane);

/// Returns true if this VPInstruction converts a vector value to a scalar,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns true if this VPInstruction converts a vector value to a scalar,
/// Returns true if this VPInstruction produces a scalar value from a vector,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, thanks!

@@ -137,6 +137,7 @@ bool VPRecipeBase::mayHaveSideEffects() const {
case VPInstruction::Not:
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::ExtractFromEnd:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed in this patch, given the LiveOut user, or only for the follow-up patch extracting the last element?
Worth a TODO? Temporarily to keep alive until VPUsers are introduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed for the patch, as it always adds the extract and RAUW outside the loop, relying on DCE to clean up unneeded extracts.

Copy link
Collaborator

@ayalz ayalz Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this marks ExtractFromEnd as not having side effects.


auto *CI = cast<ConstantInt>(getOperand(1)->getLiveInIRValue());
unsigned Offset = CI->getZExtValue();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(Offset > 0 && "Offset from end must be positive");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

Comment on lines 3675 to 3676
return VPI->getOpcode() == VPInstruction::ComputeReductionResult ||
VPI->getOpcode() == VPInstruction::ExtractFromEnd;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return VPI->getOpcode() == VPInstruction::ComputeReductionResult ||
VPI->getOpcode() == VPInstruction::ExtractFromEnd;
return VPI->isVectorToScalar();

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

static VPLane getLastLaneForVF(const ElementCount &VF) {
unsigned LaneOffset = VF.getKnownMinValue() - 1;
static VPLane getLaneFromEnd(const ElementCount &VF, unsigned Offset) {
assert(Offset <= VF.getKnownMinValue() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(Offset <= VF.getKnownMinValue() &&
assert(Offset > 0 && Offset <= VF.getKnownMinValue() &&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

VPInstruction::ExtractFromEnd,
{FOR->getBackedgeValue(),
Plan.getOrAddLiveIn(ConstantInt::get(IntTy, 2))},
{}, "vector.recur.extract.for.phi"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this is the unique Name assigned to this VPInstruction instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

fhahn added 5 commits June 3, 2024 15:41
This patch introduces a new ExtractRecurrenceResult VPInstruction opcode
to extract the value of a FOR for users outside the loop (i.e. in the
scalar loop's exits). This moves the first part of fixing first order
recurrences to VPlan, and removes some additional code to patch up
live-outs, which is now handled automatically.

The majority of test changes is due to changes in the order of which the
extracts are generated now. As we are now using VPTransformState to
generate the extracts, we may be able to re-use existing extracts in the
loop body in some cases. For scalable vectors, in some cases we now have
to compute the runtime VF twice, as each extract is now independent, but
those should be trivial to clean up for later passes (and in line with
other places in the code that also liberally re-compute runtime VFs).
@fhahn fhahn force-pushed the vplan-compute-for-result branch from faf35ac to fb5997b Compare June 3, 2024 14:58
@fhahn fhahn merged commit 07b3301 into llvm:main Jun 3, 2024
7 checks passed
@fhahn fhahn deleted the vplan-compute-for-result branch June 3, 2024 19:20
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jun 3, 2024
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode
to extract the value of a FOR to be used as resume value for the ph in
the scalar loop.

Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is
needed at the moment to conveniently let fixFixedOrderRecurrence still
handle creating and patching up the phis in the scalar pre-header and
header. As ExtractRecurrenceResume won't have any users yet, it is
marked as having side-effects until the phi in the scalar pre-header is
also created and managed by VPlan.

Depends on llvm#93395
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jun 4, 2024
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode
to extract the value of a FOR to be used as resume value for the ph in
the scalar loop.

Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is
needed at the moment to conveniently let fixFixedOrderRecurrence still
handle creating and patching up the phis in the scalar pre-header and
header. As ExtractRecurrenceResume won't have any users yet, it is
marked as having side-effects until the phi in the scalar pre-header is
also created and managed by VPlan.

Depends on llvm#93395
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jun 4, 2024
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode
to extract the value of a FOR to be used as resume value for the ph in
the scalar loop.

Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is
needed at the moment to conveniently let fixFixedOrderRecurrence still
handle creating and patching up the phis in the scalar pre-header and
header. As ExtractRecurrenceResume won't have any users yet, it is
marked as having side-effects until the phi in the scalar pre-header is
also created and managed by VPlan.

Depends on llvm#93395
fhahn added a commit to fhahn/llvm-project that referenced this pull request Jun 5, 2024
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode
to extract the value of a FOR to be used as resume value for the ph in
the scalar loop.

Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is
needed at the moment to conveniently let fixFixedOrderRecurrence still
handle creating and patching up the phis in the scalar pre-header and
header. As ExtractRecurrenceResume won't have any users yet, it is
marked as having side-effects until the phi in the scalar pre-header is
also created and managed by VPlan.

Depends on llvm#93395
fhahn added a commit that referenced this pull request Jun 5, 2024
This patch uses the ExtractFromEnd VPInstruction opcode
to extract the value of a FOR to be used as resume value for the ph in
the scalar loop.

It adds a new live-out that temporarily wraps the FOR phi in the scalar
loop. fixFixedOrderRecurrence will process live outs for fixed order
recurrence phis by creating a new phi node in the scalar preheader, 
using the generated value for the live-out as incoming value from the
middle block and the original start value as incoming value for the
other edge. Creation of the phi in the preheader, as well as updating
the phi in the scalar loop will also be moved to VPlan in the future,
eventually retiring fixFixedOrderRecurrence

Depends on #93395

PR: #93396
fhahn added a commit that referenced this pull request Jul 4, 2024
The assertion added as part of #93395
surfaced cases where first-order recurrences are vectorized with
<vscale x 1 x ..>. If vscale is 1, then we are unable to extract the
penultimate value (second to last lane). Previously this case got
mis-compiled, trying to extract from an invalid lane (-1)
https://llvm.godbolt.org/z/3adzYYcf9.

Fixes #97452.
kbluck pushed a commit to kbluck/llvm-project that referenced this pull request Jul 6, 2024
The assertion added as part of llvm#93395
surfaced cases where first-order recurrences are vectorized with
<vscale x 1 x ..>. If vscale is 1, then we are unable to extract the
penultimate value (second to last lane). Previously this case got
mis-compiled, trying to extract from an invalid lane (-1)
https://llvm.godbolt.org/z/3adzYYcf9.

Fixes llvm#97452.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants