[VPlan] Introduce ComputeReductionResult VPInstruction opcode. #70253

fhahn · 2023-10-25T21:00:47Z

This patch introduces a new ComputeReductionResult opcode to compute the final reduction result in the middle block. The code from fixReduction has been moved to ComputeReductionResult, after some earlier cleanup changes to model parts of fixReduction explicitly elsewhere as needed.

The recipe may be broken down further in the future.

Note that the ComputeReductionResult at the moment also creates the phi node to merge the reduction result from the trip count check and the middle block, to be used as resume value for the scalar remainder loop. Once we have a VPValue for the reduction result, this can also be modeled explicitly and moved out of the recipe.

llvmbot · 2023-10-25T21:02:24Z

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch introduces a new ComputeReductionResult opcode to compute the final reduction result in the middle block. The code from fixReduction has been moved to ComputeReductionResult, after some earlier cleanup changes to model parts of fixReduction explicitly elsewhere as needed.

The recipe may be broken down further in the future.

Note that the ComputeReductionResult at the moment also creates the phi node to merge the reduction result from the trip count check and the middle block, to be used as resume value for the scalar remainder loop. Once we have a VPValue for the reduction result, this can also be modeled explicitly and moved out of the recipe.

Patch is 28.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/70253.diff

7 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+7-4)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+64-190)
(modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+1)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+2-1)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+136)
(modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll (+2-1)
(modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+5-2)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 577ce8000de27b9..a4ead0523a2fb70 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -347,10 +347,13 @@ class LoopVectorizationPlanner {
   /// vectorization re-using plans for both the main and epilogue vector loops.
   /// It should be removed once the re-use issue has been fixed.
   /// \p ExpandedSCEVs is passed during execution of the plan for epilogue loop
-  /// to re-use expansion results generated during main plan execution. Returns
-  /// a mapping of SCEVs to their expanded IR values. Note that this is a
-  /// temporary workaround needed due to the current epilogue handling.
-  DenseMap<const SCEV *, Value *>
+  /// to re-use expansion results generated during main plan execution.
+  ///
+  /// Returns a mapping of SCEVs to their expanded IR values and a mapping for
+  /// the reduction resume values. Note that this is a temporary workaround
+  /// needed due to the current epilogue handling.
+  std::pair<DenseMap<const SCEV *, Value *>,
+            DenseMap<const RecurrenceDescriptor *, Value *>>
   executePlan(ElementCount VF, unsigned UF, VPlan &BestPlan,
               InnerLoopVectorizer &LB, DominatorTree *DT,
               bool IsEpilogueVectorization,
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 0807d2a7e5a2671..13267ec5017c181 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -566,10 +566,6 @@ class InnerLoopVectorizer {
   /// able to vectorize with strict in-order reductions for the given RdxDesc.
   bool useOrderedReductions(const RecurrenceDescriptor &RdxDesc);
 
-  // Returns the resume value (bc.merge.rdx) for a reduction as
-  // generated by fixReduction.
-  PHINode *getReductionResumeValue(const RecurrenceDescriptor &RdxDesc);
-
   /// Create a new phi node for the induction variable \p OrigPhi to resume
   /// iteration count in the scalar epilogue, from where the vectorized loop
   /// left off. \p Step is the SCEV-expanded induction step to use. In cases
@@ -1151,14 +1147,6 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
   }
 }
 
-PHINode *InnerLoopVectorizer::getReductionResumeValue(
-    const RecurrenceDescriptor &RdxDesc) {
-  auto It = ReductionResumeValues.find(&RdxDesc);
-  assert(It != ReductionResumeValues.end() &&
-         "Expected to find a resume value for the reduction.");
-  return It->second;
-}
-
 namespace llvm {
 
 // Loop vectorization cost-model hints how the scalar epilogue loop should be
@@ -3617,11 +3605,6 @@ void InnerLoopVectorizer::fixCrossIterationPHIs(VPTransformState &State) {
   VPBasicBlock *Header =
       State.Plan->getVectorLoopRegion()->getEntryBasicBlock();
 
-  for (VPRecipeBase &R : Header->phis()) {
-    if (auto *ReductionPhi = dyn_cast<VPReductionPHIRecipe>(&R))
-      fixReduction(ReductionPhi, State);
-  }
-
   for (VPRecipeBase &R : Header->phis()) {
     if (auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&R))
       fixFixedOrderRecurrence(FOR, State);
@@ -3747,169 +3730,6 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(
   Phi->setName("scalar.recur");
 }
 
-void InnerLoopVectorizer::fixReduction(VPReductionPHIRecipe *PhiR,
-                                       VPTransformState &State) {
-  PHINode *OrigPhi = cast<PHINode>(PhiR->getUnderlyingValue());
-  // Get it's reduction variable descriptor.
-  assert(Legal->isReductionVariable(OrigPhi) &&
-         "Unable to find the reduction variable");
-  const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor();
-
-  RecurKind RK = RdxDesc.getRecurrenceKind();
-  TrackingVH<Value> ReductionStartValue = RdxDesc.getRecurrenceStartValue();
-  Instruction *LoopExitInst = RdxDesc.getLoopExitInstr();
-  if (auto *I = dyn_cast<Instruction>(&*ReductionStartValue))
-    State.setDebugLocFrom(I->getDebugLoc());
-
-  VPValue *LoopExitInstDef = PhiR->getBackedgeValue();
-
-  // Before each round, move the insertion point right between
-  // the PHIs and the values we are going to write.
-  // This allows us to write both PHINodes and the extractelement
-  // instructions.
-  Builder.SetInsertPoint(LoopMiddleBlock,
-                         LoopMiddleBlock->getFirstInsertionPt());
-
-  State.setDebugLocFrom(LoopExitInst->getDebugLoc());
-
-  Type *PhiTy = OrigPhi->getType();
-  // If tail is folded by masking, the vector value to leave the loop should be
-  // a Select choosing between the vectorized LoopExitInst and vectorized Phi,
-  // instead of the former. For an inloop reduction the reduction will already
-  // be predicated, and does not need to be handled here.
-  if (Cost->foldTailByMasking() && !PhiR->isInLoop()) {
-    VPValue *Def = nullptr;
-    for (VPUser *U : LoopExitInstDef->users()) {
-      auto *S = dyn_cast<VPInstruction>(U);
-      if (S && S->getOpcode() == Instruction::Select) {
-        Def = S;
-        break;
-      }
-    }
-    if (Def)
-      LoopExitInstDef = Def;
-  }
-
-  VectorParts RdxParts(UF);
-  for (unsigned Part = 0; Part < UF; ++Part)
-    RdxParts[Part] = State.get(LoopExitInstDef, Part);
-
-  // If the vector reduction can be performed in a smaller type, we truncate
-  // then extend the loop exit value to enable InstCombine to evaluate the
-  // entire expression in the smaller type.
-  if (VF.isVector() && PhiTy != RdxDesc.getRecurrenceType()) {
-    Builder.SetInsertPoint(LoopMiddleBlock,
-                           LoopMiddleBlock->getFirstInsertionPt());
-    Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), VF);
-    for (unsigned Part = 0; Part < UF; ++Part) {
-      RdxParts[Part] = Builder.CreateTrunc(RdxParts[Part], RdxVecTy);
-    }
-  }
-
-  // Reduce all of the unrolled parts into a single vector.
-  Value *ReducedPartRdx = RdxParts[0];
-  unsigned Op = RecurrenceDescriptor::getOpcode(RK);
-
-  // The middle block terminator has already been assigned a DebugLoc here (the
-  // OrigLoop's single latch terminator). We want the whole middle block to
-  // appear to execute on this line because: (a) it is all compiler generated,
-  // (b) these instructions are always executed after evaluating the latch
-  // conditional branch, and (c) other passes may add new predecessors which
-  // terminate on this line. This is the easiest way to ensure we don't
-  // accidentally cause an extra step back into the loop while debugging.
-  State.setDebugLocFrom(LoopMiddleBlock->getTerminator()->getDebugLoc());
-  if (PhiR->isOrdered())
-    ReducedPartRdx = RdxParts[UF - 1];
-  else {
-    // Floating-point operations should have some FMF to enable the reduction.
-    IRBuilderBase::FastMathFlagGuard FMFG(Builder);
-    Builder.setFastMathFlags(RdxDesc.getFastMathFlags());
-    for (unsigned Part = 1; Part < UF; ++Part) {
-      Value *RdxPart = RdxParts[Part];
-      if (Op != Instruction::ICmp && Op != Instruction::FCmp)
-        ReducedPartRdx = Builder.CreateBinOp(
-            (Instruction::BinaryOps)Op, RdxPart, ReducedPartRdx, "bin.rdx");
-      else if (RecurrenceDescriptor::isAnyOfRecurrenceKind(RK))
-        ReducedPartRdx = createAnyOfOp(Builder, ReductionStartValue, RK,
-                                       ReducedPartRdx, RdxPart);
-      else
-        ReducedPartRdx = createMinMaxOp(Builder, RK, ReducedPartRdx, RdxPart);
-    }
-  }
-
-  // Create the reduction after the loop. Note that inloop reductions create the
-  // target reduction in the loop using a Reduction recipe.
-  if (VF.isVector() && !PhiR->isInLoop()) {
-    ReducedPartRdx =
-        createTargetReduction(Builder, RdxDesc, ReducedPartRdx, OrigPhi);
-    // If the reduction can be performed in a smaller type, we need to extend
-    // the reduction to the wider type before we branch to the original loop.
-    if (PhiTy != RdxDesc.getRecurrenceType())
-      ReducedPartRdx = RdxDesc.isSigned()
-                           ? Builder.CreateSExt(ReducedPartRdx, PhiTy)
-                           : Builder.CreateZExt(ReducedPartRdx, PhiTy);
-  }
-
-  PHINode *ResumePhi =
-      dyn_cast<PHINode>(PhiR->getStartValue()->getUnderlyingValue());
-
-  // Create a phi node that merges control-flow from the backedge-taken check
-  // block and the middle block.
-  PHINode *BCBlockPhi = PHINode::Create(PhiTy, 2, "bc.merge.rdx",
-                                        LoopScalarPreHeader->getTerminator());
-
-  // If we are fixing reductions in the epilogue loop then we should already
-  // have created a bc.merge.rdx Phi after the main vector body. Ensure that
-  // we carry over the incoming values correctly.
-  for (auto *Incoming : predecessors(LoopScalarPreHeader)) {
-    if (Incoming == LoopMiddleBlock)
-      BCBlockPhi->addIncoming(ReducedPartRdx, Incoming);
-    else if (ResumePhi && llvm::is_contained(ResumePhi->blocks(), Incoming))
-      BCBlockPhi->addIncoming(ResumePhi->getIncomingValueForBlock(Incoming),
-                              Incoming);
-    else
-      BCBlockPhi->addIncoming(ReductionStartValue, Incoming);
-  }
-
-  // Set the resume value for this reduction
-  ReductionResumeValues.insert({&RdxDesc, BCBlockPhi});
-
-  // If there were stores of the reduction value to a uniform memory address
-  // inside the loop, create the final store here.
-  if (StoreInst *SI = RdxDesc.IntermediateStore) {
-    StoreInst *NewSI =
-        Builder.CreateAlignedStore(ReducedPartRdx, SI->getPointerOperand(),
-                                   SI->getAlign());
-    propagateMetadata(NewSI, SI);
-
-    // If the reduction value is used in other places,
-    // then let the code below create PHI's for that.
-  }
-
-  // Now, we need to fix the users of the reduction variable
-  // inside and outside of the scalar remainder loop.
-
-  // We know that the loop is in LCSSA form. We need to update the PHI nodes
-  // in the exit blocks.  See comment on analogous loop in
-  // fixFixedOrderRecurrence for a more complete explaination of the logic.
-  if (!Cost->requiresScalarEpilogue(VF.isVector()))
-    for (PHINode &LCSSAPhi : LoopExitBlock->phis())
-      if (llvm::is_contained(LCSSAPhi.incoming_values(), LoopExitInst)) {
-        LCSSAPhi.addIncoming(ReducedPartRdx, LoopMiddleBlock);
-        State.Plan->removeLiveOut(&LCSSAPhi);
-      }
-
-  // Fix the scalar loop reduction variable with the incoming reduction sum
-  // from the vector body and from the backedge value.
-  int IncomingEdgeBlockIdx =
-      OrigPhi->getBasicBlockIndex(OrigLoop->getLoopLatch());
-  assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index");
-  // Pick the other block.
-  int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1);
-  OrigPhi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi);
-  OrigPhi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);
-}
-
 void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) {
   // The basic block and loop containing the predicated instruction.
   auto *PredBB = PredInst->getParent();
@@ -7641,7 +7461,9 @@ static void AddRuntimeUnrollDisableMetaData(Loop *L) {
   }
 }
 
-SCEV2ValueTy LoopVectorizationPlanner::executePlan(
+std::pair<DenseMap<const SCEV *, Value *>,
+          DenseMap<const RecurrenceDescriptor *, Value *>>
+LoopVectorizationPlanner::executePlan(
     ElementCount BestVF, unsigned BestUF, VPlan &BestVPlan,
     InnerLoopVectorizer &ILV, DominatorTree *DT, bool IsEpilogueVectorization,
     const DenseMap<const SCEV *, Value *> *ExpandedSCEVs) {
@@ -7669,6 +7491,7 @@ SCEV2ValueTy LoopVectorizationPlanner::executePlan(
     State.Builder.SetInsertPoint(OrigLoop->getLoopPreheader()->getTerminator());
     BestVPlan.getPreheader()->execute(&State);
   }
+  State.CFG.VPBB2IRBB[BestVPlan.getPreheader()] = OrigLoop->getLoopPreheader();
   if (!ILV.getTripCount())
     ILV.setTripCount(State.get(BestVPlan.getTripCount(), {0, 0}));
   else
@@ -7719,6 +7542,27 @@ SCEV2ValueTy LoopVectorizationPlanner::executePlan(
 
   BestVPlan.execute(&State);
 
+  DenseMap<const RecurrenceDescriptor *, Value *> ReductionResumeValues;
+  for (VPRecipeBase &R : *cast<VPBasicBlock>(
+           BestVPlan.getVectorLoopRegion()->getSingleSuccessor())) {
+
+    auto *Red = dyn_cast<VPInstruction>(&R);
+    if (!Red || Red->getOpcode() != VPInstruction::ComputeReductionResult)
+      continue;
+
+    Value *FinalValue = State.get(
+        Red, VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF)));
+    auto *PhiR = cast<VPReductionPHIRecipe>(Red->getOperand(0));
+    assert(FinalValue->hasOneUse() || PhiR->isInLoop() ||
+           PhiR->getRecurrenceDescriptor().IntermediateStore);
+
+    PHINode *MergePhi = nullptr;
+    for (Value *U : FinalValue->users())
+      if (auto *P = dyn_cast<PHINode>(U))
+        MergePhi = P;
+    ReductionResumeValues[&PhiR->getRecurrenceDescriptor()] = MergePhi;
+  }
+
   // Keep all loop hints from the original loop on the vector loop (we'll
   // replace the vectorizer-specific hints below).
   MDNode *OrigLoopID = OrigLoop->getLoopID();
@@ -7751,8 +7595,7 @@ SCEV2ValueTy LoopVectorizationPlanner::executePlan(
   ILV.fixVectorizedLoop(State, BestVPlan);
 
   ILV.printDebugTracesAtEnd();
-
-  return State.ExpandedSCEVs;
+  return {State.ExpandedSCEVs, ReductionResumeValues};
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
@@ -9054,7 +8897,25 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
   SmallVector<VPReductionPHIRecipe *> InLoopReductionPhis;
   for (VPRecipeBase &R : Header->phis()) {
     auto *PhiR = dyn_cast<VPReductionPHIRecipe>(&R);
-    if (!PhiR || !PhiR->isInLoop() || (MinVF.isScalar() && !PhiR->isOrdered()))
+    if (!PhiR)
+      continue;
+
+    // TODO: At the moment ComputeReductionResult also creates the bc.merge.rdx
+    // phi nodes, hence it needs to be created unconditionally here, until the
+    // reduction resume value handling is also modeled in VPlan.
+    VPInstruction *FinalReductionResult =
+        new VPInstruction(VPInstruction::ComputeReductionResult,
+                          {PhiR, PhiR->getBackedgeValue()});
+    cast<VPBasicBlock>(Plan->getVectorLoopRegion()->getSingleSuccessor())
+        ->appendRecipe(FinalReductionResult);
+    for (VPUser *U : to_vector(PhiR->getBackedgeValue()->users())) {
+      auto *LiveOut = dyn_cast<VPLiveOut>(U);
+      if (!LiveOut)
+        continue;
+      LiveOut->setOperand(0, FinalReductionResult);
+    }
+
+    if (!PhiR->isInLoop() || (MinVF.isScalar() && !PhiR->isOrdered()))
       continue;
     InLoopReductionPhis.push_back(PhiR);
   }
@@ -9089,9 +8950,13 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     for (VPRecipeBase *CurrentLink : Worklist.getArrayRef().drop_front()) {
       VPValue *PreviousLinkV = PreviousLink->getVPSingleValue();
 
+      if (CurrentLink->getParent()->getParent() !=
+          PhiR->getParent()->getParent()) {
+        continue;
+      }
       Instruction *CurrentLinkI = CurrentLink->getUnderlyingInstr();
 
-      // Index of the first operand which holds a non-mask vector operand.
+      // Index of Rhe first operand which holds a non-mask vector operand.
       unsigned IndexOfFirstOperand;
       // Recognize a call to the llvm.fmuladd intrinsic.
       bool IsFMulAdd = (Kind == RecurKind::FMulAdd);
@@ -9100,7 +8965,7 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
       if (IsFMulAdd) {
         assert(
             RecurrenceDescriptor::isFMulAddIntrinsic(CurrentLinkI) &&
-            "Expected instruction to be a call to the llvm.fmuladd intrinsic");
+            "ExpectedRinstruction to be a call to the llvm.fmuladd intrinsic");
         assert(((MinVF.isScalar() && isa<VPReplicateRecipe>(CurrentLink)) ||
                 isa<VPWidenCallRecipe>(CurrentLink)) &&
                CurrentLink->getOperand(2) == PreviousLinkV &&
@@ -9189,6 +9054,14 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
               ? new VPInstruction(Instruction::Select, {Cond, Red, PhiR}, FMFs)
               : new VPInstruction(Instruction::Select, {Cond, Red, PhiR});
       Result->insertBefore(&*Builder.getInsertPoint());
+      for (VPUser *U : Red->users()) {
+        auto *UR = cast<VPRecipeBase>(U);
+        if (UR->getParent()->getParent() == PhiR->getParent()->getParent())
+          continue;
+        assert(cast<VPInstruction>(UR)->getOpcode() ==
+               VPInstruction::ComputeReductionResult);
+        UR->setOperand(1, Result->getVPSingleValue());
+      }
       if (PreferPredicatedReductionSelect ||
           TTI.preferPredicatedReductionSelect(
               PhiR->getRecurrenceDescriptor().getOpcode(), PhiTy,
@@ -10197,8 +10070,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
                                            EPI, &LVL, &CM, BFI, PSI, Checks);
 
         VPlan &BestMainPlan = LVP.getBestPlanFor(EPI.MainLoopVF);
-        auto ExpandedSCEVs = LVP.executePlan(EPI.MainLoopVF, EPI.MainLoopUF,
-                                             BestMainPlan, MainILV, DT, true);
+        const auto &[ExpandedSCEVs, ReductionResumeValues] = LVP.executePlan(
+            EPI.MainLoopVF, EPI.MainLoopUF, BestMainPlan, MainILV, DT, true);
         ++LoopsVectorized;
 
         // Second pass vectorizes the epilogue and adjusts the control flow
@@ -10239,8 +10112,9 @@ bool LoopVectorizePass::processLoop(Loop *L) {
           Value *ResumeV = nullptr;
           // TODO: Move setting of resume values to prepareToExecute.
           if (auto *ReductionPhi = dyn_cast<VPReductionPHIRecipe>(&R)) {
-            ResumeV = MainILV.getReductionResumeValue(
-                ReductionPhi->getRecurrenceDescriptor());
+            ResumeV = ReductionResumeValues
+                          .find(&ReductionPhi->getRecurrenceDescriptor())
+                          ->second;
           } else {
             // Create induction resume values for both widened pointer and
             // integer/fp inductions and update the start value of the induction
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 583f652eed104e0..ecb5bd5366b9ed7 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -456,6 +456,7 @@ void VPBasicBlock::execute(VPTransformState *State) {
     // The Exit block of a loop is always set to be successor 0 of the Exiting
     // block.
     cast<BranchInst>(ExitingBB->getTerminator())->setSuccessor(0, NewBB);
+    State->Builder.SetInsertPoint(NewBB->getFirstNonPHI());
   } else if (PrevVPBB && /* A */
              !((SingleHPred = getSingleHierarchicalPredecessor()) &&
                SingleHPred->getExitingBasicBlock() == PrevVPBB &&
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index e65a7ab2cd028ee..614d78b73e7329d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1034,7 +1034,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
     // canonical IV separately for each unrolled part.
     CanonicalIVIncrementForPart,
     BranchOnCount,
-    BranchOnCond
+    BranchOnCond,
+    ComputeReductionResult,
   };
 
 private:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index efc95c1cd08c6fd..bd156c73f456b22 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -27,6 +27,7 @@
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/Transforms/Utils/BasicBlockUtils.h"
+#include "llvm/Transforms/Utils/LoopUtils.h"
 #include "llvm/Transforms/Utils/ScalarEvolutionExpander.h"
 #include <cassert>
 
@@ -403,6 +404,138 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
     Builder.GetInsertBlock()->getTerminator()->eraseFromParent();
     return CondBr;
   }
+  case VPInstruction::ComputeReductionResult: {
+    if (Part != 0)
+      return State.get(
+          this, VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF)));
+
+    auto *PhiR = dyn_cast<VPReductionPHIRecipe>(getOperand(0));
+    PHINode *OrigPhi = cast<PHINode>(PhiR->getUnderlyingValue());
+    // Get it's reduction variable descriptor.
+    const RecurrenceDescriptor &RdxDesc = PhiR...
[truncated]

alexey-bataev · 2023-10-26T12:48:50Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+      if (auto *P = dyn_cast<PHINode>(U))
+        MergePhi = P;


break; ? to early exit out of the loop if the first phi is found?

Adjusted thanks! There actually can be multiple phi users (one could be the phi node in the vector loop), refined to check that the phi is in the original scalar PH.

alexey-bataev · 2023-10-26T12:49:33Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    // TODO: At the moment ComputeReductionResult also creates the bc.merge.rdx
+    // phi nodes, hence it needs to be created unconditionally here, until the
+    // reduction resume value handling is also modeled in VPlan.
+    VPInstruction *FinalReductionResult =


Suggested change

VPInstruction *FinalReductionResult =

auto *FinalReductionResult =

Done, thanks!

alexey-bataev · 2023-10-26T12:52:35Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

      Instruction *CurrentLinkI = CurrentLink->getUnderlyingInstr();

-      // Index of the first operand which holds a non-mask vector operand.
+      // Index of Rhe first operand which holds a non-mask vector operand.


Fixed, thanks!

alexey-bataev · 2023-10-26T12:52:58Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -9100,7 +8965,7 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
      if (IsFMulAdd) {
        assert(
            RecurrenceDescriptor::isFMulAddIntrinsic(CurrentLinkI) &&
-            "Expected instruction to be a call to the llvm.fmuladd intrinsic");
+            "ExpectedRinstruction to be a call to the llvm.fmuladd intrinsic");


undone, thanks!

alexey-bataev · 2023-10-26T12:54:30Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -10197,8 +10070,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
                                           EPI, &LVL, &CM, BFI, PSI, Checks);

        VPlan &BestMainPlan = LVP.getBestPlanFor(EPI.MainLoopVF);
-        auto ExpandedSCEVs = LVP.executePlan(EPI.MainLoopVF, EPI.MainLoopUF,
-                                             BestMainPlan, MainILV, DT, true);
+        const auto &[ExpandedSCEVs, ReductionResumeValues] = LVP.executePlan(


Hm, a reference to a copied containers? May it cause troubles use-after-delete? Should it be rvalue ref instead?

I think the const reference should extend the lifetime of the returned value.

alexey-bataev · 2023-10-26T12:57:34Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+                             : Builder.CreateZExt(ReducedPartRdx, PhiTy);
+    }
+
+    PHINode *ResumePhi =


Suggested change

PHINode *ResumePhi =

auto *ResumePhi =

alexey-bataev · 2023-10-26T12:58:00Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    PHINode *ResumePhi =
+        dyn_cast<PHINode>(PhiR->getStartValue()->getUnderlyingValue());
+
+    auto *OrigLoop = State.LI->getLoopFor(OrigPhi->getParent());


Suggested change

auto *OrigLoop = State.LI->getLoopFor(OrigPhi->getParent());

const Loop *OrigLoop = State.LI->getLoopFor(OrigPhi->getParent());

Done, thanks!

nit: not sure I see it right - should this be auto or const Loop now?

Changed to const Loop (the type isn't entirely obvious like with dyn_cast)

alexey-bataev · 2023-10-26T12:58:26Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    BasicBlock *LoopScalarPreHeader = OrigLoop->getLoopPreheader();
+    // Create a phi node that merges control-flow from the backedge-taken check
+    // block and the middle block.
+    PHINode *BCBlockPhi = PHINode::Create(PhiTy, 2, "bc.merge.rdx",


Suggested change

PHINode *BCBlockPhi = PHINode::Create(PhiTy, 2, "bc.merge.rdx",

auto *BCBlockPhi = PHINode::Create(PhiTy, 2, "bc.merge.rdx",

Done, thanks!

alexey-bataev · 2023-10-26T12:59:08Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    for (auto *Incoming : predecessors(LoopScalarPreHeader)) {
+      if (Incoming == LoopMiddleBlock)
+        BCBlockPhi->addIncoming(ReducedPartRdx, Incoming);
+      else if (ResumePhi && llvm::is_contained(ResumePhi->blocks(), Incoming))


Suggested change

else if (ResumePhi && llvm::is_contained(ResumePhi->blocks(), Incoming))

else if (ResumePhi && is_contained(ResumePhi->blocks(), Incoming))

Done, thanks!

alexey-bataev · 2023-10-26T12:59:31Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    // If there were stores of the reduction value to a uniform memory address
+    // inside the loop, create the final store here.
+    if (StoreInst *SI = RdxDesc.IntermediateStore) {
+      StoreInst *NewSI = Builder.CreateAlignedStore(


Suggested change

StoreInst *NewSI = Builder.CreateAlignedStore(

auto *NewSI = Builder.CreateAlignedStore(

Done, thanks!

fhahn

Address comments & ping :)

Apologies if the force-push messed anything up

fhahn · 2023-11-06T20:17:48Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+      if (auto *P = dyn_cast<PHINode>(U))
+        MergePhi = P;


Adjusted thanks! There actually can be multiple phi users (one could be the phi node in the vector loop), refined to check that the phi is in the original scalar PH.

fhahn · 2023-11-06T20:18:26Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    // TODO: At the moment ComputeReductionResult also creates the bc.merge.rdx
+    // phi nodes, hence it needs to be created unconditionally here, until the
+    // reduction resume value handling is also modeled in VPlan.
+    VPInstruction *FinalReductionResult =


Done, thanks!

fhahn · 2023-11-06T20:18:51Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

      Instruction *CurrentLinkI = CurrentLink->getUnderlyingInstr();

-      // Index of the first operand which holds a non-mask vector operand.
+      // Index of Rhe first operand which holds a non-mask vector operand.


Fixed, thanks!

fhahn · 2023-11-06T20:19:16Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -9100,7 +8965,7 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
      if (IsFMulAdd) {
        assert(
            RecurrenceDescriptor::isFMulAddIntrinsic(CurrentLinkI) &&
-            "Expected instruction to be a call to the llvm.fmuladd intrinsic");
+            "ExpectedRinstruction to be a call to the llvm.fmuladd intrinsic");


undone, thanks!

fhahn · 2023-11-06T20:19:57Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+          this, VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF)));
+
+    auto *PhiR = dyn_cast<VPReductionPHIRecipe>(getOperand(0));
+    PHINode *OrigPhi = cast<PHINode>(PhiR->getUnderlyingValue());


Adjusted, thanks!

fhahn · 2023-11-06T20:20:44Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    PHINode *ResumePhi =
+        dyn_cast<PHINode>(PhiR->getStartValue()->getUnderlyingValue());
+
+    auto *OrigLoop = State.LI->getLoopFor(OrigPhi->getParent());


Done, thanks!

fhahn · 2023-11-06T20:20:49Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    BasicBlock *LoopScalarPreHeader = OrigLoop->getLoopPreheader();
+    // Create a phi node that merges control-flow from the backedge-taken check
+    // block and the middle block.
+    PHINode *BCBlockPhi = PHINode::Create(PhiTy, 2, "bc.merge.rdx",


Done, thanks!

fhahn · 2023-11-06T20:21:14Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    for (auto *Incoming : predecessors(LoopScalarPreHeader)) {
+      if (Incoming == LoopMiddleBlock)
+        BCBlockPhi->addIncoming(ReducedPartRdx, Incoming);
+      else if (ResumePhi && llvm::is_contained(ResumePhi->blocks(), Incoming))


Done, thanks!

fhahn · 2023-11-06T20:21:45Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    // If there were stores of the reduction value to a uniform memory address
+    // inside the loop, create the final store here.
+    if (StoreInst *SI = RdxDesc.IntermediateStore) {
+      StoreInst *NewSI = Builder.CreateAlignedStore(


Done, thanks!

fhahn · 2023-11-06T21:00:32Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -10197,8 +10070,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
                                           EPI, &LVL, &CM, BFI, PSI, Checks);

        VPlan &BestMainPlan = LVP.getBestPlanFor(EPI.MainLoopVF);
-        auto ExpandedSCEVs = LVP.executePlan(EPI.MainLoopVF, EPI.MainLoopUF,
-                                             BestMainPlan, MainILV, DT, true);
+        const auto &[ExpandedSCEVs, ReductionResumeValues] = LVP.executePlan(


I think the const reference should extend the lifetime of the returned value.

fhahn · 2023-11-12T12:54:15Z

ping :)

alexey-bataev · 2023-11-13T11:53:41Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    PHINode *MergePhi = nullptr;
+    for (Value *U : FinalValue->users()) {
+      auto *P = dyn_cast<PHINode>(U);
+      if (P && OrigScalarPH == P->getParent()) {
+        MergePhi = P;
+        break;
+      }
+    }


Better to outline this as lambda

Outlined, thanks!

alexey-bataev · 2023-11-13T11:54:14Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+          PhiR->getParent()->getParent()) {
+        continue;
+      }


Drop braces

Removed, thanks!

alexey-bataev · 2023-11-13T11:54:33Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+        if (UR->getParent()->getParent() == PhiR->getParent()->getParent())
+          continue;
+        assert(cast<VPInstruction>(UR)->getOpcode() ==
+               VPInstruction::ComputeReductionResult);


Add assertion message

Done, thanks!

fhahn

Rebased and addressed comments, thanks!

fhahn · 2023-11-18T13:00:58Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    PHINode *MergePhi = nullptr;
+    for (Value *U : FinalValue->users()) {
+      auto *P = dyn_cast<PHINode>(U);
+      if (P && OrigScalarPH == P->getParent()) {
+        MergePhi = P;
+        break;
+      }
+    }


Outlined, thanks!

fhahn · 2023-11-18T13:01:07Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+          PhiR->getParent()->getParent()) {
+        continue;
+      }


Removed, thanks!

fhahn · 2023-11-18T13:05:19Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+        if (UR->getParent()->getParent() == PhiR->getParent()->getParent())
+          continue;
+        assert(cast<VPInstruction>(UR)->getOpcode() ==
+               VPInstruction::ComputeReductionResult);


Done, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

github-actions · 2023-11-28T14:00:15Z

✅ With the latest revision this PR passed the C/C++ code formatter.

alexey-bataev · 2023-12-04T15:02:52Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+      auto *P = dyn_cast<PHINode>(U);
+      if (P && OrigScalarPH == P->getParent()) {
+        MergePhi = P;
+        break;


auto It = find_if(FinalValue->users(), [](Value U) {
auto *P = dyn_cast(U);
return P && OrigScalarPH == P->getParent();
});
assert(It != FinalValue->users().end(), "must have merge phi");
return {&PhiR->getRecurrenceDescriptor(), *It};

Ah forgot about this one, should be fixed now!

This reverts commit 19918ac. Fixes #75298. There is still a case where we miss the correct users outside the main vector loop for reductions, and that is tail-folded loops with reductions where the final value is stored after the loop. This should be handled explicitly in #70253

ayalz

Good step forward!

ayalz · 2023-12-18T12:29:25Z

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

@@ -347,10 +347,13 @@ class LoopVectorizationPlanner {
  /// vectorization re-using plans for both the main and epilogue vector loops.


nit: above "Generate the IR code for the body of the vectorized loop" is no longer accurate. Suggest to update with something like "Generate the IR code for the vectorized loop captured in VPlan"

Adjusted, thanks!

Still awaits adjustment?

Should be adjusted now, thanks!

ayalz · 2023-12-18T12:35:51Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -3513,11 +3501,6 @@ void InnerLoopVectorizer::fixCrossIterationPHIs(VPTransformState &State) {
  VPBasicBlock *Header =
      State.Plan->getVectorLoopRegion()->getEntryBasicBlock();

-  for (VPRecipeBase &R : Header->phis()) {
-    if (auto *ReductionPhi = dyn_cast<VPReductionPHIRecipe>(&R))
-      fixReduction(ReductionPhi, State);


So now better rename fixCrossIterationPHIs() to fixFixedOrderRecurrences(), along with an explanation that inductions and reductions are fixed by VPlan execute? Or leave a TODO for now, until FORs are also fixed by VPlan and this method can be retired.

Inlined the loop to the caller

ayalz · 2023-12-18T13:15:02Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -7590,6 +7412,7 @@ SCEV2ValueTy LoopVectorizationPlanner::executePlan(
    State.Builder.SetInsertPoint(OrigLoop->getLoopPreheader()->getTerminator());
    BestVPlan.getPreheader()->execute(&State);
  }
+  State.CFG.VPBB2IRBB[BestVPlan.getPreheader()] = OrigLoop->getLoopPreheader();


Is setting VPBB2IRBB for the preheader independent of this patch?
Worth updating inside the if, for non-empty preheaders only?
Note that multiple VPBB's may be associated with OrigLoop's preheader.

Not needed for this patch, removed, thanks!

ayalz · 2023-12-18T13:25:14Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -7640,6 +7463,42 @@ SCEV2ValueTy LoopVectorizationPlanner::executePlan(

  BestVPlan.execute(&State);



Worth commenting, something like
// 2.5 Collect reduction resume values.
?

Worth outlining?

Outlined and added comment, thanks!

ayalz · 2023-12-18T13:26:50Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -7640,6 +7463,42 @@ SCEV2ValueTy LoopVectorizationPlanner::executePlan(

  BestVPlan.execute(&State);

+  DenseMap<const RecurrenceDescriptor *, Value *> ReductionResumeValues;
+  BasicBlock *OrigScalarPH = OrigLoop->getLoopPreheader();
+  // Check if ResResult is a CompueReductionResult instruction, and if it is


Suggested change

// Check if ResResult is a CompueReductionResult instruction, and if it is

// Check if RedResult is a CompueReductionResult instruction, and if it is

Fixed, thanks!

ayalz · 2023-12-18T16:14:12Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    auto *PhiR = dyn_cast<VPReductionPHIRecipe>(getOperand(0));
+    auto *OrigPhi = cast<PHINode>(PhiR->getUnderlyingValue());
+    // Get it's reduction variable descriptor.
+    const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor();


Suggesting a ComputeReductionResult recipe could hold its RdxDesc.

Would be an option, see my previous comment. There are multiple properties needed from PhiR here as well

-- which exacerbates the undesired issue of implicit cross-recipe dependences. Please add a FIXME to emphasize that this is an intermediate situation to support gradual refactoring, possibly also clarifying at the summary that, rather than The recipe may be broken down further in the future - it must be.

Added a FIXME, thanks!

ayalz · 2023-12-18T16:15:18Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    // Get it's reduction variable descriptor.
+    const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor();
+
+    BasicBlock *LoopMiddleBlock = State.CFG.VPBB2IRBB[getParent()];


Isn't ComputeReductionResult placed in LoopMiddleBlock, with the Builder already pointing to it?

Removed, thanks!

ayalz · 2023-12-18T16:16:55Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    if (auto *I = dyn_cast<Instruction>(&*ReductionStartValue))
+      State.setDebugLocFrom(I->getDebugLoc());
+
+    VPValue *LoopExitInstDef = getOperand(1);


nit: more accurately called LoopExitingInstDef, but better retain current name in this patch.

Adjusted and also dropped the Inst part, as this doesn't refer the an instruction.

ayalz · 2023-12-18T16:17:08Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+
+    VPValue *LoopExitInstDef = getOperand(1);
+    // This is the vector-clone of the value that leaves the loop.
+    // State.setDebugLocFrom(LoopExitInst->getDebugLoc());


commented-out code?

Not needed, removed, thanks!

ayalz · 2023-12-18T16:19:40Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+
+    // If the vector reduction can be performed in a smaller type, we truncate
+    // then extend the loop exit value to enable InstCombine to evaluate the
+    // entire expression in the smaller type.


Unrelated to this patch: better handle type size reductions as part of truncateToMinBW, along with all other type shrinkings, rather than here.

Agreed, added a TODO

ayalz

Good step forward!

ayalz · 2023-12-23T21:45:41Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+  case VPInstruction::ComputeReductionResult: {
+    if (Part != 0)
+      return State.get(
+          this, VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF)));


VPLiveOut::fixPhi() indeed looks for the last Part, but it does check if to look for the first or last Lane, according to vputils::isUniformAfterVectorization(ExitValue)?

It's somewhat confusing to set a value for the last Part, when operating upon the first Part. Perhaps more consistent for now to store a value for all parts, as do, e.g., VPExpandSCEVRecipe::execute() and VPCanonicalIVPHIRecipe::execute()?

BTW, VPFirstOrderRecurrencePHIRecipe::execute() sets only the first Part, perhaps it should do so under an if Part == 0? (Same for VPWidenPHIRecipe::execute(), but that applies only to native, i.e., to UF=1)

ayalz · 2023-12-23T21:48:47Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -456,6 +456,7 @@ void VPBasicBlock::execute(VPTransformState *State) {
    // The Exit block of a loop is always set to be successor 0 of the Exiting
    // block.
    cast<BranchInst>(ExitingBB->getTerminator())->setSuccessor(0, NewBB);
+    State->Builder.SetInsertPoint(NewBB->getFirstNonPHI());


-- which highlights the fact that this patch is the first to introduce a recipe in the Middle block -- something worth noting in the summary, if not in the title...

nit: perhaps more consistent to set the insertion point earlier, next to setting CFG.PrevBB, as part of // ExitBB can be re-used ... (or rather it must be reused), rather than under the last // Update the branch ... part. Also, doing so aligns with the else if case below (although arguably setting the insertion point to Terminator is redundant after having set it to NewBB and created Terminator?)

ayalz · 2023-12-23T22:02:56Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    auto *PhiR = dyn_cast<VPReductionPHIRecipe>(getOperand(0));
+    auto *OrigPhi = cast<PHINode>(PhiR->getUnderlyingValue());
+    // Get it's reduction variable descriptor.
+    const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor();


-- which exacerbates the undesired issue of implicit cross-recipe dependences. Please add a FIXME to emphasize that this is an intermediate situation to support gradual refactoring, possibly also clarifying at the summary that, rather than The recipe may be broken down further in the future - it must be.

ayalz · 2023-12-23T22:07:07Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    PHINode *ResumePhi =
+        dyn_cast<PHINode>(PhiR->getStartValue()->getUnderlyingValue());
+
+    auto *OrigLoop = State.LI->getLoopFor(OrigPhi->getParent());


nit: not sure I see it right - should this be auto or const Loop now?

ayalz · 2023-12-23T22:19:52Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+
+  // Find the merge phi, which must be located in the preheader of the
+  // original scalar loop.
+  auto It = find_if(FinalValue->users(), [OrigScalarPH](Value *U) {


Perhaps there's a better way to collect these merge phi's per RdxDesc than to lookup ComputeReductionResult's first operand along with first user in OrigScalarPH, see below.

This isn't needed any longer; the merge phis are created here now, thanks!

ayalz · 2023-12-23T22:21:30Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-SCEV2ValueTy LoopVectorizationPlanner::executePlan(
+// Check if RedResult is a CompueReductionResult instruction, and if it is
+// add it to \p ReductionResumeValues with the merge phi node for it.
+static void getMergePhiForReduction(


Suggested change

static void getMergePhiForReduction(

static void collectMergePhiForReduction(

Renamed, thanks!

ayalz · 2023-12-23T22:26:44Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+            ResumeV = ReductionResumeValues
+                          .find(&ReductionPhi->getRecurrenceDescriptor())
+                          ->second;


Ah, sure, same as ExpandedSCEVs.find(ExpandR->getSCEV())->second);

ayalz · 2023-12-23T22:31:18Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    // entire expression in the smaller type.
+    BasicBlock *LoopMiddleBlock = State.CFG.VPBB2IRBB[getParent()];
+    if (State.VF.isVector() && PhiTy != RdxDesc.getRecurrenceType()) {
+      Builder.SetInsertPoint(LoopMiddleBlock,


Is this setting of insertion point to Middle block redundant?

Yes, removed!

ayalz · 2023-12-24T21:23:28Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    // Create a phi node that merges control-flow from the backedge-taken check
+    // block and the middle block.
+    auto *BCBlockPhi = PHINode::Create(PhiTy, 2, "bc.merge.rdx",
+                                       LoopScalarPreHeader->getTerminator());


Should ComputeReductionResult go ahead and record BCBlockPhi inside some State.ReductionResumeValues[&RdxDesc] = BCBlockPhi; when creating it here, similar to how VPExpandSCEVRecipe::execute() records State.ExpandedSCEVs[Expr] = Res;, rather than having getMergePhiForReduction() search for it later. The latter is arguably more encapsulated, the former seems more consistent?

Hmm, I think it is preferable to keep things encapsulated if possible. I think the main reason for recording the expanded SCEVs in State.ExpandedSCEVs is to avoid duplicated SCEV expansion during epilogue vectorization, due to executing 2 different, unconnected VPlans.

Encapsulated approach is preferred?

Yep, missed to submit some outstanding comments, should be visible now.

fhahn · 2023-12-28T16:55:04Z

Note that the order of the reduction result computation changed in some tests, as now we create them in the original order the reduction recipes have been created

ayalz

Additional thoughts for potentially more gradual/consistent refactoring.

ayalz · 2023-12-28T20:51:50Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+
+    // FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
+    // and will be removed by breaking up the recipe further.
+    auto *PhiR = dyn_cast<VPReductionPHIRecipe>(getOperand(0));


Suggested change

auto *PhiR = dyn_cast<VPReductionPHIRecipe>(getOperand(0));

auto *PhiR = cast<VPReductionPHIRecipe>(getOperand(0));

Done, thanks!

ayalz · 2023-12-30T23:09:16Z

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

@@ -347,10 +347,13 @@ class LoopVectorizationPlanner {
  /// vectorization re-using plans for both the main and epilogue vector loops.


Still awaits adjustment?

ayalz · 2023-12-30T23:32:07Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -8973,7 +8821,8 @@ VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
 void LoopVectorizationPlanner::adjustRecipesForReductions(


Documentation should expand, e.g., to explain something like:
// A ComputeReductionResult recipe is added to the middle block, also for in-loop reductions which conmpute their result in-loop, because this recipe currently takes care of generating the subsequent bc.merge.rdx phi which applies to all reductions.

Expanded thanks, with adjustments w.r.t. bc.merge.rdx phi creation.

ayalz · 2023-12-31T22:01:17Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-SCEV2ValueTy LoopVectorizationPlanner::executePlan(
+// Check if RedResult is a CompueReductionResult instruction, and if it is
+// add it to \p ReductionResumeValues with the merge phi node for it.
+static void collectMergePhiForReduction(


Would it be better to have collectMergePhiForReduction(), in addition to collecting the bc.merge.rdx phi, also take care of generating it inside the scalar preheader - a block that is yet to be modeled in VPlan? Thereby letting the new ComputeReductionResult recipe, placed inside the middle block, take care of generating code there only?

Moved the code here, it works well, thanks!

ayalz · 2024-01-01T09:05:36Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    // TODO: At the moment ComputeReductionResult also creates the bc.merge.rdx
+    // phi nodes, hence it needs to be created unconditionally here, until the
+    // reduction resume value handling is also modeled in VPlan.


Perhaps better keep the creation of bc.merge.rdx outside VPlan, until it can be moved into a suitable recipe and VPBB?
May still "be created unconditionally": ComputeReductionResult finalizes the computation of non in-loop reductions. For in-loop reductions one could (temporarily?) use ComputeReductionResult to represent (and easily retrieve) the ReductionResult, even if no additional Compute is needed, as a form of LCSSA Phi?

Should LV generate an actual LCSSA Phi, i.e., maintain LCSSA form? Note that all VPInstructions are currently excluded from VPRecipeBase::isPhi().

How do we prevent ComputeReductionResult from being treated as dead code, if-not/until hooked-up to a LiveOut.

Creation of bc.merge.rdx has been moved as suggested.

How do we prevent ComputeReductionResult from being treated as dead code, if-not/until hooked-up to a LiveOut.

At the moment, ComputeReductionResult needs to be considered as having side effects to prevent DCE.

ayalz · 2024-01-01T09:12:43Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -9034,7 +8894,7 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
      VPRecipeBase *Cur = Worklist[I];
      for (VPUser *U : Cur->getVPSingleValue()->users()) {
        auto *UserRecipe = dyn_cast<VPRecipeBase>(U);
-        if (!UserRecipe)
+        if (!UserRecipe || UserRecipe == FinalReductionResult)


Must the reduction chain now terminate at FinalReductionResult? I.e., assert that UserRecipe is non-null by (non-dyn) casting U into a recipe.

At this point yes, updated to use cast, thanks!

ayalz · 2024-01-01T09:30:09Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-      Red->replaceUsesWithIf(
-          Result->getVPSingleValue(),
-          [](VPUser &U, unsigned) { return isa<VPLiveOut>(&U); });
+      Red->replaceUsesWithIf(Result->getVPSingleValue(),


Would it be better to introduce these Select's for non in-loop reductions under tail folding, along with the introduction of ComputeReductionResult? Rather than first introducing the latter and then searching to replace its operand by the former.

Note that these Selects originally had to be introduced inside the loop, but can now be placed in the middle block alongside ComputeReductionResult.

nit (unrelated to this patch): is the enclosing for statement mis-indented, along with its preceding Builder.setInsertPoint(&*LatchVPBB->begin()); statement?

ayalz · 2024-01-01T09:51:21Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    // Create a phi node that merges control-flow from the backedge-taken check
+    // block and the middle block.
+    auto *BCBlockPhi = PHINode::Create(PhiTy, 2, "bc.merge.rdx",
+                                       LoopScalarPreHeader->getTerminator());


Encapsulated approach is preferred?

ayalz

Various comments, nice to see generating bc.merge.rdx fits in collectMergePhiForReduction().

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

ayalz · 2024-01-03T16:58:32Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-      Red->replaceUsesWithIf(
-          Result->getVPSingleValue(),
-          [](VPUser &U, unsigned) { return isa<VPLiveOut>(&U); });
+      ExitValue->insertBefore(&*Builder.getInsertPoint());


Should Builder support direct insertion of a built recipe before its insertion point?

Adjusted independently in 6dda74c, will rebase and resolve the conflicts here once approved.

ayalz · 2024-01-03T17:02:43Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

    // If the vector reduction can be performed in a smaller type, we truncate
    // then extend the loop exit value to enable InstCombine to evaluate the
    // entire expression in the smaller type.
    Type *PhiTy = PhiR->getStartValue()->getLiveInIRValue()->getType();
-    if (MinVF.isVector() && PhiTy != RdxDesc.getRecurrenceType()) {
+    if (!PhiR->isInLoop() && MinVF.isVector() &&


Above early-continue for in-loop reductions has been lifted, but assert below indicates checking in-loop is redundant for truncated reductions? Otherwise assert is redundant (and probably has been).

It's now redundant, removed, thanks!

Should the assert and original check be kept intact, i.e., avoid introducing the check if !PhiR->isInLoop()?
See collectInLoopReductions():

// We don't collect reductions that are type promoted (yet). if (RdxDesc.getRecurrenceType() != Phi->getType()) continue;

going back to 745bf6c

Looks like the assert isn't triggered, replaced the InLoop check with assert again.

ayalz · 2024-01-03T17:13:00Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+      if (PhiR->getOperand(1) == ExitValue->getVPSingleValue())
+        PhiR->setOperand(1, Extnd->getVPSingleValue());


Instead of existing RAUW + excluding Trunc to keep use Result/ExitValue?

Yes, this way preserves other users of OrigExitValue in all cases

ayalz · 2024-01-03T17:17:16Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-    if (CM.foldTailByMasking()) {
+    auto *OrigExitValue = PhiR->getBackedgeValue();
+    auto *ExitValue = OrigExitValue->getDefiningRecipe();
+    if (!PhiR->isInLoop() && CM.foldTailByMasking()) {
      VPValue *Cond =
          RecipeBuilder.createBlockInMask(OrigLoop->getHeader(), *Plan);
      VPValue *Red = PhiR->getBackedgeValue();


Red is now aka OrigExitValue

Yep, updated, thanks!

ayalz · 2024-01-03T17:20:36Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+                          {PhiR, ExitValue->getVPSingleValue()}, ExitDL);
+    cast<VPBasicBlock>(VectorLoopRegion->getSingleSuccessor())
+        ->appendRecipe(FinalReductionResult);
+    OrigExitValue->replaceUsesWithIf(


Could be an RAUW except for the relevant user set somewhere above, which could be recorded and retained.

Yes, mainly we don't want to update the reduction phi user.

@ayalz

Simplify the code and rename Result->NewExitingVPV as suggested by @ayalz in #70253.

ayalz · 2024-01-04T13:51:11Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -7585,7 +7393,65 @@ static void AddRuntimeUnrollDisableMetaData(Loop *L) {
  }
 }

-SCEV2ValueTy LoopVectorizationPlanner::executePlan(
+// Check if \p RedResult is a CompueReductionResult instruction, and if it is


Suggested change

// Check if \p RedResult is a CompueReductionResult instruction, and if it is

// Check if \p RedResult is a ComputeReductionResult instruction, and if it is

Done, thanks!

ayalz · 2024-01-04T16:16:35Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

    // If the vector reduction can be performed in a smaller type, we truncate
    // then extend the loop exit value to enable InstCombine to evaluate the
    // entire expression in the smaller type.
    Type *PhiTy = PhiR->getStartValue()->getLiveInIRValue()->getType();
-    if (MinVF.isVector() && PhiTy != RdxDesc.getRecurrenceType()) {
+    if (!PhiR->isInLoop() && MinVF.isVector() &&


Should the assert and original check be kept intact, i.e., avoid introducing the check if !PhiR->isInLoop()?
See collectInLoopReductions():

// We don't collect reductions that are type promoted (yet). if (RdxDesc.getRecurrenceType() != Phi->getType()) continue;

going back to 745bf6c

ayalz · 2024-01-04T16:18:54Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+
+    // FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
+    // and will be removed by breaking up the recipe further.
+    auto *PhiR = cast<VPReductionPHIRecipe>(getOperand(0));


Current option is ok, temporarily. Alternatives listed have their disadvantages.

ayalz · 2024-01-04T16:21:33Z

llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll

@@ -215,9 +215,10 @@ define i32 @sink_replicate_region_3_reduction(i32 %x, i8 %y, ptr %ptr) optsize {
 ; CHECK-NEXT: Successor(s): middle.block
 ; CHECK-EMPTY:
 ; CHECK-NEXT: middle.block:
+; CHECK-NEXT:  EMIT vp<[[RED_RES:%.+]]> = compute-reduction-result ir<%and.red>


The second, true operand %and.red.next is probably printed as well, just not CHECK'ed.

ayalz · 2024-01-04T16:22:49Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -7664,6 +7531,17 @@ SCEV2ValueTy LoopVectorizationPlanner::executePlan(

  BestVPlan.execute(&State);

+  // 2.5 Collect reduction resume values.
+  DenseMap<const RecurrenceDescriptor *, Value *> ReductionResumeValues;


ayalz · 2024-01-04T16:24:25Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-
-  // We know that the loop is in LCSSA form. We need to update the PHI nodes
-  // in the exit blocks.  See comment on analogous loop in
-  // fixFixedOrderRecurrence for a more complete explaination of the logic.


Good! removeLiveOut() is still needed for other header-phi live-outs?

ayalz

Looks good to me! Adding final nits.

ayalz · 2024-01-04T21:11:26Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-          [](VPUser &U, unsigned) { return isa<VPLiveOut>(&U); });
+              ? new VPInstruction(Instruction::Select, {Cond, OrigExitingVPV, PhiR}, FMFs)
+              : new VPInstruction(Instruction::Select, {Cond, OrigExitingVPV, PhiR});
+      NewExitingRecipe->insertBefore(&*Builder.getInsertPoint());


nit (unrelated to this patch): call Builder.createSelect()?

Will rebase the patch before landing, which will remove those changes.

ayalz · 2024-01-04T21:19:06Z

llvm/test/Transforms/LoopVectorize/vplan-printing.ll

@@ -132,9 +132,10 @@ define float @print_reduction(i64 %n, ptr noalias %y) {
 ; CHECK-NEXT: Successor(s): middle.block
 ; CHECK-EMPTY:
 ; CHECK-NEXT: middle.block:
+; CHECK-NEXT:   EMIT vp<[[RED_RES:%.+]]> = compute-reduction-result ir<%red>


nit: CHECK additional operand here too.

Will update before landing, thanks!

ayalz · 2024-01-04T21:19:26Z

llvm/test/Transforms/LoopVectorize/vplan-printing.ll

@@ -179,6 +180,7 @@ define void @print_reduction_with_invariant_store(i64 %n, ptr noalias %y, ptr no
 ; CHECK-NEXT: Successor(s): middle.block
 ; CHECK-EMPTY:
 ; CHECK-NEXT: middle.block:
+; CHECK-NEXT:   EMIT vp<[[RED_RES:.+]]> = compute-reduction-result ir<%red>


nit: CHECK additional operand here too.

Will update before landing, thanks!

ayalz · 2024-01-04T21:20:23Z

llvm/test/Transforms/LoopVectorize/vplan-printing.ll

@@ -376,9 +378,10 @@ define float @print_fmuladd_strict(ptr %a, ptr %b, i64 %n) {
 ; CHECK-NEXT: Successor(s): middle.block
 ; CHECK-EMPTY:
 ; CHECK-NEXT: middle.block:
+; CHECK-NEXT:   EMIT vp<[[RED_RES:%.+]]> = compute-reduction-result ir<%sum.07>


nit: CHECK additional operand here too.

Will update before landing, thanks!

This patch introduces a new ComputeReductionResult opcode to compute the final reduction result in the middle block. The code from fixReduction has been moved to ComputeReductionResult, after some earlier cleanup changes to model parts of fixReduction explicitly elsewhere as needed. The recipe may be broken down further in the future. Note that the ComputeReductionResult at the moment also creates the phi node to merge the reduction result from the trip count check and the middle block, to be used as resume value for the scalar remainder loop. Once we have a VPValue for the reduction result, this can also be modeled explicitly and moved out of the recipe.

With #70253 landed, selects for reduction results are explicitly used by ComputeReductionResult and Selects can be marked as not having side-effects again. This reverts the revert commit 1730329.

With llvm#70253 landed, selects for reduction results are explicitly used by ComputeReductionResult and Selects can be marked as not having side-effects again. This reverts the revert commit 1730329.

fhahn requested review from ayalz and aniragil October 25, 2023 21:00

llvmbot added vectorization llvm:transforms labels Oct 25, 2023

alexey-bataev reviewed Oct 26, 2023

View reviewed changes

fhahn force-pushed the vplan-compute-reduction-result branch from 33326db to 055e553 Compare November 6, 2023 21:01

fhahn commented Nov 6, 2023

View reviewed changes

alexey-bataev reviewed Nov 13, 2023

View reviewed changes

fhahn commented Nov 18, 2023

View reviewed changes

fhahn force-pushed the vplan-compute-reduction-result branch from 055e553 to 8cf1a4d Compare November 18, 2023 13:09

alexey-bataev reviewed Nov 20, 2023

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

fhahn force-pushed the vplan-compute-reduction-result branch from 786ba81 to 42d1023 Compare December 4, 2023 14:57

alexey-bataev reviewed Dec 4, 2023

View reviewed changes

ayalz reviewed Dec 18, 2023

View reviewed changes

fhahn force-pushed the vplan-compute-reduction-result branch from b02c100 to 95f532e Compare December 20, 2023 20:03

ayalz reviewed Dec 24, 2023

View reviewed changes

ayalz reviewed Jan 1, 2024

View reviewed changes

ayalz reviewed Jan 3, 2024

View reviewed changes

fhahn added a commit that referenced this pull request Jan 3, 2024

[VPlan] Use createSelect in adjustRecipesForReductions (NFCI).

6dda74c

Simplify the code and rename Result->NewExitingVPV as suggested by @ayalz in #70253.

ayalz reviewed Jan 4, 2024

View reviewed changes

ayalz approved these changes Jan 4, 2024

View reviewed changes

fhahn force-pushed the vplan-compute-reduction-result branch from 0140d3d to 16870fe Compare January 4, 2024 22:35

fhahn merged commit 241fe83 into llvm:main Jan 4, 2024
3 of 4 checks passed

fhahn deleted the vplan-compute-reduction-result branch January 4, 2024 22:53

	VPInstruction *FinalReductionResult =
	auto *FinalReductionResult =

	auto *OrigLoop = State.LI->getLoopFor(OrigPhi->getParent());
	const Loop *OrigLoop = State.LI->getLoopFor(OrigPhi->getParent());

	PHINode *BCBlockPhi = PHINode::Create(PhiTy, 2, "bc.merge.rdx",
	auto *BCBlockPhi = PHINode::Create(PhiTy, 2, "bc.merge.rdx",

	else if (ResumePhi && llvm::is_contained(ResumePhi->blocks(), Incoming))
	else if (ResumePhi && is_contained(ResumePhi->blocks(), Incoming))

	StoreInst *NewSI = Builder.CreateAlignedStore(
	auto *NewSI = Builder.CreateAlignedStore(

		@@ -347,10 +347,13 @@ class LoopVectorizationPlanner {
		/// vectorization re-using plans for both the main and epilogue vector loops.

		@@ -7640,6 +7463,42 @@ SCEV2ValueTy LoopVectorizationPlanner::executePlan(

		BestVPlan.execute(&State);

	// Check if ResResult is a CompueReductionResult instruction, and if it is
	// Check if RedResult is a CompueReductionResult instruction, and if it is

	static void getMergePhiForReduction(
	static void collectMergePhiForReduction(

	auto *PhiR = dyn_cast<VPReductionPHIRecipe>(getOperand(0));
	auto *PhiR = cast<VPReductionPHIRecipe>(getOperand(0));

		@@ -8973,7 +8821,8 @@ VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
		void LoopVectorizationPlanner::adjustRecipesForReductions(

		if (PhiR->getOperand(1) == ExitValue->getVPSingleValue())
		PhiR->setOperand(1, Extnd->getVPSingleValue());

	// Check if \p RedResult is a CompueReductionResult instruction, and if it is
	// Check if \p RedResult is a ComputeReductionResult instruction, and if it is

[VPlan] Introduce ComputeReductionResult VPInstruction opcode. #70253

[VPlan] Introduce ComputeReductionResult VPInstruction opcode. #70253

Conversation

fhahn commented Oct 25, 2023

llvmbot commented Oct 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn commented Nov 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayalz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 28, 2023 •

edited