[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor #158690

lukel97 · 2025-09-15T17:25:53Z

In 531.deepsjeng_r from SPEC CPU 2017 there's a loop that we unprofitably loop vectorize on RISC-V.

The loop looks something like:

  for (int i = 0; i < n; i++) {
    if (x0[i] == a)
      if (x1[i] == b)
        if (x2[i] == c)
          // do stuff...
  }

Because it's so deeply nested the actual inner level of the loop rarely gets executed. However we still deem it profitable to vectorize, which due to the if-conversion means we now always execute the body.

This stems from the fact that getPredBlockCostDivisor currently assumes that blocks have 50% chance of being executed as a heuristic.

We can fix this by using BlockFrequencyInfo, which gives a more accurate estimate of the innermost block being executed 12.5% of the time. We can then calculate the probability as HeaderFrequency / BlockFrequency.

Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.

Whilst there's a lot of changes in the in-tree tests, this doesn't affect llvm-test-suite or SPEC CPU 2017 that much:

On armv9-a -flto -O3 there's 0.0%/0.2% more geomean loops vectorized on llvm-test-suite/SPEC CPU 2017.
On x86-64 -flto -O3 with PGO there's 0.9%/0% less geomean loops vectorized on llvm-test-suite/SPEC CPU 2017.

llvmbot · 2025-09-15T17:26:25Z

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

In 531.deepsjeng_r on SPEC CPU 2017 there's a loop that we unprofitably loop vectorize.

The loop looks something like:

  for (int i = 0; i &lt; n; i++) {
    if (x0[i] == a)
      if (x1[i] == b)
	if (x2[i] == c)
	  // do stuff...
  }

Because it's so deeply nested the actual inner level of the loop rarely gets executed. However we still deem it profitable vectorize, which due to the if-conversion means we now always execute the body.

This stems from the fact that getPredBlockCostDivisor currently assumes that blocks have 50% chance of being executed as a heuristic.

We can fix this by using BlockFrequencyInfo, which gives a more accurate estimate of the innermost block being executed 12.5% of the time. We can then calculate the probability as HeaderFrequency / BlockFrequency.

Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.

This is a WIP, as there is an issue with BlockFrequencyInfo not being invalidated with LTO which leads to incorrect results that #157888 aims to fix. I'm posting this early to give a motivating example that illustrates the issue in llvm/test/Transforms/PhaseOrdering/loop-vectorize-bfi.ll

There are also more test changes that I haven't included for now.

Patch is 24.56 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158690.diff

6 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+41-9)
(modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+3-1)
(modified) llvm/lib/Transforms/Vectorize/VPlanHelpers.h (+3-15)
(added) llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll (+157)
(added) llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll (+125)
(added) llvm/test/Transforms/PhaseOrdering/loop-vectorize-bfi.ll (+34)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 640a98c622f80..088c35b23563d 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -578,8 +578,10 @@ class InnerLoopVectorizer {
   /// The profitablity analysis.
   LoopVectorizationCostModel *Cost;
 
-  /// BFI and PSI are used to check for profile guided size optimizations.
+  /// Used to calculate the probability of predicated blocks in
+  /// getPredBlockCostDivisor.
   BlockFrequencyInfo *BFI;
+  /// Used to check for profile guided size optimizations.
   ProfileSummaryInfo *PSI;
 
   /// Structure to hold information about generated runtime checks, responsible
@@ -900,7 +902,7 @@ class LoopVectorizationCostModel {
                              InterleavedAccessInfo &IAI,
                              ProfileSummaryInfo *PSI, BlockFrequencyInfo *BFI)
       : ScalarEpilogueStatus(SEL), TheLoop(L), PSE(PSE), LI(LI), Legal(Legal),
-        TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE), TheFunction(F),
+        TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE), BFI(BFI), TheFunction(F),
         Hints(Hints), InterleaveInfo(IAI) {
     if (TTI.supportsScalableVectors() || ForceTargetSupportsScalableVectors)
       initializeVScaleForTuning();
@@ -1249,6 +1251,17 @@ class LoopVectorizationCostModel {
   /// Superset of instructions that return true for isScalarWithPredication.
   bool isPredicatedInst(Instruction *I) const;
 
+  /// A helper function that returns how much we should divide the cost of a
+  /// predicated block by. Typically this is the reciprocal of the block
+  /// probability, i.e. if we return X we are assuming the predicated block will
+  /// execute once for every X iterations of the loop header so the block should
+  /// only contribute 1/X of its cost to the total cost calculation, but when
+  /// optimizing for code size it will just be 1 as code size costs don't depend
+  /// on execution probabilities.
+  inline unsigned
+  getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
+                          const BasicBlock *BB) const;
+
   /// Return the costs for our two available strategies for lowering a
   /// div/rem operation which requires speculating at least one lane.
   /// First result is for scalarization (will be invalid for scalable
@@ -1711,6 +1724,8 @@ class LoopVectorizationCostModel {
   /// Interface to emit optimization remarks.
   OptimizationRemarkEmitter *ORE;
 
+  const BlockFrequencyInfo *BFI;
+
   const Function *TheFunction;
 
   /// Loop Vectorize Hint.
@@ -2863,6 +2878,19 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
   }
 }
 
+unsigned LoopVectorizationCostModel::getPredBlockCostDivisor(
+    TargetTransformInfo::TargetCostKind CostKind, const BasicBlock *BB) const {
+  if (CostKind == TTI::TCK_CodeSize)
+    return 1;
+
+  uint64_t HeaderFreq = BFI->getBlockFreq(TheLoop->getHeader()).getFrequency();
+  uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
+  assert(HeaderFreq >= BBFreq &&
+         "Header has smaller block freq than dominated BB?");
+  return BFI->getBlockFreq(TheLoop->getHeader()).getFrequency() /
+         BFI->getBlockFreq(BB).getFrequency();
+}
+
 std::pair<InstructionCost, InstructionCost>
 LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
                                                     ElementCount VF) const {
@@ -2899,7 +2927,8 @@ LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
     // Scale the cost by the probability of executing the predicated blocks.
     // This assumes the predicated block for each vector lane is equally
     // likely.
-    ScalarizationCost = ScalarizationCost / getPredBlockCostDivisor(CostKind);
+    ScalarizationCost =
+        ScalarizationCost / getPredBlockCostDivisor(CostKind, I->getParent());
   }
   InstructionCost SafeDivisorCost = 0;
 
@@ -5031,7 +5060,7 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
       }
 
     // Scale the total scalar cost by block probability.
-    ScalarCost /= getPredBlockCostDivisor(CostKind);
+    ScalarCost /= getPredBlockCostDivisor(CostKind, PredInst->getParent());
 
     // Compute the discount. A non-negative discount means the vector version
     // of the instruction costs more, and scalarizing would be beneficial.
@@ -5084,7 +5113,7 @@ InstructionCost LoopVectorizationCostModel::expectedCost(ElementCount VF) {
     // cost by the probability of executing it. blockNeedsPredication from
     // Legal is used so as to not include all blocks in tail folded loops.
     if (VF.isScalar() && Legal->blockNeedsPredication(BB))
-      BlockCost /= getPredBlockCostDivisor(CostKind);
+      BlockCost /= getPredBlockCostDivisor(CostKind, BB);
 
     Cost += BlockCost;
   }
@@ -5162,7 +5191,7 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
   // conditional branches, but may not be executed for each vector lane. Scale
   // the cost by the probability of executing the predicated block.
   if (isPredicatedInst(I)) {
-    Cost /= getPredBlockCostDivisor(CostKind);
+    Cost /= getPredBlockCostDivisor(CostKind, I->getParent());
 
     // Add the cost of an i1 extract and a branch
     auto *VecI1Ty =
@@ -6710,6 +6739,11 @@ bool VPCostContext::skipCostComputation(Instruction *UI, bool IsVector) const {
          SkipCostComputation.contains(UI);
 }
 
+unsigned VPCostContext::getPredBlockCostDivisor(
+    TargetTransformInfo::TargetCostKind CostKind, const BasicBlock *BB) const {
+  return CM.getPredBlockCostDivisor(CostKind, BB);
+}
+
 InstructionCost
 LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
                                           VPCostContext &CostCtx) const {
@@ -10273,9 +10307,7 @@ PreservedAnalyses LoopVectorizePass::run(Function &F,
 
   auto &MAMProxy = AM.getResult<ModuleAnalysisManagerFunctionProxy>(F);
   PSI = MAMProxy.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
-  BFI = nullptr;
-  if (PSI && PSI->hasProfileSummary())
-    BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
+  BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
   LoopVectorizeResult Result = runImpl(F);
   if (!Result.MadeAnyChange)
     return PreservedAnalyses::all();
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 30a3a01ddd949..e2dffdd129c24 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -855,7 +855,9 @@ InstructionCost VPRegionBlock::cost(ElementCount VF, VPCostContext &Ctx) {
   // For the scalar case, we may not always execute the original predicated
   // block, Thus, scale the block's cost by the probability of executing it.
   if (VF.isScalar())
-    return ThenCost / getPredBlockCostDivisor(Ctx.CostKind);
+    if (auto *VPIRBB = dyn_cast<VPIRBasicBlock>(Then))
+      return ThenCost / Ctx.getPredBlockCostDivisor(Ctx.CostKind,
+                                                    VPIRBB->getIRBasicBlock());
 
   return ThenCost;
 }
diff --git a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
index fe59774b7c838..0e9d6e47c740d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
@@ -50,21 +50,6 @@ Value *getRuntimeVF(IRBuilderBase &B, Type *Ty, ElementCount VF);
 Value *createStepForVF(IRBuilderBase &B, Type *Ty, ElementCount VF,
                        int64_t Step);
 
-/// A helper function that returns how much we should divide the cost of a
-/// predicated block by. Typically this is the reciprocal of the block
-/// probability, i.e. if we return X we are assuming the predicated block will
-/// execute once for every X iterations of the loop header so the block should
-/// only contribute 1/X of its cost to the total cost calculation, but when
-/// optimizing for code size it will just be 1 as code size costs don't depend
-/// on execution probabilities.
-///
-/// TODO: We should use actual block probability here, if available. Currently,
-///       we always assume predicated blocks have a 50% chance of executing.
-inline unsigned
-getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind) {
-  return CostKind == TTI::TCK_CodeSize ? 1 : 2;
-}
-
 /// A range of powers-of-2 vectorization factors with fixed start and
 /// adjustable end. The range includes start and excludes end, e.g.,:
 /// [1, 16) = {1, 2, 4, 8}
@@ -378,6 +363,9 @@ struct VPCostContext {
   InstructionCost getScalarizationOverhead(Type *ResultTy,
                                            ArrayRef<const VPValue *> Operands,
                                            ElementCount VF);
+
+  unsigned getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
+                                   const BasicBlock *BB) const;
 };
 
 /// This class can be used to assign names to VPValues. For VPValues without
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll
new file mode 100644
index 0000000000000..e683f19b6effa
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll
@@ -0,0 +1,157 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; RUN: opt -p loop-vectorize -mtriple=aarch64 -mattr=+sve -S %s | FileCheck %s
+
+define void @nested(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @nested(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[X:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT:    br i1 [[C0]], label %[[THEN_0:.*]], label %[[LATCH]]
+; CHECK:       [[THEN_0]]:
+; CHECK-NEXT:    br i1 [[C1]], label %[[THEN_1:.*]], label %[[LATCH]]
+; CHECK:       [[THEN_1]]:
+; CHECK-NEXT:    [[GEP0:%.*]] = getelementptr i64, ptr [[P0]], i32 [[X]]
+; CHECK-NEXT:    [[X1:%.*]] = load i64, ptr [[GEP0]], align 8
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[P1]], i32 [[X]]
+; CHECK-NEXT:    [[Y:%.*]] = load i64, ptr [[GEP1]], align 8
+; CHECK-NEXT:    [[Z:%.*]] = udiv i64 [[X1]], [[Y]]
+; CHECK-NEXT:    store i64 [[Z]], ptr [[GEP1]], align 8
+; CHECK-NEXT:    br label %[[LATCH]]
+; CHECK:       [[LATCH]]:
+; CHECK-NEXT:    [[IV_NEXT]] = add i32 [[X]], 1
+; CHECK-NEXT:    [[DONE:%.*]] = icmp eq i32 [[IV_NEXT]], 1024
+; CHECK-NEXT:    br i1 [[DONE]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
+  br i1 %c0, label %then.0, label %latch
+
+then.0:
+  br i1 %c1, label %then.1, label %latch
+
+then.1:
+  %gep0 = getelementptr i64, ptr %p0, i32 %iv
+  %x = load i64, ptr %gep0
+  %gep1 = getelementptr i64, ptr %p1, i32 %iv
+  %y = load i64, ptr %gep1
+  %z = udiv i64 %x, %y
+  store i64 %z, ptr %gep1
+  br label %latch
+
+latch:
+  %iv.next = add i32 %iv, 1
+  %done = icmp eq i32 %iv.next, 1024
+  br i1 %done, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+define void @always_taken(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @always_taken(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
+; CHECK-NEXT:    [[TMP1:%.*]] = shl nuw i32 [[TMP0]], 2
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 1024, [[TMP1]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[TMP4:%.*]] = call i32 @llvm.vscale.i32()
+; CHECK-NEXT:    [[TMP5:%.*]] = mul nuw i32 [[TMP4]], 4
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i32 1024, [[TMP5]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i32 1024, [[N_MOD_VF]]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[C1]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i1> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[C0]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i1> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP6:%.*]] = select <vscale x 2 x i1> [[BROADCAST_SPLAT2]], <vscale x 2 x i1> [[BROADCAST_SPLAT]], <vscale x 2 x i1> zeroinitializer
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr i64, ptr [[P0]], i32 [[INDEX]]
+; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP7:%.*]] = shl nuw i64 [[TMP8]], 1
+; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr i64, ptr [[TMP10]], i64 [[TMP7]]
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP10]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD3:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP20]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr i64, ptr [[P1]], i32 [[INDEX]]
+; CHECK-NEXT:    [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP11:%.*]] = shl nuw i64 [[TMP13]], 1
+; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr i64, ptr [[TMP9]], i64 [[TMP11]]
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD4:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP9]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD5:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP12]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT:    [[TMP21:%.*]] = select <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> [[WIDE_MASKED_LOAD4]], <vscale x 2 x i64> splat (i64 1)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> [[WIDE_MASKED_LOAD5]], <vscale x 2 x i64> splat (i64 1)
+; CHECK-NEXT:    [[TMP15:%.*]] = udiv <vscale x 2 x i64> [[WIDE_MASKED_LOAD]], [[TMP21]]
+; CHECK-NEXT:    [[TMP22:%.*]] = udiv <vscale x 2 x i64> [[WIDE_MASKED_LOAD3]], [[TMP14]]
+; CHECK-NEXT:    [[TMP17:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP18:%.*]] = shl nuw i64 [[TMP17]], 1
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr i64, ptr [[TMP9]], i64 [[TMP18]]
+; CHECK-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP15]], ptr [[TMP9]], i32 8, <vscale x 2 x i1> [[TMP6]])
+; CHECK-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP22]], ptr [[TMP19]], i32 8, <vscale x 2 x i1> [[TMP6]])
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP5]]
+; CHECK-NEXT:    [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP16]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i32 1024, [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i32 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT1:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT:    br i1 [[C0]], label %[[THEN_0:.*]], label %[[LATCH]], !prof [[PROF3:![0-9]+]]
+; CHECK:       [[THEN_0]]:
+; CHECK-NEXT:    br i1 [[C1]], label %[[THEN_1:.*]], label %[[LATCH]], !prof [[PROF3]]
+; CHECK:       [[THEN_1]]:
+; CHECK-NEXT:    [[GEP0:%.*]] = getelementptr i64, ptr [[P0]], i32 [[IV1]]
+; CHECK-NEXT:    [[X:%.*]] = load i64, ptr [[GEP0]], align 8
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[P1]], i32 [[IV1]]
+; CHECK-NEXT:    [[Y:%.*]] = load i64, ptr [[GEP1]], align 8
+; CHECK-NEXT:    [[Z:%.*]] = udiv i64 [[X]], [[Y]]
+; CHECK-NEXT:    store i64 [[Z]], ptr [[GEP1]], align 8
+; CHECK-NEXT:    br label %[[LATCH]]
+; CHECK:       [[LATCH]]:
+; CHECK-NEXT:    [[IV_NEXT1]] = add i32 [[IV1]], 1
+; CHECK-NEXT:    [[DONE:%.*]] = icmp eq i32 [[IV_NEXT1]], 1024
+; CHECK-NEXT:    br i1 [[DONE]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
+  br i1 %c0, label %then.0, label %latch, !prof !4
+
+then.0:
+  br i1 %c1, label %then.1, label %latch, !prof !4
+
+then.1:
+  %gep0 = getelementptr i64, ptr %p0, i32 %iv
+  %x = load i64, ptr %gep0
+  %gep1 = getelementptr i64, ptr %p1, i32 %iv
+  %y = load i64, ptr %gep1
+  %z = udiv i64 %x, %y
+  store i64 %z, ptr %gep1
+  br label %latch
+
+latch:
+  %iv.next = add i32 %iv, 1
+  %done = icmp eq i32 %iv.next, 1024
+  br i1 %done, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+!4 = !{!"branch_weights", i32 1, i32 0}
+
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll b/llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll
new file mode 100644
index 0000000000000..23f2624a207b3
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll
@@ -0,0 +1,125 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; RUN: opt < %s -S -p loop-vectorize -mtriple=riscv64 -mattr=+v | FileCheck %s
+
+define void @nested(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @nested(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT:    br i1 [[C0]], label %[[THEN_0:.*]], label %[[LATCH]]
+; CHECK:       [[THEN_0]]:
+; CHECK-NEXT:    br i1 [[C1]], label %[[THEN_1:.*]], label %[[LATCH]]
+; CHECK:       [[THEN_1]]:
+; CHECK-NEXT:    [[GEP2:%.*]] = getelementptr i32, ptr [[P0]], i32 [[IV1]]
+; CHECK-NEXT:    [[X:%.*]] = load i32, ptr [[GEP2]], align 4
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i32, ptr [[P1]], i32 [[X]]
+; CHECK-NEXT:    store i32 0, ptr [[GEP1]], align 4
+; CHECK-NEXT:    br label %[[LATCH]]
+; CHECK:       [[LATCH]]:
+; CHECK-NEXT:    [[IV_NEXT]] = add i32 [[IV1]], 1
+; CHECK-NEXT:    [[DONE:%.*]] = icmp eq i32 [[IV_NEXT]], 1024
+; CHECK-NEXT:    br i1 [[DONE]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
+  br i1 %c0, label %then.0, label %latch
+
+then.0:
+  br i1 %c1, label %then.1, label %latch
+
+then.1:
+  %gep0 = getelementptr i32, ptr %p0, i32 %iv
+  %x = load i32, ptr %gep0
+  %gep1 = getelementptr i32, ptr %p1, i32 %x
+  store i32 0, ptr %gep1
+  br label %latch
+
+latch:
+  %iv.next = add i32 %iv, 1
+  %done = icmp eq i32 %iv.next, 1024
+  br i1 %done, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+define void @always_taken(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @always_taken(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i1> poison, i1 [[C1]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i1> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_...
[truncated]

I ran into this crash when llvm#158690 caused a loop with a struct call to be vectorized. If we have a replicate recipe in a branch-on-mask predicated region that's used by a widened recipe in another block then it will be packed together with the other lanes via a VPPredInstPHIRecipe. If we're replicating a call with a struct return type then we currently crash. The code that handles structs in packScalarIntoVectorizedValue seemed to be untested at least on test/Transforms/LoopVectorize. There's two places that need to be fixed. The poison value that the scalar is packed into needs to use toVectorizedTy to correctly handle structs (not to be confused with toVectorTy!) The other is that VPPredInstPHIRecipe expects its operand to be an InsertElementInstr when stringing together the different lanes. For structs this will be an InsertVlaueInstr, and the value for the previous lane will be at the back of a chain of InsertValueInstrs.

lukel97 · 2025-09-23T12:11:06Z

llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll

  %var0 = getelementptr inbounds i64, ptr %a, i64 %i
  %var2 = load i64, ptr %var0, align 4
-  %cond0 = icmp sgt i64 %var2, 0
+  %cond0 = icmp sgt i64 %var2, 1


This constant was changed to keep the old branch probability the same and keep the block scalarized, since with BFI icmp sgt %x, 0 is predicted to be slightly > 50%.

lukel97 · 2025-09-23T12:16:49Z

llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll

+; COMMON-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 7
+; COMMON-NEXT:    br i1 [[EC]], label %[[SCALAR_PH1:.*]], label %[[EXIT1]]
+; COMMON:       [[SCALAR_PH1]]:
+; COMMON-NEXT:    ret void


Previously all the predicated scalar stores were discounted by x0.5 in computePredInstDiscount. BFI now correctly returns that this block is always executed so the VF=1 plan is no longer discounted.

lukel97 · 2025-09-23T12:29:37Z

llvm/test/Transforms/LoopVectorize/AArch64/early_exit_costs.ll

  %l = load i64, ptr %gep.src, align 1
  %t = trunc i64 %l to i1
-  br i1 %t, label %exit.0, label %loop.latch
+  br i1 %t, label %exit.0, label %loop.latch, !prof !0


Added to retain the old branch probability. With BFI, the probability of exit.0 being taken is computed as 3% (why that is, I'm not sure) and the IV increment isn't discounted in the VF=1 plan.

lukel97 · 2025-09-23T12:32:41Z

llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll

-; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE6]] ]
-; CHECK-NEXT:    [[VECTOR_RECUR:%.*]] = phi <4 x i32> [ <i32 poison, i32 poison, i32 poison, i32 4>, [[VECTOR_PH]] ], [ [[VEC_IND]], [[PRED_STORE_CONTINUE6]] ]
-; CHECK-NEXT:    [[TMP0:%.*]] = shufflevector <4 x i32> [[VECTOR_RECUR]], <4 x i32> [[VEC_IND]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>
-; CHECK-NEXT:    [[TMP1:%.*]] = icmp ule <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]


Previously blocks that were predicated because of tail folding were discounted by 0.5x by computePredInstDiscount because they were technically predicated. However in practice they're always going to be executed except for the last iteration. BFI correctly returns that the probability is ~100% for these instructions which removes the discount.

lukel97 · 2025-09-23T12:34:02Z

llvm/test/Transforms/LoopVectorize/X86/masked_load_store.ll

+; AVX1-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 -3
+; AVX1-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4, !alias.scope [[META18:![0-9]+]]
+; AVX1-NEXT:    [[REVERSE:%.*]] = shufflevector <4 x i32> [[WIDE_LOAD]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; AVX1-NEXT:    [[TMP3:%.*]] = icmp sgt <4 x i32> [[REVERSE]], zeroinitializer


BPI computes >= 0 to be slightly >50%, which influences the cost for AVX1 here.

lukel97 · 2025-09-23T12:36:17Z

llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll

 ; COST:       [[LOOP_HEADER]]:
-; COST-NEXT:    [[PTR_IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; COST-NEXT:    [[PTR_IV:%.*]] = phi ptr [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
 ; COST-NEXT:    [[L:%.*]] = load i64, ptr [[PTR_IV]], align 1


The total probability of all these switch destinations being executed is 1. BFI correctly divides up the probabilities, but previously each destination was given a 50% chance which meant the VF=1 VPlan was overly discounted.

lukel97 · 2025-09-23T12:37:58Z

llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-store-accesses-with-gaps.ll

-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP8:%.*]] = extractelement <4 x i64> [[TMP2]], i64 1
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP9:%.*]] = getelementptr inbounds nuw i16, ptr [[POINTS]], i64 [[TMP8]]
-; ENABLED_MASKED_STRIDED-NEXT:    [[TMP10:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i64 1
-; ENABLED_MASKED_STRIDED-NEXT:    store i16 [[TMP10]], ptr [[TMP9]], align 2


Predication due to tail folding no longer accidentally discounts

lukel97 · 2025-09-23T12:52:36Z

The invalid results being returned from BlockFrequencyInfo has been fixed in #157888 so this should be ready to review now. There is one crash to do with struct return types that #160274 fixes.

fhahn · 2025-09-23T19:26:26Z

llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll

 ;
 ; PRED-LABEL: define void @iv_trunc(
 ; PRED-SAME: i32 [[X:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
-; PRED-NEXT:  [[ENTRY:.*:]]


can we add branch probabilities to keep the generated code/original checks for the cost?

fhahn · 2025-09-23T19:27:48Z

llvm/test/Transforms/LoopVectorize/ARM/optsize_minsize.ll

-; DEFAULT-NEXT:    [[TMP7:%.*]] = add <16 x i8> [[TMP4]], [[TMP6]]
-; DEFAULT-NEXT:    [[TMP8:%.*]] = extractelement <16 x i1> [[TMP0]], i32 0
-; DEFAULT-NEXT:    br i1 [[TMP8]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; DEFAULT:       [[PRED_STORE_IF]]:


This is due to folding the tail, right? I think we could also assign a probability much closer to 1 w/o BFI, splitting up in 2 separate changes would make it easier to isolate issues.

… blocks Split off from llvm#158690. Currently if an instruction needs predicated due to tail folding, it will also have a predicated discount applied to it in multiple places. This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last. This fixes it by checking if the instruction/block was originally predicated, and in doing so prevents vectorization with tail folding where we would have had to scalarize the memory op anyway.

I ran into this crash when #158690 caused a loop with a struct call to be vectorized. If we have a replicate recipe in a branch-on-mask predicated region that's used by a widened recipe in another block then it will be packed together with the other lanes via a VPPredInstPHIRecipe. If we're replicating a call with a struct return type then we currently crash. The code that handles structs in packScalarIntoVectorizedValue seemed to be untested at least on test/Transforms/LoopVectorize. There's two places that need to be fixed. The poison value that the scalar is packed into needs to use toVectorizedTy to correctly handle structs (not to be confused with toVectorTy!) The other is that VPPredInstPHIRecipe expects its operand to be an InsertElementInstr when stringing together the different lanes. For structs this will be an InsertVlaueInstr, and the value for the previous lane will be at the back of a chain of InsertValueInstrs.

SamTebbs33 · 2025-10-02T10:00:55Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  uint64_t HeaderFreq = BFI->getBlockFreq(TheLoop->getHeader()).getFrequency();
+  uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
+  assert(HeaderFreq >= BBFreq &&
+         "Header has smaller block freq than dominated BB?");
+  return BFI->getBlockFreq(TheLoop->getHeader()).getFrequency() /
+         BFI->getBlockFreq(BB).getFrequency();


You already have HeaderFreq and BBFreq so I don't think you need to call getBlockFreq again when returning.

SamTebbs33 · 2025-10-02T10:02:50Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

         SkipCostComputation.contains(UI);
 }

+unsigned VPCostContext::getPredBlockCostDivisor(


Is this needed when the caller could just use Ctx.CM.getPredBlockCostDivisor?

I ran into this crash when llvm#158690 caused a loop with a struct call to be vectorized. If we have a replicate recipe in a branch-on-mask predicated region that's used by a widened recipe in another block then it will be packed together with the other lanes via a VPPredInstPHIRecipe. If we're replicating a call with a struct return type then we currently crash. The code that handles structs in packScalarIntoVectorizedValue seemed to be untested at least on test/Transforms/LoopVectorize. There's two places that need to be fixed. The poison value that the scalar is packed into needs to use toVectorizedTy to correctly handle structs (not to be confused with toVectorTy!) The other is that VPPredInstPHIRecipe expects its operand to be an InsertElementInstr when stringing together the different lanes. For structs this will be an InsertVlaueInstr, and the value for the previous lane will be at the back of a chain of InsertValueInstrs.

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Sep 15, 2025

lukel97 marked this pull request as draft September 15, 2025 17:26

lukel97 force-pushed the loop-vectorize/bfi branch from c191ed6 to a8810d5 Compare September 17, 2025 12:56

Precommit tests

994ac69

lukel97 force-pushed the loop-vectorize/bfi branch from a8810d5 to ccf4343 Compare September 23, 2025 04:22

[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor

319b4e7

lukel97 mentioned this pull request Sep 23, 2025

[VPlan] Fix packed replication of struct types #160274

Merged

lukel97 force-pushed the loop-vectorize/bfi branch from ccf4343 to 319b4e7 Compare September 23, 2025 11:56

Add comment to test

0b4a76e

lukel97 commented Sep 23, 2025

View reviewed changes

lukel97 changed the title ~~[WIP][VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor~~ [VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor Sep 23, 2025

lukel97 requested review from fhahn, david-arm, SamTebbs33 and MacDue September 23, 2025 12:47

lukel97 marked this pull request as ready for review September 23, 2025 12:51

fhahn reviewed Sep 23, 2025

View reviewed changes

lukel97 mentioned this pull request Sep 24, 2025

[VPlan] Don't apply predication discount to non-originally-predicated blocks #160449

Open

SamTebbs33 reviewed Oct 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor #158690

[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor #158690

lukel97 commented Sep 15, 2025 •

edited

Loading

Uh oh!

llvmbot commented Sep 15, 2025 •

edited

Loading

Uh oh!

lukel97 Sep 23, 2025

Uh oh!

lukel97 Sep 23, 2025

Uh oh!

lukel97 Sep 23, 2025

Uh oh!

lukel97 Sep 23, 2025

Uh oh!

lukel97 Sep 23, 2025

Uh oh!

lukel97 Sep 23, 2025

Uh oh!

lukel97 Sep 23, 2025

Uh oh!

lukel97 commented Sep 23, 2025

Uh oh!

fhahn Sep 23, 2025

Uh oh!

fhahn Sep 23, 2025

Uh oh!

SamTebbs33 Oct 2, 2025

Uh oh!

SamTebbs33 Oct 2, 2025

Uh oh!

Uh oh!

[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor #158690

Are you sure you want to change the base?

[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor #158690

Conversation

lukel97 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lukel97 commented Sep 15, 2025 •

edited

Loading

llvmbot commented Sep 15, 2025 •

edited

Loading