-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor #158690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-backend-risc-v Author: Luke Lau (lukel97) ChangesIn 531.deepsjeng_r on SPEC CPU 2017 there's a loop that we unprofitably loop vectorize. The loop looks something like: for (int i = 0; i < n; i++) {
if (x0[i] == a)
if (x1[i] == b)
if (x2[i] == c)
// do stuff...
} Because it's so deeply nested the actual inner level of the loop rarely gets executed. However we still deem it profitable vectorize, which due to the if-conversion means we now always execute the body. This stems from the fact that We can fix this by using BlockFrequencyInfo, which gives a more accurate estimate of the innermost block being executed 12.5% of the time. We can then calculate the probability as Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V. This is a WIP, as there is an issue with BlockFrequencyInfo not being invalidated with LTO which leads to incorrect results that #157888 aims to fix. I'm posting this early to give a motivating example that illustrates the issue in There are also more test changes that I haven't included for now. Patch is 24.56 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158690.diff 6 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 640a98c622f80..088c35b23563d 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -578,8 +578,10 @@ class InnerLoopVectorizer {
/// The profitablity analysis.
LoopVectorizationCostModel *Cost;
- /// BFI and PSI are used to check for profile guided size optimizations.
+ /// Used to calculate the probability of predicated blocks in
+ /// getPredBlockCostDivisor.
BlockFrequencyInfo *BFI;
+ /// Used to check for profile guided size optimizations.
ProfileSummaryInfo *PSI;
/// Structure to hold information about generated runtime checks, responsible
@@ -900,7 +902,7 @@ class LoopVectorizationCostModel {
InterleavedAccessInfo &IAI,
ProfileSummaryInfo *PSI, BlockFrequencyInfo *BFI)
: ScalarEpilogueStatus(SEL), TheLoop(L), PSE(PSE), LI(LI), Legal(Legal),
- TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE), TheFunction(F),
+ TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE), BFI(BFI), TheFunction(F),
Hints(Hints), InterleaveInfo(IAI) {
if (TTI.supportsScalableVectors() || ForceTargetSupportsScalableVectors)
initializeVScaleForTuning();
@@ -1249,6 +1251,17 @@ class LoopVectorizationCostModel {
/// Superset of instructions that return true for isScalarWithPredication.
bool isPredicatedInst(Instruction *I) const;
+ /// A helper function that returns how much we should divide the cost of a
+ /// predicated block by. Typically this is the reciprocal of the block
+ /// probability, i.e. if we return X we are assuming the predicated block will
+ /// execute once for every X iterations of the loop header so the block should
+ /// only contribute 1/X of its cost to the total cost calculation, but when
+ /// optimizing for code size it will just be 1 as code size costs don't depend
+ /// on execution probabilities.
+ inline unsigned
+ getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
+ const BasicBlock *BB) const;
+
/// Return the costs for our two available strategies for lowering a
/// div/rem operation which requires speculating at least one lane.
/// First result is for scalarization (will be invalid for scalable
@@ -1711,6 +1724,8 @@ class LoopVectorizationCostModel {
/// Interface to emit optimization remarks.
OptimizationRemarkEmitter *ORE;
+ const BlockFrequencyInfo *BFI;
+
const Function *TheFunction;
/// Loop Vectorize Hint.
@@ -2863,6 +2878,19 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
}
}
+unsigned LoopVectorizationCostModel::getPredBlockCostDivisor(
+ TargetTransformInfo::TargetCostKind CostKind, const BasicBlock *BB) const {
+ if (CostKind == TTI::TCK_CodeSize)
+ return 1;
+
+ uint64_t HeaderFreq = BFI->getBlockFreq(TheLoop->getHeader()).getFrequency();
+ uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
+ assert(HeaderFreq >= BBFreq &&
+ "Header has smaller block freq than dominated BB?");
+ return BFI->getBlockFreq(TheLoop->getHeader()).getFrequency() /
+ BFI->getBlockFreq(BB).getFrequency();
+}
+
std::pair<InstructionCost, InstructionCost>
LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
ElementCount VF) const {
@@ -2899,7 +2927,8 @@ LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
// Scale the cost by the probability of executing the predicated blocks.
// This assumes the predicated block for each vector lane is equally
// likely.
- ScalarizationCost = ScalarizationCost / getPredBlockCostDivisor(CostKind);
+ ScalarizationCost =
+ ScalarizationCost / getPredBlockCostDivisor(CostKind, I->getParent());
}
InstructionCost SafeDivisorCost = 0;
@@ -5031,7 +5060,7 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
}
// Scale the total scalar cost by block probability.
- ScalarCost /= getPredBlockCostDivisor(CostKind);
+ ScalarCost /= getPredBlockCostDivisor(CostKind, PredInst->getParent());
// Compute the discount. A non-negative discount means the vector version
// of the instruction costs more, and scalarizing would be beneficial.
@@ -5084,7 +5113,7 @@ InstructionCost LoopVectorizationCostModel::expectedCost(ElementCount VF) {
// cost by the probability of executing it. blockNeedsPredication from
// Legal is used so as to not include all blocks in tail folded loops.
if (VF.isScalar() && Legal->blockNeedsPredication(BB))
- BlockCost /= getPredBlockCostDivisor(CostKind);
+ BlockCost /= getPredBlockCostDivisor(CostKind, BB);
Cost += BlockCost;
}
@@ -5162,7 +5191,7 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
// conditional branches, but may not be executed for each vector lane. Scale
// the cost by the probability of executing the predicated block.
if (isPredicatedInst(I)) {
- Cost /= getPredBlockCostDivisor(CostKind);
+ Cost /= getPredBlockCostDivisor(CostKind, I->getParent());
// Add the cost of an i1 extract and a branch
auto *VecI1Ty =
@@ -6710,6 +6739,11 @@ bool VPCostContext::skipCostComputation(Instruction *UI, bool IsVector) const {
SkipCostComputation.contains(UI);
}
+unsigned VPCostContext::getPredBlockCostDivisor(
+ TargetTransformInfo::TargetCostKind CostKind, const BasicBlock *BB) const {
+ return CM.getPredBlockCostDivisor(CostKind, BB);
+}
+
InstructionCost
LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
VPCostContext &CostCtx) const {
@@ -10273,9 +10307,7 @@ PreservedAnalyses LoopVectorizePass::run(Function &F,
auto &MAMProxy = AM.getResult<ModuleAnalysisManagerFunctionProxy>(F);
PSI = MAMProxy.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
- BFI = nullptr;
- if (PSI && PSI->hasProfileSummary())
- BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
+ BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
LoopVectorizeResult Result = runImpl(F);
if (!Result.MadeAnyChange)
return PreservedAnalyses::all();
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 30a3a01ddd949..e2dffdd129c24 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -855,7 +855,9 @@ InstructionCost VPRegionBlock::cost(ElementCount VF, VPCostContext &Ctx) {
// For the scalar case, we may not always execute the original predicated
// block, Thus, scale the block's cost by the probability of executing it.
if (VF.isScalar())
- return ThenCost / getPredBlockCostDivisor(Ctx.CostKind);
+ if (auto *VPIRBB = dyn_cast<VPIRBasicBlock>(Then))
+ return ThenCost / Ctx.getPredBlockCostDivisor(Ctx.CostKind,
+ VPIRBB->getIRBasicBlock());
return ThenCost;
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
index fe59774b7c838..0e9d6e47c740d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanHelpers.h
@@ -50,21 +50,6 @@ Value *getRuntimeVF(IRBuilderBase &B, Type *Ty, ElementCount VF);
Value *createStepForVF(IRBuilderBase &B, Type *Ty, ElementCount VF,
int64_t Step);
-/// A helper function that returns how much we should divide the cost of a
-/// predicated block by. Typically this is the reciprocal of the block
-/// probability, i.e. if we return X we are assuming the predicated block will
-/// execute once for every X iterations of the loop header so the block should
-/// only contribute 1/X of its cost to the total cost calculation, but when
-/// optimizing for code size it will just be 1 as code size costs don't depend
-/// on execution probabilities.
-///
-/// TODO: We should use actual block probability here, if available. Currently,
-/// we always assume predicated blocks have a 50% chance of executing.
-inline unsigned
-getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind) {
- return CostKind == TTI::TCK_CodeSize ? 1 : 2;
-}
-
/// A range of powers-of-2 vectorization factors with fixed start and
/// adjustable end. The range includes start and excludes end, e.g.,:
/// [1, 16) = {1, 2, 4, 8}
@@ -378,6 +363,9 @@ struct VPCostContext {
InstructionCost getScalarizationOverhead(Type *ResultTy,
ArrayRef<const VPValue *> Operands,
ElementCount VF);
+
+ unsigned getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind,
+ const BasicBlock *BB) const;
};
/// This class can be used to assign names to VPValues. For VPValues without
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll
new file mode 100644
index 0000000000000..e683f19b6effa
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-predicated-costs.ll
@@ -0,0 +1,157 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; RUN: opt -p loop-vectorize -mtriple=aarch64 -mattr=+sve -S %s | FileCheck %s
+
+define void @nested(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @nested(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[LOOP:.*]]
+; CHECK: [[LOOP]]:
+; CHECK-NEXT: [[X:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT: br i1 [[C0]], label %[[THEN_0:.*]], label %[[LATCH]]
+; CHECK: [[THEN_0]]:
+; CHECK-NEXT: br i1 [[C1]], label %[[THEN_1:.*]], label %[[LATCH]]
+; CHECK: [[THEN_1]]:
+; CHECK-NEXT: [[GEP0:%.*]] = getelementptr i64, ptr [[P0]], i32 [[X]]
+; CHECK-NEXT: [[X1:%.*]] = load i64, ptr [[GEP0]], align 8
+; CHECK-NEXT: [[GEP1:%.*]] = getelementptr i64, ptr [[P1]], i32 [[X]]
+; CHECK-NEXT: [[Y:%.*]] = load i64, ptr [[GEP1]], align 8
+; CHECK-NEXT: [[Z:%.*]] = udiv i64 [[X1]], [[Y]]
+; CHECK-NEXT: store i64 [[Z]], ptr [[GEP1]], align 8
+; CHECK-NEXT: br label %[[LATCH]]
+; CHECK: [[LATCH]]:
+; CHECK-NEXT: [[IV_NEXT]] = add i32 [[X]], 1
+; CHECK-NEXT: [[DONE:%.*]] = icmp eq i32 [[IV_NEXT]], 1024
+; CHECK-NEXT: br i1 [[DONE]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
+ br i1 %c0, label %then.0, label %latch
+
+then.0:
+ br i1 %c1, label %then.1, label %latch
+
+then.1:
+ %gep0 = getelementptr i64, ptr %p0, i32 %iv
+ %x = load i64, ptr %gep0
+ %gep1 = getelementptr i64, ptr %p1, i32 %iv
+ %y = load i64, ptr %gep1
+ %z = udiv i64 %x, %y
+ store i64 %z, ptr %gep1
+ br label %latch
+
+latch:
+ %iv.next = add i32 %iv, 1
+ %done = icmp eq i32 %iv.next, 1024
+ br i1 %done, label %exit, label %loop
+
+exit:
+ ret void
+}
+
+define void @always_taken(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @always_taken(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
+; CHECK-NEXT: [[TMP1:%.*]] = shl nuw i32 [[TMP0]], 2
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 1024, [[TMP1]]
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vscale.i32()
+; CHECK-NEXT: [[TMP5:%.*]] = mul nuw i32 [[TMP4]], 4
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 1024, [[TMP5]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 1024, [[N_MOD_VF]]
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[C1]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i1> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[C0]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i1> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP6:%.*]] = select <vscale x 2 x i1> [[BROADCAST_SPLAT2]], <vscale x 2 x i1> [[BROADCAST_SPLAT]], <vscale x 2 x i1> zeroinitializer
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP10:%.*]] = getelementptr i64, ptr [[P0]], i32 [[INDEX]]
+; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP7:%.*]] = shl nuw i64 [[TMP8]], 1
+; CHECK-NEXT: [[TMP20:%.*]] = getelementptr i64, ptr [[TMP10]], i64 [[TMP7]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP10]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT: [[WIDE_MASKED_LOAD3:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP20]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT: [[TMP9:%.*]] = getelementptr i64, ptr [[P1]], i32 [[INDEX]]
+; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP11:%.*]] = shl nuw i64 [[TMP13]], 1
+; CHECK-NEXT: [[TMP12:%.*]] = getelementptr i64, ptr [[TMP9]], i64 [[TMP11]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD4:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP9]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT: [[WIDE_MASKED_LOAD5:%.*]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP12]], i32 8, <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> poison)
+; CHECK-NEXT: [[TMP21:%.*]] = select <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> [[WIDE_MASKED_LOAD4]], <vscale x 2 x i64> splat (i64 1)
+; CHECK-NEXT: [[TMP14:%.*]] = select <vscale x 2 x i1> [[TMP6]], <vscale x 2 x i64> [[WIDE_MASKED_LOAD5]], <vscale x 2 x i64> splat (i64 1)
+; CHECK-NEXT: [[TMP15:%.*]] = udiv <vscale x 2 x i64> [[WIDE_MASKED_LOAD]], [[TMP21]]
+; CHECK-NEXT: [[TMP22:%.*]] = udiv <vscale x 2 x i64> [[WIDE_MASKED_LOAD3]], [[TMP14]]
+; CHECK-NEXT: [[TMP17:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP18:%.*]] = shl nuw i64 [[TMP17]], 1
+; CHECK-NEXT: [[TMP19:%.*]] = getelementptr i64, ptr [[TMP9]], i64 [[TMP18]]
+; CHECK-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP15]], ptr [[TMP9]], i32 8, <vscale x 2 x i1> [[TMP6]])
+; CHECK-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP22]], ptr [[TMP19]], i32 8, <vscale x 2 x i1> [[TMP6]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP5]]
+; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP16]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 1024, [[N_VEC]]
+; CHECK-NEXT: br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK: [[SCALAR_PH]]:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT: br label %[[LOOP:.*]]
+; CHECK: [[LOOP]]:
+; CHECK-NEXT: [[IV1:%.*]] = phi i32 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT1:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT: br i1 [[C0]], label %[[THEN_0:.*]], label %[[LATCH]], !prof [[PROF3:![0-9]+]]
+; CHECK: [[THEN_0]]:
+; CHECK-NEXT: br i1 [[C1]], label %[[THEN_1:.*]], label %[[LATCH]], !prof [[PROF3]]
+; CHECK: [[THEN_1]]:
+; CHECK-NEXT: [[GEP0:%.*]] = getelementptr i64, ptr [[P0]], i32 [[IV1]]
+; CHECK-NEXT: [[X:%.*]] = load i64, ptr [[GEP0]], align 8
+; CHECK-NEXT: [[GEP1:%.*]] = getelementptr i64, ptr [[P1]], i32 [[IV1]]
+; CHECK-NEXT: [[Y:%.*]] = load i64, ptr [[GEP1]], align 8
+; CHECK-NEXT: [[Z:%.*]] = udiv i64 [[X]], [[Y]]
+; CHECK-NEXT: store i64 [[Z]], ptr [[GEP1]], align 8
+; CHECK-NEXT: br label %[[LATCH]]
+; CHECK: [[LATCH]]:
+; CHECK-NEXT: [[IV_NEXT1]] = add i32 [[IV1]], 1
+; CHECK-NEXT: [[DONE:%.*]] = icmp eq i32 [[IV_NEXT1]], 1024
+; CHECK-NEXT: br i1 [[DONE]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
+ br i1 %c0, label %then.0, label %latch, !prof !4
+
+then.0:
+ br i1 %c1, label %then.1, label %latch, !prof !4
+
+then.1:
+ %gep0 = getelementptr i64, ptr %p0, i32 %iv
+ %x = load i64, ptr %gep0
+ %gep1 = getelementptr i64, ptr %p1, i32 %iv
+ %y = load i64, ptr %gep1
+ %z = udiv i64 %x, %y
+ store i64 %z, ptr %gep1
+ br label %latch
+
+latch:
+ %iv.next = add i32 %iv, 1
+ %done = icmp eq i32 %iv.next, 1024
+ br i1 %done, label %exit, label %loop
+
+exit:
+ ret void
+}
+
+!4 = !{!"branch_weights", i32 1, i32 0}
+
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll b/llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll
new file mode 100644
index 0000000000000..23f2624a207b3
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/predicated-costs.ll
@@ -0,0 +1,125 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; RUN: opt < %s -S -p loop-vectorize -mtriple=riscv64 -mattr=+v | FileCheck %s
+
+define void @nested(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @nested(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br label %[[LOOP:.*]]
+; CHECK: [[LOOP]]:
+; CHECK-NEXT: [[IV1:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LATCH:.*]] ]
+; CHECK-NEXT: br i1 [[C0]], label %[[THEN_0:.*]], label %[[LATCH]]
+; CHECK: [[THEN_0]]:
+; CHECK-NEXT: br i1 [[C1]], label %[[THEN_1:.*]], label %[[LATCH]]
+; CHECK: [[THEN_1]]:
+; CHECK-NEXT: [[GEP2:%.*]] = getelementptr i32, ptr [[P0]], i32 [[IV1]]
+; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[GEP2]], align 4
+; CHECK-NEXT: [[GEP1:%.*]] = getelementptr i32, ptr [[P1]], i32 [[X]]
+; CHECK-NEXT: store i32 0, ptr [[GEP1]], align 4
+; CHECK-NEXT: br label %[[LATCH]]
+; CHECK: [[LATCH]]:
+; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV1]], 1
+; CHECK-NEXT: [[DONE:%.*]] = icmp eq i32 [[IV_NEXT]], 1024
+; CHECK-NEXT: br i1 [[DONE]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
+ br i1 %c0, label %then.0, label %latch
+
+then.0:
+ br i1 %c1, label %then.1, label %latch
+
+then.1:
+ %gep0 = getelementptr i32, ptr %p0, i32 %iv
+ %x = load i32, ptr %gep0
+ %gep1 = getelementptr i32, ptr %p1, i32 %x
+ store i32 0, ptr %gep1
+ br label %latch
+
+latch:
+ %iv.next = add i32 %iv, 1
+ %done = icmp eq i32 %iv.next, 1024
+ br i1 %done, label %exit, label %loop
+
+exit:
+ ret void
+}
+
+define void @always_taken(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
+; CHECK-LABEL: define void @always_taken(
+; CHECK-SAME: ptr noalias [[P0:%.*]], ptr noalias [[P1:%.*]], i1 [[C0:%.*]], i1 [[C1:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i1> poison, i1 [[C1]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i1> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT: [[BROADCAST_...
[truncated]
|
c191ed6
to
a8810d5
Compare
a8810d5
to
ccf4343
Compare
I ran into this crash when llvm#158690 caused a loop with a struct call to be vectorized. If we have a replicate recipe in a branch-on-mask predicated region that's used by a widened recipe in another block then it will be packed together with the other lanes via a VPPredInstPHIRecipe. If we're replicating a call with a struct return type then we currently crash. The code that handles structs in packScalarIntoVectorizedValue seemed to be untested at least on test/Transforms/LoopVectorize. There's two places that need to be fixed. The poison value that the scalar is packed into needs to use toVectorizedTy to correctly handle structs (not to be confused with toVectorTy!) The other is that VPPredInstPHIRecipe expects its operand to be an InsertElementInstr when stringing together the different lanes. For structs this will be an InsertVlaueInstr, and the value for the previous lane will be at the back of a chain of InsertValueInstrs.
ccf4343
to
319b4e7
Compare
%var0 = getelementptr inbounds i64, ptr %a, i64 %i | ||
%var2 = load i64, ptr %var0, align 4 | ||
%cond0 = icmp sgt i64 %var2, 0 | ||
%cond0 = icmp sgt i64 %var2, 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant was changed to keep the old branch probability the same and keep the block scalarized, since with BFI icmp sgt %x, 0
is predicted to be slightly > 50%.
; COMMON-NEXT: [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 7 | ||
; COMMON-NEXT: br i1 [[EC]], label %[[SCALAR_PH1:.*]], label %[[EXIT1]] | ||
; COMMON: [[SCALAR_PH1]]: | ||
; COMMON-NEXT: ret void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously all the predicated scalar stores were discounted by x0.5 in computePredInstDiscount. BFI now correctly returns that this block is always executed so the VF=1 plan is no longer discounted.
%l = load i64, ptr %gep.src, align 1 | ||
%t = trunc i64 %l to i1 | ||
br i1 %t, label %exit.0, label %loop.latch | ||
br i1 %t, label %exit.0, label %loop.latch, !prof !0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to retain the old branch probability. With BFI, the probability of exit.0 being taken is computed as 3% (why that is, I'm not sure) and the IV increment isn't discounted in the VF=1 plan.
; CHECK-NEXT: [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE6]] ] | ||
; CHECK-NEXT: [[VECTOR_RECUR:%.*]] = phi <4 x i32> [ <i32 poison, i32 poison, i32 poison, i32 4>, [[VECTOR_PH]] ], [ [[VEC_IND]], [[PRED_STORE_CONTINUE6]] ] | ||
; CHECK-NEXT: [[TMP0:%.*]] = shufflevector <4 x i32> [[VECTOR_RECUR]], <4 x i32> [[VEC_IND]], <4 x i32> <i32 3, i32 4, i32 5, i32 6> | ||
; CHECK-NEXT: [[TMP1:%.*]] = icmp ule <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously blocks that were predicated because of tail folding were discounted by 0.5x by computePredInstDiscount
because they were technically predicated. However in practice they're always going to be executed except for the last iteration. BFI correctly returns that the probability is ~100% for these instructions which removes the discount.
; AVX1-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 -3 | ||
; AVX1-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4, !alias.scope [[META18:![0-9]+]] | ||
; AVX1-NEXT: [[REVERSE:%.*]] = shufflevector <4 x i32> [[WIDE_LOAD]], <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0> | ||
; AVX1-NEXT: [[TMP3:%.*]] = icmp sgt <4 x i32> [[REVERSE]], zeroinitializer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BPI computes >= 0
to be slightly >50%, which influences the cost for AVX1 here.
; COST: [[LOOP_HEADER]]: | ||
; COST-NEXT: [[PTR_IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ] | ||
; COST-NEXT: [[PTR_IV:%.*]] = phi ptr [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ] | ||
; COST-NEXT: [[L:%.*]] = load i64, ptr [[PTR_IV]], align 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total probability of all these switch destinations being executed is 1. BFI correctly divides up the probabilities, but previously each destination was given a 50% chance which meant the VF=1 VPlan was overly discounted.
; ENABLED_MASKED_STRIDED-NEXT: [[TMP8:%.*]] = extractelement <4 x i64> [[TMP2]], i64 1 | ||
; ENABLED_MASKED_STRIDED-NEXT: [[TMP9:%.*]] = getelementptr inbounds nuw i16, ptr [[POINTS]], i64 [[TMP8]] | ||
; ENABLED_MASKED_STRIDED-NEXT: [[TMP10:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i64 1 | ||
; ENABLED_MASKED_STRIDED-NEXT: store i16 [[TMP10]], ptr [[TMP9]], align 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Predication due to tail folding no longer accidentally discounts
; | ||
; PRED-LABEL: define void @iv_trunc( | ||
; PRED-SAME: i32 [[X:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]]) #[[ATTR0]] { | ||
; PRED-NEXT: [[ENTRY:.*:]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add branch probabilities to keep the generated code/original checks for the cost?
; DEFAULT-NEXT: [[TMP7:%.*]] = add <16 x i8> [[TMP4]], [[TMP6]] | ||
; DEFAULT-NEXT: [[TMP8:%.*]] = extractelement <16 x i1> [[TMP0]], i32 0 | ||
; DEFAULT-NEXT: br i1 [[TMP8]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]] | ||
; DEFAULT: [[PRED_STORE_IF]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to folding the tail, right? I think we could also assign a probability much closer to 1 w/o BFI, splitting up in 2 separate changes would make it easier to isolate issues.
… blocks Split off from llvm#158690. Currently if an instruction needs predicated due to tail folding, it will also have a predicated discount applied to it in multiple places. This is likely inaccurate because we can expect a tail folded instruction to be executed on every iteration bar the last. This fixes it by checking if the instruction/block was originally predicated, and in doing so prevents vectorization with tail folding where we would have had to scalarize the memory op anyway.
I ran into this crash when #158690 caused a loop with a struct call to be vectorized. If we have a replicate recipe in a branch-on-mask predicated region that's used by a widened recipe in another block then it will be packed together with the other lanes via a VPPredInstPHIRecipe. If we're replicating a call with a struct return type then we currently crash. The code that handles structs in packScalarIntoVectorizedValue seemed to be untested at least on test/Transforms/LoopVectorize. There's two places that need to be fixed. The poison value that the scalar is packed into needs to use toVectorizedTy to correctly handle structs (not to be confused with toVectorTy!) The other is that VPPredInstPHIRecipe expects its operand to be an InsertElementInstr when stringing together the different lanes. For structs this will be an InsertVlaueInstr, and the value for the previous lane will be at the back of a chain of InsertValueInstrs.
uint64_t HeaderFreq = BFI->getBlockFreq(TheLoop->getHeader()).getFrequency(); | ||
uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency(); | ||
assert(HeaderFreq >= BBFreq && | ||
"Header has smaller block freq than dominated BB?"); | ||
return BFI->getBlockFreq(TheLoop->getHeader()).getFrequency() / | ||
BFI->getBlockFreq(BB).getFrequency(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already have HeaderFreq
and BBFreq
so I don't think you need to call getBlockFreq
again when returning.
SkipCostComputation.contains(UI); | ||
} | ||
|
||
unsigned VPCostContext::getPredBlockCostDivisor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed when the caller could just use Ctx.CM.getPredBlockCostDivisor
?
I ran into this crash when llvm#158690 caused a loop with a struct call to be vectorized. If we have a replicate recipe in a branch-on-mask predicated region that's used by a widened recipe in another block then it will be packed together with the other lanes via a VPPredInstPHIRecipe. If we're replicating a call with a struct return type then we currently crash. The code that handles structs in packScalarIntoVectorizedValue seemed to be untested at least on test/Transforms/LoopVectorize. There's two places that need to be fixed. The poison value that the scalar is packed into needs to use toVectorizedTy to correctly handle structs (not to be confused with toVectorTy!) The other is that VPPredInstPHIRecipe expects its operand to be an InsertElementInstr when stringing together the different lanes. For structs this will be an InsertVlaueInstr, and the value for the previous lane will be at the back of a chain of InsertValueInstrs.
In 531.deepsjeng_r from SPEC CPU 2017 there's a loop that we unprofitably loop vectorize on RISC-V.
The loop looks something like:
Because it's so deeply nested the actual inner level of the loop rarely gets executed. However we still deem it profitable to vectorize, which due to the if-conversion means we now always execute the body.
This stems from the fact that
getPredBlockCostDivisor
currently assumes that blocks have 50% chance of being executed as a heuristic.We can fix this by using BlockFrequencyInfo, which gives a more accurate estimate of the innermost block being executed 12.5% of the time. We can then calculate the probability as
HeaderFrequency / BlockFrequency
.Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.
Whilst there's a lot of changes in the in-tree tests, this doesn't affect llvm-test-suite or SPEC CPU 2017 that much: