-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LV: clamp VF with TC only when scalar epilogue is needed #91253
Conversation
@llvm/pr-subscribers-llvm-transforms Author: Ramkumar Ramachandra (artagnon) Changes9a087a3 (LoopVectorize: MaxVF should not be larger than the loop trip count) was the first commit to add the condition PowerOf2_32() of the trip-count to, what is now getMaximizedVFForTarget(). It made sense at the time, as there was no tail-folding support. Much later, 2025e09 ([LV] Make sure VF doesn't exceed compile time known TC) came along to patch this with an extra condition on FoldTailByMasking, in order to ensure that that the VF doesn't exceed the trip-count. However, it didn't go far enough, and we can do better, as there is existing code to clamp the trip-count, and do tail-folding. Fixes #82626. Full diff: https://github.com/llvm/llvm-project/pull/91253.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3be0102bea3e34..2e358c19bf5c2a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4776,16 +4776,16 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
if (MaxTripCount > 0 && requiresScalarEpilogue(true))
MaxTripCount -= 1;
- if (MaxTripCount && MaxTripCount <= WidestRegisterMinEC &&
- (!FoldTailByMasking || isPowerOf2_32(MaxTripCount))) {
+ if (MaxTripCount && MaxTripCount <= WidestRegisterMinEC) {
// If upper bound loop trip count (TC) is known at compile time there is no
// point in choosing VF greater than TC (as done in the loop below). Select
// maximum power of two which doesn't exceed TC. If MaxVectorElementCount is
// scalable, we only fall back on a fixed VF when the TC is less than or
// equal to the known number of lanes.
- auto ClampedUpperTripCount = llvm::bit_floor(MaxTripCount);
- LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to maximum power of two not "
- "exceeding the constant trip count: "
+ auto ClampedUpperTripCount = FoldTailByMasking
+ ? llvm::bit_ceil(MaxTripCount)
+ : llvm::bit_floor(MaxTripCount);
+ LLVM_DEBUG(dbgs() << "LV: Clamping the trip count to a power of two: "
<< ClampedUpperTripCount << "\n");
return ElementCount::get(
ClampedUpperTripCount,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll b/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
index 5b90456a4f458b..882ab7ef8c1380 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
@@ -11,9 +11,6 @@ target triple = "aarch64"
; so will be masked.
; COST: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %0 = load
-; COST: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %0 = load
-; COST: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %0 = load
-; COST: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %0 = load
; COST: LV: Selecting VF: 1.
define i32 @test(ptr nocapture noundef readonly %pInVec, ptr nocapture noundef readonly %pInA1, ptr nocapture noundef readonly %pInA2, ptr nocapture noundef readonly %pInA3, ptr nocapture noundef readonly %pInA4, i32 noundef %numCols) {
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll b/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
index 7ccbc98d26567c..2d1ad770e7808f 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
@@ -48,29 +48,29 @@ define void @trip3_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 16
+; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 16
+; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 3, [[TMP4]]
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 16
+; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 4
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP7]], i64 3)
+; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP7]], i64 3)
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP8]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP9]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i64 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP9]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
+; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 4 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 4 x i8> insertelement (<vscale x 4 x i8> poison, i8 1, i64 0), <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer)
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 16 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
-; CHECK-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
+; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 4 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
+; CHECK-NEXT: call void @llvm.masked.store.nxv4i8.p0(<vscale x 4 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: middle.block:
@@ -119,29 +119,29 @@ define void @trip5_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 16
+; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 8
; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 16
+; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 8
; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 5, [[TMP4]]
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 16
+; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 8
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP7]], i64 5)
+; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[TMP7]], i64 5)
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP8]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP9]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i64 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr [[TMP9]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
+; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 8 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 8 x i8> insertelement (<vscale x 8 x i8> poison, i8 1, i64 0), <vscale x 8 x i8> poison, <vscale x 8 x i32> zeroinitializer)
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 16 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
-; CHECK-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr [[TMP12]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
+; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 8 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
+; CHECK-NEXT: call void @llvm.masked.store.nxv8i8.p0(<vscale x 8 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]])
; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: middle.block:
|
Gentle ping. |
Gentle ping. The logic of this patch and the corresponding test changes should be straightforward. |
Just checking, did you do some performance runs with this? Can you share some results? |
I thought the change was a logical fix, and that the test updates made it clear what exactly the change was doing. I didn't think it was necessary to do performance benchmarking for this patch: besides, I don't have access to the necessary infrastructure at the moment. |
Hi. One quick question - If there is a target that can vectorize naturally at, say v16i8, but has more trouble vectorizing at v8i8 or v4i8, then why should the vectorizer not be considering larger vector factors? So long as it's a single iteration, could the higher factor not be lower cost? I doubt it should matter a lot, but MVE has 128bit vector lengths and smaller factors can be less natural for it. Does the cost models not already handle picking the best factor? |
VPlan already picks the best VF based on CostModel. By rule of thumb, a larger VF is more expensive than a smaller one. There are several reasons why a larger VF could be cheaper than a smaller one: it depends on the target and the instruction itself. Here's an example from the tree: ; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i64 = call <4 x i64> @llvm.bswap.v4i64(<4 x i64> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v3i32 = call <3 x i32> @llvm.bswap.v3i32(<3 x i32> undef) Now, the patch itself is to
The function of In the case of a scalar epilogue, we correctly take the |
Sorry I didn't really want to be a block, I've been doing that too much lately. But there shouldn't be a problem in using 3 out of 8 vector lanes in a predicated v8i16, if it still uses a single vector iteration. It is still the same number of operations thanks to the predication. Why limit it to v4i16 if the target, like MVE, is better at 128bit vectorization than 64bit? |
See the following example from the tree for masked loads: ; CHECK-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %v2f16 = call <2 x half> @llvm.masked.load.v2f16.p0(ptr undef, i32 8, <2 x i1> undef, <2 x half> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 19 for instruction: %v4f16 = call <4 x half> @llvm.masked.load.v4f16.p0(ptr undef, i32 8, <4 x i1> undef, <4 x half> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 39 for instruction: %v8f16 = call <8 x half> @llvm.masked.load.v8f16.p0(ptr undef, i32 8, <8 x i1> undef, <8 x half> undef) VPlan makes the cost-modeling decisions which could return |
Hi. I've added a low-TC MVE test under #93181, and verified that there are no changes due to this patch. |
I'll try and add a test case. There are some examples in https://godbolt.org/z/3aa81Yaqj of things that I thought might be more expensive for smaller vector lengths, even if the costs for X86 don't always show it (to be fair I think that might be more about the loads/stores than the instructions between in that case). Those are unpredicated, and some of the codegen could definitely be better, but if you imagine with predicated loads/stores too small vector could be difficult to codegen. My understanding is that the code in
|
Okay, I'll wait for your test case then.
Thanks for the suggestion! See #93300. |
Sorry for the delay - I added a quick test case in 46541a3. |
I can indeed confirm that your concerns were valid, and that this test breaks (it selects VF of 1) with the patch. Have to think about this some more. |
I've thought through the problem some more, experimented with different changes to see the failing tests, and have a proper fix now. I can confirm that the changed RISCV tests have the same cost as before. The patch is more of a cleanup patch now, as #93300 is the main patch. |
b93a9c3
to
434f865
Compare
Rebase and ping. |
Gentle ping. Previously, the RISC-V test file was non-uniform for no good reason. |
Gentle ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this will make much effect to MVE any more. @fhahn (or others) any opinions on the remaining code?
9a087a3 (LoopVectorize: MaxVF should not be larger than the loop trip count) was the first commit to add the condition PowerOf2_32() of the trip-count to, what is now getMaximizedVFForTarget(). It made sense at the time, as there was no tail-folding support. Much later, 2025e09 ([LV] Make sure VF doesn't exceed compile time known TC) came along to patch this with an extra condition on FoldTailByMasking, in order to ensure that that the VF doesn't exceed the trip-count. However, it didn't go far enough, and we can do better, as there is existing code to clamp the trip-count, and do tail-folding. Fixes llvm#82626.
Rebase and ping. There is an additional positive test change after the rebase. |
Gentle ping. |
Gentle ping, |
@fhahn Could you kindly review this PR that has been open for a long time? |
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 8 x i8>, ptr [[TMP11]], align 1 | ||
; CHECK-NEXT: [[TMP12:%.*]] = add <vscale x 8 x i8> [[TMP9]], [[WIDE_LOAD1]] | ||
; CHECK-NEXT: store <vscale x 8 x i8> [[TMP12]], ptr [[TMP11]], align 1 | ||
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of the test changes profitable? I'd assume at least for this one vectorizing with VF 16 as done previously is more profitable, as we know that we execute exactly one vector iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't the number of iterations the same, as fixed-width 16 has now been changed to vscale x 8, where vscale is computed as 2? On targets supporting scalable-vectors, I think scalable-vector vectorization is slightly more profitable than the fixed-width one (see profitability function in LV, which slightly prefers scalable over fixed).
Also, consider that this patch has removed an ugly and unnecessary special-case: this test output is now in line with the rest of the test outputs in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we need an extra minimum iteration checks and code to compute the runtime VF. If 16 x i8
fits in a vector register, I'd assume that there's no benefit from using scalable vectors here.
It's not my area of expertise, maybe @preames or @nikolaypanchenko could chime in on whether this is desirable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On RISC-V this minimum iteration check is going to be expensive, since register's size is unknown. For instance, that's what is generated before that change:
trip16_i8: # @trip16_i8
# %bb.0: # %entry
vsetivli zero, 16, e8, m1, ta, ma
vle8.v v8, (a1)
vle8.v v9, (a0)
vadd.vv v8, v8, v8
vadd.vv v8, v8, v9
vse8.v v8, (a0)
ret
after that change
.type trip16_i8,@function
trip16_i8:# @trip16_i8
# %bb.0: # %entry
addi sp, sp, -32
sd ra, 24(sp) # 8-byte Folded Spill
sd s0, 16(sp) # 8-byte Folded Spill
sd s1, 8(sp) # 8-byte Folded Spill
csrr a2, vlenb
srli a2, a2, 3
li a3, 2
mv s1, a1
mv s0, a0
bgeu a3, a2, .LBB4_2
# %bb.1:
li a1, 0
j .LBB4_3
.LBB4_2: # %vector.ph
csrr a0, vlenb
li a1, 3
call __muldi3
vl1r.v v8, (s1)
vl1r.v v9, (s0)
andi a1, a0, 16
vsetvli a0, zero, e8, m1, ta, ma
vadd.vv v8, v8, v8
vadd.vv v8, v8, v9
vs1r.v v8, (s0)
bnez a1, .LBB4_5
.LBB4_3: # %for.body.preheader
add a0, s0, a1
add s1, s1, a1
addi s0, s0, 16
.LBB4_4: # %for.body
# =>This Inner Loop Header: Depth=1
lbu a1, 0(s1)
lbu a2, 0(a0)
slli a1, a1, 1
add a1, a1, a2
sb a1, 0(a0)
addi a0, a0, 1
addi s1, s1, 1
bne a0, s0, .LBB4_4
.LBB4_5: # %for.end
ld ra, 24(sp) # 8-byte Folded Reload
ld s0, 16(sp) # 8-byte Folded Reload
ld s1, 8(sp) # 8-byte Folded Reload
addi sp, sp, 32
ret
However, long term plan is to enable EVL vectorization for RISC-V. As it doesn't require this extra check, generated code will be good:
trip16_i8: # @trip16_i8
# %bb.0: # %entry
vsetivli zero, 16, e8, m1, ta, ma
vle8.v v8, (a1)
vle8.v v9, (a0)
vadd.vv v8, v8, v8
vadd.vv v8, v8, v9
vse8.v v8, (a0)
ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! The current code with isPowerOf2_32
seemed wrong to me, purely from a logical standpoint: it is technically the job of isMoreProfitableThan
to determine the best VPlan with the best VF, with access to the CostModel, and the code I removed seemed to be doing an ad-hoc overriding of the VF, and its existence seems to be an accident of history (see commit message). However, as EVL isn't enabled for RISC-V today, this change is a regression, and LV isn't ready for this cleanup yet. I will close this PR.
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0 | ||
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP0]] | ||
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <8 x i8>, ptr [[TMP2]], align 1 | ||
; CHECK-NEXT: [[TMP3:%.*]] = shl <8 x i8> [[WIDE_LOAD]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1> | ||
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP0]] | ||
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <8 x i8>, ptr [[TMP5]], align 1 | ||
; CHECK-NEXT: [[TMP6:%.*]] = add <8 x i8> [[TMP3]], [[WIDE_LOAD1]] | ||
; CHECK-NEXT: store <8 x i8> [[TMP6]], ptr [[TMP5]], align 1 | ||
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8 | ||
; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 24 | ||
; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]] | ||
; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX]], 0 | ||
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP6]] | ||
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x i8>, ptr [[TMP8]], align 1 | ||
; CHECK-NEXT: [[TMP9:%.*]] = shl <vscale x 4 x i8> [[WIDE_LOAD]], shufflevector (<vscale x 4 x i8> insertelement (<vscale x 4 x i8> poison, i8 1, i64 0), <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer) | ||
; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP6]] | ||
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP10]], i32 0 | ||
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 4 x i8>, ptr [[TMP11]], align 1 | ||
; CHECK-NEXT: [[TMP12:%.*]] = add <vscale x 4 x i8> [[TMP9]], [[WIDE_LOAD1]] | ||
; CHECK-NEXT: store <vscale x 4 x i8> [[TMP12]], ptr [[TMP11]], align 1 | ||
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]] | ||
; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]] | ||
; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for instance, this test change is profitable: we changed a fixed-width 8 (requiring three iterations) to a scalable-vector 4 (with vscale computed as 6, hence requiring one iteration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RISC-V uses vscale x N
to encode LMUL. In this case vscale x 4 x i8
represents half of a vector register of i8
, i.e. with a default minimum vector register size(VLEN) = 128, that loop is going to have 3 vector iteration.
However, if target or option specifies bigger VLEN, number of vector iterations will decrease.
That said, specifically this case is not clear:
- when VLEN = 128, this will perform worse then before (see other comment)
- when VLEN >= 256, it might perform better. 256 might be a boarder line, though.
I'm talking from hw we have, but on some targets that might be even better to choose full vector + scalar
LV isn't ready for this cleanup yet, and regressions are observed with this change. |
9a087a3 (LoopVectorize: MaxVF should not be larger than the loop trip count) was the first commit to add the condition PowerOf2_32() of the trip-count to, what is now getMaximizedVFForTarget(). It made sense at the time. Much later, 2025e09 ([LV] Make sure VF doesn't exceed compile time known TC) came along to patch this with an extra condition on FoldTailByMasking, in order to ensure that that the VF doesn't exceed the trip-count. However, this condition is broken, as VPlan already picks the best VF when tail-folding, and the clamping code is only applicable in the case of scalar epilogue.
Fixes #82626.