LV: clamp VF with TC only when scalar epilogue is needed #91253

artagnon · 2024-05-06T18:50:49Z

9a087a3 (LoopVectorize: MaxVF should not be larger than the loop trip count) was the first commit to add the condition PowerOf2_32() of the trip-count to, what is now getMaximizedVFForTarget(). It made sense at the time. Much later, 2025e09 ([LV] Make sure VF doesn't exceed compile time known TC) came along to patch this with an extra condition on FoldTailByMasking, in order to ensure that that the VF doesn't exceed the trip-count. However, this condition is broken, as VPlan already picks the best VF when tail-folding, and the clamping code is only applicable in the case of scalar epilogue.

Fixes #82626.

llvmbot · 2024-05-06T18:51:17Z

@llvm/pr-subscribers-llvm-transforms

Author: Ramkumar Ramachandra (artagnon)

Changes

9a087a3 (LoopVectorize: MaxVF should not be larger than the loop trip count) was the first commit to add the condition PowerOf2_32() of the trip-count to, what is now getMaximizedVFForTarget(). It made sense at the time, as there was no tail-folding support. Much later, 2025e09 ([LV] Make sure VF doesn't exceed compile time known TC) came along to patch this with an extra condition on FoldTailByMasking, in order to ensure that that the VF doesn't exceed the trip-count. However, it didn't go far enough, and we can do better, as there is existing code to clamp the trip-count, and do tail-folding.

Fixes #82626.

Full diff: https://github.com/llvm/llvm-project/pull/91253.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+5-5)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll (-3)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll (+18-18)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3be0102bea3e34..2e358c19bf5c2a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4776,16 +4776,16 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
   if (MaxTripCount > 0 && requiresScalarEpilogue(true))
     MaxTripCount -= 1;
 
-  if (MaxTripCount && MaxTripCount <= WidestRegisterMinEC &&
-      (!FoldTailByMasking || isPowerOf2_32(MaxTripCount))) {
+  if (MaxTripCount && MaxTripCount <= WidestRegisterMinEC) {
     // If upper bound loop trip count (TC) is known at compile time there is no
     // point in choosing VF greater than TC (as done in the loop below). Select
     // maximum power of two which doesn't exceed TC. If MaxVectorElementCount is
     // scalable, we only fall back on a fixed VF when the TC is less than or
     // equal to the known number of lanes.
-    auto ClampedUpperTripCount = llvm::bit_floor(MaxTripCount);
-    LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to maximum power of two not "
-                         "exceeding the constant trip count: "
+    auto ClampedUpperTripCount = FoldTailByMasking
+                                     ? llvm::bit_ceil(MaxTripCount)
+                                     : llvm::bit_floor(MaxTripCount);
+    LLVM_DEBUG(dbgs() << "LV: Clamping the trip count to a power of two: "
                       << ClampedUpperTripCount << "\n");
     return ElementCount::get(
         ClampedUpperTripCount,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll b/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
index 5b90456a4f458b..882ab7ef8c1380 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
@@ -11,9 +11,6 @@ target triple = "aarch64"
 ; so will be masked.
 
 ; COST: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   %0 = load
-; COST: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   %0 = load
-; COST: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   %0 = load
-; COST: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   %0 = load
 ; COST: LV: Selecting VF: 1.
 
 define i32 @test(ptr nocapture noundef readonly %pInVec, ptr nocapture noundef readonly %pInA1, ptr nocapture noundef readonly %pInA2, ptr nocapture noundef readonly %pInA3, ptr nocapture noundef readonly %pInA4, i32 noundef %numCols) {
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll b/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
index 7ccbc98d26567c..2d1ad770e7808f 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
@@ -48,29 +48,29 @@ define void @trip3_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture
 ; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
 ; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 16
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 4
 ; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 16
+; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 4
 ; CHECK-NEXT:    [[TMP4:%.*]] = sub i64 [[TMP3]], 1
 ; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 3, [[TMP4]]
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 16
+; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 4
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP7:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP7]], i64 3)
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP7]], i64 3)
 ; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP7]]
 ; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP8]], i32 0
-; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP9]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT:    [[TMP10:%.*]] = shl <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i64 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP9]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
+; CHECK-NEXT:    [[TMP10:%.*]] = shl <vscale x 4 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 4 x i8> insertelement (<vscale x 4 x i8> poison, i8 1, i64 0), <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer)
 ; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP7]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i32 0
-; CHECK-NEXT:    [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 16 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
-; CHECK-NEXT:    call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
+; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 4 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
+; CHECK-NEXT:    call void @llvm.masked.store.nxv4i8.p0(<vscale x 4 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
 ; CHECK-NEXT:    br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       middle.block:
@@ -119,29 +119,29 @@ define void @trip5_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture
 ; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
 ; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 16
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 8
 ; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 16
+; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 8
 ; CHECK-NEXT:    [[TMP4:%.*]] = sub i64 [[TMP3]], 1
 ; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 5, [[TMP4]]
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 16
+; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP7:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP7]], i64 5)
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[TMP7]], i64 5)
 ; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP7]]
 ; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP8]], i32 0
-; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP9]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT:    [[TMP10:%.*]] = shl <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i64 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr [[TMP9]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
+; CHECK-NEXT:    [[TMP10:%.*]] = shl <vscale x 8 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 8 x i8> insertelement (<vscale x 8 x i8> poison, i8 1, i64 0), <vscale x 8 x i8> poison, <vscale x 8 x i32> zeroinitializer)
 ; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP7]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i32 0
-; CHECK-NEXT:    [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 16 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
-; CHECK-NEXT:    call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr [[TMP12]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
+; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 8 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
+; CHECK-NEXT:    call void @llvm.masked.store.nxv8i8.p0(<vscale x 8 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]])
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
 ; CHECK-NEXT:    br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK:       middle.block:

artagnon · 2024-05-14T09:25:16Z

Gentle ping.

artagnon · 2024-05-20T09:49:38Z

Gentle ping. The logic of this patch and the corresponding test changes should be straightforward.

sjoerdmeijer · 2024-05-22T11:03:04Z

Just checking, did you do some performance runs with this? Can you share some results?

artagnon · 2024-05-22T11:10:55Z

Just checking, did you do some performance runs with this? Can you share some results?

I thought the change was a logical fix, and that the test updates made it clear what exactly the change was doing. I didn't think it was necessary to do performance benchmarking for this patch: besides, I don't have access to the necessary infrastructure at the moment.

davemgreen · 2024-05-22T16:37:54Z

Hi. One quick question - If there is a target that can vectorize naturally at, say v16i8, but has more trouble vectorizing at v8i8 or v4i8, then why should the vectorizer not be considering larger vector factors? So long as it's a single iteration, could the higher factor not be lower cost?

I doubt it should matter a lot, but MVE has 128bit vector lengths and smaller factors can be less natural for it. Does the cost models not already handle picking the best factor?

artagnon · 2024-05-22T17:35:47Z

Hi. One quick question - If there is a target that can vectorize naturally at, say v16i8, but has more trouble vectorizing at v8i8 or v4i8, then why should the vectorizer not be considering larger vector factors? So long as it's a single iteration, could the higher factor not be lower cost?

I doubt it should matter a lot, but MVE has 128bit vector lengths and smaller factors can be less natural for it. Does the cost models not already handle picking the best factor?

VPlan already picks the best VF based on CostModel. By rule of thumb, a larger VF is more expensive than a smaller one. There are several reasons why a larger VF could be cheaper than a smaller one: it depends on the target and the instruction itself. Here's an example from the tree:

; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4i64 = call <4 x i64> @llvm.bswap.v4i64(<4 x i64> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v3i32 = call <3 x i32> @llvm.bswap.v3i32(<3 x i32> undef)

Now, the patch itself is to getMaximizedVFForTarget, which by computeFeasibleMaxVF, which in turn gets called in two cases:

We need to generate a scalar epilogue.
We need to fold the tail by masking.

The function of getMaximizedVFForTarget is to pick a VF that is no larger than necessary, to fit the elements. At the end of the function, all cost modeling decisions are invalidated. When the trip count is known, and is less than or equal to the size of the widest register with the minimum element count, returning a VF that doesn't take the trip count into account is wasteful, bordering on a logical error.

In the case of a scalar epilogue, we correctly take the bit_floor of the TC. However, due to different people fixing different things in the code historically, we don't take the TC into account when we're masking by tail folding. This patch fixes that.

davemgreen · 2024-05-23T07:56:02Z

Sorry I didn't really want to be a block, I've been doing that too much lately.

But there shouldn't be a problem in using 3 out of 8 vector lanes in a predicated v8i16, if it still uses a single vector iteration. It is still the same number of operations thanks to the predication. Why limit it to v4i16 if the target, like MVE, is better at 128bit vectorization than 64bit?
On paper I'm not sure if there should be an advantage to limiting the VF for predicated loops, unless it was previously picking VFs that involved more instructions (more than the natural vector width), which then feels like it should be a cost modelling issue.

artagnon · 2024-05-23T09:58:54Z

See the following example from the tree for masked loads:

; CHECK-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v2f16 = call <2 x half> @llvm.masked.load.v2f16.p0(ptr undef, i32 8, <2 x i1> undef, <2 x half> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %v4f16 = call <4 x half> @llvm.masked.load.v4f16.p0(ptr undef, i32 8, <4 x i1> undef, <4 x half> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 39 for instruction: %v8f16 = call <8 x half> @llvm.masked.load.v8f16.p0(ptr undef, i32 8, <8 x i1> undef, <8 x half> undef)

VPlan makes the cost-modeling decisions which could return CM_ScalarEpilogueAllowed for scalar-epilogue, or CM_ScalarEpilogueNotNeededUsePredicate for tail-folding. Unless I'm very much mistaken, getMaximizedVFForTarget is a rule-of-thumb approximation, that does not query the CostModel. I think VPlan would have already determined that MVE with low TC is unprofitable to vectorize, and this code will never be called. ~~If you have a suitable test case, that would be good.~~ Let me just craft a test for MVE: will post it soon.

artagnon · 2024-05-23T12:19:13Z

Hi. I've added a low-TC MVE test under #93181, and verified that there are no changes due to this patch.

davemgreen · 2024-05-23T19:33:26Z

See the following example from the tree for masked loads:
; CHECK-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v2f16 = call <2 x half> @llvm.masked.load.v2f16.p0(ptr undef, i32 8, <2 x i1> undef, <2 x half> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %v4f16 = call <4 x half> @llvm.masked.load.v4f16.p0(ptr undef, i32 8, <4 x i1> undef, <4 x half> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 39 for instruction: %v8f16 = call <8 x half> @llvm.masked.load.v8f16.p0(ptr undef, i32 8, <8 x i1> undef, <8 x half> undef)
VPlan makes the cost-modeling decisions which could return CM_ScalarEpilogueAllowed for scalar-epilogue, or CM_ScalarEpilogueNotNeededUsePredicate for tail-folding. Unless I'm very much mistaken, getMaximizedVFForTarget is a rule-of-thumb approximation, that does not query the CostModel. I think VPlan would have already determined that MVE with low TC is unprofitable to vectorize, and this code will never be called. ~~If you have a suitable test case, that would be good.~~ Let me just craft a test for MVE: will post it soon.

I'll try and add a test case. There are some examples in https://godbolt.org/z/3aa81Yaqj of things that I thought might be more expensive for smaller vector lengths, even if the costs for X86 don't always show it (to be fair I think that might be more about the loads/stores than the instructions between in that case). Those are unpredicated, and some of the codegen could definitely be better, but if you imagine with predicated loads/stores too small vector could be difficult to codegen.
MVE certainly has some peculiarities with it being low-power, but I don't think it is especially different other than it is a heavy use of (tail) predicated vectorization. Short vectors in general do not always get looked at as much as longer ones as they usually come up less.

My understanding is that the code in

llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Line 4885 in a38f015

return CM.foldTailByMasking() ? VectorCost * divideCeil(MaxTripCount, VF)

handles the costs, and will account for the tripcount so long it is known (and the vectors are fixed-width). Perhaps that could be extended to scalable vectors in some way too?

artagnon · 2024-05-24T13:29:51Z

I'll try and add a test case. There are some examples in https://godbolt.org/z/3aa81Yaqj of things that I thought might be more expensive for smaller vector lengths, even if the costs for X86 don't always show it (to be fair I think that might be more about the loads/stores than the instructions between in that case). Those are unpredicated, and some of the codegen could definitely be better, but if you imagine with predicated loads/stores too small vector could be difficult to codegen. MVE certainly has some peculiarities with it being low-power, but I don't think it is especially different other than it is a heavy use of (tail) predicated vectorization. Short vectors in general do not always get looked at as much as longer ones as they usually come up less.

Okay, I'll wait for your test case then.

My understanding is that the code in

llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Line 4885 in a38f015

return CM.foldTailByMasking() ? VectorCost * divideCeil(MaxTripCount, VF)

handles the costs, and will account for the tripcount so long it is known (and the vectors are fixed-width). Perhaps that could be extended to scalable vectors in some way too?

Thanks for the suggestion! See #93300.

davemgreen · 2024-05-28T09:23:22Z

Sorry for the delay - I added a quick test case in 46541a3.

artagnon · 2024-05-28T10:10:35Z

Sorry for the delay - I added a quick test case in 46541a3.

I can indeed confirm that your concerns were valid, and that this test breaks (it selects VF of 1) with the patch. Have to think about this some more.

artagnon · 2024-05-28T16:40:47Z

I've thought through the problem some more, experimented with different changes to see the failing tests, and have a proper fix now. I can confirm that the changed RISCV tests have the same cost as before. The patch is more of a cleanup patch now, as #93300 is the main patch.

artagnon · 2024-06-04T10:54:39Z

Rebase and ping.

artagnon · 2024-06-13T16:30:36Z

Gentle ping. Previously, the RISC-V test file was non-uniform for no good reason.

artagnon · 2024-06-24T09:07:57Z

Gentle ping.

davemgreen

I don't believe this will make much effect to MVE any more. @fhahn (or others) any opinions on the remaining code?

9a087a3 (LoopVectorize: MaxVF should not be larger than the loop trip count) was the first commit to add the condition PowerOf2_32() of the trip-count to, what is now getMaximizedVFForTarget(). It made sense at the time, as there was no tail-folding support. Much later, 2025e09 ([LV] Make sure VF doesn't exceed compile time known TC) came along to patch this with an extra condition on FoldTailByMasking, in order to ensure that that the VF doesn't exceed the trip-count. However, it didn't go far enough, and we can do better, as there is existing code to clamp the trip-count, and do tail-folding. Fixes llvm#82626.

artagnon · 2024-07-22T12:42:39Z

Rebase and ping. There is an additional positive test change after the rebase.

artagnon · 2024-08-22T12:08:27Z

Gentle ping.

artagnon · 2024-09-04T10:13:47Z

Gentle ping,

artagnon · 2024-09-13T10:09:59Z

@fhahn Could you kindly review this PR that has been open for a long time?

fhahn · 2024-09-13T13:02:39Z

llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll

+; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <vscale x 8 x i8>, ptr [[TMP11]], align 1
+; CHECK-NEXT:    [[TMP12:%.*]] = add <vscale x 8 x i8> [[TMP9]], [[WIDE_LOAD1]]
+; CHECK-NEXT:    store <vscale x 8 x i8> [[TMP12]], ptr [[TMP11]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]]


Are any of the test changes profitable? I'd assume at least for this one vectorizing with VF 16 as done previously is more profitable, as we know that we execute exactly one vector iteration?

Aren't the number of iterations the same, as fixed-width 16 has now been changed to vscale x 8, where vscale is computed as 2? On targets supporting scalable-vectors, I think scalable-vector vectorization is slightly more profitable than the fixed-width one (see profitability function in LV, which slightly prefers scalable over fixed).

Also, consider that this patch has removed an ugly and unnecessary special-case: this test output is now in line with the rest of the test outputs in this file.

But we need an extra minimum iteration checks and code to compute the runtime VF. If 16 x i8 fits in a vector register, I'd assume that there's no benefit from using scalable vectors here.

It's not my area of expertise, maybe @preames or @nikolaypanchenko could chime in on whether this is desirable

On RISC-V this minimum iteration check is going to be expensive, since register's size is unknown. For instance, that's what is generated before that change:

trip16_i8: # @trip16_i8 # %bb.0: # %entry vsetivli zero, 16, e8, m1, ta, ma vle8.v v8, (a1) vle8.v v9, (a0) vadd.vv v8, v8, v8 vadd.vv v8, v8, v9 vse8.v v8, (a0) ret

after that change

.type trip16_i8,@function trip16_i8:# @trip16_i8 # %bb.0: # %entry addi sp, sp, -32 sd ra, 24(sp) # 8-byte Folded Spill sd s0, 16(sp) # 8-byte Folded Spill sd s1, 8(sp) # 8-byte Folded Spill csrr a2, vlenb srli a2, a2, 3 li a3, 2 mv s1, a1 mv s0, a0 bgeu a3, a2, .LBB4_2 # %bb.1: li a1, 0 j .LBB4_3 .LBB4_2: # %vector.ph csrr a0, vlenb li a1, 3 call __muldi3 vl1r.v v8, (s1) vl1r.v v9, (s0) andi a1, a0, 16 vsetvli a0, zero, e8, m1, ta, ma vadd.vv v8, v8, v8 vadd.vv v8, v8, v9 vs1r.v v8, (s0) bnez a1, .LBB4_5 .LBB4_3: # %for.body.preheader add a0, s0, a1 add s1, s1, a1 addi s0, s0, 16 .LBB4_4: # %for.body # =>This Inner Loop Header: Depth=1 lbu a1, 0(s1) lbu a2, 0(a0) slli a1, a1, 1 add a1, a1, a2 sb a1, 0(a0) addi a0, a0, 1 addi s1, s1, 1 bne a0, s0, .LBB4_4 .LBB4_5: # %for.end ld ra, 24(sp) # 8-byte Folded Reload ld s0, 16(sp) # 8-byte Folded Reload ld s1, 8(sp) # 8-byte Folded Reload addi sp, sp, 32 ret

However, long term plan is to enable EVL vectorization for RISC-V. As it doesn't require this extra check, generated code will be good:

trip16_i8: # @trip16_i8 # %bb.0: # %entry vsetivli zero, 16, e8, m1, ta, ma vle8.v v8, (a1) vle8.v v9, (a0) vadd.vv v8, v8, v8 vadd.vv v8, v8, v9 vse8.v v8, (a0) ret

Thanks for the explanation! The current code with isPowerOf2_32 seemed wrong to me, purely from a logical standpoint: it is technically the job of isMoreProfitableThan to determine the best VPlan with the best VF, with access to the CostModel, and the code I removed seemed to be doing an ad-hoc overriding of the VF, and its existence seems to be an accident of history (see commit message). However, as EVL isn't enabled for RISC-V today, this change is a regression, and LV isn't ready for this cleanup yet. I will close this PR.

artagnon · 2024-09-13T13:27:19Z

llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll

-; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP0]]
-; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0
-; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <8 x i8>, ptr [[TMP2]], align 1
-; CHECK-NEXT:    [[TMP3:%.*]] = shl <8 x i8> [[WIDE_LOAD]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
-; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP0]]
-; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 0
-; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <8 x i8>, ptr [[TMP5]], align 1
-; CHECK-NEXT:    [[TMP6:%.*]] = add <8 x i8> [[TMP3]], [[WIDE_LOAD1]]
-; CHECK-NEXT:    store <8 x i8> [[TMP6]], ptr [[TMP5]], align 1
-; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
-; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 24
-; CHECK-NEXT:    br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
+; CHECK-NEXT:    [[TMP6:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP6]]
+; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x i8>, ptr [[TMP8]], align 1
+; CHECK-NEXT:    [[TMP9:%.*]] = shl <vscale x 4 x i8> [[WIDE_LOAD]], shufflevector (<vscale x 4 x i8> insertelement (<vscale x 4 x i8> poison, i8 1, i64 0), <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer)
+; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP6]]
+; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP10]], i32 0
+; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <vscale x 4 x i8>, ptr [[TMP11]], align 1
+; CHECK-NEXT:    [[TMP12:%.*]] = add <vscale x 4 x i8> [[TMP9]], [[WIDE_LOAD1]]
+; CHECK-NEXT:    store <vscale x 4 x i8> [[TMP12]], ptr [[TMP11]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]]
+; CHECK-NEXT:    [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]


So, for instance, this test change is profitable: we changed a fixed-width 8 (requiring three iterations) to a scalable-vector 4 (with vscale computed as 6, hence requiring one iteration).

RISC-V uses vscale x N to encode LMUL. In this case vscale x 4 x i8 represents half of a vector register of i8, i.e. with a default minimum vector register size(VLEN) = 128, that loop is going to have 3 vector iteration.
However, if target or option specifies bigger VLEN, number of vector iterations will decrease.

That said, specifically this case is not clear:

when VLEN = 128, this will perform worse then before (see other comment)

when VLEN >= 256, it might perform better. 256 might be a boarder line, though.

I'm talking from hw we have, but on some targets that might be even better to choose full vector + scalar

artagnon · 2024-09-16T15:07:37Z

LV isn't ready for this cleanup yet, and regressions are observed with this change.

artagnon requested review from fhahn, preames and davemgreen May 6, 2024 18:50

llvmbot added vectorizers llvm:transforms labels May 6, 2024

artagnon requested review from david-arm, sjoerdmeijer and MaskRay May 8, 2024 14:41

artagnon changed the title ~~LoopVectorize: optimize VF for low TC, when tail-folding~~ [LV] optimize VF for low TC, when tail-folding May 14, 2024

artagnon removed the request for review from MaskRay May 20, 2024 09:48

artagnon force-pushed the lv-82626 branch from a4833aa to 275f344 Compare May 23, 2024 12:17

artagnon marked this pull request as draft May 28, 2024 10:10

artagnon force-pushed the lv-82626 branch from 275f344 to a2ef731 Compare May 28, 2024 16:32

artagnon changed the title ~~[LV] optimize VF for low TC, when tail-folding~~ LV: clamp VF with TC only when scalar epilogue is needed May 28, 2024

artagnon marked this pull request as ready for review May 28, 2024 16:37

artagnon force-pushed the lv-82626 branch 2 times, most recently from b93a9c3 to 434f865 Compare June 4, 2024 10:54

davemgreen reviewed Jun 26, 2024

View reviewed changes

artagnon force-pushed the lv-82626 branch from 434f865 to 3a7f4be Compare July 22, 2024 12:42

fhahn reviewed Sep 13, 2024

View reviewed changes

artagnon commented Sep 13, 2024

View reviewed changes

artagnon requested a review from npanchen September 13, 2024 19:31

artagnon closed this Sep 16, 2024

artagnon deleted the lv-82626 branch September 16, 2024 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LV: clamp VF with TC only when scalar epilogue is needed #91253

LV: clamp VF with TC only when scalar epilogue is needed #91253

artagnon commented May 6, 2024 •

edited

Loading

llvmbot commented May 6, 2024

artagnon commented May 14, 2024

artagnon commented May 20, 2024

sjoerdmeijer commented May 22, 2024

artagnon commented May 22, 2024

davemgreen commented May 22, 2024

artagnon commented May 22, 2024

davemgreen commented May 23, 2024

artagnon commented May 23, 2024 •

edited

Loading

artagnon commented May 23, 2024

davemgreen commented May 23, 2024

artagnon commented May 24, 2024

davemgreen commented May 28, 2024

artagnon commented May 28, 2024

artagnon commented May 28, 2024

artagnon commented Jun 4, 2024

artagnon commented Jun 13, 2024

artagnon commented Jun 24, 2024

davemgreen left a comment

artagnon commented Jul 22, 2024

artagnon commented Aug 22, 2024

artagnon commented Sep 4, 2024

artagnon commented Sep 13, 2024

fhahn Sep 13, 2024

artagnon Sep 13, 2024

fhahn Sep 13, 2024

npanchen Sep 16, 2024

artagnon Sep 16, 2024

artagnon Sep 13, 2024

npanchen Sep 16, 2024

artagnon commented Sep 16, 2024

LV: clamp VF with TC only when scalar epilogue is needed #91253

LV: clamp VF with TC only when scalar epilogue is needed #91253

Conversation

artagnon commented May 6, 2024 • edited Loading

llvmbot commented May 6, 2024

artagnon commented May 14, 2024

artagnon commented May 20, 2024

sjoerdmeijer commented May 22, 2024

artagnon commented May 22, 2024

davemgreen commented May 22, 2024

artagnon commented May 22, 2024

davemgreen commented May 23, 2024

artagnon commented May 23, 2024 • edited Loading

artagnon commented May 23, 2024

davemgreen commented May 23, 2024

artagnon commented May 24, 2024

davemgreen commented May 28, 2024

artagnon commented May 28, 2024

artagnon commented May 28, 2024

artagnon commented Jun 4, 2024

artagnon commented Jun 13, 2024

artagnon commented Jun 24, 2024

davemgreen left a comment

Choose a reason for hiding this comment

artagnon commented Jul 22, 2024

artagnon commented Aug 22, 2024

artagnon commented Sep 4, 2024

artagnon commented Sep 13, 2024

fhahn Sep 13, 2024

Choose a reason for hiding this comment

artagnon Sep 13, 2024

Choose a reason for hiding this comment

fhahn Sep 13, 2024

Choose a reason for hiding this comment

npanchen Sep 16, 2024

Choose a reason for hiding this comment

artagnon Sep 16, 2024

Choose a reason for hiding this comment

artagnon Sep 13, 2024

Choose a reason for hiding this comment

npanchen Sep 16, 2024

Choose a reason for hiding this comment

artagnon commented Sep 16, 2024

artagnon commented May 6, 2024 •

edited

Loading

artagnon commented May 23, 2024 •

edited

Loading