Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LV: clamp VF with TC only when scalar epilogue is needed #91253

Closed
wants to merge 1 commit into from

Conversation

artagnon
Copy link
Contributor

@artagnon artagnon commented May 6, 2024

9a087a3 (LoopVectorize: MaxVF should not be larger than the loop trip count) was the first commit to add the condition PowerOf2_32() of the trip-count to, what is now getMaximizedVFForTarget(). It made sense at the time. Much later, 2025e09 ([LV] Make sure VF doesn't exceed compile time known TC) came along to patch this with an extra condition on FoldTailByMasking, in order to ensure that that the VF doesn't exceed the trip-count. However, this condition is broken, as VPlan already picks the best VF when tail-folding, and the clamping code is only applicable in the case of scalar epilogue.

Fixes #82626.

@llvmbot
Copy link

llvmbot commented May 6, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Ramkumar Ramachandra (artagnon)

Changes

9a087a3 (LoopVectorize: MaxVF should not be larger than the loop trip count) was the first commit to add the condition PowerOf2_32() of the trip-count to, what is now getMaximizedVFForTarget(). It made sense at the time, as there was no tail-folding support. Much later, 2025e09 ([LV] Make sure VF doesn't exceed compile time known TC) came along to patch this with an extra condition on FoldTailByMasking, in order to ensure that that the VF doesn't exceed the trip-count. However, it didn't go far enough, and we can do better, as there is existing code to clamp the trip-count, and do tail-folding.

Fixes #82626.


Full diff: https://github.com/llvm/llvm-project/pull/91253.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+5-5)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll (-3)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll (+18-18)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3be0102bea3e34..2e358c19bf5c2a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4776,16 +4776,16 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
   if (MaxTripCount > 0 && requiresScalarEpilogue(true))
     MaxTripCount -= 1;
 
-  if (MaxTripCount && MaxTripCount <= WidestRegisterMinEC &&
-      (!FoldTailByMasking || isPowerOf2_32(MaxTripCount))) {
+  if (MaxTripCount && MaxTripCount <= WidestRegisterMinEC) {
     // If upper bound loop trip count (TC) is known at compile time there is no
     // point in choosing VF greater than TC (as done in the loop below). Select
     // maximum power of two which doesn't exceed TC. If MaxVectorElementCount is
     // scalable, we only fall back on a fixed VF when the TC is less than or
     // equal to the known number of lanes.
-    auto ClampedUpperTripCount = llvm::bit_floor(MaxTripCount);
-    LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to maximum power of two not "
-                         "exceeding the constant trip count: "
+    auto ClampedUpperTripCount = FoldTailByMasking
+                                     ? llvm::bit_ceil(MaxTripCount)
+                                     : llvm::bit_floor(MaxTripCount);
+    LLVM_DEBUG(dbgs() << "LV: Clamping the trip count to a power of two: "
                       << ClampedUpperTripCount << "\n");
     return ElementCount::get(
         ClampedUpperTripCount,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll b/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
index 5b90456a4f458b..882ab7ef8c1380 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
@@ -11,9 +11,6 @@ target triple = "aarch64"
 ; so will be masked.
 
 ; COST: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   %0 = load
-; COST: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   %0 = load
-; COST: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   %0 = load
-; COST: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   %0 = load
 ; COST: LV: Selecting VF: 1.
 
 define i32 @test(ptr nocapture noundef readonly %pInVec, ptr nocapture noundef readonly %pInA1, ptr nocapture noundef readonly %pInA2, ptr nocapture noundef readonly %pInA3, ptr nocapture noundef readonly %pInA4, i32 noundef %numCols) {
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll b/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
index 7ccbc98d26567c..2d1ad770e7808f 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
@@ -48,29 +48,29 @@ define void @trip3_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture
 ; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
 ; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 16
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 4
 ; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 16
+; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 4
 ; CHECK-NEXT:    [[TMP4:%.*]] = sub i64 [[TMP3]], 1
 ; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 3, [[TMP4]]
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 16
+; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 4
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP7:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP7]], i64 3)
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP7]], i64 3)
 ; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP7]]
 ; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP8]], i32 0
-; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP9]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT:    [[TMP10:%.*]] = shl <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i64 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP9]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
+; CHECK-NEXT:    [[TMP10:%.*]] = shl <vscale x 4 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 4 x i8> insertelement (<vscale x 4 x i8> poison, i8 1, i64 0), <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer)
 ; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP7]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i32 0
-; CHECK-NEXT:    [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 16 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
-; CHECK-NEXT:    call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
+; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 4 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
+; CHECK-NEXT:    call void @llvm.masked.store.nxv4i8.p0(<vscale x 4 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
 ; CHECK-NEXT:    br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       middle.block:
@@ -119,29 +119,29 @@ define void @trip5_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture
 ; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
 ; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 16
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 8
 ; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 16
+; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 8
 ; CHECK-NEXT:    [[TMP4:%.*]] = sub i64 [[TMP3]], 1
 ; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 5, [[TMP4]]
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 16
+; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP7:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP7]], i64 5)
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[TMP7]], i64 5)
 ; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP7]]
 ; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP8]], i32 0
-; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP9]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT:    [[TMP10:%.*]] = shl <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i64 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr [[TMP9]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
+; CHECK-NEXT:    [[TMP10:%.*]] = shl <vscale x 8 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 8 x i8> insertelement (<vscale x 8 x i8> poison, i8 1, i64 0), <vscale x 8 x i8> poison, <vscale x 8 x i32> zeroinitializer)
 ; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP7]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i32 0
-; CHECK-NEXT:    [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
-; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 16 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
-; CHECK-NEXT:    call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr [[TMP12]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
+; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 8 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
+; CHECK-NEXT:    call void @llvm.masked.store.nxv8i8.p0(<vscale x 8 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]])
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP6]]
 ; CHECK-NEXT:    br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK:       middle.block:

@artagnon
Copy link
Contributor Author

Gentle ping.

@artagnon artagnon changed the title LoopVectorize: optimize VF for low TC, when tail-folding [LV] optimize VF for low TC, when tail-folding May 14, 2024
@artagnon artagnon removed the request for review from MaskRay May 20, 2024 09:48
@artagnon
Copy link
Contributor Author

Gentle ping. The logic of this patch and the corresponding test changes should be straightforward.

@sjoerdmeijer
Copy link
Collaborator

Just checking, did you do some performance runs with this? Can you share some results?

@artagnon
Copy link
Contributor Author

Just checking, did you do some performance runs with this? Can you share some results?

I thought the change was a logical fix, and that the test updates made it clear what exactly the change was doing. I didn't think it was necessary to do performance benchmarking for this patch: besides, I don't have access to the necessary infrastructure at the moment.

@davemgreen
Copy link
Collaborator

Hi. One quick question - If there is a target that can vectorize naturally at, say v16i8, but has more trouble vectorizing at v8i8 or v4i8, then why should the vectorizer not be considering larger vector factors? So long as it's a single iteration, could the higher factor not be lower cost?

I doubt it should matter a lot, but MVE has 128bit vector lengths and smaller factors can be less natural for it. Does the cost models not already handle picking the best factor?

@artagnon
Copy link
Contributor Author

Hi. One quick question - If there is a target that can vectorize naturally at, say v16i8, but has more trouble vectorizing at v8i8 or v4i8, then why should the vectorizer not be considering larger vector factors? So long as it's a single iteration, could the higher factor not be lower cost?

I doubt it should matter a lot, but MVE has 128bit vector lengths and smaller factors can be less natural for it. Does the cost models not already handle picking the best factor?

VPlan already picks the best VF based on CostModel. By rule of thumb, a larger VF is more expensive than a smaller one. There are several reasons why a larger VF could be cheaper than a smaller one: it depends on the target and the instruction itself. Here's an example from the tree:

; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4i64 = call <4 x i64> @llvm.bswap.v4i64(<4 x i64> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v3i32 = call <3 x i32> @llvm.bswap.v3i32(<3 x i32> undef)

Now, the patch itself is to getMaximizedVFForTarget, which by computeFeasibleMaxVF, which in turn gets called in two cases:

  1. We need to generate a scalar epilogue.
  2. We need to fold the tail by masking.

The function of getMaximizedVFForTarget is to pick a VF that is no larger than necessary, to fit the elements. At the end of the function, all cost modeling decisions are invalidated. When the trip count is known, and is less than or equal to the size of the widest register with the minimum element count, returning a VF that doesn't take the trip count into account is wasteful, bordering on a logical error.

In the case of a scalar epilogue, we correctly take the bit_floor of the TC. However, due to different people fixing different things in the code historically, we don't take the TC into account when we're masking by tail folding. This patch fixes that.

@davemgreen
Copy link
Collaborator

Sorry I didn't really want to be a block, I've been doing that too much lately.

But there shouldn't be a problem in using 3 out of 8 vector lanes in a predicated v8i16, if it still uses a single vector iteration. It is still the same number of operations thanks to the predication. Why limit it to v4i16 if the target, like MVE, is better at 128bit vectorization than 64bit?
On paper I'm not sure if there should be an advantage to limiting the VF for predicated loops, unless it was previously picking VFs that involved more instructions (more than the natural vector width), which then feels like it should be a cost modelling issue.

@artagnon
Copy link
Contributor Author

artagnon commented May 23, 2024

See the following example from the tree for masked loads:

; CHECK-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v2f16 = call <2 x half> @llvm.masked.load.v2f16.p0(ptr undef, i32 8, <2 x i1> undef, <2 x half> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %v4f16 = call <4 x half> @llvm.masked.load.v4f16.p0(ptr undef, i32 8, <4 x i1> undef, <4 x half> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 39 for instruction: %v8f16 = call <8 x half> @llvm.masked.load.v8f16.p0(ptr undef, i32 8, <8 x i1> undef, <8 x half> undef)

VPlan makes the cost-modeling decisions which could return CM_ScalarEpilogueAllowed for scalar-epilogue, or CM_ScalarEpilogueNotNeededUsePredicate for tail-folding. Unless I'm very much mistaken, getMaximizedVFForTarget is a rule-of-thumb approximation, that does not query the CostModel. I think VPlan would have already determined that MVE with low TC is unprofitable to vectorize, and this code will never be called. If you have a suitable test case, that would be good. Let me just craft a test for MVE: will post it soon.

@artagnon
Copy link
Contributor Author

Hi. I've added a low-TC MVE test under #93181, and verified that there are no changes due to this patch.

@davemgreen
Copy link
Collaborator

See the following example from the tree for masked loads:

; CHECK-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v2f16 = call <2 x half> @llvm.masked.load.v2f16.p0(ptr undef, i32 8, <2 x i1> undef, <2 x half> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %v4f16 = call <4 x half> @llvm.masked.load.v4f16.p0(ptr undef, i32 8, <4 x i1> undef, <4 x half> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 39 for instruction: %v8f16 = call <8 x half> @llvm.masked.load.v8f16.p0(ptr undef, i32 8, <8 x i1> undef, <8 x half> undef)

VPlan makes the cost-modeling decisions which could return CM_ScalarEpilogueAllowed for scalar-epilogue, or CM_ScalarEpilogueNotNeededUsePredicate for tail-folding. Unless I'm very much mistaken, getMaximizedVFForTarget is a rule-of-thumb approximation, that does not query the CostModel. I think VPlan would have already determined that MVE with low TC is unprofitable to vectorize, and this code will never be called. If you have a suitable test case, that would be good. Let me just craft a test for MVE: will post it soon.

I'll try and add a test case. There are some examples in https://godbolt.org/z/3aa81Yaqj of things that I thought might be more expensive for smaller vector lengths, even if the costs for X86 don't always show it (to be fair I think that might be more about the loads/stores than the instructions between in that case). Those are unpredicated, and some of the codegen could definitely be better, but if you imagine with predicated loads/stores too small vector could be difficult to codegen.
MVE certainly has some peculiarities with it being low-power, but I don't think it is especially different other than it is a heavy use of (tail) predicated vectorization. Short vectors in general do not always get looked at as much as longer ones as they usually come up less.

My understanding is that the code in

return CM.foldTailByMasking() ? VectorCost * divideCeil(MaxTripCount, VF)
handles the costs, and will account for the tripcount so long it is known (and the vectors are fixed-width). Perhaps that could be extended to scalable vectors in some way too?

@artagnon
Copy link
Contributor Author

I'll try and add a test case. There are some examples in https://godbolt.org/z/3aa81Yaqj of things that I thought might be more expensive for smaller vector lengths, even if the costs for X86 don't always show it (to be fair I think that might be more about the loads/stores than the instructions between in that case). Those are unpredicated, and some of the codegen could definitely be better, but if you imagine with predicated loads/stores too small vector could be difficult to codegen. MVE certainly has some peculiarities with it being low-power, but I don't think it is especially different other than it is a heavy use of (tail) predicated vectorization. Short vectors in general do not always get looked at as much as longer ones as they usually come up less.

Okay, I'll wait for your test case then.

My understanding is that the code in

return CM.foldTailByMasking() ? VectorCost * divideCeil(MaxTripCount, VF)

handles the costs, and will account for the tripcount so long it is known (and the vectors are fixed-width). Perhaps that could be extended to scalable vectors in some way too?

Thanks for the suggestion! See #93300.

@davemgreen
Copy link
Collaborator

Sorry for the delay - I added a quick test case in 46541a3.

@artagnon
Copy link
Contributor Author

Sorry for the delay - I added a quick test case in 46541a3.

I can indeed confirm that your concerns were valid, and that this test breaks (it selects VF of 1) with the patch. Have to think about this some more.

@artagnon artagnon marked this pull request as draft May 28, 2024 10:10
@artagnon artagnon changed the title [LV] optimize VF for low TC, when tail-folding LV: clamp VF with TC only when scalar epilogue is needed May 28, 2024
@artagnon artagnon marked this pull request as ready for review May 28, 2024 16:37
@artagnon
Copy link
Contributor Author

I've thought through the problem some more, experimented with different changes to see the failing tests, and have a proper fix now. I can confirm that the changed RISCV tests have the same cost as before. The patch is more of a cleanup patch now, as #93300 is the main patch.

@artagnon artagnon force-pushed the lv-82626 branch 2 times, most recently from b93a9c3 to 434f865 Compare June 4, 2024 10:54
@artagnon
Copy link
Contributor Author

artagnon commented Jun 4, 2024

Rebase and ping.

@artagnon
Copy link
Contributor Author

Gentle ping. Previously, the RISC-V test file was non-uniform for no good reason.

@artagnon
Copy link
Contributor Author

Gentle ping.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this will make much effect to MVE any more. @fhahn (or others) any opinions on the remaining code?

9a087a3 (LoopVectorize: MaxVF should not be larger than the loop trip
count) was the first commit to add the condition PowerOf2_32() of the
trip-count to, what is now getMaximizedVFForTarget(). It made sense at
the time, as there was no tail-folding support. Much later, 2025e09
([LV] Make sure VF doesn't exceed compile time known TC) came along to
patch this with an extra condition on FoldTailByMasking, in order to
ensure that that the VF doesn't exceed the trip-count. However, it
didn't go far enough, and we can do better, as there is existing code to
clamp the trip-count, and do tail-folding.

Fixes llvm#82626.
@artagnon
Copy link
Contributor Author

Rebase and ping. There is an additional positive test change after the rebase.

@artagnon
Copy link
Contributor Author

Gentle ping.

@artagnon
Copy link
Contributor Author

artagnon commented Sep 4, 2024

Gentle ping,

@artagnon
Copy link
Contributor Author

@fhahn Could you kindly review this PR that has been open for a long time?

; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 8 x i8>, ptr [[TMP11]], align 1
; CHECK-NEXT: [[TMP12:%.*]] = add <vscale x 8 x i8> [[TMP9]], [[WIDE_LOAD1]]
; CHECK-NEXT: store <vscale x 8 x i8> [[TMP12]], ptr [[TMP11]], align 1
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of the test changes profitable? I'd assume at least for this one vectorizing with VF 16 as done previously is more profitable, as we know that we execute exactly one vector iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't the number of iterations the same, as fixed-width 16 has now been changed to vscale x 8, where vscale is computed as 2? On targets supporting scalable-vectors, I think scalable-vector vectorization is slightly more profitable than the fixed-width one (see profitability function in LV, which slightly prefers scalable over fixed).

Also, consider that this patch has removed an ugly and unnecessary special-case: this test output is now in line with the rest of the test outputs in this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need an extra minimum iteration checks and code to compute the runtime VF. If 16 x i8 fits in a vector register, I'd assume that there's no benefit from using scalable vectors here.

It's not my area of expertise, maybe @preames or @nikolaypanchenko could chime in on whether this is desirable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On RISC-V this minimum iteration check is going to be expensive, since register's size is unknown. For instance, that's what is generated before that change:

trip16_i8:                              # @trip16_i8
# %bb.0:                                # %entry
  vsetivli  zero, 16, e8, m1, ta, ma
  vle8.v  v8, (a1)
  vle8.v  v9, (a0)
  vadd.vv v8, v8, v8
  vadd.vv v8, v8, v9
  vse8.v  v8, (a0)
  ret      

after that change

  .type trip16_i8,@function 
trip16_i8:# @trip16_i8
# %bb.0:  # %entry 
  addi  sp, sp, -32
  sd  ra, 24(sp) # 8-byte Folded Spill  
  sd  s0, 16(sp) # 8-byte Folded Spill  
  sd  s1, 8(sp)  # 8-byte Folded Spill  
  csrr  a2, vlenb  
  srli  a2, a2, 3  
  li  a3, 2  
  mv  s1, a1 
  mv  s0, a0 
  bgeu  a3, a2, .LBB4_2  
# %bb.1:  
  li  a1, 0  
  j .LBB4_3  
.LBB4_2:  # %vector.ph
  csrr  a0, vlenb  
  li  a1, 3  
  call  __muldi3
  vl1r.v  v8, (s1) 
  vl1r.v  v9, (s0) 
  andi  a1, a0, 16 
  vsetvli a0, zero, e8, m1, ta, ma
  vadd.vv v8, v8, v8  
  vadd.vv v8, v8, v9  
  vs1r.v  v8, (s0) 
  bnez  a1, .LBB4_5
.LBB4_3:  # %for.body.preheader
  add a0, s0, a1
  add s1, s1, a1
  addi  s0, s0, 16
.LBB4_4:  # %for.body
 # =>This Inner Loop Header: Depth=1
  lbu a1, 0(s1)
  lbu a2, 0(a0)
  slli  a1, a1, 1
  add a1, a1, a2
  sb  a1, 0(a0)
  addi  a0, a0, 1
  addi  s1, s1, 1
  bne a0, s0, .LBB4_4
.LBB4_5:  # %for.end
  ld  ra, 24(sp) # 8-byte Folded Reload
  ld  s0, 16(sp) # 8-byte Folded Reload
  ld  s1, 8(sp)  # 8-byte Folded Reload
  addi  sp, sp, 32
  ret

However, long term plan is to enable EVL vectorization for RISC-V. As it doesn't require this extra check, generated code will be good:

trip16_i8:                              # @trip16_i8 
# %bb.0:                                # %entry
  vsetivli  zero, 16, e8, m1, ta, ma
  vle8.v  v8, (a1)
  vle8.v  v9, (a0)
  vadd.vv v8, v8, v8
  vadd.vv v8, v8, v9
  vse8.v  v8, (a0)
  ret 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! The current code with isPowerOf2_32 seemed wrong to me, purely from a logical standpoint: it is technically the job of isMoreProfitableThan to determine the best VPlan with the best VF, with access to the CostModel, and the code I removed seemed to be doing an ad-hoc overriding of the VF, and its existence seems to be an accident of history (see commit message). However, as EVL isn't enabled for RISC-V today, this change is a regression, and LV isn't ready for this cleanup yet. I will close this PR.

Comment on lines -381 to +422
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <8 x i8>, ptr [[TMP2]], align 1
; CHECK-NEXT: [[TMP3:%.*]] = shl <8 x i8> [[WIDE_LOAD]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 0
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <8 x i8>, ptr [[TMP5]], align 1
; CHECK-NEXT: [[TMP6:%.*]] = add <8 x i8> [[TMP3]], [[WIDE_LOAD1]]
; CHECK-NEXT: store <8 x i8> [[TMP6]], ptr [[TMP5]], align 1
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 24
; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP6]]
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x i8>, ptr [[TMP8]], align 1
; CHECK-NEXT: [[TMP9:%.*]] = shl <vscale x 4 x i8> [[WIDE_LOAD]], shufflevector (<vscale x 4 x i8> insertelement (<vscale x 4 x i8> poison, i8 1, i64 0), <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer)
; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP6]]
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP10]], i32 0
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 4 x i8>, ptr [[TMP11]], align 1
; CHECK-NEXT: [[TMP12:%.*]] = add <vscale x 4 x i8> [[TMP9]], [[WIDE_LOAD1]]
; CHECK-NEXT: store <vscale x 4 x i8> [[TMP12]], ptr [[TMP11]], align 1
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]]
; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for instance, this test change is profitable: we changed a fixed-width 8 (requiring three iterations) to a scalable-vector 4 (with vscale computed as 6, hence requiring one iteration).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RISC-V uses vscale x N to encode LMUL. In this case vscale x 4 x i8 represents half of a vector register of i8, i.e. with a default minimum vector register size(VLEN) = 128, that loop is going to have 3 vector iteration.
However, if target or option specifies bigger VLEN, number of vector iterations will decrease.

That said, specifically this case is not clear:

  1. when VLEN = 128, this will perform worse then before (see other comment)
  2. when VLEN >= 256, it might perform better. 256 might be a boarder line, though.

I'm talking from hw we have, but on some targets that might be even better to choose full vector + scalar

@artagnon
Copy link
Contributor Author

LV isn't ready for this cleanup yet, and regressions are observed with this change.

@artagnon artagnon closed this Sep 16, 2024
@artagnon artagnon deleted the lv-82626 branch September 16, 2024 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove power-of-two constraint on the trip count when tail folding
6 participants