[LV] Update interleaving count computation when scalar epilogue loop needs to run at least once #79651

nilanjana87 · 2024-01-26T21:24:12Z

This patch modifies Interleaving count computation to take into account loops that mandatorily require to run at least one scalar iteration in the epilogue loop. For this case, the available trip count for the vector loop is one less. requiresScalarEpilogue function checks if this case is true.

Initial test cases are in #79640.

llvmbot · 2024-01-26T21:24:40Z

@llvm/pr-subscribers-llvm-transforms

Author: Nilanjana Basu (nilanjana87)

Changes

This patch modifies Interleaving count computation to take into account loops that mandatorily require to run at least one scalar iteration in the epilogue loop. For this case, the available trip count for the vector loop is one less. requiresScalarEpilogue function checks if this case is true.

Full diff: https://github.com/llvm/llvm-project/pull/79651.diff

4 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+15-7)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_estimated_tc.ll (+48)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_known_tc.ll (+48)
(added) llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_loops_needing_scalar_epilogue.ll (+132)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index aa5d1bfa57d5353..c2fad626f2ee500 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5393,7 +5393,11 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   assert(EstimatedVF >= 1 && "Estimated VF shouldn't be less than 1");
 
   unsigned KnownTC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
-  if (KnownTC) {
+  if (KnownTC > 0) {
+    // At least one iteration must be scalar when this constraint holds. So the
+    // maximum available iterations for interleaving is one less.
+    unsigned availableTC = (requiresScalarEpilogue(VF.isVector())) ? KnownTC - 1 : KnownTC;
+
     // If trip count is known we select between two prospective ICs, where
     // 1) the aggressive IC is capped by the trip count divided by VF
     // 2) the conservative IC is capped by the trip count divided by (VF * 2)
@@ -5403,27 +5407,31 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
     // we run the vector loop at least twice.
 
     unsigned InterleaveCountUB = bit_floor(
-        std::max(1u, std::min(KnownTC / EstimatedVF, MaxInterleaveCount)));
+        std::max(1u, std::min(availableTC / EstimatedVF, MaxInterleaveCount)));
     unsigned InterleaveCountLB = bit_floor(std::max(
-        1u, std::min(KnownTC / (EstimatedVF * 2), MaxInterleaveCount)));
+        1u, std::min(availableTC / (EstimatedVF * 2), MaxInterleaveCount)));
     MaxInterleaveCount = InterleaveCountLB;
 
     if (InterleaveCountUB != InterleaveCountLB) {
-      unsigned TailTripCountUB = (KnownTC % (EstimatedVF * InterleaveCountUB));
-      unsigned TailTripCountLB = (KnownTC % (EstimatedVF * InterleaveCountLB));
+      unsigned TailTripCountUB = (availableTC % (EstimatedVF * InterleaveCountUB));
+      unsigned TailTripCountLB = (availableTC % (EstimatedVF * InterleaveCountLB));
       // If both produce same scalar tail, maximize the IC to do the same work
       // in fewer vector loop iterations
       if (TailTripCountUB == TailTripCountLB)
         MaxInterleaveCount = InterleaveCountUB;
     }
-  } else if (BestKnownTC) {
+  } else if (BestKnownTC > 0) {
+    // At least one iteration must be scalar when this constraint holds. So the
+    // maximum available iterations for interleaving is one less.
+    unsigned availableTC = (requiresScalarEpilogue(VF.isVector())) ? (*BestKnownTC) - 1 : *BestKnownTC;
+
     // If trip count is an estimated compile time constant, limit the
     // IC to be capped by the trip count divided by VF * 2, such that the vector
     // loop runs at least twice to make interleaving seem profitable when there
     // is an epilogue loop present. Since exact Trip count is not known we
     // choose to be conservative in our IC estimate.
     MaxInterleaveCount = bit_floor(std::max(
-        1u, std::min(*BestKnownTC / (EstimatedVF * 2), MaxInterleaveCount)));
+        1u, std::min(availableTC / (EstimatedVF * 2), MaxInterleaveCount)));
   }
 
   assert(MaxInterleaveCount > 0 &&
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_estimated_tc.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_estimated_tc.ll
index 5552f9dd70c954e..2c046327f7c815e 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_estimated_tc.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_estimated_tc.ll
@@ -125,6 +125,30 @@ for.end:
   ret void
 }
 
+; This has the same trip count as loop_with_profile_tc_64 but since the resulting interleaved group 
+; in this case may access memory out-of-bounds, it requires a scalar epilogue iteration for 
+; correctness, making at most 63 iterations available for interleaving.
+; When the auto-vectorizer chooses VF 16, it should choose IC 1 to leave a smaller scalar remainder
+; than IC 2
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
+define void @loop_with_profile_tc_64_scalar_epilogue_reqd(ptr noalias %p, ptr noalias %q, i64 %n) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %gep.src = getelementptr inbounds [3 x i8], ptr %p, i64 %i, i64 0
+  %l = load i8, ptr %gep.src, align 1
+  %gep.dst = getelementptr inbounds i8, ptr %q, i64 %i
+  store i8 %l, ptr %gep.dst, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, %n
+  br i1 %cond, label %for.end, label %for.body, !prof !4
+
+for.end:
+  ret void
+}
+
 ; For a loop with a profile-guided estimated TC of 100, when the auto-vectorizer chooses VF 16, 
 ; it should choose conservatively IC 2 so that the vector loop runs twice at least
 ; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
@@ -173,6 +197,30 @@ for.end:
   ret void
 }
 
+; This has the same trip count as loop_with_profile_tc_128 but since the resulting interleaved group 
+; in this case may access memory out-of-bounds, it requires a scalar epilogue iteration for 
+; correctness, making at most 127 iterations available for interleaving.
+; When the auto-vectorizer chooses VF 16, it should choose IC 2 to leave a smaller scalar remainder
+; than IC 4
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_profile_tc_128_scalar_epilogue_reqd(ptr noalias %p, ptr noalias %q, i64 %n) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %gep.src = getelementptr inbounds [3 x i8], ptr %p, i64 %i, i64 0
+  %l = load i8, ptr %gep.src, align 1
+  %gep.dst = getelementptr inbounds i8, ptr %q, i64 %i
+  store i8 %l, ptr %gep.dst, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, %n
+  br i1 %cond, label %for.end, label %for.body, !prof !6
+
+for.end:
+  ret void
+}
+
 ; For a loop with a profile-guided estimated TC of 129, when the auto-vectorizer chooses VF 16, 
 ; it should choose conservatively IC 4 so that the vector loop runs twice at least
 ; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 4)
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_known_tc.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_known_tc.ll
index 0569bfb2ae4e027..9209df400afe4a9 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_known_tc.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_known_tc.ll
@@ -29,6 +29,30 @@ for.end:
   ret void
 }
 
+; This has the same trip count as loop_with_tc_32 but since the resulting interleaved group 
+; in this case may access memory out-of-bounds, it requires a scalar epilogue iteration for 
+; correctness, making at most 31 iterations available for interleaving.
+; When the auto-vectorizer chooses VF 16, it should choose IC 1 to leave a smaller scalar remainder
+; than IC 2
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
+define void @loop_with_tc_32_scalar_epilogue_reqd(ptr noalias %p, ptr noalias %q) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %gep.src = getelementptr inbounds [3 x i8], ptr %p, i64 %i, i64 0
+  %l = load i8, ptr %gep.src, align 1
+  %gep.dst = getelementptr inbounds i8, ptr %q, i64 %i
+  store i8 %l, ptr %gep.dst, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, 32
+  br i1 %cond, label %for.end, label %for.body
+
+for.end:
+  ret void
+}
+
 ; For this loop with known TC of 33, when the auto-vectorizer chooses VF 16, it should choose
 ; IC 2 since there is a small remainder loop TC that needs to run after the vector loop.
 ; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
@@ -197,6 +221,30 @@ for.end:
   ret void
 }
 
+; This has the same trip count as loop_with_tc_128 but since the resulting interleaved group 
+; in this case may access memory out-of-bounds, it requires a scalar epilogue iteration for 
+; correctness, making at most 127 iterations available for interleaving.
+; When the auto-vectorizer chooses VF 16, it should choose IC 2 to leave a smaller scalar remainder
+; than IC 4
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_tc_128_scalar_epilogue_reqd(ptr noalias %p, ptr noalias %q) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %gep.src = getelementptr inbounds [3 x i8], ptr %p, i64 %i, i64 0
+  %l = load i8, ptr %gep.src, align 1
+  %gep.dst = getelementptr inbounds i8, ptr %q, i64 %i
+  store i8 %l, ptr %gep.dst, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, 128
+  br i1 %cond, label %for.end, label %for.body
+
+for.end:
+  ret void
+}
+
 ; For this loop with known TC of 129, when the auto-vectorizer chooses VF 16, it should choose
 ; IC 8 since there is a small remainder loop that needs to run after the vector loop
 ; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 8)
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_loops_needing_scalar_epilogue.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_loops_needing_scalar_epilogue.ll
new file mode 100644
index 000000000000000..7a09327b4e1a37d
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count_for_loops_needing_scalar_epilogue.ll
@@ -0,0 +1,132 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt < %s -force-target-max-vector-interleave=8 -p loop-vectorize -S 2>&1 | FileCheck %s
+; RUN: opt < %s -force-target-max-vector-interleave=8 -p loop-vectorize -pass-remarks=loop-vectorize -disable-output -S 2>&1 | FileCheck %s -check-prefix=CHECK-REMARKS
+
+target triple = "aarch64-linux-gnu"
+
+%pair = type { i8, i8 }
+
+; For this loop with known TC of 128, when the auto-vectorizer chooses VF 16, it should choose
+; IC 8 since there is no remainder loop run needed after the vector loop runs
+; CHECK-REMARKS: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 8)
+define void @loop_with_tc_128(ptr noalias %p, ptr noalias %q) {
+; CHECK-LABEL: define void @loop_with_tc_128(
+; CHECK-SAME: ptr noalias [[P:%.*]], ptr noalias [[Q:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %gep.src = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %load.src = load i8, ptr %gep.src, align 1
+  %gep.dst = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %load.dst = load i8, ptr %gep.dst, align 1
+  %add = add i8 %load.src, %load.dst
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, 128
+  br i1 %cond, label %for.end, label %for.body
+
+for.end:
+  ret void
+}
+
+; This function has the same trip count as loop_with_tc_128 but since the resulting interleaved group
+; in this case may access memory out-of-bounds, it requires a scalar epilogue iteration for
+; correctness, making at most 127 iterations available for interleaving. 
+; The entry block should branch into the vector loop, instead of the scalar epilogue.
+; When the auto-vectorizer chooses VF 16, it should choose IC 2, to have a smaller scalar remainder
+; than when using IC 4. 
+; CHECK-REMARKS: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_tc_128_scalar_epilogue_reqd(ptr noalias %p, ptr noalias %q) {
+; CHECK-LABEL: define void @loop_with_tc_128_scalar_epilogue_reqd(
+; CHECK-SAME: ptr noalias [[P:%.*]], ptr noalias [[Q:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %gep.src = getelementptr inbounds [3 x i8], ptr %p, i64 %i, i64 0
+  %l = load i8, ptr %gep.src, align 1
+  %gep.dst = getelementptr inbounds i8, ptr %q, i64 %i
+  store i8 %l, ptr %gep.dst, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, 128
+  br i1 %cond, label %for.end, label %for.body
+
+for.end:
+  ret void
+}
+
+; For a loop with a profile-guided estimated TC of 128, when the auto-vectorizer chooses VF 16,
+; it should choose conservatively IC 4 so that the vector loop runs twice at least
+; CHECK-REMARKS: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 4)
+define void @loop_with_profile_tc_128(ptr noalias %p, ptr noalias %q, i64 %n) {
+; CHECK-LABEL: define void @loop_with_profile_tc_128(
+; CHECK-SAME: ptr noalias [[P:%.*]], ptr noalias [[Q:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  iter.check:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]], !prof [[PROF6:![0-9]+]]
+; CHECK:       vector.main.loop.iter.check:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[N]], 64
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH:%.*]], !prof [[PROF6]]
+;
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %gep.src = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %load.src = load i8, ptr %gep.src, align 1
+  %gep.dst = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %load.dst = load i8, ptr %gep.dst, align 1
+  %add = add i8 %load.src, %load.dst
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, %n
+  br i1 %cond, label %for.end, label %for.body, !prof !0
+
+for.end:
+  ret void
+}
+
+; This function has the same trip count as loop_with_profile_tc_128 but since the resulting interleaved group
+; in this case may access memory out-of-bounds, it requires a scalar epilogue iteration for
+; correctness, making at most 127 iterations available for interleaving.
+; When the auto-vectorizer chooses VF 16, it should choose IC 2, to have a smaller scalar remainder
+; than IC 4. 
+; CHECK-REMARKS: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
+define void @loop_with_profile_tc_128_scalar_epilogue_reqd(ptr noalias %p, ptr noalias %q, i64 %n) {
+; CHECK-LABEL: define void @loop_with_profile_tc_128_scalar_epilogue_reqd(
+; CHECK-SAME: ptr noalias [[P:%.*]], ptr noalias [[Q:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  iter.check:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ule i64 [[N]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]], !prof [[PROF6]]
+; CHECK:       vector.main.loop.iter.check:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK1:%.*]] = icmp ule i64 [[N]], 32
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH:%.*]], !prof [[PROF6]]
+;
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %gep.src = getelementptr inbounds [3 x i8], ptr %p, i64 %i, i64 0
+  %l = load i8, ptr %gep.src, align 1
+  %gep.dst = getelementptr inbounds i8, ptr %q, i64 %i
+  store i8 %l, ptr %gep.dst, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, %n
+  br i1 %cond, label %for.end, label %for.body, !prof !0
+
+for.end:
+  ret void
+}
+
+!0 = !{!"branch_weights", i32 1, i32 127}

github-actions · 2024-01-26T21:26:51Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

fhahn

LGTM, thanks!

…needs to run at least once

…n limit-vf-by-tripcount.ll that got fixed by patch llvm#79651

…needs to run at least once (llvm#79651) Update loop interleaving count computation to address loops that require at least one scalar iteration in the epilogue loop. For this case, the available trip count for interleaving the loop is one less.

llvmbot added vectorization llvm:transforms labels Jan 26, 2024

fhahn reviewed Jan 26, 2024

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

nilanjana87 mentioned this pull request Jan 26, 2024

[Tests][LV][AArch64] Pre-commit tests for changing loop interleaving count computation for loops that need to run scalar iterations #79640

Merged

nilanjana87 requested review from ayalz, david-arm and davemgreen January 27, 2024 00:08

nilanjana87 mentioned this pull request Jan 27, 2024

[LV] Relax high loop trip count threshold for deciding to interleave a loop #67725

Merged

fhahn approved these changes Jan 27, 2024

View reviewed changes

[LV] Update interleaving count computation when scalar epilogue loop …

884aef2

…needs to run at least once

nilanjana87 force-pushed the loop_ic_for_scalar_epilogue branch from 45cded4 to 884aef2 Compare January 29, 2024 19:42

nilanjana87 merged commit c492eb6 into llvm:main Jan 29, 2024
3 of 4 checks passed

nilanjana87 added a commit to nilanjana87/llvm-project that referenced this pull request Jan 29, 2024

Updated limit_main_loop_vf_to_avoid_dead_main_vector_loop test case i…

ab5c249

…n limit-vf-by-tripcount.ll that got fixed by patch llvm#79651

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LV] Update interleaving count computation when scalar epilogue loop needs to run at least once #79651

[LV] Update interleaving count computation when scalar epilogue loop needs to run at least once #79651

nilanjana87 commented Jan 26, 2024 •

edited

llvmbot commented Jan 26, 2024

github-actions bot commented Jan 26, 2024 •

edited

fhahn left a comment

[LV] Update interleaving count computation when scalar epilogue loop needs to run at least once #79651

[LV] Update interleaving count computation when scalar epilogue loop needs to run at least once #79651

Conversation

nilanjana87 commented Jan 26, 2024 • edited

llvmbot commented Jan 26, 2024

github-actions bot commented Jan 26, 2024 • edited

fhahn left a comment

Choose a reason for hiding this comment

nilanjana87 commented Jan 26, 2024 •

edited

github-actions bot commented Jan 26, 2024 •

edited