[LV] Relax high loop trip count threshold for deciding to interleave a loop #67725

nilanjana87 · 2023-09-28T19:05:35Z

A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on an AArch64 platform, demonstrates that loop interleaving is beneficial once the interleaved part of the loop runs at least twice. The interleaved loop performance improves as trip count increases and is best for trip counts that don't have to run the epilogue loop.

The current trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc.

We have also found example in an application benchmark that was compiled with PGO where a hot loop with trip count less than 24 shows a 40% regression since it didn't get interleaved because the profile-driven trip count was less than 128. Therefore, it seems reasonable to propose getting rid of this threshold and use the trip count for computing interleaving count instead.

Update: This PR has been broken into the following parts:

Precommit tests in [LV] Pre-committing tests for changing loop interleaving count computation #70272 & [LV] Adding/modifying pre-commit tests for changing loop interleaving count computation #74689
Part 1a for changing interleaving count computation in [LV] Change loops' interleave count computation #73766
Part 1b for changing interleaving count computation for loops that need to run at least one scalar iteration: [LV] Update interleaving count computation when scalar epilogue loop needs to run at least once #79651
Part 2 for removing trip count threshold for interleaving in this PR

llvmbot · 2023-09-28T19:06:48Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Changes

A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on an AArch64 platform, demonstrates that loop interleaving is beneficial once the interleaved part of the loop runs at least twice. The interleaved loop performance improves as trip count increases and is best for trip counts that don't have to run the epilogue loop. For example, if VW is the vectorization width & IC is the interleaving count, loops with trip count TC > VW * IC * 2 performs better with interleaving and the performance is best with loop interleaving when TC is a multiple of VW * IC.

The current trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc.

We have also found example in an application benchmark that was compiled with PGO where a hot loop with trip count less than 24 shows a 40% regression since it didn't get interleaved because the profile-driven trip count was less than 128. Therefore, it seems reasonable to propose getting rid of this threshold and use the trip count for computing interleaving count instead.

Previously, https://reviews.llvm.org/D81416 addressed this issue with a configuration flag InterleaveSmallLoopScalarReductionTrue for a specific type of loop (small scalar loop with reduction). We left the IC computation as it is, when this flag is set to true.

Patch is 312.74 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67725.diff

16 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+11-17)
(modified) llvm/test/Transforms/LoopDistribute/basic-with-memchecks.ll (+2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll (+132-29)
(modified) llvm/test/Transforms/LoopVectorize/SystemZ/zero_unroll.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/imprecise-through-phis.ll (+68-28)
(modified) llvm/test/Transforms/LoopVectorize/X86/interleave_short_tc.ll (+6-9)
(modified) llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll (+48-33)
(modified) llvm/test/Transforms/LoopVectorize/X86/load-deref-pred.ll (+530-161)
(modified) llvm/test/Transforms/LoopVectorize/X86/metadata-enable.ll (+837-821)
(modified) llvm/test/Transforms/LoopVectorize/X86/pr42674.ll (+13-5)
(modified) llvm/test/Transforms/LoopVectorize/X86/strided_load_cost.ll (+161-41)
(modified) llvm/test/Transforms/LoopVectorize/X86/unroll-small-loops.ll (+157-3)
(modified) llvm/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll (+25-16)
(modified) llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks-loopid-dbg.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/vectorization-remarks.ll (+1-1)
(modified) llvm/test/Transforms/PhaseOrdering/X86/excessive-unrolling.ll (+89-89)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index cc17d91d4f43727..f208f77080027ad 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -264,11 +264,6 @@ static cl::opt<bool> EnableMaskedInterleavedMemAccesses(
     "enable-masked-interleaved-mem-accesses", cl::init(false), cl::Hidden,
     cl::desc("Enable vectorization on masked interleaved memory accesses in a loop"));
 
-static cl::opt<unsigned> TinyTripCountInterleaveThreshold(
-    "tiny-trip-count-interleave-threshold", cl::init(128), cl::Hidden,
-    cl::desc("We don't interleave loops with a estimated constant trip count "
-             "below this number"));
-
 static cl::opt<unsigned> ForceTargetNumScalarRegs(
     "force-target-num-scalar-regs", cl::init(0), cl::Hidden,
     cl::desc("A flag that overrides the target's number of scalar registers."));
@@ -5664,14 +5659,6 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
 
   auto BestKnownTC = getSmallBestKnownTC(*PSE.getSE(), TheLoop);
   const bool HasReductions = !Legal->getReductionVars().empty();
-  // Do not interleave loops with a relatively small known or estimated trip
-  // count. But we will interleave when InterleaveSmallLoopScalarReduction is
-  // enabled, and the code has scalar reductions(HasReductions && VF = 1),
-  // because with the above conditions interleaving can expose ILP and break
-  // cross iteration dependences for reductions.
-  if (BestKnownTC && (*BestKnownTC < TinyTripCountInterleaveThreshold) &&
-      !(InterleaveSmallLoopScalarReduction && HasReductions && VF.isScalar()))
-    return 1;
 
   // If we did not calculate the cost for VF (because the user selected the VF)
   // then we calculate the cost of VF here.
@@ -5745,8 +5732,11 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   }
 
   // If trip count is known or estimated compile time constant, limit the
-  // interleave count to be less than the trip count divided by VF, provided it
-  // is at least 1.
+  // interleave count to be less than the trip count divided by VF * 2,
+  // provided VF is at least 1, such that the vector loop runs at least twice
+  // to make interleaving seem profitable. When
+  // InterleaveSmallLoopScalarReduction is true, we allow interleaving even when
+  // the vector loop runs once.
   //
   // For scalable vectors we can't know if interleaving is beneficial. It may
   // not be beneficial for small loops if none of the lanes in the second vector
@@ -5755,8 +5745,12 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   // the InterleaveCount as if vscale is '1', although if some information about
   // the vector is known (e.g. min vector size), we can make a better decision.
   if (BestKnownTC) {
-    MaxInterleaveCount =
-        std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
+    if (InterleaveSmallLoopScalarReduction)
+      MaxInterleaveCount =
+          std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
+    else
+      MaxInterleaveCount = std::min(*BestKnownTC / (VF.getKnownMinValue() * 2),
+                                    MaxInterleaveCount);
     // Make sure MaxInterleaveCount is greater than 0.
     MaxInterleaveCount = std::max(1u, MaxInterleaveCount);
   }
diff --git a/llvm/test/Transforms/LoopDistribute/basic-with-memchecks.ll b/llvm/test/Transforms/LoopDistribute/basic-with-memchecks.ll
index d1cff14c3f4b752..27ca1a7541db662 100644
--- a/llvm/test/Transforms/LoopDistribute/basic-with-memchecks.ll
+++ b/llvm/test/Transforms/LoopDistribute/basic-with-memchecks.ll
@@ -79,6 +79,8 @@ entry:
 
 
 ; VECTORIZE: mul <4 x i32>
+; VECTORIZE: mul <4 x i32>
+; VECTORIZE-NOT: mul <4 x i32>
 
 for.body:                                         ; preds = %for.body, %entry
   %ind = phi i64 [ 0, %entry ], [ %add, %for.body ]
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll
index 01657493e9d4a66..ccecc2fd8f9bc59 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll
@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
 ; REQUIRES: asserts
 ; RUN: opt -passes=loop-vectorize -S < %s -debug -prefer-predicate-over-epilogue=scalar-epilogue 2>%t | FileCheck %s
 ; RUN: cat %t | FileCheck %s --check-prefix=DEBUG
@@ -9,26 +10,78 @@ target triple = "aarch64-unknown-linux-gnu"
 ; DEBUG: Found an estimated cost of Invalid for VF vscale x 1 For instruction:   %indvars.iv.next1295 = add i7 %indvars.iv1294, 1
 
 define void @induction_i7(ptr %dst) #0 {
-; CHECK-LABEL: @induction_i7(
+; CHECK-LABEL: define void @induction_i7(
+; CHECK-SAME: ptr [[DST:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 64, [[TMP1]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
-; CHECK:         [[TMP4:%.*]] = call <vscale x 2 x i8> @llvm.experimental.stepvector.nxv2i8()
-; CHECK:         [[TMP5:%.*]] = trunc <vscale x 2 x i8> [[TMP4]] to <vscale x 2 x i7>
+; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 4
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 64, [[TMP3]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 64, [[N_MOD_VF]]
+; CHECK-NEXT:    [[IND_END:%.*]] = trunc i64 [[N_VEC]] to i7
+; CHECK-NEXT:    [[TMP4:%.*]] = call <vscale x 2 x i8> @llvm.experimental.stepvector.nxv2i8()
+; CHECK-NEXT:    [[TMP5:%.*]] = trunc <vscale x 2 x i8> [[TMP4]] to <vscale x 2 x i7>
 ; CHECK-NEXT:    [[TMP6:%.*]] = add <vscale x 2 x i7> [[TMP5]], zeroinitializer
 ; CHECK-NEXT:    [[TMP7:%.*]] = mul <vscale x 2 x i7> [[TMP6]], shufflevector (<vscale x 2 x i7> insertelement (<vscale x 2 x i7> poison, i7 1, i64 0), <vscale x 2 x i7> poison, <vscale x 2 x i32> zeroinitializer)
 ; CHECK-NEXT:    [[INDUCTION:%.*]] = add <vscale x 2 x i7> zeroinitializer, [[TMP7]]
+; CHECK-NEXT:    [[TMP8:%.*]] = call i7 @llvm.vscale.i7()
+; CHECK-NEXT:    [[TMP9:%.*]] = mul i7 [[TMP8]], 2
+; CHECK-NEXT:    [[TMP10:%.*]] = mul i7 1, [[TMP9]]
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i7> poison, i7 [[TMP10]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i7> [[DOTSPLATINSERT]], <vscale x 2 x i7> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.*]], %vector.body ]
-; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 2 x i7> [ [[INDUCTION]], %vector.ph ], [ [[VEC_IND_NEXT:%.*]], %vector.body ]
-; CHECK-NEXT:    [[TMP10:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[TMP11:%.*]] = add <vscale x 2 x i7> [[VEC_IND]], zeroinitializer
-; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[DST:%.*]], i64 [[TMP10]]
-; CHECK-NEXT:    [[EXT:%.+]]  = zext <vscale x 2 x i7> [[TMP11]] to <vscale x 2 x i64>
-; CHECK-NEXT:    [[TMP13:%.*]] = getelementptr inbounds i64, ptr [[TMP12]], i32 0
-; CHECK-NEXT:    store <vscale x 2 x i64> [[EXT]], ptr [[TMP13]], align 8
-; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP16:%.*]] = mul i64 [[TMP15]], 2
-; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP16]]
-; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 2 x i7> [[VEC_IND]],
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 2 x i7> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <vscale x 2 x i7> [[VEC_IND]], [[DOTSPLAT]]
+; CHECK-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP12:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP13:%.*]] = mul i64 [[TMP12]], 2
+; CHECK-NEXT:    [[TMP14:%.*]] = add i64 [[TMP13]], 0
+; CHECK-NEXT:    [[TMP15:%.*]] = mul i64 [[TMP14]], 1
+; CHECK-NEXT:    [[TMP16:%.*]] = add i64 [[INDEX]], [[TMP15]]
+; CHECK-NEXT:    [[TMP17:%.*]] = add <vscale x 2 x i7> [[VEC_IND]], zeroinitializer
+; CHECK-NEXT:    [[TMP18:%.*]] = add <vscale x 2 x i7> [[STEP_ADD]], zeroinitializer
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr inbounds i64, ptr [[DST]], i64 [[TMP11]]
+; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr inbounds i64, ptr [[DST]], i64 [[TMP16]]
+; CHECK-NEXT:    [[TMP21:%.*]] = zext <vscale x 2 x i7> [[TMP17]] to <vscale x 2 x i64>
+; CHECK-NEXT:    [[TMP22:%.*]] = zext <vscale x 2 x i7> [[TMP18]] to <vscale x 2 x i64>
+; CHECK-NEXT:    [[TMP23:%.*]] = getelementptr inbounds i64, ptr [[TMP19]], i32 0
+; CHECK-NEXT:    store <vscale x 2 x i64> [[TMP21]], ptr [[TMP23]], align 8
+; CHECK-NEXT:    [[TMP24:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP25:%.*]] = mul i64 [[TMP24]], 2
+; CHECK-NEXT:    [[TMP26:%.*]] = getelementptr inbounds i64, ptr [[TMP19]], i64 [[TMP25]]
+; CHECK-NEXT:    store <vscale x 2 x i64> [[TMP22]], ptr [[TMP26]], align 8
+; CHECK-NEXT:    [[TMP27:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP28:%.*]] = mul i64 [[TMP27]], 4
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP28]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 2 x i7> [[STEP_ADD]], [[DOTSPLAT]]
+; CHECK-NEXT:    [[TMP29:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP29]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 64, [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i7 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL1:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[INDVARS_IV1294:%.*]] = phi i7 [ [[INDVARS_IV_NEXT1295:%.*]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
+; CHECK-NEXT:    [[INDVARS_IV1286:%.*]] = phi i64 [ [[INDVARS_IV_NEXT1287:%.*]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ]
+; CHECK-NEXT:    [[ADDI7:%.*]] = add i7 [[INDVARS_IV1294]], 0
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[DST]], i64 [[INDVARS_IV1286]]
+; CHECK-NEXT:    [[EXT:%.*]] = zext i7 [[ADDI7]] to i64
+; CHECK-NEXT:    store i64 [[EXT]], ptr [[ARRAYIDX]], align 8
+; CHECK-NEXT:    [[INDVARS_IV_NEXT1287]] = add nuw nsw i64 [[INDVARS_IV1286]], 1
+; CHECK-NEXT:    [[INDVARS_IV_NEXT1295]] = add i7 [[INDVARS_IV1294]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT1287]], 64
+; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       for.end:
+; CHECK-NEXT:    ret void
 ;
 entry:
   br label %for.body
@@ -55,25 +108,75 @@ for.end:                                          ; preds = %for.body
 ; DEBUG: Found an estimated cost of Invalid for VF vscale x 1 For instruction:   %indvars.iv.next1295 = add i3 %indvars.iv1294, 1
 
 define void @induction_i3_zext(ptr %dst) #0 {
-; CHECK-LABEL: @induction_i3_zext(
+; CHECK-LABEL: define void @induction_i3_zext(
+; CHECK-SAME: ptr [[DST:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 64, [[TMP1]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
-; CHECK:         [[TMP4:%.*]] = call <vscale x 2 x i8> @llvm.experimental.stepvector.nxv2i8()
-; CHECK:         [[TMP5:%.*]] = trunc <vscale x 2 x i8> [[TMP4]] to <vscale x 2 x i3>
+; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 4
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 64, [[TMP3]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 64, [[N_MOD_VF]]
+; CHECK-NEXT:    [[IND_END:%.*]] = trunc i64 [[N_VEC]] to i3
+; CHECK-NEXT:    [[TMP4:%.*]] = call <vscale x 2 x i8> @llvm.experimental.stepvector.nxv2i8()
+; CHECK-NEXT:    [[TMP5:%.*]] = trunc <vscale x 2 x i8> [[TMP4]] to <vscale x 2 x i3>
 ; CHECK-NEXT:    [[TMP6:%.*]] = add <vscale x 2 x i3> [[TMP5]], zeroinitializer
 ; CHECK-NEXT:    [[TMP7:%.*]] = mul <vscale x 2 x i3> [[TMP6]], shufflevector (<vscale x 2 x i3> insertelement (<vscale x 2 x i3> poison, i3 1, i64 0), <vscale x 2 x i3> poison, <vscale x 2 x i32> zeroinitializer)
 ; CHECK-NEXT:    [[INDUCTION:%.*]] = add <vscale x 2 x i3> zeroinitializer, [[TMP7]]
+; CHECK-NEXT:    [[TMP8:%.*]] = call i3 @llvm.vscale.i3()
+; CHECK-NEXT:    [[TMP9:%.*]] = mul i3 [[TMP8]], 2
+; CHECK-NEXT:    [[TMP10:%.*]] = mul i3 1, [[TMP9]]
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i3> poison, i3 [[TMP10]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i3> [[DOTSPLATINSERT]], <vscale x 2 x i3> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.*]], %vector.body ]
-; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 2 x i3> [ [[INDUCTION]], %vector.ph ], [ [[VEC_IND_NEXT:%.*]], %vector.body ]
-; CHECK-NEXT:    [[TMP9:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[TMP10:%.*]] = zext <vscale x 2 x i3> [[VEC_IND]] to <vscale x 2 x i64>
-; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[DST:%.*]], i64 [[TMP9]]
-; CHECK-NEXT:    [[TMP13:%.*]] = getelementptr inbounds i64, ptr [[TMP12]], i32 0
-; CHECK-NEXT:    store <vscale x 2 x i64> [[TMP10]], ptr [[TMP13]], align 8
-; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP16:%.*]] = mul i64 [[TMP15]], 2
-; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP16]]
-; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 2 x i3> [[VEC_IND]],
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 2 x i3> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <vscale x 2 x i3> [[VEC_IND]], [[DOTSPLAT]]
+; CHECK-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP12:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP13:%.*]] = mul i64 [[TMP12]], 2
+; CHECK-NEXT:    [[TMP14:%.*]] = add i64 [[TMP13]], 0
+; CHECK-NEXT:    [[TMP15:%.*]] = mul i64 [[TMP14]], 1
+; CHECK-NEXT:    [[TMP16:%.*]] = add i64 [[INDEX]], [[TMP15]]
+; CHECK-NEXT:    [[TMP17:%.*]] = zext <vscale x 2 x i3> [[VEC_IND]] to <vscale x 2 x i64>
+; CHECK-NEXT:    [[TMP18:%.*]] = zext <vscale x 2 x i3> [[STEP_ADD]] to <vscale x 2 x i64>
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr inbounds i64, ptr [[DST]], i64 [[TMP11]]
+; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr inbounds i64, ptr [[DST]], i64 [[TMP16]]
+; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds i64, ptr [[TMP19]], i32 0
+; CHECK-NEXT:    store <vscale x 2 x i64> [[TMP17]], ptr [[TMP21]], align 8
+; CHECK-NEXT:    [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP23:%.*]] = mul i64 [[TMP22]], 2
+; CHECK-NEXT:    [[TMP24:%.*]] = getelementptr inbounds i64, ptr [[TMP19]], i64 [[TMP23]]
+; CHECK-NEXT:    store <vscale x 2 x i64> [[TMP18]], ptr [[TMP24]], align 8
+; CHECK-NEXT:    [[TMP25:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP26:%.*]] = mul i64 [[TMP25]], 4
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP26]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 2 x i3> [[STEP_ADD]], [[DOTSPLAT]]
+; CHECK-NEXT:    [[TMP27:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP27]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 64, [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i3 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL1:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[INDVARS_IV1294:%.*]] = phi i3 [ [[INDVARS_IV_NEXT1295:%.*]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
+; CHECK-NEXT:    [[INDVARS_IV1286:%.*]] = phi i64 [ [[INDVARS_IV_NEXT1287:%.*]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ]
+; CHECK-NEXT:    [[ZEXTI3:%.*]] = zext i3 [[INDVARS_IV1294]] to i64
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[DST]], i64 [[INDVARS_IV1286]]
+; CHECK-NEXT:    store i64 [[ZEXTI3]], ptr [[ARRAYIDX]], align 8
+; CHECK-NEXT:    [[INDVARS_IV_NEXT1287]] = add nuw nsw i64 [[INDVARS_IV1286]], 1
+; CHECK-NEXT:    [[INDVARS_IV_NEXT1295]] = add i3 [[INDVARS_IV1294]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT1287]], 64
+; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; CHECK:       for.end:
+; CHECK-NEXT:    ret void
 ;
 entry:
   br label %for.body
diff --git a/llvm/test/Transforms/LoopVectorize/SystemZ/zero_unroll.ll b/llvm/test/Transforms/LoopVectorize/SystemZ/zero_unroll.ll
index d469178fec0e7b1..d8535d510ca03ab 100644
--- a/llvm/test/Transforms/LoopVectorize/SystemZ/zero_unroll.ll
+++ b/llvm/test/Transforms/LoopVectorize/SystemZ/zero_unroll.ll
@@ -1,4 +1,4 @@
-; RUN: opt -S -passes=loop-vectorize -mtriple=s390x-linux-gnu -tiny-trip-count-interleave-threshold=4 -vectorizer-min-trip-count=8 < %s | FileCheck %s
+; RUN: opt -S -passes=loop-vectorize -mtriple=s390x-linux-gnu -vectorizer-min-trip-count=8 < %s | FileCheck %s
 
 define i32 @main(i32 %arg, ptr nocapture readnone %arg1) #0 {
 ;CHECK: vector.body:
diff --git a/llvm/test/Transforms/LoopVectorize/X86/imprecise-through-phis.ll b/llvm/test/Transforms/LoopVectorize/X86/imprecise-through-phis.ll
index 8c33e7b8a59a58c..eaabfeb9cbb4f75 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/imprecise-through-phis.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/imprecise-through-phis.ll
@@ -73,23 +73,33 @@ define double @sumIfVector(ptr nocapture readonly %arr) {
 ; SSE:       vector.body:
 ; SSE-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; SSE-NEXT:    [[VEC_PHI:%.*]] = phi <2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PREDPHI:%.*]], [[VECTOR_BODY]] ]
+; SSE-NEXT:    [[VEC_PHI1:%.*]] = phi <2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PREDPHI3:%.*]], [[VECTOR_BODY]] ]
 ; SSE-NEXT:    [[TMP0:%.*]] = add i32 [[INDEX]], 0
-; SSE-NEXT:    [[TMP1:%.*]] = getelementptr double, ptr [[ARR:%.*]], i32 [[TMP0]]
-; SSE-NEXT:    [[TMP2:%.*]] = getelementptr double, ptr [[TMP1]], i32 0
-; SSE-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x double>, ptr [[TMP2]], align 8
-; SSE-NEXT:    [[TMP3:%.*]] = fcmp fast une <2 x double> [[WIDE_LOAD]], <double 4.200000e+01, double 4.200000e+01>
-; SSE-NEXT:    [[TMP4:%.*]] = fadd fast <2 x double> [[VEC_PHI]], [[WIDE_LOAD]]
-; SSE-NEXT:    [[TMP5:%.*]] = xor <2 x i1> [[TMP3]], <i1 true, i1 true>
-; SSE-NEXT:    [[PREDPHI]] = select <2 x i1> [[TMP3]], <2 x double> [[TMP4]], <2 x double> [[VEC_PHI]]
-; SSE-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
-; SSE-NEXT:    [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], 32
-; SSE-NEXT:    br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; SSE-NEXT:    [[TMP1:%.*]] = add i32 [[INDEX]], 2
+; SSE-NEXT:    [[TMP2:%.*]] = getelementptr double, ptr [[ARR:%.*]], i32 [[TMP0]]
+; SSE-NEXT:    [[TMP3:%.*]] = getelementptr double, ptr [[ARR]], i32 [[TMP1]]
+; SSE-NEXT:    [[TMP4:%.*]] = getelementptr double, ptr [[TMP2]], i32 ...
[truncated]

llvm/test/Transforms/LoopVectorize/X86/pr42674.ll

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll

llvm/test/Transforms/LoopVectorize/X86/unroll-small-loops.ll

…terleaving Count with varying loop iterations (#26) * [MicroBenchmarks,LoopInterleaving] Check performance impact of Loop Interleaving Count with varying loop iterations. This microbenchmark attempts to find the impact of loop interleaving count for different types of loops (big or small, with or without reductions inside them) over different vectorization factors for varying loop trip counts. Note: Interleaving count of 1 means interleaving is disabled. These microbenchmarks are to help guide changes in loop interleaving count computation and removal of trip count threshold for interleaving loops in llvm/llvm-project#67725 & related patches.

…terleaving Count with varying loop iterations (llvm#26) * [MicroBenchmarks,LoopInterleaving] Check performance impact of Loop Interleaving Count with varying loop iterations. This microbenchmark attempts to find the impact of loop interleaving count for different types of loops (big or small, with or without reductions inside them) over different vectorization factors for varying loop trip counts. Note: Interleaving count of 1 means interleaving is disabled. These microbenchmarks are to help guide changes in loop interleaving count computation and removal of trip count threshold for interleaving loops in llvm/llvm-project#67725 & related patches.

llvm/test/Transforms/LoopVectorize/X86/interleave_short_tc.ll

nilanjana87 · 2024-01-22T18:48:33Z

ping!

llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

nilanjana87 · 2024-01-27T02:33:16Z

Resolved conflicts & rebased on latest code. Created separate patch #79651 for addressing the issue with loop interleaving count computation for loops that have to run at least one scalar iteration in the epilogue, as seen in the limit-vf-by-tripcount.ll test case, which is otherwise unrelated to the changes in this patch.

nilanjana87 · 2024-01-30T20:32:21Z

Resolved conflicts & rebased on latest code. Created separate patch #79651 for addressing the issue with loop interleaving count computation for loops that have to run at least one scalar iteration in the epilogue, as seen in the limit-vf-by-tripcount.ll test case, which is otherwise unrelated to the changes in this patch.

Submitted #79651 and rebased this patch on it to fix the regression identified in limit_main_loop_vf_to_avoid_dead_main_vector_loop test case in limit-vf-by-tripcount.ll.

fhahn

LGTM, thanks!

Might be good to wait a day or so with committing in case there are additional thoughts.

…the loop The current loop trip count threshold to allow loop interleaving is 128 which seems arbitrarily high & uncorrelated with factors like VW, IC, register pressure etc. A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, shows that loop interleaving is beneficial even for loops with low trip counts. We have also found similar evidence in an application benchmark that when compiled with PGO shows a 40% regression when it's hot loop with profile-guided trip count of 24 doesn't get interleaved because of this threshold. Therefore, it seems reasonable to eliminate this threshold and use the trip count for computing interleaving count instead (llvm#73766).

dyung · 2024-02-06T07:25:28Z

Hi @nilanjana87, one of the tests you modified in this change, test/Transforms/LoopDistribute/basic-with-memchecks.ll is failing on AArch64 bots. Can you take a look and revert if you need to investigate?

https://lab.llvm.org/buildbot/#/builders/188/builds/41465
https://lab.llvm.org/staging/#/builders/189/builds/13

rorth · 2024-02-06T07:39:47Z

It also FAILs on Solaris/sparcv9.

nilanjana87 · 2024-02-06T08:04:09Z

Hi @dyung, I am looking into it. The reason why I didn't see this in my local tests is because I configured for the build to include X86 targets as well, in which case this test succeeds perhaps because it's using the X86 based target-triple in the file. I'll try to fix this soon.

dyung · 2024-02-06T08:22:09Z

Hi @dyung, I am looking into it. The reason why I didn't see this in my local tests is because I configured for the build to include X86 targets as well, in which case this test succeeds perhaps because it's using the X86 based target-triple in the file. I'll try to fix this soon.

Thank you. And in the future, it is preferred not to leave failing tests on the build bots failing for so long because the failures could mask other issues.

Removed target-triple in target-independent test case to fix failing test caused by #67725.

fhahn · 2024-02-06T09:47:11Z

In the future, would be good to check the precommit test status as reported to the PR (as buildkite/github-pull-requests in the checks section)

nilanjana87 · 2024-02-06T20:52:41Z

In the future, would be good to check the precommit test status as reported to the PR (as buildkite/github-pull-requests in the checks section)

Noted. Though for this particular failed test case, it wasn't caught in the pre-commit test because the test build included the X86 target as well. This failure was only triggered when LLVM was built without an X86 target.

Recent set of changes (PR llvm#67725) in loop interleaving algorithm removed the loop trip count threshold for allowing interleaving. Therefore configuration option interleave-small-loop-scalar-reduction option is no longer needed.

Recent set of changes (PR #67725) in loop interleaving algorithm caused removal of the loop trip count threshold for allowing interleaving. Therefore configuration option interleave-small-loop-scalar-reduction is no longer needed.

…ave a loop (llvm#67725) A set of microbenchmarks (llvm/llvm-test-suite#26) showed that loop interleaving can be beneficial for loops with low trip count as well. Loop interleaving count computation is updated accordingly in prior patches while this patch removes the loop trip count threshold for interleaving.

Removed target-triple in target-independent test case to fix failing test caused by llvm#67725.

Recent set of changes (PR llvm#67725) in loop interleaving algorithm caused removal of the loop trip count threshold for allowing interleaving. Therefore configuration option interleave-small-loop-scalar-reduction is no longer needed.

…herry_picked [LV] Allow loop interleaving for loops with low trip count. (llvm#67725)

nilanjana87 requested review from nikic, fhahn, davemgreen, topperc, fpetrogalli, ayalz, ebrevnov, david-arm and AaronHLiu September 28, 2023 19:05

llvmbot added vectorization vectorizers llvm:transforms labels Sep 28, 2023

Endilll removed the vectorizers label Sep 29, 2023

david-arm reviewed Oct 5, 2023

View reviewed changes

llvm/test/Transforms/LoopVectorize/X86/pr42674.ll Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

fhahn reviewed Oct 10, 2023

View reviewed changes

llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll Outdated Show resolved Hide resolved

llvm/test/Transforms/LoopVectorize/X86/unroll-small-loops.ll Outdated Show resolved Hide resolved

nilanjana87 force-pushed the loop_interleaving_threshold branch from fd354a7 to b8b6ebb Compare January 4, 2024 10:55

nilanjana87 commented Jan 4, 2024

View reviewed changes

llvm/test/Transforms/LoopVectorize/X86/interleave_short_tc.ll Show resolved Hide resolved

nilanjana87 force-pushed the loop_interleaving_threshold branch from 27eab7d to 5ac7eae Compare January 16, 2024 23:27

david-arm reviewed Jan 23, 2024

View reviewed changes

llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll Outdated Show resolved Hide resolved

fhahn reviewed Jan 25, 2024

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

nilanjana87 force-pushed the loop_interleaving_threshold branch from d987bf5 to 0328528 Compare January 27, 2024 02:26

nilanjana87 force-pushed the loop_interleaving_threshold branch from 0328528 to ab5c249 Compare January 29, 2024 23:35

fhahn approved these changes Feb 1, 2024

View reviewed changes

nilanjana87 force-pushed the loop_interleaving_threshold branch from ab5c249 to ddfa53c Compare February 6, 2024 00:07

nilanjana87 force-pushed the loop_interleaving_threshold branch from ddfa53c to d1b4654 Compare February 6, 2024 00:19

nilanjana87 merged commit c1c5b85 into llvm:main Feb 6, 2024
3 of 4 checks passed

nilanjana87 added a commit to nilanjana87/llvm-project that referenced this pull request Feb 6, 2024

[Tests][LoopDistribute] Fixes failing unit test caused by llvm#67725

b403ffd

nilanjana87 mentioned this pull request Feb 6, 2024

[Tests][LoopDistribute] Fixes failing unit test #80809

Merged

nilanjana87 added a commit that referenced this pull request Feb 6, 2024

[Tests][LoopDistribute] Fixes failing unit test (#80809)

168002e

Removed target-triple in target-independent test case to fix failing test caused by #67725.

nilanjana87 mentioned this pull request Feb 26, 2024

[LV] Remove unused configuration option #82955

Merged

nilanjana87 mentioned this pull request Mar 2, 2024

[LV] Cherry-picked changes for loop interleaving algorithm apple/llvm-project#8320

Merged

nilanjana87 added a commit to apple/llvm-project that referenced this pull request Mar 5, 2024

[Tests][LoopDistribute] Fixes failing unit test (llvm#80809)

dcd3b01

Removed target-triple in target-independent test case to fix failing test caused by llvm#67725.

nilanjana87 added a commit to apple/llvm-project that referenced this pull request Mar 5, 2024

Merge pull request #8320 from nilanjana87/loop_interleaving_changes_c…

c385bf3

…herry_picked [LV] Allow loop interleaving for loops with low trip count. (llvm#67725)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LV] Relax high loop trip count threshold for deciding to interleave a loop #67725

[LV] Relax high loop trip count threshold for deciding to interleave a loop #67725

nilanjana87 commented Sep 28, 2023 •

edited

llvmbot commented Sep 28, 2023 •

edited

nilanjana87 commented Jan 22, 2024

nilanjana87 commented Jan 27, 2024

nilanjana87 commented Jan 30, 2024

fhahn left a comment

dyung commented Feb 6, 2024

rorth commented Feb 6, 2024

nilanjana87 commented Feb 6, 2024

dyung commented Feb 6, 2024

fhahn commented Feb 6, 2024

nilanjana87 commented Feb 6, 2024

[LV] Relax high loop trip count threshold for deciding to interleave a loop #67725

[LV] Relax high loop trip count threshold for deciding to interleave a loop #67725

Conversation

nilanjana87 commented Sep 28, 2023 • edited

llvmbot commented Sep 28, 2023 • edited

nilanjana87 commented Jan 22, 2024

nilanjana87 commented Jan 27, 2024

nilanjana87 commented Jan 30, 2024

fhahn left a comment

Choose a reason for hiding this comment

dyung commented Feb 6, 2024

rorth commented Feb 6, 2024

nilanjana87 commented Feb 6, 2024

dyung commented Feb 6, 2024

fhahn commented Feb 6, 2024

nilanjana87 commented Feb 6, 2024

nilanjana87 commented Sep 28, 2023 •

edited

llvmbot commented Sep 28, 2023 •

edited