LV/EVL: use TailFoldingStyle::Data when EVL illegal #93303

artagnon · 2024-05-24T14:17:55Z

413a66f ([LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL.) uses TailFoldingStyle::DataWithoutLaneMask when EVL is determined to be illegal, but this is unnecessary, as @llvm.get.active.lane.mask is unrelated to EVL. Moreover, SelectionDAG automatically detects if a target supports the @llvm.get.active.lane.mask intrinsic, and lowers it to equivalent instructions on targets where it is not preferred, since 243a532 ([SelectionDAG] Lower @llvm.get.active.lane.mask to setcc). Hence, change this usage to TailFoldingStyle::Data.

llvmbot · 2024-05-24T14:18:27Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-powerpc

Author: Ramkumar Ramachandra (artagnon)

Changes

There is just one usage of TailFoldingStyle::DataWithoutLaneMask in LoopVectorize, introduced by 413a66f ([LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL.), but this usage is completely unnecessary, as @llvm.get.active.lane.mask is unrelated to EVL. Moreover, SelectionDAG automatically detects if a target supports the @llvm.get.active.lane.mask intrinsic, and lowers it to equivalent instructions on targets where it is not preferred, since 243a532 ([SelectionDAG] Lower @llvm.get.active.lane.mask to setcc).

Patch is 75.73 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/93303.diff

7 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+3-8)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/tail-folding-styles.ll (-56)
(modified) llvm/test/Transforms/LoopVectorize/PowerPC/vplan-force-tail-with-evl.ll (+320-64)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll (+15-23)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-interleave.ll (+49-52)
(modified) llvm/test/Transforms/LoopVectorize/X86/vectorize-force-tail-with-evl.ll (+16-22)
(modified) llvm/test/Transforms/LoopVectorize/X86/vplan-vp-intrinsics.ll (+243-50)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 52cb8c9f88f94..44905bee96519 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -243,9 +243,6 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
         clEnumValN(
             TailFoldingStyle::Data, "data",
             "Create lane mask for data only, using active.lane.mask intrinsic"),
-        clEnumValN(TailFoldingStyle::DataWithoutLaneMask,
-                   "data-without-lane-mask",
-                   "Create lane mask with compare/stepvector"),
         clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
                    "Create lane mask using active.lane.mask intrinsic, and use "
                    "it for both data and control flow"),
@@ -1538,12 +1535,10 @@ class LoopVectorizationCostModel {
         // FIXME: remove this once reductions are supported.
         Legal->getReductionVars().empty();
     if (!EVLIsLegal) {
-      // If for some reason EVL mode is unsupported, fallback to
-      // DataWithoutLaneMask to try to vectorize the loop with folded tail
-      // in a generic way.
+      // If for some reason EVL mode is unsupported, fallback to Data to try to
+      // vectorize the loop with folded tail in a generic way.
       ChosenTailFoldingStyle =
-          std::make_pair(TailFoldingStyle::DataWithoutLaneMask,
-                         TailFoldingStyle::DataWithoutLaneMask);
+          std::make_pair(TailFoldingStyle::Data, TailFoldingStyle::Data);
       LLVM_DEBUG(
           dbgs()
           << "LV: Preference for VP intrinsics indicated. Will "
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/tail-folding-styles.ll b/llvm/test/Transforms/LoopVectorize/AArch64/tail-folding-styles.ll
index 13fc0eaafb808..666c03565dc60 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/tail-folding-styles.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/tail-folding-styles.ll
@@ -1,7 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
 ; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=none < %s | FileCheck %s --check-prefix=NONE
 ; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data < %s | FileCheck %s --check-prefix=DATA
-; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data-without-lane-mask < %s | FileCheck %s --check-prefix=DATA_NO_LANEMASK
 ; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data-and-control < %s | FileCheck %s --check-prefix=DATA_AND_CONTROL
 ; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data-and-control-without-rt-check < %s | FileCheck %s --check-prefix=DATA_AND_CONTROL_NO_RT_CHECK
 
@@ -99,61 +98,6 @@ define void @simple_memset_tailfold(i32 %val, ptr %ptr, i64 %n) "target-features
 ; DATA:       while.end.loopexit:
 ; DATA-NEXT:    ret void
 ;
-; DATA_NO_LANEMASK-LABEL: @simple_memset_tailfold(
-; DATA_NO_LANEMASK-NEXT:  entry:
-; DATA_NO_LANEMASK-NEXT:    [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[N:%.*]], i64 1)
-; DATA_NO_LANEMASK-NEXT:    [[TMP0:%.*]] = sub i64 -1, [[UMAX]]
-; DATA_NO_LANEMASK-NEXT:    [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
-; DATA_NO_LANEMASK-NEXT:    [[TMP2:%.*]] = mul i64 [[TMP1]], 4
-; DATA_NO_LANEMASK-NEXT:    [[TMP3:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
-; DATA_NO_LANEMASK-NEXT:    br i1 [[TMP3]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
-; DATA_NO_LANEMASK:       vector.ph:
-; DATA_NO_LANEMASK-NEXT:    [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
-; DATA_NO_LANEMASK-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 4
-; DATA_NO_LANEMASK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
-; DATA_NO_LANEMASK-NEXT:    [[TMP7:%.*]] = mul i64 [[TMP6]], 4
-; DATA_NO_LANEMASK-NEXT:    [[TMP8:%.*]] = sub i64 [[TMP7]], 1
-; DATA_NO_LANEMASK-NEXT:    [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]
-; DATA_NO_LANEMASK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
-; DATA_NO_LANEMASK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
-; DATA_NO_LANEMASK-NEXT:    [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[UMAX]], 1
-; DATA_NO_LANEMASK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
-; DATA_NO_LANEMASK-NEXT:    [[TMP16:%.*]] = mul i64 [[TMP15]], 4
-; DATA_NO_LANEMASK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i64 0
-; DATA_NO_LANEMASK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; DATA_NO_LANEMASK-NEXT:    [[BROADCAST_SPLATINSERT4:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.*]], i64 0
-; DATA_NO_LANEMASK-NEXT:    [[BROADCAST_SPLAT5:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT4]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
-; DATA_NO_LANEMASK-NEXT:    br label [[VECTOR_BODY:%.*]]
-; DATA_NO_LANEMASK:       vector.body:
-; DATA_NO_LANEMASK-NEXT:    [[INDEX1:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT6:%.*]], [[VECTOR_BODY]] ]
-; DATA_NO_LANEMASK-NEXT:    [[TMP9:%.*]] = add i64 [[INDEX1]], 0
-; DATA_NO_LANEMASK-NEXT:    [[BROADCAST_SPLATINSERT2:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX1]], i64 0
-; DATA_NO_LANEMASK-NEXT:    [[BROADCAST_SPLAT3:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT2]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; DATA_NO_LANEMASK-NEXT:    [[TMP10:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
-; DATA_NO_LANEMASK-NEXT:    [[TMP11:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP10]]
-; DATA_NO_LANEMASK-NEXT:    [[VEC_IV:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT3]], [[TMP11]]
-; DATA_NO_LANEMASK-NEXT:    [[TMP12:%.*]] = icmp ule <vscale x 4 x i64> [[VEC_IV]], [[BROADCAST_SPLAT]]
-; DATA_NO_LANEMASK-NEXT:    [[TMP13:%.*]] = getelementptr i32, ptr [[PTR:%.*]], i64 [[TMP9]]
-; DATA_NO_LANEMASK-NEXT:    [[TMP14:%.*]] = getelementptr i32, ptr [[TMP13]], i32 0
-; DATA_NO_LANEMASK-NEXT:    call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> [[BROADCAST_SPLAT5]], ptr [[TMP14]], i32 4, <vscale x 4 x i1> [[TMP12]])
-; DATA_NO_LANEMASK-NEXT:    [[INDEX_NEXT6]] = add i64 [[INDEX1]], [[TMP16]]
-; DATA_NO_LANEMASK-NEXT:    [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT6]], [[N_VEC]]
-; DATA_NO_LANEMASK-NEXT:    br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
-; DATA_NO_LANEMASK:       middle.block:
-; DATA_NO_LANEMASK-NEXT:    br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
-; DATA_NO_LANEMASK:       scalar.ph:
-; DATA_NO_LANEMASK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
-; DATA_NO_LANEMASK-NEXT:    br label [[WHILE_BODY:%.*]]
-; DATA_NO_LANEMASK:       while.body:
-; DATA_NO_LANEMASK-NEXT:    [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
-; DATA_NO_LANEMASK-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[PTR]], i64 [[INDEX]]
-; DATA_NO_LANEMASK-NEXT:    store i32 [[VAL]], ptr [[GEP]], align 4
-; DATA_NO_LANEMASK-NEXT:    [[INDEX_NEXT]] = add nsw i64 [[INDEX]], 1
-; DATA_NO_LANEMASK-NEXT:    [[CMP10:%.*]] = icmp ult i64 [[INDEX_NEXT]], [[N]]
-; DATA_NO_LANEMASK-NEXT:    br i1 [[CMP10]], label [[WHILE_BODY]], label [[WHILE_END_LOOPEXIT]], !llvm.loop [[LOOP3:![0-9]+]]
-; DATA_NO_LANEMASK:       while.end.loopexit:
-; DATA_NO_LANEMASK-NEXT:    ret void
-;
 ; DATA_AND_CONTROL-LABEL: @simple_memset_tailfold(
 ; DATA_AND_CONTROL-NEXT:  entry:
 ; DATA_AND_CONTROL-NEXT:    [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[N:%.*]], i64 1)
diff --git a/llvm/test/Transforms/LoopVectorize/PowerPC/vplan-force-tail-with-evl.ll b/llvm/test/Transforms/LoopVectorize/PowerPC/vplan-force-tail-with-evl.ll
index bd52c2a8f0645..2b50b44a3d289 100644
--- a/llvm/test/Transforms/LoopVectorize/PowerPC/vplan-force-tail-with-evl.ll
+++ b/llvm/test/Transforms/LoopVectorize/PowerPC/vplan-force-tail-with-evl.ll
@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
 ; REQUIRES: asserts
 
 ; RUN: opt -passes=loop-vectorize -debug-only=loop-vectorize \
@@ -7,50 +8,183 @@
 ; RUN: -mcpu=pwr10 -disable-output < %s 2>&1 | FileCheck %s
 
 define void @foo(ptr noalias %a, ptr noalias %b, ptr noalias %c, i64 %N) {
-; CHECK-LABEL: VPlan 'Initial VPlan for VF={2,4},UF>=1' {
-; CHECK-NEXT: Live-in vp<%0> = VF * UF
-; CHECK-NEXT: Live-in vp<%1> = vector-trip-count
-; CHECK-NEXT: Live-in vp<%2> = backedge-taken count
-; CHECK-NEXT: Live-in ir<%N> = original trip-count
-; CHECK-EMPTY:
-; CHECK-NEXT: vector.ph:
-; CHECK-NEXT: Successor(s): vector loop
-; CHECK-EMPTY:
-; CHECK-NEXT: <x1> vector loop: {
-; CHECK-NEXT:   vector.body:
-; CHECK-NEXT:     EMIT vp<%3> = CANONICAL-INDUCTION ir<0>, vp<%8>
-; CHECK-NEXT:     WIDEN-INDUCTION %iv = phi 0, %iv.next, ir<1>
-; CHECK-NEXT:     EMIT vp<%4> = icmp ule ir<%iv>, vp<%2>
-; CHECK-NEXT:   Successor(s): pred.store
-; CHECK-EMPTY:
-; CHECK-NEXT:  <xVFxUF> pred.store: {
-; CHECK-NEXT:    pred.store.entry:
-; CHECK-NEXT:      BRANCH-ON-MASK vp<%4>
-; CHECK-NEXT:    Successor(s): pred.store.if, pred.store.continue
-; CHECK-EMPTY:
-; CHECK-NEXT:    pred.store.if:
-; CHECK-NEXT:      vp<%5> = SCALAR-STEPS vp<%3>, ir<1>
-; CHECK-NEXT:      REPLICATE ir<%arrayidx> = getelementptr inbounds ir<%b>, vp<%5>
-; CHECK-NEXT:      REPLICATE ir<%0> = load ir<%arrayidx>
-; CHECK-NEXT:      REPLICATE ir<%arrayidx2> = getelementptr inbounds ir<%c>, vp<%5>
-; CHECK-NEXT:      REPLICATE ir<%1> = load ir<%arrayidx2>
-; CHECK-NEXT:      REPLICATE ir<%arrayidx4> = getelementptr inbounds ir<%a>, vp<%5>
-; CHECK-NEXT:      REPLICATE ir<%add> = add nsw ir<%1>, ir<%0>
-; CHECK-NEXT:      REPLICATE store ir<%add>, ir<%arrayidx4>
-; CHECK-NEXT:    Successor(s): pred.store.continue
-; CHECK-EMPTY:
-; CHECK-NEXT:    pred.store.continue:
-; CHECK-NEXT:      PHI-PREDICATED-INSTRUCTION vp<%6> = ir<%0>
-; CHECK-NEXT:      PHI-PREDICATED-INSTRUCTION vp<%7> = ir<%1>
+; CHECK-LABEL: 'foo'
+; CHECK-NEXT:  LV: Loop hints: force=? width=0 interleave=0
+; CHECK-NEXT:  LV: Found a loop: for.body
+; CHECK-NEXT:  LV: Found an induction variable.
+; CHECK-NEXT:  LV: We can vectorize this loop!
+; CHECK-NEXT:  LV: Found trip count: 0
+; CHECK-NEXT:  LV: vector predicate hint/switch found.
+; CHECK-NEXT:  LV: Not allowing scalar epilogue, creating predicated vector loop.
+; CHECK-NEXT:  LV: The max safe fixed VF is: 67108864.
+; CHECK-NEXT:  LV: The max safe scalable VF is: vscale x 0.
+; CHECK-NEXT:  LV: The Smallest and Widest types: 32 / 32 bits.
+; CHECK-NEXT:  LV: The Widest register safe to use is: 128 bits.
+; CHECK-NEXT:  LV: The Widest register safe to use is: vscale x 0 bits.
+; CHECK-NEXT:  LV: The target has no scalable vector registers.
+; CHECK-NEXT:  LV: checking if tail can be folded by masking.
+; CHECK-NEXT:  LV: can fold tail by masking.
+; CHECK-NEXT:  LV: Preference for VP intrinsics indicated. Will not try to generate VP Intrinsics due to non-interleaving reasons.
+; CHECK-NEXT:  LV: Invalidate all interleaved groups due to fold-tail by masking which requires masked-interleaved support.
+; CHECK-NEXT:  LV: Found uniform instruction: %exitcond.not = icmp eq i64 %iv.next, %N
+; CHECK-NEXT:  LV: Found scalar instruction: %arrayidx = getelementptr inbounds i32, ptr %b, i64 %iv
+; CHECK-NEXT:  LV: Found scalar instruction: %arrayidx2 = getelementptr inbounds i32, ptr %c, i64 %iv
+; CHECK-NEXT:  LV: Found scalar instruction: %arrayidx4 = getelementptr inbounds i32, ptr %a, i64 %iv
+; CHECK-NEXT:  LV: Found uniform instruction: %exitcond.not = icmp eq i64 %iv.next, %N
+; CHECK-NEXT:  LV: Found scalar instruction: %arrayidx = getelementptr inbounds i32, ptr %b, i64 %iv
+; CHECK-NEXT:  LV: Found scalar instruction: %arrayidx2 = getelementptr inbounds i32, ptr %c, i64 %iv
+; CHECK-NEXT:  LV: Found scalar instruction: %arrayidx4 = getelementptr inbounds i32, ptr %a, i64 %iv
+; CHECK-NEXT:  LV: Scalarizing: %arrayidx = getelementptr inbounds i32, ptr %b, i64 %iv
+; CHECK-NEXT:  LV: Scalarizing and predicating: %0 = load i32, ptr %arrayidx, align 4
+; CHECK-NEXT:  LV: Scalarizing: %arrayidx2 = getelementptr inbounds i32, ptr %c, i64 %iv
+; CHECK-NEXT:  LV: Scalarizing and predicating: %1 = load i32, ptr %arrayidx2, align 4
+; CHECK-NEXT:  LV: Scalarizing: %add = add nsw i32 %1, %0
+; CHECK-NEXT:  LV: Scalarizing: %arrayidx4 = getelementptr inbounds i32, ptr %a, i64 %iv
+; CHECK-NEXT:  LV: Scalarizing and predicating: store i32 %add, ptr %arrayidx4, align 4
+; CHECK-NEXT:  LV: Scalarizing: %iv.next = add nuw nsw i64 %iv, 1
+; CHECK-NEXT:  LV: Scalarizing: %exitcond.not = icmp eq i64 %iv.next, %N
+; CHECK-NEXT:  LV: Scalarizing: %arrayidx = getelementptr inbounds i32, ptr %b, i64 %iv
+; CHECK-NEXT:  LV: Scalarizing and predicating: %0 = load i32, ptr %arrayidx, align 4
+; CHECK-NEXT:  LV: Scalarizing: %arrayidx2 = getelementptr inbounds i32, ptr %c, i64 %iv
+; CHECK-NEXT:  LV: Scalarizing and predicating: %1 = load i32, ptr %arrayidx2, align 4
+; CHECK-NEXT:  LV: Scalarizing: %add = add nsw i32 %1, %0
+; CHECK-NEXT:  LV: Scalarizing: %arrayidx4 = getelementptr inbounds i32, ptr %a, i64 %iv
+; CHECK-NEXT:  LV: Scalarizing and predicating: store i32 %add, ptr %arrayidx4, align 4
+; CHECK-NEXT:  LV: Scalarizing: %exitcond.not = icmp eq i64 %iv.next, %N
+; CHECK-NEXT:  VPlan 'Initial VPlan for VF={1},UF>=1' {
+; CHECK-NEXT:  Live-in vp<%0> = VF * UF
+; CHECK-NEXT:  Live-in vp<%1> = vector-trip-count
+; CHECK-NEXT:  Live-in ir<%N> = original trip-count
+; CHECK-EMPTY:
+; CHECK-NEXT:  vector.ph:
+; CHECK-NEXT:  Successor(s): vector loop
+; CHECK-EMPTY:
+; CHECK-NEXT:  <x1> vector loop: {
+; CHECK-NEXT:    vector.body:
+; CHECK-NEXT:      EMIT vp<%3> = CANONICAL-INDUCTION ir<0>, vp<%8>
+; CHECK-NEXT:      vp<%4> = SCALAR-STEPS vp<%3>, ir<1>
+; CHECK-NEXT:      EMIT vp<%5> = active lane mask vp<%4>, ir<%N>
+; CHECK-NEXT:    Successor(s): pred.store
+; CHECK-EMPTY:
+; CHECK-NEXT:    <xVFxUF> pred.store: {
+; CHECK-NEXT:      pred.store.entry:
+; CHECK-NEXT:        BRANCH-ON-MASK vp<%5>
+; CHECK-NEXT:      Successor(s): pred.store.if, pred.store.continue
+; CHECK-EMPTY:
+; CHECK-NEXT:      pred.store.if:
+; CHECK-NEXT:        CLONE ir<%arrayidx> = getelementptr inbounds ir<%b>, vp<%4>
+; CHECK-NEXT:        CLONE ir<%0> = load ir<%arrayidx>
+; CHECK-NEXT:        CLONE ir<%arrayidx2> = getelementptr inbounds ir<%c>, vp<%4>
+; CHECK-NEXT:        CLONE ir<%1> = load ir<%arrayidx2>
+; CHECK-NEXT:        CLONE ir<%arrayidx4> = getelementptr inbounds ir<%a>, vp<%4>
+; CHECK-NEXT:        CLONE ir<%add> = add nsw ir<%1>, ir<%0>
+; CHECK-NEXT:        CLONE store ir<%add>, ir<%arrayidx4>
+; CHECK-NEXT:      Successor(s): pred.store.continue
+; CHECK-EMPTY:
+; CHECK-NEXT:      pred.store.continue:
+; CHECK-NEXT:        PHI-PREDICATED-INSTRUCTION vp<%6> = ir<%0>
+; CHECK-NEXT:        PHI-PREDICATED-INSTRUCTION vp<%7> = ir<%1>
+; CHECK-NEXT:      No successors
+; CHECK-NEXT:    }
+; CHECK-NEXT:    Successor(s): for.body.2
+; CHECK-EMPTY:
+; CHECK-NEXT:    for.body.2:
+; CHECK-NEXT:      EMIT vp<%8> = add vp<%3>, vp<%0>
+; CHECK-NEXT:      EMIT branch-on-count vp<%8>, vp<%1>
+; CHECK-NEXT:    No successors
+; CHECK-NEXT:  }
+; CHECK-NEXT:  Successor(s): middle.block
+; CHECK-EMPTY:
+; CHECK-NEXT:  middle.block:
+; CHECK-NEXT:  No successors
+; CHECK-NEXT:  }
+; CHECK-NEXT:  VPlan 'Initial VPlan for VF={2,4},UF>=1' {
+; CHECK-NEXT:  Live-in vp<%0> = VF * UF
+; CHECK-NEXT:  Live-in vp<%1> = vector-trip-count
+; CHECK-NEXT:  Live-in ir<%N> = original trip-count
+; CHECK-EMPTY:
+; CHECK-NEXT:  vector.ph:
+; CHECK-NEXT:  Successor(s): vector loop
+; CHECK-EMPTY:
+; CHECK-NEXT:  <x1> vector loop: {
+; CHECK-NEXT:    vector.body:
+; CHECK-NEXT:      EMIT vp<%3> = CANONICAL-INDUCTION ir<0>, vp<%8>
+; CHECK-NEXT:      vp<%4> = SCALAR-STEPS vp<%3>, ir<1>
+; CHECK-NEXT:      EMIT vp<%5> = active lane mask vp<%4>, ir<%N>
+; CHECK-NEXT:    Successor(s): pred.store
+; CHECK-EMPTY:
+; CHECK-NEXT:    <xVFxUF> pred.store: {
+; CHECK-NEXT:      pred.store.entry:
+; CHECK-NEXT:        BRANCH-ON-MASK vp<%5>
+; CHECK-NEXT:      Successor(s): pred.store.if, pred.store.continue
+; CHECK-EMPTY:
+; CHECK-NEXT:      pred.store.if:
+; CHECK-NEXT:        REPLICATE ir<%arrayidx> = getelementptr inbounds ir<%b>, vp<%4>
+; CHECK-NEXT:        REPLICATE ir<%0> = load ir<%arrayidx>
+; CHECK-NEXT:        REPLICATE ir<%arrayidx2> = getelementptr inbounds ir<%c>, vp<%4>
+; CHECK-NEXT:        REPLICATE ir<%1> = load ir<%arrayidx2>
+; CHECK-NEXT:        REPLICATE ir<%arrayidx4> = getelementptr inbounds ir<%a>, vp<%4>
+; CHECK-NEXT:        REPLICATE ir<%add> = add nsw ir<%1>, ir<%0>
+; CHECK-NEXT:        REPLICATE store ir<%add>, ir<%arrayidx4>
+; CHECK-NEXT:      Successor(s): pred.store.continue
+; CHECK-EMPTY:
+; CHECK-NEXT:      pred.store.continue:
+; CHECK-NEXT:        PHI-PREDICATED-INSTRUCTION vp<%6> = ir<%0>
+; CHECK-NEXT:        PHI-PREDICATED-INSTRUCTION vp<%7> = ir<%1>
+; CHECK-NEXT:      No successors
+; CHECK-NEXT:    }
+; CHECK-NEXT:    Successor(s): for.body.2
+; CHECK-EMPTY:
+; CHECK-NEXT:    for.body.2:
+; CHECK-NEXT:      EMIT vp<%8> = add vp<%3>, vp<%0>
+; CHECK-NEXT:      EMIT branch-on-count vp<%8>, vp<%1>
 ; CHECK-NEXT:    No successors
 ; CHECK-NEXT:  }
-; CHECK-NEXT:  Successor(s): for.body.2
+; CHECK-NEXT:  Successor(s): middle.block
+; CHECK-EMPTY:
+; CHECK-NEXT:  middle.block:
+; CHECK-NEXT:  No successors
+; CHECK-NEXT:  }
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx = getelementptr inbounds i32, ptr %b, i64 %iv
+; CHECK-NEXT:  LV: Found an estimated cost of 1 for VF 1 For instruction: %0 = load i32, ptr %arrayidx, align 4
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx2 = getelementptr inbounds i32, ptr %c, i64 %iv
+; CHECK-NEXT:  LV: Found an estimated cost of 1 for VF 1 For instruction: %1 = load i32, ptr %arrayidx2, align 4
+; CHECK-NEXT:  LV: Found an estimated cost of 1 for VF 1 For instruction: %add = add nsw i32 %1, %0
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx4 = getelementptr inbounds i32, ptr %a, i64 %iv
+; CHECK-NEXT:  LV: Found an estimated cost of 1 for VF 1 For instruction: store i32 %add, ptr %arrayidx4, align 4
+; CHECK-NEXT:  LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add nuw nsw i64 %iv, 1
+; CHECK-NEXT:  LV: Found an estimated cost of 1 for VF 1 For instruction: %exitcond.not = icmp eq i64 %iv.next, %N
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+; CHECK-NEXT:  LV: Scalar loop costs: 6.
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for VF 2 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx = getelementptr inbounds i32, ptr %b, i64 %iv
+; CHECK-NEXT:  LV: Found an estimated cost of 3000000 for VF 2 For instruction: %0 = load i32, ptr %arrayidx, align 4
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx2 = getelementptr inbounds i32, ptr %c, i64 %iv
+; CHECK-NEXT:  LV: Found an estimated cost of 3000000 for VF 2 For instruction: %1 = load i32, ptr %arrayidx2, align 4
+; CHECK-NEXT:  LV: Found an estimated cost of 3 for VF 2 For instruction: %add = add nsw i32 %1, %0
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx4 = getelementptr inbounds i32, ptr %a, i64 %iv
+; CHECK-NEXT:  LV: Found an estimated cost of 1 for VF 2 For instruction: store i32 %add, ptr %arrayidx4, align 4
+; CHECK-NEXT:  LV: Found an estimated cost of 1 for VF 2 For instruction: %iv.next = add nuw nsw i64 %iv, 1
+; CHECK-NEXT:  LV: Found an estimated cost of 1 for VF 2 For instruction: %exitcond.not = icmp eq i64 %iv.next, %N
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+; CHECK-NEXT:  LV: Vector loop of width 2 costs: 3000003.
+; CHECK-NEXT:  LV: Found an estimated cost of 0 for ...
[truncated]

artagnon · 2024-06-04T07:59:55Z

Gentle ping.

sjoerdmeijer · 2024-06-10T13:39:17Z

This wasn't entirely clear to me from a quick look:

TailFoldingStyle::DataWithoutLaneMask in LoopVectorize, introduced by 413a66f

I.e., it doesn't seem to be introduced there, but perhaps it is not that important where exactly this happened.

This is right:

SelectionDAG automatically detects if a target supports the @llvm.get.active.lane.mask intrinsic

And was in place almost from the start when get.active.lane.mask was introduced, so that's why I don't know exactly what this comment is about:

  /// FIXME: Can this kind be removed now that SelectionDAGBuilder expands the
  /// active.lane.mask intrinsic when it is not natively supported?
  DataWithoutLaneMask,

This tail folding style calculates the mask in a particular way:

  /// Same as Data, but avoids using the get.active.lane.mask intrinsic to
  /// calculate the mask and instead implements this with a
  /// splat/stepvector/cmp.

And that is not exactly the same as the lowering of get.active.lane.mask when it is not supported?

So what I am trying to say is that DataWithoutLaneMask seems like a valid use case, so why do we need to remove it?

david-arm · 2024-06-10T16:14:38Z

I wonder if the DataWithoutLaneMask style is even functionally correct in all cases because get.active.lane.mask correctly deals with any potential overflow of the induction variable mid-vector?

artagnon · 2024-06-24T13:00:15Z

I've rebased and updated the commit message. Can we kindly confirm that DataWithoutLaneMask is okay to strip?

artagnon · 2024-07-22T14:38:15Z

Rebase and ping.

artagnon · 2024-08-22T12:09:04Z

Gentle ping.

npanchen · 2024-09-16T18:07:56Z

From my understanding DataWithoutLaneMask was just a rename of PredicationStyle::None done by https://reviews.llvm.org/D142887. If I got it right, there were not tests before and after that patch, until @alexey-bataev's patch.
As for this style itself, it seems to be needed to guard IV overflow in its computation and uses: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L2482

As for EVL vectorization, that check should not be needed due its increment by a result of get.vector.length

artagnon · 2024-09-16T18:38:53Z

From my understanding DataWithoutLaneMask was just a rename of PredicationStyle::None done by https://reviews.llvm.org/D142887. If I got it right, there were not tests before and after that patch, until @alexey-bataev's patch.

I see. Thanks for the context.

As for this style itself, it seems to be needed to guard IV overflow in its computation and uses: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L2482

Not sure I follow: isn't the code you pointed to emitting a runtime check when indvar-overflow-check is needed, and when the style is not DataAndControlFlowWithoutRuntimeCheck? What does it have to do with DataWithoutLaneMask?

As for EVL vectorization, that check should not be needed due its increment by a result of get.vector.length

When the IV update may overflow, the code you pointed you creates a runtime-check (except in the case when the tail-folding style is DataAndControlFlowWithoutRuntimeCheck), and I think the commit message is incorrect: will update. When EVL isn't legal, whether we choose DataWithoutLaneMask or Data, shouldn't the runtime-check be created?

Okay, so there are really two changes in this PR:

When EVL is not legal, choose Data as the tail folding style. This part should be fine, right? Illegality of EVL should not imply non-availability of active.lane.mask, yes?
Stripping DataWithoutLaneMask tail-folding style. Since there is no provable correctness issue, would you prefer we keep this style?

There is just one usage of TailFoldingStyle::DataWithoutLaneMask in LoopVectorize, introduced by 413a66f ([LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL.), but this usage is completely unnecessary, as @llvm.get.active.lane.mask is unrelated to EVL. Moreover, SelectionDAG automatically detects if a target supports the @llvm.get.active.lane.mask intrinsic, and lowers it to equivalent instructions on targets where it is not preferred, since 243a532 ([SelectionDAG] Lower @llvm.get.active.lane.mask to setcc).

artagnon · 2025-01-15T18:40:06Z

Gentle ping.

alexey-bataev · 2025-01-17T19:05:27Z

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-interleave.ll

+; IF-EVL-NEXT:    [[TMP16:%.*]] = add i64 [[TMP15]], 0
+; IF-EVL-NEXT:    [[TMP17:%.*]] = mul i64 [[TMP16]], 1
+; IF-EVL-NEXT:    [[TMP18:%.*]] = add i64 [[INDEX]], [[TMP17]]
+; IF-EVL-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP13]], i64 [[N]])


The question is can lowering of llvm.get.active.lane.mask move invariant values to a preheader like in this test?

Interesting. Seems strange that invariant values weren't moved to the preheader. I thought they always were, so I'm not sure what's different about this particular situation.

artagnon requested review from fhahn, alexey-bataev, davemgreen, sjoerdmeijer and david-arm May 24, 2024 14:17

llvmbot added backend:PowerPC vectorizers llvm:transforms labels May 24, 2024

artagnon changed the title ~~[LV] strip TailFoldingStyle::DataWithoutLaneMask~~ LV/EVL: strip TailFoldingStyle::DataWithoutLaneMask May 25, 2024

artagnon force-pushed the lv-lanemask-legal branch from 1592bab to ead0894 Compare June 4, 2024 11:21

artagnon force-pushed the lv-lanemask-legal branch from ead0894 to 13bdb26 Compare June 24, 2024 12:59

artagnon force-pushed the lv-lanemask-legal branch from 13bdb26 to 9207d63 Compare July 22, 2024 14:38

artagnon force-pushed the lv-lanemask-legal branch from 9207d63 to 16fac57 Compare September 16, 2024 16:31

artagnon requested a review from npanchen September 16, 2024 16:31

artagnon changed the title ~~LV/EVL: strip TailFoldingStyle::DataWithoutLaneMask~~ LV/EVL: use TailFoldingStyle::Data when EVL illegal Sep 17, 2024

artagnon force-pushed the lv-lanemask-legal branch from 77ea67f to ed469da Compare October 1, 2024 10:40

artagnon force-pushed the lv-lanemask-legal branch from ed469da to 98b6da1 Compare January 15, 2025 18:39

artagnon requested a review from Mel-Chen January 17, 2025 11:33

alexey-bataev reviewed Jan 17, 2025

View reviewed changes

artagnon closed this Apr 3, 2025

artagnon deleted the lv-lanemask-legal branch April 3, 2025 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LV/EVL: use TailFoldingStyle::Data when EVL illegal #93303

LV/EVL: use TailFoldingStyle::Data when EVL illegal #93303

Uh oh!

artagnon commented May 24, 2024 •

edited

Loading

Uh oh!

llvmbot commented May 24, 2024 •

edited

Loading

Uh oh!

artagnon commented Jun 4, 2024

Uh oh!

sjoerdmeijer commented Jun 10, 2024

Uh oh!

david-arm commented Jun 10, 2024

Uh oh!

artagnon commented Jun 24, 2024

Uh oh!

artagnon commented Jul 22, 2024

Uh oh!

artagnon commented Aug 22, 2024

Uh oh!

npanchen commented Sep 16, 2024

Uh oh!

artagnon commented Sep 16, 2024

Uh oh!

artagnon commented Jan 15, 2025

Uh oh!

alexey-bataev Jan 17, 2025

Uh oh!

artagnon Jan 21, 2025

Uh oh!

Uh oh!

LV/EVL: use TailFoldingStyle::Data when EVL illegal #93303

LV/EVL: use TailFoldingStyle::Data when EVL illegal #93303

Uh oh!

Conversation

artagnon commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artagnon commented Jun 4, 2024

Uh oh!

sjoerdmeijer commented Jun 10, 2024

Uh oh!

david-arm commented Jun 10, 2024

Uh oh!

artagnon commented Jun 24, 2024

Uh oh!

artagnon commented Jul 22, 2024

Uh oh!

artagnon commented Aug 22, 2024

Uh oh!

npanchen commented Sep 16, 2024

Uh oh!

artagnon commented Sep 16, 2024

Uh oh!

artagnon commented Jan 15, 2025

Uh oh!

alexey-bataev Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

artagnon Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

artagnon commented May 24, 2024 •

edited

Loading

llvmbot commented May 24, 2024 •

edited

Loading