[LV] Optimize partial reduction extends before handling inloop subs#199665
Open
MacDue wants to merge 2 commits into
Open
[LV] Optimize partial reduction extends before handling inloop subs#199665MacDue wants to merge 2 commits into
MacDue wants to merge 2 commits into
Conversation
The crash avoided in #194660 was caused by the extend optimizations failing to match as due to the extra sub/negation added to the "ExtendedOp". A similar crash exists for [us]abs partial reductions (see https://godbolt.org/z/MerMon5rE), which is fixed with this patch. This patch solves the underlying issue by running the extend optimizations before any inloop sub/fsub handling. Fixes #194000
Member
Author
|
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-vectorizers Author: Benjamin Maxwell (MacDue) ChangesThe crash avoided in #194660 was caused by the extend optimizations failing to match as due to the extra sub/negation added to the "ExtendedOp". A similar crash exists for [us]abs partial reductions (see https://godbolt.org/z/MerMon5rE), which is fixed with this patch. This patch solves the underlying issue by running the extend optimizations before any inloop sub/fsub handling. Fixes #194000 Patch is 22.35 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/199665.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 4e406ad542e83..f3c65ab712cb1 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -6121,6 +6121,9 @@ static void transformToPartialReduction(const VPPartialReductionChain &Chain,
auto *ExtendedOp = cast<VPSingleDefRecipe>(
WidenRecipe->getOperand(1 - Chain.AccumulatorOpIdx));
+ // FIXME: Do these transforms before invoking the cost-model.
+ ExtendedOp = optimizeExtendsForPartialReduction(ExtendedOp, TypeInfo);
+
// Sub-reductions can be implemented in two ways:
// (1) negate the operand in the vector loop (the default way).
// (2) subtract the reduced value from the init value in the middle block.
@@ -6156,9 +6159,6 @@ static void transformToPartialReduction(const VPPartialReductionChain &Chain,
ExtendedOp = NegRecipe;
}
- // FIXME: Do these transforms before invoking the cost-model.
- ExtendedOp = optimizeExtendsForPartialReduction(ExtendedOp, TypeInfo);
-
// Check if WidenRecipe is the final result of the reduction. If so look
// through selects for predicated reductions.
VPValue *Cond = nullptr;
@@ -6324,9 +6324,6 @@ matchExtendedReductionOperand(VPWidenRecipe *UpdateR, VPValue *Op,
// by widening the inner extends to match it. See
// optimizeExtendsForPartialReduction.
Op = CastSource;
- // FIXME: createPartialReductionExpression can't handle sub(ext(mul(...)))
- if (UpdateR->getOpcode() == Instruction::Sub)
- return std::nullopt;
} else {
return ExtendedReductionOperand{
UpdateR,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll b/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll
index 5cfa3961fb180..435c8cd6f6f5e 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll
@@ -1378,8 +1378,8 @@ define i32 @chained_partial_reduce_add_sub_ext_mul(ptr %a, ptr %b, ptr %c, i64 %
; CHECK-NEON-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK-NEON: vector.body:
; CHECK-NEON-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEON-NEXT: [[VEC_PHI:%.*]] = phi <16 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP19:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEON-NEXT: [[VEC_PHI1:%.*]] = phi <16 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP20:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEON-NEXT: [[VEC_PHI:%.*]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE8:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEON-NEXT: [[VEC_PHI1:%.*]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE9:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEON-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 1
; CHECK-NEON-NEXT: [[TMP1:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP0]]
; CHECK-NEON-NEXT: [[TMP2:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP0]]
@@ -1393,26 +1393,26 @@ define i32 @chained_partial_reduce_add_sub_ext_mul(ptr %a, ptr %b, ptr %c, i64 %
; CHECK-NEON-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[TMP3]], i64 16
; CHECK-NEON-NEXT: [[WIDE_LOAD5:%.*]] = load <16 x i8>, ptr [[TMP3]], align 1
; CHECK-NEON-NEXT: [[WIDE_LOAD6:%.*]] = load <16 x i8>, ptr [[TMP6]], align 1
-; CHECK-NEON-NEXT: [[TMP7:%.*]] = sext <16 x i8> [[WIDE_LOAD]] to <16 x i16>
-; CHECK-NEON-NEXT: [[TMP8:%.*]] = sext <16 x i8> [[WIDE_LOAD2]] to <16 x i16>
-; CHECK-NEON-NEXT: [[TMP9:%.*]] = sext <16 x i8> [[WIDE_LOAD3]] to <16 x i16>
-; CHECK-NEON-NEXT: [[TMP10:%.*]] = sext <16 x i8> [[WIDE_LOAD4]] to <16 x i16>
-; CHECK-NEON-NEXT: [[TMP11:%.*]] = sext <16 x i8> [[WIDE_LOAD5]] to <16 x i32>
-; CHECK-NEON-NEXT: [[TMP12:%.*]] = sext <16 x i8> [[WIDE_LOAD6]] to <16 x i32>
-; CHECK-NEON-NEXT: [[TMP13:%.*]] = mul nsw <16 x i16> [[TMP7]], [[TMP9]]
-; CHECK-NEON-NEXT: [[TMP14:%.*]] = mul nsw <16 x i16> [[TMP8]], [[TMP10]]
-; CHECK-NEON-NEXT: [[TMP15:%.*]] = sext <16 x i16> [[TMP13]] to <16 x i32>
-; CHECK-NEON-NEXT: [[TMP16:%.*]] = sext <16 x i16> [[TMP14]] to <16 x i32>
-; CHECK-NEON-NEXT: [[TMP17:%.*]] = sub <16 x i32> [[VEC_PHI]], [[TMP15]]
-; CHECK-NEON-NEXT: [[TMP18:%.*]] = sub <16 x i32> [[VEC_PHI1]], [[TMP16]]
-; CHECK-NEON-NEXT: [[TMP19]] = add <16 x i32> [[TMP17]], [[TMP11]]
-; CHECK-NEON-NEXT: [[TMP20]] = add <16 x i32> [[TMP18]], [[TMP12]]
+; CHECK-NEON-NEXT: [[TMP7:%.*]] = sext <16 x i8> [[WIDE_LOAD]] to <16 x i32>
+; CHECK-NEON-NEXT: [[TMP8:%.*]] = sext <16 x i8> [[WIDE_LOAD3]] to <16 x i32>
+; CHECK-NEON-NEXT: [[TMP15:%.*]] = mul nsw <16 x i32> [[TMP7]], [[TMP8]]
+; CHECK-NEON-NEXT: [[TMP19:%.*]] = sub <16 x i32> zeroinitializer, [[TMP15]]
+; CHECK-NEON-NEXT: [[PARTIAL_REDUCE:%.*]] = call <4 x i32> @llvm.vector.partial.reduce.add.v4i32.v16i32(<4 x i32> [[VEC_PHI]], <16 x i32> [[TMP19]])
+; CHECK-NEON-NEXT: [[TMP11:%.*]] = sext <16 x i8> [[WIDE_LOAD2]] to <16 x i32>
+; CHECK-NEON-NEXT: [[TMP12:%.*]] = sext <16 x i8> [[WIDE_LOAD4]] to <16 x i32>
+; CHECK-NEON-NEXT: [[TMP16:%.*]] = mul nsw <16 x i32> [[TMP11]], [[TMP12]]
+; CHECK-NEON-NEXT: [[TMP22:%.*]] = sub <16 x i32> zeroinitializer, [[TMP16]]
+; CHECK-NEON-NEXT: [[PARTIAL_REDUCE7:%.*]] = call <4 x i32> @llvm.vector.partial.reduce.add.v4i32.v16i32(<4 x i32> [[VEC_PHI1]], <16 x i32> [[TMP22]])
+; CHECK-NEON-NEXT: [[TMP17:%.*]] = sext <16 x i8> [[WIDE_LOAD5]] to <16 x i32>
+; CHECK-NEON-NEXT: [[PARTIAL_REDUCE8]] = call <4 x i32> @llvm.vector.partial.reduce.add.v4i32.v16i32(<4 x i32> [[PARTIAL_REDUCE]], <16 x i32> [[TMP17]])
+; CHECK-NEON-NEXT: [[TMP18:%.*]] = sext <16 x i8> [[WIDE_LOAD6]] to <16 x i32>
+; CHECK-NEON-NEXT: [[PARTIAL_REDUCE9]] = call <4 x i32> @llvm.vector.partial.reduce.add.v4i32.v16i32(<4 x i32> [[PARTIAL_REDUCE7]], <16 x i32> [[TMP18]])
; CHECK-NEON-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32
; CHECK-NEON-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEON-NEXT: br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
; CHECK-NEON: middle.block:
-; CHECK-NEON-NEXT: [[BIN_RDX:%.*]] = add <16 x i32> [[TMP20]], [[TMP19]]
-; CHECK-NEON-NEXT: [[TMP22:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[BIN_RDX]])
+; CHECK-NEON-NEXT: [[BIN_RDX:%.*]] = add <4 x i32> [[PARTIAL_REDUCE9]], [[PARTIAL_REDUCE8]]
+; CHECK-NEON-NEXT: [[TMP20:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[BIN_RDX]])
; CHECK-NEON-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
; CHECK-NEON-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
; CHECK-NEON: scalar.ph:
@@ -1420,49 +1420,53 @@ define i32 @chained_partial_reduce_add_sub_ext_mul(ptr %a, ptr %b, ptr %c, i64 %
; CHECK-SVE-LABEL: define i32 @chained_partial_reduce_add_sub_ext_mul(
; CHECK-SVE-SAME: ptr [[A:%.*]], ptr [[B:%.*]], ptr [[C:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
; CHECK-SVE-NEXT: entry:
-; CHECK-SVE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 32
+; CHECK-SVE-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-SVE-NEXT: [[TMP6:%.*]] = shl nuw nsw i64 [[TMP5]], 5
+; CHECK-SVE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP6]]
; CHECK-SVE-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK-SVE: vector.ph:
-; CHECK-SVE-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], 32
+; CHECK-SVE-NEXT: [[TMP8:%.*]] = shl nuw i64 [[TMP5]], 4
+; CHECK-SVE-NEXT: [[TMP4:%.*]] = shl nuw i64 [[TMP8]], 1
+; CHECK-SVE-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP4]]
; CHECK-SVE-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
; CHECK-SVE-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK-SVE: vector.body:
; CHECK-SVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-SVE-NEXT: [[VEC_PHI:%.*]] = phi <16 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP19:%.*]], [[VECTOR_BODY]] ]
-; CHECK-SVE-NEXT: [[VEC_PHI1:%.*]] = phi <16 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP20:%.*]], [[VECTOR_BODY]] ]
+; CHECK-SVE-NEXT: [[VEC_PHI:%.*]] = phi <vscale x 4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE8:%.*]], [[VECTOR_BODY]] ]
+; CHECK-SVE-NEXT: [[VEC_PHI1:%.*]] = phi <vscale x 4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE9:%.*]], [[VECTOR_BODY]] ]
; CHECK-SVE-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 1
; CHECK-SVE-NEXT: [[TMP1:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP0]]
; CHECK-SVE-NEXT: [[TMP2:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP0]]
; CHECK-SVE-NEXT: [[TMP3:%.*]] = getelementptr i8, ptr [[C]], i64 [[TMP0]]
-; CHECK-SVE-NEXT: [[TMP4:%.*]] = getelementptr i8, ptr [[TMP1]], i64 16
-; CHECK-SVE-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP1]], align 1
-; CHECK-SVE-NEXT: [[WIDE_LOAD2:%.*]] = load <16 x i8>, ptr [[TMP4]], align 1
-; CHECK-SVE-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[TMP2]], i64 16
-; CHECK-SVE-NEXT: [[WIDE_LOAD3:%.*]] = load <16 x i8>, ptr [[TMP2]], align 1
-; CHECK-SVE-NEXT: [[WIDE_LOAD4:%.*]] = load <16 x i8>, ptr [[TMP5]], align 1
-; CHECK-SVE-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[TMP3]], i64 16
-; CHECK-SVE-NEXT: [[WIDE_LOAD5:%.*]] = load <16 x i8>, ptr [[TMP3]], align 1
-; CHECK-SVE-NEXT: [[WIDE_LOAD6:%.*]] = load <16 x i8>, ptr [[TMP6]], align 1
-; CHECK-SVE-NEXT: [[TMP7:%.*]] = sext <16 x i8> [[WIDE_LOAD]] to <16 x i16>
-; CHECK-SVE-NEXT: [[TMP8:%.*]] = sext <16 x i8> [[WIDE_LOAD2]] to <16 x i16>
-; CHECK-SVE-NEXT: [[TMP9:%.*]] = sext <16 x i8> [[WIDE_LOAD3]] to <16 x i16>
-; CHECK-SVE-NEXT: [[TMP10:%.*]] = sext <16 x i8> [[WIDE_LOAD4]] to <16 x i16>
-; CHECK-SVE-NEXT: [[TMP11:%.*]] = sext <16 x i8> [[WIDE_LOAD5]] to <16 x i32>
-; CHECK-SVE-NEXT: [[TMP12:%.*]] = sext <16 x i8> [[WIDE_LOAD6]] to <16 x i32>
-; CHECK-SVE-NEXT: [[TMP13:%.*]] = mul nsw <16 x i16> [[TMP7]], [[TMP9]]
-; CHECK-SVE-NEXT: [[TMP14:%.*]] = mul nsw <16 x i16> [[TMP8]], [[TMP10]]
-; CHECK-SVE-NEXT: [[TMP15:%.*]] = sext <16 x i16> [[TMP13]] to <16 x i32>
-; CHECK-SVE-NEXT: [[TMP16:%.*]] = sext <16 x i16> [[TMP14]] to <16 x i32>
-; CHECK-SVE-NEXT: [[TMP17:%.*]] = sub <16 x i32> [[VEC_PHI]], [[TMP15]]
-; CHECK-SVE-NEXT: [[TMP18:%.*]] = sub <16 x i32> [[VEC_PHI1]], [[TMP16]]
-; CHECK-SVE-NEXT: [[TMP19]] = add <16 x i32> [[TMP17]], [[TMP11]]
-; CHECK-SVE-NEXT: [[TMP20]] = add <16 x i32> [[TMP18]], [[TMP12]]
-; CHECK-SVE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32
+; CHECK-SVE-NEXT: [[TMP9:%.*]] = getelementptr i8, ptr [[TMP1]], i64 [[TMP8]]
+; CHECK-SVE-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 16 x i8>, ptr [[TMP1]], align 1
+; CHECK-SVE-NEXT: [[WIDE_LOAD2:%.*]] = load <vscale x 16 x i8>, ptr [[TMP9]], align 1
+; CHECK-SVE-NEXT: [[TMP10:%.*]] = getelementptr i8, ptr [[TMP2]], i64 [[TMP8]]
+; CHECK-SVE-NEXT: [[WIDE_LOAD3:%.*]] = load <vscale x 16 x i8>, ptr [[TMP2]], align 1
+; CHECK-SVE-NEXT: [[WIDE_LOAD4:%.*]] = load <vscale x 16 x i8>, ptr [[TMP10]], align 1
+; CHECK-SVE-NEXT: [[TMP11:%.*]] = getelementptr i8, ptr [[TMP3]], i64 [[TMP8]]
+; CHECK-SVE-NEXT: [[WIDE_LOAD5:%.*]] = load <vscale x 16 x i8>, ptr [[TMP3]], align 1
+; CHECK-SVE-NEXT: [[WIDE_LOAD6:%.*]] = load <vscale x 16 x i8>, ptr [[TMP11]], align 1
+; CHECK-SVE-NEXT: [[TMP12:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT: [[TMP13:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD3]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT: [[TMP14:%.*]] = mul nsw <vscale x 16 x i32> [[TMP12]], [[TMP13]]
+; CHECK-SVE-NEXT: [[TMP15:%.*]] = sub <vscale x 16 x i32> zeroinitializer, [[TMP14]]
+; CHECK-SVE-NEXT: [[PARTIAL_REDUCE:%.*]] = call <vscale x 4 x i32> @llvm.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 4 x i32> [[VEC_PHI]], <vscale x 16 x i32> [[TMP15]])
+; CHECK-SVE-NEXT: [[TMP16:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD2]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT: [[TMP17:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD4]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT: [[TMP18:%.*]] = mul nsw <vscale x 16 x i32> [[TMP16]], [[TMP17]]
+; CHECK-SVE-NEXT: [[TMP19:%.*]] = sub <vscale x 16 x i32> zeroinitializer, [[TMP18]]
+; CHECK-SVE-NEXT: [[PARTIAL_REDUCE7:%.*]] = call <vscale x 4 x i32> @llvm.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 4 x i32> [[VEC_PHI1]], <vscale x 16 x i32> [[TMP19]])
+; CHECK-SVE-NEXT: [[TMP20:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD5]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT: [[PARTIAL_REDUCE8]] = call <vscale x 4 x i32> @llvm.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 4 x i32> [[PARTIAL_REDUCE]], <vscale x 16 x i32> [[TMP20]])
+; CHECK-SVE-NEXT: [[TMP22:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD6]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT: [[PARTIAL_REDUCE9]] = call <vscale x 4 x i32> @llvm.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 4 x i32> [[PARTIAL_REDUCE7]], <vscale x 16 x i32> [[TMP22]])
+; CHECK-SVE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
; CHECK-SVE-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-SVE-NEXT: br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
; CHECK-SVE: middle.block:
-; CHECK-SVE-NEXT: [[BIN_RDX:%.*]] = add <16 x i32> [[TMP20]], [[TMP19]]
-; CHECK-SVE-NEXT: [[TMP22:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[BIN_RDX]])
+; CHECK-SVE-NEXT: [[BIN_RDX:%.*]] = add <vscale x 4 x i32> [[PARTIAL_REDUCE9]], [[PARTIAL_REDUCE8]]
+; CHECK-SVE-NEXT: [[TMP23:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[BIN_RDX]])
; CHECK-SVE-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
; CHECK-SVE-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
; CHECK-SVE: scalar.ph:
@@ -1478,8 +1482,8 @@ define i32 @chained_partial_reduce_add_sub_ext_mul(ptr %a, ptr %b, ptr %c, i64 %
; CHECK-SVE-MAXBW-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK-SVE-MAXBW: vector.body:
; CHECK-SVE-MAXBW-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-SVE-MAXBW-NEXT: [[VEC_PHI:%.*]] = phi <8 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP19:%.*]], [[VECTOR_BODY]] ]
-; CHECK-SVE-MAXBW-NEXT: [[VEC_PHI1:%.*]] = phi <8 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP20:%.*]], [[VECTOR_BODY]] ]
+; CHECK-SVE-MAXBW-NEXT: [[VEC_PHI:%.*]] = phi <2 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE8:%.*]], [[VECTOR_BODY]] ]
+; CHECK-SVE-MAXBW-NEXT: [[VEC_PHI1:%.*]] = phi <2 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE9:%.*]], [[VECTOR_BODY]] ]
; CHECK-SVE-MAXBW-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 1
; CHECK-SVE-MAXBW-NEXT: [[TMP1:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP0]]
; CHECK-SVE-MAXBW-NEXT: [[TMP2:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP0]]
@@ -1493,26 +1497,26 @@ define i32 @chained_partial_reduce_add_sub_ext_mul(ptr %a, ptr %b, ptr %c, i64 %
; CHECK-SVE-MAXBW-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[TMP3]], i64 8
; CHECK-SVE-MAXBW-NEXT: [[WIDE_LOAD5:%.*]] = load <8 x i8>, ptr [[TMP3]], align 1
; CHECK-SVE-MAXBW-NEXT: [[WIDE_LOAD6:%.*]] = load <8 x i8>, ptr [[TMP6]], align 1
-; CHECK-SVE-MAXBW-NEXT: [[TMP7:%.*]] = sext <8 x i8> [[WIDE_LOAD]] to <8 x i16>
-; CHECK-SVE-MAXBW-NEXT: [[TMP8:%.*]] = sext <8 x i8> [[WIDE_LOAD2]] to <8 x i16>
-; CHECK-SVE-MAXBW-NEXT: [[TMP9:%.*]] = sext <8 x i8> [[WIDE_LOAD3]] to <8 x i16>
-; CHECK-SVE-MAXBW-NEXT: [[TMP10:%.*]] = sext <8 x i8> [[WIDE_LOAD4]] to <8 x i16>
-; CHECK-SVE-MAXBW-NEXT: [[TMP11:%.*]] = sext <8 x i8> [[WIDE_LOAD5]] to <8 x i32>
-; CHECK-SVE-MAXBW-NEXT: [[TMP12:%.*]] = sext <8 x i8> [[WIDE_LOAD6]] to <8 x i32>
-; CHECK-SVE-MAXBW-NEXT: [[TMP13:%.*]] = mul nsw <8 x i16> [[TMP7]], [[TMP9]]
-; CHECK-SVE-MAXBW-NEXT: [[TMP14:%.*]] = mul nsw <8 x i16> [[TMP8]], [[TMP10]]
-; CHECK-SVE-MAXBW-NEXT: [[TMP15:%.*]] = sext <8 x i16> [[TMP13]] to <8 x i32>
-; CHECK-SVE-MAXBW-NEXT: [[TMP16:%.*]] = sext <8 x i16> [[TMP14]] to <8 x i32>
-; CHECK-SVE-MAXBW-NEXT: [[TMP17:%.*]] = sub <8 x i32> [[VEC_PHI]], [[TMP15]]
-; CHECK-SVE-MAXBW-NEXT: [[TMP18:%.*]] = sub <8 x i32> [[VEC_PHI1]], [[TMP16]]
-; CHECK-SVE-MAXBW-NEXT: [[TMP19]] = add <8 x i32> [[TMP17]], [[TMP11]]
-; CHECK-SVE-MAXBW-NEXT: [[TMP20]] = add <8 x i32> [[TMP18]], [[TMP12]]
+; CHECK-SVE-MAXBW-NEXT: [[TMP7:%.*]] = sext <8 x i8> [[WIDE_LOAD]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT: [[TMP8:%.*]] = sext <8 x i8> [[WIDE_LOAD3]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT: [[TMP15:%.*]] = mul nsw <8 x i32> [[TMP7]], [[TMP8]]
+; CHECK-SVE-MAXBW-NEXT: [[TMP19:%.*]] = sub <8 x i32> zeroinitializer, [[TMP15]]
+; CHECK-SVE-MAXBW-NEXT: [[PARTIAL_REDUCE:%.*]] = call <2 x i32> @llvm.vector.partial.reduce.add.v2i32.v8i32(<2 x i32> [[VEC_PHI]], <8 x i32> [[TMP19]])
+; CHECK-SVE-MAXBW-NEXT: [[TMP11:%.*]] = sext <8 x i8> [[WIDE_LOAD2]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT: [[TMP12:%.*]] = sext <8 x i8> [[WIDE_LOAD4]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT: [[TMP16:%.*]] = mul nsw <8 x i32> [[TMP11]], [[TMP12]]
+; CHECK-SVE-MAXBW-NEXT: [[TMP22:%.*]] = sub <8 x i32> zeroinitializer, [[TMP16]]
+; CHECK-SVE-MAXBW-NEXT: [[PARTIAL_REDUCE7:%.*]] = call <2 x i32> @llvm.vector.partial.reduce.add.v2i32.v8i32(<2 x i32> [[VEC_PHI1]], <8 x i32> [[TMP22]])
+; CHECK-SVE-MAXBW-NEXT: [[TMP17:%.*]] = sext <8 x i8> [[WIDE_LOAD5]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT: [[PARTIAL_REDUCE8]] = call <2 x i32> @llvm.vector.partial.reduce.add.v2i32.v8i32(<2 x i32> [[PARTIAL_REDUCE]], <8 x i32> [[TMP17]])
+; CHECK-SVE-MAXBW-NEXT: [[TMP18:%.*]] = sext <8 x i8> [[WIDE_LOAD6]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT: [[PARTIAL_REDUCE9]] = call <2 x i32> @llvm.vector.partial.reduce.add.v2i32.v8i32(<2 x i32> [[PARTIAL_REDUCE7]], <8 x i32> [[TMP18]])
; CHECK-SVE-MAXBW-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
; CHECK-SVE-MAXBW-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-SVE-MAXBW-NEXT: br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
; CHECK-SVE-MAXBW: middle.block:
-; CHECK-SVE-MAXBW-NEXT: [[BIN_RDX:%.*]] = add <8 x i32> [[TMP20]], [[TMP19]]
-; CHECK-SVE-MAXBW-NEXT: [[TMP22:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[BIN_RDX]])
+; CHECK-SVE-MAXBW-NEXT: [[BIN_RDX:%.*]] = add <2 x i32> [[PARTIAL_REDUCE9]], [[PARTIAL_REDUCE8]]
+; CHECK-SVE-MAXBW-NEXT: [[TMP20:%.*]] = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> [[BIN_RDX]])
; CHECK-SVE-MAXBW-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
; CHECK-SVE-MAXBW-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
; CHECK-SVE-MAXBW: scalar.ph:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-usabs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-usabs.ll
index a4bfa47fa9850..0fcac876e8788 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-usabs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-usabs.ll
@@ -380,3 +380,63 @@ for.body:
exit:
ret i32 %sum.1
}
+
+; Tests an abs diff extended operand in a AddChainWithSubs reduction, where the
+; abs diff is negated (the operand of the sub). Previously, we'd crash handling
+; this as we would fail to match the negated abs diff.
+define i32 @sub_add_chain_unsigned_absolute_difference(ptr noalias %x, ptr noalias %y) {
+; CHECK-LABEL: define i32 @sub_add_chain_unsigned_absolute_difference(
+; CHECK-SAME: ptr noalias [[X:%.*]], ptr noalias [[Y:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: br label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <4 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[PARTIAL_REDUCE2:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds nuw i8, ptr [[X]], i64 [[INDEX]]
+; CHECK-NEXT: ...
[truncated]
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The crash avoided in #194660 was caused by the extend optimizations failing to match as due to the extra sub/negation added to the "ExtendedOp".
A similar crash exists for [us]abs partial reductions (see https://godbolt.org/z/MerMon5rE), which is fixed with this patch.
This patch solves the underlying issue by running the extend optimizations before any inloop sub/fsub handling.
Fixes #194000