Skip to content

[LV] Optimize partial reduction extends before handling inloop subs#199665

Open
MacDue wants to merge 2 commits into
users/MacDue/no_mul_sub_prfrom
users/MacDue/fix_inloop_sub_pr
Open

[LV] Optimize partial reduction extends before handling inloop subs#199665
MacDue wants to merge 2 commits into
users/MacDue/no_mul_sub_prfrom
users/MacDue/fix_inloop_sub_pr

Conversation

@MacDue
Copy link
Copy Markdown
Member

@MacDue MacDue commented May 26, 2026

The crash avoided in #194660 was caused by the extend optimizations failing to match as due to the extra sub/negation added to the "ExtendedOp".

A similar crash exists for [us]abs partial reductions (see https://godbolt.org/z/MerMon5rE), which is fixed with this patch.

This patch solves the underlying issue by running the extend optimizations before any inloop sub/fsub handling.

Fixes #194000

The crash avoided in #194660 was caused by the extend optimizations
failing to match as due to the extra sub/negation added to the
"ExtendedOp".

A similar crash exists for [us]abs partial reductions
(see https://godbolt.org/z/MerMon5rE), which is fixed with this patch.

This patch solves the underlying issue by running the extend optimizations
before any inloop sub/fsub handling.

Fixes #194000
@MacDue
Copy link
Copy Markdown
Member Author

MacDue commented May 26, 2026

@MacDue MacDue marked this pull request as ready for review May 26, 2026 11:51
Copy link
Copy Markdown
Contributor

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-usabs.ll
@llvmorg-github-actions
Copy link
Copy Markdown

llvmorg-github-actions Bot commented May 27, 2026

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Benjamin Maxwell (MacDue)

Changes

The crash avoided in #194660 was caused by the extend optimizations failing to match as due to the extra sub/negation added to the "ExtendedOp".

A similar crash exists for [us]abs partial reductions (see https://godbolt.org/z/MerMon5rE), which is fixed with this patch.

This patch solves the underlying issue by running the extend optimizations before any inloop sub/fsub handling.

Fixes #194000


Patch is 22.35 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/199665.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+3-6)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll (+70-66)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-usabs.ll (+60)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 4e406ad542e83..f3c65ab712cb1 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -6121,6 +6121,9 @@ static void transformToPartialReduction(const VPPartialReductionChain &Chain,
   auto *ExtendedOp = cast<VPSingleDefRecipe>(
       WidenRecipe->getOperand(1 - Chain.AccumulatorOpIdx));
 
+  // FIXME: Do these transforms before invoking the cost-model.
+  ExtendedOp = optimizeExtendsForPartialReduction(ExtendedOp, TypeInfo);
+
   // Sub-reductions can be implemented in two ways:
   // (1) negate the operand in the vector loop (the default way).
   // (2) subtract the reduced value from the init value in the middle block.
@@ -6156,9 +6159,6 @@ static void transformToPartialReduction(const VPPartialReductionChain &Chain,
     ExtendedOp = NegRecipe;
   }
 
-  // FIXME: Do these transforms before invoking the cost-model.
-  ExtendedOp = optimizeExtendsForPartialReduction(ExtendedOp, TypeInfo);
-
   // Check if WidenRecipe is the final result of the reduction. If so look
   // through selects for predicated reductions.
   VPValue *Cond = nullptr;
@@ -6324,9 +6324,6 @@ matchExtendedReductionOperand(VPWidenRecipe *UpdateR, VPValue *Op,
       // by widening the inner extends to match it. See
       // optimizeExtendsForPartialReduction.
       Op = CastSource;
-      // FIXME: createPartialReductionExpression can't handle sub(ext(mul(...)))
-      if (UpdateR->getOpcode() == Instruction::Sub)
-        return std::nullopt;
     } else {
       return ExtendedReductionOperand{
           UpdateR,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll b/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll
index 5cfa3961fb180..435c8cd6f6f5e 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll
@@ -1378,8 +1378,8 @@ define i32 @chained_partial_reduce_add_sub_ext_mul(ptr %a, ptr %b, ptr %c, i64 %
 ; CHECK-NEON-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK-NEON:       vector.body:
 ; CHECK-NEON-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEON-NEXT:    [[VEC_PHI:%.*]] = phi <16 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP19:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEON-NEXT:    [[VEC_PHI1:%.*]] = phi <16 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP20:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEON-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE8:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEON-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE9:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEON-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 1
 ; CHECK-NEON-NEXT:    [[TMP1:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP0]]
 ; CHECK-NEON-NEXT:    [[TMP2:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP0]]
@@ -1393,26 +1393,26 @@ define i32 @chained_partial_reduce_add_sub_ext_mul(ptr %a, ptr %b, ptr %c, i64 %
 ; CHECK-NEON-NEXT:    [[TMP6:%.*]] = getelementptr i8, ptr [[TMP3]], i64 16
 ; CHECK-NEON-NEXT:    [[WIDE_LOAD5:%.*]] = load <16 x i8>, ptr [[TMP3]], align 1
 ; CHECK-NEON-NEXT:    [[WIDE_LOAD6:%.*]] = load <16 x i8>, ptr [[TMP6]], align 1
-; CHECK-NEON-NEXT:    [[TMP7:%.*]] = sext <16 x i8> [[WIDE_LOAD]] to <16 x i16>
-; CHECK-NEON-NEXT:    [[TMP8:%.*]] = sext <16 x i8> [[WIDE_LOAD2]] to <16 x i16>
-; CHECK-NEON-NEXT:    [[TMP9:%.*]] = sext <16 x i8> [[WIDE_LOAD3]] to <16 x i16>
-; CHECK-NEON-NEXT:    [[TMP10:%.*]] = sext <16 x i8> [[WIDE_LOAD4]] to <16 x i16>
-; CHECK-NEON-NEXT:    [[TMP11:%.*]] = sext <16 x i8> [[WIDE_LOAD5]] to <16 x i32>
-; CHECK-NEON-NEXT:    [[TMP12:%.*]] = sext <16 x i8> [[WIDE_LOAD6]] to <16 x i32>
-; CHECK-NEON-NEXT:    [[TMP13:%.*]] = mul nsw <16 x i16> [[TMP7]], [[TMP9]]
-; CHECK-NEON-NEXT:    [[TMP14:%.*]] = mul nsw <16 x i16> [[TMP8]], [[TMP10]]
-; CHECK-NEON-NEXT:    [[TMP15:%.*]] = sext <16 x i16> [[TMP13]] to <16 x i32>
-; CHECK-NEON-NEXT:    [[TMP16:%.*]] = sext <16 x i16> [[TMP14]] to <16 x i32>
-; CHECK-NEON-NEXT:    [[TMP17:%.*]] = sub <16 x i32> [[VEC_PHI]], [[TMP15]]
-; CHECK-NEON-NEXT:    [[TMP18:%.*]] = sub <16 x i32> [[VEC_PHI1]], [[TMP16]]
-; CHECK-NEON-NEXT:    [[TMP19]] = add <16 x i32> [[TMP17]], [[TMP11]]
-; CHECK-NEON-NEXT:    [[TMP20]] = add <16 x i32> [[TMP18]], [[TMP12]]
+; CHECK-NEON-NEXT:    [[TMP7:%.*]] = sext <16 x i8> [[WIDE_LOAD]] to <16 x i32>
+; CHECK-NEON-NEXT:    [[TMP8:%.*]] = sext <16 x i8> [[WIDE_LOAD3]] to <16 x i32>
+; CHECK-NEON-NEXT:    [[TMP15:%.*]] = mul nsw <16 x i32> [[TMP7]], [[TMP8]]
+; CHECK-NEON-NEXT:    [[TMP19:%.*]] = sub <16 x i32> zeroinitializer, [[TMP15]]
+; CHECK-NEON-NEXT:    [[PARTIAL_REDUCE:%.*]] = call <4 x i32> @llvm.vector.partial.reduce.add.v4i32.v16i32(<4 x i32> [[VEC_PHI]], <16 x i32> [[TMP19]])
+; CHECK-NEON-NEXT:    [[TMP11:%.*]] = sext <16 x i8> [[WIDE_LOAD2]] to <16 x i32>
+; CHECK-NEON-NEXT:    [[TMP12:%.*]] = sext <16 x i8> [[WIDE_LOAD4]] to <16 x i32>
+; CHECK-NEON-NEXT:    [[TMP16:%.*]] = mul nsw <16 x i32> [[TMP11]], [[TMP12]]
+; CHECK-NEON-NEXT:    [[TMP22:%.*]] = sub <16 x i32> zeroinitializer, [[TMP16]]
+; CHECK-NEON-NEXT:    [[PARTIAL_REDUCE7:%.*]] = call <4 x i32> @llvm.vector.partial.reduce.add.v4i32.v16i32(<4 x i32> [[VEC_PHI1]], <16 x i32> [[TMP22]])
+; CHECK-NEON-NEXT:    [[TMP17:%.*]] = sext <16 x i8> [[WIDE_LOAD5]] to <16 x i32>
+; CHECK-NEON-NEXT:    [[PARTIAL_REDUCE8]] = call <4 x i32> @llvm.vector.partial.reduce.add.v4i32.v16i32(<4 x i32> [[PARTIAL_REDUCE]], <16 x i32> [[TMP17]])
+; CHECK-NEON-NEXT:    [[TMP18:%.*]] = sext <16 x i8> [[WIDE_LOAD6]] to <16 x i32>
+; CHECK-NEON-NEXT:    [[PARTIAL_REDUCE9]] = call <4 x i32> @llvm.vector.partial.reduce.add.v4i32.v16i32(<4 x i32> [[PARTIAL_REDUCE7]], <16 x i32> [[TMP18]])
 ; CHECK-NEON-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32
 ; CHECK-NEON-NEXT:    [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEON-NEXT:    br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
 ; CHECK-NEON:       middle.block:
-; CHECK-NEON-NEXT:    [[BIN_RDX:%.*]] = add <16 x i32> [[TMP20]], [[TMP19]]
-; CHECK-NEON-NEXT:    [[TMP22:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[BIN_RDX]])
+; CHECK-NEON-NEXT:    [[BIN_RDX:%.*]] = add <4 x i32> [[PARTIAL_REDUCE9]], [[PARTIAL_REDUCE8]]
+; CHECK-NEON-NEXT:    [[TMP20:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[BIN_RDX]])
 ; CHECK-NEON-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
 ; CHECK-NEON-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; CHECK-NEON:       scalar.ph:
@@ -1420,49 +1420,53 @@ define i32 @chained_partial_reduce_add_sub_ext_mul(ptr %a, ptr %b, ptr %c, i64 %
 ; CHECK-SVE-LABEL: define i32 @chained_partial_reduce_add_sub_ext_mul(
 ; CHECK-SVE-SAME: ptr [[A:%.*]], ptr [[B:%.*]], ptr [[C:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
 ; CHECK-SVE-NEXT:  entry:
-; CHECK-SVE-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 32
+; CHECK-SVE-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-SVE-NEXT:    [[TMP6:%.*]] = shl nuw nsw i64 [[TMP5]], 5
+; CHECK-SVE-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP6]]
 ; CHECK-SVE-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK-SVE:       vector.ph:
-; CHECK-SVE-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 32
+; CHECK-SVE-NEXT:    [[TMP8:%.*]] = shl nuw i64 [[TMP5]], 4
+; CHECK-SVE-NEXT:    [[TMP4:%.*]] = shl nuw i64 [[TMP8]], 1
+; CHECK-SVE-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP4]]
 ; CHECK-SVE-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
 ; CHECK-SVE-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK-SVE:       vector.body:
 ; CHECK-SVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-SVE-NEXT:    [[VEC_PHI:%.*]] = phi <16 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP19:%.*]], [[VECTOR_BODY]] ]
-; CHECK-SVE-NEXT:    [[VEC_PHI1:%.*]] = phi <16 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP20:%.*]], [[VECTOR_BODY]] ]
+; CHECK-SVE-NEXT:    [[VEC_PHI:%.*]] = phi <vscale x 4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE8:%.*]], [[VECTOR_BODY]] ]
+; CHECK-SVE-NEXT:    [[VEC_PHI1:%.*]] = phi <vscale x 4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE9:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-SVE-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 1
 ; CHECK-SVE-NEXT:    [[TMP1:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP0]]
 ; CHECK-SVE-NEXT:    [[TMP2:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP0]]
 ; CHECK-SVE-NEXT:    [[TMP3:%.*]] = getelementptr i8, ptr [[C]], i64 [[TMP0]]
-; CHECK-SVE-NEXT:    [[TMP4:%.*]] = getelementptr i8, ptr [[TMP1]], i64 16
-; CHECK-SVE-NEXT:    [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP1]], align 1
-; CHECK-SVE-NEXT:    [[WIDE_LOAD2:%.*]] = load <16 x i8>, ptr [[TMP4]], align 1
-; CHECK-SVE-NEXT:    [[TMP5:%.*]] = getelementptr i8, ptr [[TMP2]], i64 16
-; CHECK-SVE-NEXT:    [[WIDE_LOAD3:%.*]] = load <16 x i8>, ptr [[TMP2]], align 1
-; CHECK-SVE-NEXT:    [[WIDE_LOAD4:%.*]] = load <16 x i8>, ptr [[TMP5]], align 1
-; CHECK-SVE-NEXT:    [[TMP6:%.*]] = getelementptr i8, ptr [[TMP3]], i64 16
-; CHECK-SVE-NEXT:    [[WIDE_LOAD5:%.*]] = load <16 x i8>, ptr [[TMP3]], align 1
-; CHECK-SVE-NEXT:    [[WIDE_LOAD6:%.*]] = load <16 x i8>, ptr [[TMP6]], align 1
-; CHECK-SVE-NEXT:    [[TMP7:%.*]] = sext <16 x i8> [[WIDE_LOAD]] to <16 x i16>
-; CHECK-SVE-NEXT:    [[TMP8:%.*]] = sext <16 x i8> [[WIDE_LOAD2]] to <16 x i16>
-; CHECK-SVE-NEXT:    [[TMP9:%.*]] = sext <16 x i8> [[WIDE_LOAD3]] to <16 x i16>
-; CHECK-SVE-NEXT:    [[TMP10:%.*]] = sext <16 x i8> [[WIDE_LOAD4]] to <16 x i16>
-; CHECK-SVE-NEXT:    [[TMP11:%.*]] = sext <16 x i8> [[WIDE_LOAD5]] to <16 x i32>
-; CHECK-SVE-NEXT:    [[TMP12:%.*]] = sext <16 x i8> [[WIDE_LOAD6]] to <16 x i32>
-; CHECK-SVE-NEXT:    [[TMP13:%.*]] = mul nsw <16 x i16> [[TMP7]], [[TMP9]]
-; CHECK-SVE-NEXT:    [[TMP14:%.*]] = mul nsw <16 x i16> [[TMP8]], [[TMP10]]
-; CHECK-SVE-NEXT:    [[TMP15:%.*]] = sext <16 x i16> [[TMP13]] to <16 x i32>
-; CHECK-SVE-NEXT:    [[TMP16:%.*]] = sext <16 x i16> [[TMP14]] to <16 x i32>
-; CHECK-SVE-NEXT:    [[TMP17:%.*]] = sub <16 x i32> [[VEC_PHI]], [[TMP15]]
-; CHECK-SVE-NEXT:    [[TMP18:%.*]] = sub <16 x i32> [[VEC_PHI1]], [[TMP16]]
-; CHECK-SVE-NEXT:    [[TMP19]] = add <16 x i32> [[TMP17]], [[TMP11]]
-; CHECK-SVE-NEXT:    [[TMP20]] = add <16 x i32> [[TMP18]], [[TMP12]]
-; CHECK-SVE-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32
+; CHECK-SVE-NEXT:    [[TMP9:%.*]] = getelementptr i8, ptr [[TMP1]], i64 [[TMP8]]
+; CHECK-SVE-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 16 x i8>, ptr [[TMP1]], align 1
+; CHECK-SVE-NEXT:    [[WIDE_LOAD2:%.*]] = load <vscale x 16 x i8>, ptr [[TMP9]], align 1
+; CHECK-SVE-NEXT:    [[TMP10:%.*]] = getelementptr i8, ptr [[TMP2]], i64 [[TMP8]]
+; CHECK-SVE-NEXT:    [[WIDE_LOAD3:%.*]] = load <vscale x 16 x i8>, ptr [[TMP2]], align 1
+; CHECK-SVE-NEXT:    [[WIDE_LOAD4:%.*]] = load <vscale x 16 x i8>, ptr [[TMP10]], align 1
+; CHECK-SVE-NEXT:    [[TMP11:%.*]] = getelementptr i8, ptr [[TMP3]], i64 [[TMP8]]
+; CHECK-SVE-NEXT:    [[WIDE_LOAD5:%.*]] = load <vscale x 16 x i8>, ptr [[TMP3]], align 1
+; CHECK-SVE-NEXT:    [[WIDE_LOAD6:%.*]] = load <vscale x 16 x i8>, ptr [[TMP11]], align 1
+; CHECK-SVE-NEXT:    [[TMP12:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT:    [[TMP13:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD3]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT:    [[TMP14:%.*]] = mul nsw <vscale x 16 x i32> [[TMP12]], [[TMP13]]
+; CHECK-SVE-NEXT:    [[TMP15:%.*]] = sub <vscale x 16 x i32> zeroinitializer, [[TMP14]]
+; CHECK-SVE-NEXT:    [[PARTIAL_REDUCE:%.*]] = call <vscale x 4 x i32> @llvm.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 4 x i32> [[VEC_PHI]], <vscale x 16 x i32> [[TMP15]])
+; CHECK-SVE-NEXT:    [[TMP16:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD2]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT:    [[TMP17:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD4]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT:    [[TMP18:%.*]] = mul nsw <vscale x 16 x i32> [[TMP16]], [[TMP17]]
+; CHECK-SVE-NEXT:    [[TMP19:%.*]] = sub <vscale x 16 x i32> zeroinitializer, [[TMP18]]
+; CHECK-SVE-NEXT:    [[PARTIAL_REDUCE7:%.*]] = call <vscale x 4 x i32> @llvm.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 4 x i32> [[VEC_PHI1]], <vscale x 16 x i32> [[TMP19]])
+; CHECK-SVE-NEXT:    [[TMP20:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD5]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT:    [[PARTIAL_REDUCE8]] = call <vscale x 4 x i32> @llvm.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 4 x i32> [[PARTIAL_REDUCE]], <vscale x 16 x i32> [[TMP20]])
+; CHECK-SVE-NEXT:    [[TMP22:%.*]] = sext <vscale x 16 x i8> [[WIDE_LOAD6]] to <vscale x 16 x i32>
+; CHECK-SVE-NEXT:    [[PARTIAL_REDUCE9]] = call <vscale x 4 x i32> @llvm.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 4 x i32> [[PARTIAL_REDUCE7]], <vscale x 16 x i32> [[TMP22]])
+; CHECK-SVE-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
 ; CHECK-SVE-NEXT:    [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-SVE-NEXT:    br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
 ; CHECK-SVE:       middle.block:
-; CHECK-SVE-NEXT:    [[BIN_RDX:%.*]] = add <16 x i32> [[TMP20]], [[TMP19]]
-; CHECK-SVE-NEXT:    [[TMP22:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[BIN_RDX]])
+; CHECK-SVE-NEXT:    [[BIN_RDX:%.*]] = add <vscale x 4 x i32> [[PARTIAL_REDUCE9]], [[PARTIAL_REDUCE8]]
+; CHECK-SVE-NEXT:    [[TMP23:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[BIN_RDX]])
 ; CHECK-SVE-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
 ; CHECK-SVE-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; CHECK-SVE:       scalar.ph:
@@ -1478,8 +1482,8 @@ define i32 @chained_partial_reduce_add_sub_ext_mul(ptr %a, ptr %b, ptr %c, i64 %
 ; CHECK-SVE-MAXBW-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK-SVE-MAXBW:       vector.body:
 ; CHECK-SVE-MAXBW-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-SVE-MAXBW-NEXT:    [[VEC_PHI:%.*]] = phi <8 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP19:%.*]], [[VECTOR_BODY]] ]
-; CHECK-SVE-MAXBW-NEXT:    [[VEC_PHI1:%.*]] = phi <8 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP20:%.*]], [[VECTOR_BODY]] ]
+; CHECK-SVE-MAXBW-NEXT:    [[VEC_PHI:%.*]] = phi <2 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE8:%.*]], [[VECTOR_BODY]] ]
+; CHECK-SVE-MAXBW-NEXT:    [[VEC_PHI1:%.*]] = phi <2 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PARTIAL_REDUCE9:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-SVE-MAXBW-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 1
 ; CHECK-SVE-MAXBW-NEXT:    [[TMP1:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP0]]
 ; CHECK-SVE-MAXBW-NEXT:    [[TMP2:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP0]]
@@ -1493,26 +1497,26 @@ define i32 @chained_partial_reduce_add_sub_ext_mul(ptr %a, ptr %b, ptr %c, i64 %
 ; CHECK-SVE-MAXBW-NEXT:    [[TMP6:%.*]] = getelementptr i8, ptr [[TMP3]], i64 8
 ; CHECK-SVE-MAXBW-NEXT:    [[WIDE_LOAD5:%.*]] = load <8 x i8>, ptr [[TMP3]], align 1
 ; CHECK-SVE-MAXBW-NEXT:    [[WIDE_LOAD6:%.*]] = load <8 x i8>, ptr [[TMP6]], align 1
-; CHECK-SVE-MAXBW-NEXT:    [[TMP7:%.*]] = sext <8 x i8> [[WIDE_LOAD]] to <8 x i16>
-; CHECK-SVE-MAXBW-NEXT:    [[TMP8:%.*]] = sext <8 x i8> [[WIDE_LOAD2]] to <8 x i16>
-; CHECK-SVE-MAXBW-NEXT:    [[TMP9:%.*]] = sext <8 x i8> [[WIDE_LOAD3]] to <8 x i16>
-; CHECK-SVE-MAXBW-NEXT:    [[TMP10:%.*]] = sext <8 x i8> [[WIDE_LOAD4]] to <8 x i16>
-; CHECK-SVE-MAXBW-NEXT:    [[TMP11:%.*]] = sext <8 x i8> [[WIDE_LOAD5]] to <8 x i32>
-; CHECK-SVE-MAXBW-NEXT:    [[TMP12:%.*]] = sext <8 x i8> [[WIDE_LOAD6]] to <8 x i32>
-; CHECK-SVE-MAXBW-NEXT:    [[TMP13:%.*]] = mul nsw <8 x i16> [[TMP7]], [[TMP9]]
-; CHECK-SVE-MAXBW-NEXT:    [[TMP14:%.*]] = mul nsw <8 x i16> [[TMP8]], [[TMP10]]
-; CHECK-SVE-MAXBW-NEXT:    [[TMP15:%.*]] = sext <8 x i16> [[TMP13]] to <8 x i32>
-; CHECK-SVE-MAXBW-NEXT:    [[TMP16:%.*]] = sext <8 x i16> [[TMP14]] to <8 x i32>
-; CHECK-SVE-MAXBW-NEXT:    [[TMP17:%.*]] = sub <8 x i32> [[VEC_PHI]], [[TMP15]]
-; CHECK-SVE-MAXBW-NEXT:    [[TMP18:%.*]] = sub <8 x i32> [[VEC_PHI1]], [[TMP16]]
-; CHECK-SVE-MAXBW-NEXT:    [[TMP19]] = add <8 x i32> [[TMP17]], [[TMP11]]
-; CHECK-SVE-MAXBW-NEXT:    [[TMP20]] = add <8 x i32> [[TMP18]], [[TMP12]]
+; CHECK-SVE-MAXBW-NEXT:    [[TMP7:%.*]] = sext <8 x i8> [[WIDE_LOAD]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT:    [[TMP8:%.*]] = sext <8 x i8> [[WIDE_LOAD3]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT:    [[TMP15:%.*]] = mul nsw <8 x i32> [[TMP7]], [[TMP8]]
+; CHECK-SVE-MAXBW-NEXT:    [[TMP19:%.*]] = sub <8 x i32> zeroinitializer, [[TMP15]]
+; CHECK-SVE-MAXBW-NEXT:    [[PARTIAL_REDUCE:%.*]] = call <2 x i32> @llvm.vector.partial.reduce.add.v2i32.v8i32(<2 x i32> [[VEC_PHI]], <8 x i32> [[TMP19]])
+; CHECK-SVE-MAXBW-NEXT:    [[TMP11:%.*]] = sext <8 x i8> [[WIDE_LOAD2]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT:    [[TMP12:%.*]] = sext <8 x i8> [[WIDE_LOAD4]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT:    [[TMP16:%.*]] = mul nsw <8 x i32> [[TMP11]], [[TMP12]]
+; CHECK-SVE-MAXBW-NEXT:    [[TMP22:%.*]] = sub <8 x i32> zeroinitializer, [[TMP16]]
+; CHECK-SVE-MAXBW-NEXT:    [[PARTIAL_REDUCE7:%.*]] = call <2 x i32> @llvm.vector.partial.reduce.add.v2i32.v8i32(<2 x i32> [[VEC_PHI1]], <8 x i32> [[TMP22]])
+; CHECK-SVE-MAXBW-NEXT:    [[TMP17:%.*]] = sext <8 x i8> [[WIDE_LOAD5]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT:    [[PARTIAL_REDUCE8]] = call <2 x i32> @llvm.vector.partial.reduce.add.v2i32.v8i32(<2 x i32> [[PARTIAL_REDUCE]], <8 x i32> [[TMP17]])
+; CHECK-SVE-MAXBW-NEXT:    [[TMP18:%.*]] = sext <8 x i8> [[WIDE_LOAD6]] to <8 x i32>
+; CHECK-SVE-MAXBW-NEXT:    [[PARTIAL_REDUCE9]] = call <2 x i32> @llvm.vector.partial.reduce.add.v2i32.v8i32(<2 x i32> [[PARTIAL_REDUCE7]], <8 x i32> [[TMP18]])
 ; CHECK-SVE-MAXBW-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
 ; CHECK-SVE-MAXBW-NEXT:    [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-SVE-MAXBW-NEXT:    br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
 ; CHECK-SVE-MAXBW:       middle.block:
-; CHECK-SVE-MAXBW-NEXT:    [[BIN_RDX:%.*]] = add <8 x i32> [[TMP20]], [[TMP19]]
-; CHECK-SVE-MAXBW-NEXT:    [[TMP22:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[BIN_RDX]])
+; CHECK-SVE-MAXBW-NEXT:    [[BIN_RDX:%.*]] = add <2 x i32> [[PARTIAL_REDUCE9]], [[PARTIAL_REDUCE8]]
+; CHECK-SVE-MAXBW-NEXT:    [[TMP20:%.*]] = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> [[BIN_RDX]])
 ; CHECK-SVE-MAXBW-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
 ; CHECK-SVE-MAXBW-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; CHECK-SVE-MAXBW:       scalar.ph:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-usabs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-usabs.ll
index a4bfa47fa9850..0fcac876e8788 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-usabs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-usabs.ll
@@ -380,3 +380,63 @@ for.body:
 exit:
   ret i32 %sum.1
 }
+
+; Tests an abs diff extended operand in a AddChainWithSubs reduction, where the
+; abs diff is negated (the operand of the sub). Previously, we'd crash handling
+; this as we would fail to match the negated abs diff.
+define i32 @sub_add_chain_unsigned_absolute_difference(ptr noalias %x, ptr noalias %y) {
+; CHECK-LABEL: define i32 @sub_add_chain_unsigned_absolute_difference(
+; CHECK-SAME: ptr noalias [[X:%.*]], ptr noalias [[Y:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[PARTIAL_REDUCE2:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds nuw i8, ptr [[X]], i64 [[INDEX]]
+; CHECK-NEXT:    ...
[truncated]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants