LoopVectorize: fix phi cost when it is scalar after vectorization #74456

artagnon · 2023-12-05T12:15:18Z

In getInstructionCost, when a phi node is identified as a fixed-order recurrence, and the vectorization factor indicates that we're vectorizing, we return the cost of a shuffle. However, if it also satisfies the condition isScalarAfterVectorization, attempting to get the shuffle cost will result in a crash from a cast. Fix this.

Fixes #72969.

-- 8< --
Based on #74111.

llvmbot · 2023-12-05T12:15:48Z

@llvm/pr-subscribers-llvm-transforms

Author: Ramkumar Ramachandra (artagnon)

Changes

In getInstructionCost, when a phi node is identified as a fixed-order recurrence, and the vectorization factor indicates that we're
vectorizing, we return the cost of a shuffle. However, if it also satisfies the condition isScalarAfterVectorization, this cost is incorrect, as it won't be converted into a shuffle. Fix the cost.

Fixes #72969.

-- 8< --
Based on #74111.

Full diff: https://github.com/llvm/llvm-project/pull/74456.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+4-1)
(added) llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll (+99)
(added) llvm/test/Transforms/LoopVectorize/X86/pr72969.ll (+44)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 09a6e01226ab6..1166a39e3ff5e 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7176,7 +7176,10 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
     auto *Phi = cast<PHINode>(I);
 
     // First-order recurrences are replaced by vector shuffles inside the loop.
-    if (VF.isVector() && Legal->isFixedOrderRecurrence(Phi)) {
+    // However, if the Phi is a scalar after vectorization, it won't be
+    // converted into a shuffle.
+    if (VF.isVector() && Legal->isFixedOrderRecurrence(Phi) &&
+        !isScalarAfterVectorization(Phi, VF)) {
       SmallVector<int> Mask(VF.getKnownMinValue());
       std::iota(Mask.begin(), Mask.end(), VF.getKnownMinValue() - 1);
       return TTI.getShuffleCost(TargetTransformInfo::SK_Splice,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll b/llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll
new file mode 100644
index 0000000000000..832c55ba9a1f9
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll
@@ -0,0 +1,99 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -mtriple=aarch64 -mattr=-sve,+neon -passes=loop-vectorize -S < %s | FileCheck %s
+
+@h = global i32 0, align 4
+
+define void @test(ptr %p) {
+; CHECK-LABEL: define void @test(
+; CHECK-SAME: ptr [[P:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P1:%.*]] = ptrtoint ptr [[P]] to i64
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[P1]], 8
+; CHECK-NEXT:    [[UMAX2:%.*]] = call i64 @llvm.umax.i64(i64 [[TMP0]], i64 add (i64 ptrtoint (ptr @h to i64), i64 1))
+; CHECK-NEXT:    [[TMP1:%.*]] = add i64 [[UMAX2]], -5
+; CHECK-NEXT:    [[TMP2:%.*]] = sub i64 [[TMP1]], [[P1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = lshr i64 [[TMP2]], 2
+; CHECK-NEXT:    [[TMP4:%.*]] = add nuw nsw i64 [[TMP3]], 1
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP4]], 2
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_SCEVCHECK:%.*]]
+; CHECK:       vector.scevcheck:
+; CHECK-NEXT:    [[TMP5:%.*]] = add i64 [[P1]], 8
+; CHECK-NEXT:    [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[TMP5]], i64 add (i64 ptrtoint (ptr @h to i64), i64 1))
+; CHECK-NEXT:    [[TMP6:%.*]] = add i64 [[UMAX]], -5
+; CHECK-NEXT:    [[TMP7:%.*]] = sub i64 [[TMP6]], [[P1]]
+; CHECK-NEXT:    [[TMP8:%.*]] = lshr i64 [[TMP7]], 2
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp ugt i64 [[TMP8]], 65535
+; CHECK-NEXT:    [[TMP10:%.*]] = trunc i64 [[TMP8]] to i16
+; CHECK-NEXT:    [[TMP11:%.*]] = add i16 2, [[TMP10]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp ult i16 [[TMP11]], 2
+; CHECK-NEXT:    [[TMP13:%.*]] = icmp ugt i64 [[TMP8]], 65535
+; CHECK-NEXT:    [[TMP14:%.*]] = or i1 [[TMP12]], [[TMP13]]
+; CHECK-NEXT:    br i1 [[TMP14]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP4]], 2
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP4]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[DOTCAST:%.*]] = trunc i64 [[N_VEC]] to i16
+; CHECK-NEXT:    [[IND_END:%.*]] = add i16 1, [[DOTCAST]]
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VECTOR_RECUR:%.*]] = phi i64 [ 1, [[VECTOR_PH]] ], [ [[TMP20:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[DOTCAST4:%.*]] = trunc i64 [[INDEX]] to i16
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i16 1, [[DOTCAST4]]
+; CHECK-NEXT:    [[TMP15:%.*]] = add i16 [[OFFSET_IDX]], 0
+; CHECK-NEXT:    [[TMP16:%.*]] = add i16 [[OFFSET_IDX]], 1
+; CHECK-NEXT:    [[TMP17:%.*]] = add i16 [[TMP15]], 1
+; CHECK-NEXT:    [[TMP18:%.*]] = add i16 [[TMP16]], 1
+; CHECK-NEXT:    [[TMP19:%.*]] = zext i16 [[TMP17]] to i64
+; CHECK-NEXT:    [[TMP20]] = zext i16 [[TMP18]] to i64
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP4]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[SCALAR_RECUR_INIT:%.*]] = phi i64 [ 1, [[VECTOR_SCEVCHECK]] ], [ 1, [[ENTRY:%.*]] ], [ [[TMP20]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i16 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 1, [[ENTRY]] ], [ 1, [[VECTOR_SCEVCHECK]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL3:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ], [ 0, [[VECTOR_SCEVCHECK]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[SCALAR_RECUR:%.*]] = phi i64 [ [[SCALAR_RECUR_INIT]], [[SCALAR_PH]] ], [ [[IDX:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[INC_MERGE:%.*]] = phi i16 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[IDX_MERGE:%.*]] = phi i64 [ [[BC_RESUME_VAL3]], [[SCALAR_PH]] ], [ [[SCALAR_RECUR]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[ADD:%.*]] = shl i64 [[IDX_MERGE]], 1
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr [1 x i32], ptr [[P]], i64 0, i64 [[ADD]]
+; CHECK-NEXT:    [[TMP22:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[INC]] = add i16 [[INC_MERGE]], 1
+; CHECK-NEXT:    [[IDX]] = zext i16 [[INC]] to i64
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[P]], i64 [[IDX]]
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt ptr [[GEP]], @h
+; CHECK-NEXT:    br i1 [[CMP]], label [[EXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %for.body
+
+for.body:
+  %idx.ext.merge = phi i64 [ 1, %entry ], [ %idx, %for.body ]
+  %inc.merge = phi i16 [ 1, %entry ], [ %inc, %for.body ]
+  %idx.merge = phi i64 [ 0, %entry ], [ %idx.ext.merge, %for.body ]
+  %add = shl i64 %idx.merge, 1
+  %arrayidx = getelementptr [1 x i32], ptr %p, i64 0, i64 %add
+  %0 = load i32, ptr %arrayidx, align 4
+  %inc = add i16 %inc.merge, 1
+  %idx = zext i16 %inc to i64
+  %gep = getelementptr i32, ptr %p, i64 %idx
+  %cmp = icmp ugt ptr %gep, @h
+  br i1 %cmp, label %exit, label %for.body
+
+exit:
+  ret void
+}
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]]}
+;.
diff --git a/llvm/test/Transforms/LoopVectorize/X86/pr72969.ll b/llvm/test/Transforms/LoopVectorize/X86/pr72969.ll
new file mode 100644
index 0000000000000..4cf0399438e08
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/X86/pr72969.ll
@@ -0,0 +1,44 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -mtriple=x86_64 -mattr=-avx,-avx2,-avx512f,-sse2,-sse3,+sse -passes=loop-vectorize -S < %s | FileCheck %s
+
+@h = global i32 0, align 4
+
+define void @test(ptr %p) {
+; CHECK-LABEL: define void @test(
+; CHECK-SAME: ptr [[P:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[IDX_EXT_MERGE:%.*]] = phi i64 [ 1, [[ENTRY:%.*]] ], [ [[IDX:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[INC_MERGE:%.*]] = phi i16 [ 1, [[ENTRY]] ], [ [[INC:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[IDX_MERGE:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[IDX_EXT_MERGE]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[ADD:%.*]] = shl i64 [[IDX_MERGE]], 1
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr [1 x i32], ptr [[P]], i64 0, i64 [[ADD]]
+; CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[INC]] = add i16 [[INC_MERGE]], 1
+; CHECK-NEXT:    [[IDX]] = zext i16 [[INC]] to i64
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[P]], i64 [[IDX]]
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt ptr [[GEP]], @h
+; CHECK-NEXT:    br i1 [[CMP]], label [[EXIT:%.*]], label [[FOR_BODY]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %for.body
+
+for.body:
+  %idx.ext.merge = phi i64 [ 1, %entry ], [ %idx, %for.body ]
+  %inc.merge = phi i16 [ 1, %entry ], [ %inc, %for.body ]
+  %idx.merge = phi i64 [ 0, %entry ], [ %idx.ext.merge, %for.body ]
+  %add = shl i64 %idx.merge, 1
+  %arrayidx = getelementptr [1 x i32], ptr %p, i64 0, i64 %add
+  %0 = load i32, ptr %arrayidx, align 4
+  %inc = add i16 %inc.merge, 1
+  %idx = zext i16 %inc to i64
+  %gep = getelementptr i32, ptr %p, i64 %idx
+  %cmp = icmp ugt ptr %gep, @h
+  br i1 %cmp, label %exit, label %for.body
+
+exit:
+  ret void
+}

david-arm

LGTM! Thanks for the fix.

llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll

fhahn · 2023-12-12T18:52:00Z

However, if it also satisfies the condition isScalarAfterVectorization, this cost is incorrect, as it won't be converted into a shuffle. Fix the cost.

Hmm, is this correct? It looks like in the VPlan, there's

FIRST-ORDER-RECURRENCE-PHI ir<%idx.ext.merge> = phi ir<1>, ir<%idx>
EMIT vp<%8> = first-order splice ir<%idx.ext.merge>, ir<%idx>

so a vector phi and shuffle should be generated. It may be good to also add a use of the recurrence to the test, the load and GEP will get DCE'd.

artagnon · 2023-12-19T11:59:23Z

Fixed commit message.

It may be good to also add a use of the recurrence to the test, the load and GEP will get DCE'd.

I don't think this would apply, since nothing gets vectorized in the minimal X86 example, and I don't think we need an example showing vectorization to fix the bug.

artagnon · 2024-01-03T10:39:38Z

@fhahn Ping. Are you happy with the changes?

fhahn · 2024-01-03T10:45:10Z

@fhahn Ping. Are you happy with the changes?

What are your thoughts about the below? For that, we would need a case that gets vectorized with the change.

Hmm, is this correct? It looks like in the VPlan, there's
 FIRST-ORDER-RECURRENCE-PHI ir<%idx.ext.merge> = phi ir<1>, ir<%idx>
 EMIT vp<%8> = first-order splice ir<%idx.ext.merge>, ir<%idx>
so a vector phi and shuffle should be generated.
It may be good to also add a use of the recurrence to the test, the load and GEP will get DCE'd.
I don't think this would apply, since nothing gets vectorized in the minimal X86 example, and I don't think we need an example showing vectorization to fix the bug.

Ideally we would have a test for both cases that are impacted by the patch: vectorizing with the new cost and also not vectorizing (and not crashing).

artagnon · 2024-01-03T13:41:50Z

Ideally we would have a test for both cases that are impacted by the patch: vectorizing with the new cost and also not vectorizing (and not crashing).

Okay, I struggled a lot with your requests, because the bug is very subtle:

It's impossible to get a test under X86 to first crash, and then be vectorized after the fix, since vectorization depends on sse4.2, and the bug isn't triggered when sse4.2 is enabled.
I tried everything, but I think it's impossible to remove the SCEV checks.
I tried to use the result of the recurrence in a lot of ways, but none of them get the test to crash first, and then be vectorized after the fix.

I think we have no choice but to revert to my original version of X86 and AArch64 tests.

artagnon · 2024-01-08T14:58:07Z

@fhahn This is the best I could do (see previous comment). Do you think it's okay to merge?

fhahn · 2024-01-08T15:36:45Z

It's impossible to get a test under X86 to first crash, and then be vectorized after the fix, since vectorization depends on sse4.2, and the bug isn't triggered when sse4.2 is enabled.

I think if you pass -force-vector-width=4 to the X86 test case you can force it to be vectorized. Replacing the load in the test case with a store (like store i64 0, ptr %arrayidx, align 8) would replace a recipe that will be removed with one that cannot be removed. Does that make sense?

In getInstructionCost, when a phi node is identified as a fixed-order recurrence, and the vectorization factor indicates that we're vectorizing, we return the cost of a shuffle. However, if it also satisfies the condition isScalarAfterVectorization, attempting to get the shuffle cost will result in a crash from a cast. Fix this. Fixes llvm#72969.

artagnon · 2024-01-08T16:04:19Z

Thanks @fhahn: the patch is now ready. Your comment about replacing the load with the store makes sense, but can you clarify why you thought -force-vector-width=4 would work? Also, if you know, what is the difference between sse and sse4.2 in the context of this test?

fhahn

Thanks @fhahn: the patch is now ready. Your comment about replacing the load with the store makes sense, but can you clarify why you thought -force-vector-width=4 would work? Also, if you know, what is the difference between sse and sse4.2 in the context of this test?

Thanks @fhahn: the patch is now ready. Your comment about replacing the load with the store makes sense, but can you clarify why you thought -force-vector-width=4 would work? Also, if you know, what is the difference between sse and sse4.2 in the context of this test?

The reason for not vectorizing was probably the cost-model deciding it is not profitable; -force-vector-width=4 side-steps that by forcing to vectorize with VF=4 if legally possible.

Not sure about sse4.2.

fhahn · 2024-01-11T15:37:46Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    // However, if the Phi is a scalar after vectorization, don't get shuffle
+    // cost.
+    if (VF.isVector() && Legal->isFixedOrderRecurrence(Phi) &&
+        !isScalarAfterVectorization(Phi, VF)) {


IIUC we now don't crash because the phi in the test is considered scalar after vectorization; but this doesn't match the code we generate in the test case, where we have a vector phi for the recurrence and the corresponding shuffle AFAICT.

We only support codegen for vector recurrences, so it would probably be better to avoid marking the phi as scalar after vectorization.

artagnon · 2024-04-06T13:05:40Z

Will author a fresh patch soon.

artagnon requested review from fhahn, davemgreen and david-arm December 5, 2023 12:15

llvmbot added vectorization llvm:transforms labels Dec 5, 2023

david-arm approved these changes Dec 12, 2023

View reviewed changes

fhahn reviewed Dec 12, 2023

View reviewed changes

llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll Outdated Show resolved Hide resolved

artagnon force-pushed the lv-crash branch from 94a033f to 75c4e93 Compare December 19, 2023 12:36

artagnon requested a review from fhahn December 19, 2023 12:46

artagnon force-pushed the lv-crash branch from 75c4e93 to 9b9ef07 Compare January 3, 2024 14:32

artagnon added 2 commits January 8, 2024 16:00

LoopVectorize: add test for crash in llvm#72969

1afcfdf

artagnon force-pushed the lv-crash branch from 9b9ef07 to 8cda2f3 Compare January 8, 2024 16:02

fhahn reviewed Jan 11, 2024

View reviewed changes

david-arm mentioned this pull request Feb 22, 2024

LoopVectorize: add test for crash in #72969 #74111

Merged

artagnon closed this Apr 6, 2024

artagnon deleted the lv-crash branch April 6, 2024 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoopVectorize: fix phi cost when it is scalar after vectorization #74456

LoopVectorize: fix phi cost when it is scalar after vectorization #74456

artagnon commented Dec 5, 2023 •

edited

llvmbot commented Dec 5, 2023

david-arm left a comment

fhahn commented Dec 12, 2023

artagnon commented Dec 19, 2023

artagnon commented Jan 3, 2024

fhahn commented Jan 3, 2024

artagnon commented Jan 3, 2024

artagnon commented Jan 8, 2024

fhahn commented Jan 8, 2024

artagnon commented Jan 8, 2024

fhahn left a comment

fhahn Jan 11, 2024

artagnon commented Apr 6, 2024

LoopVectorize: fix phi cost when it is scalar after vectorization #74456

LoopVectorize: fix phi cost when it is scalar after vectorization #74456

Conversation

artagnon commented Dec 5, 2023 • edited

llvmbot commented Dec 5, 2023

david-arm left a comment

Choose a reason for hiding this comment

fhahn commented Dec 12, 2023

artagnon commented Dec 19, 2023

artagnon commented Jan 3, 2024

fhahn commented Jan 3, 2024

artagnon commented Jan 3, 2024

artagnon commented Jan 8, 2024

fhahn commented Jan 8, 2024

artagnon commented Jan 8, 2024

fhahn left a comment

Choose a reason for hiding this comment

fhahn Jan 11, 2024

Choose a reason for hiding this comment

artagnon commented Apr 6, 2024

artagnon commented Dec 5, 2023 •

edited