-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoopVectorize: fix phi cost when it is scalar after vectorization #74456
Conversation
@llvm/pr-subscribers-llvm-transforms Author: Ramkumar Ramachandra (artagnon) ChangesIn getInstructionCost, when a phi node is identified as a fixed-order recurrence, and the vectorization factor indicates that we're Fixes #72969. -- 8< -- Full diff: https://github.com/llvm/llvm-project/pull/74456.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 09a6e01226ab6..1166a39e3ff5e 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7176,7 +7176,10 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
auto *Phi = cast<PHINode>(I);
// First-order recurrences are replaced by vector shuffles inside the loop.
- if (VF.isVector() && Legal->isFixedOrderRecurrence(Phi)) {
+ // However, if the Phi is a scalar after vectorization, it won't be
+ // converted into a shuffle.
+ if (VF.isVector() && Legal->isFixedOrderRecurrence(Phi) &&
+ !isScalarAfterVectorization(Phi, VF)) {
SmallVector<int> Mask(VF.getKnownMinValue());
std::iota(Mask.begin(), Mask.end(), VF.getKnownMinValue() - 1);
return TTI.getShuffleCost(TargetTransformInfo::SK_Splice,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll b/llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll
new file mode 100644
index 0000000000000..832c55ba9a1f9
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll
@@ -0,0 +1,99 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -mtriple=aarch64 -mattr=-sve,+neon -passes=loop-vectorize -S < %s | FileCheck %s
+
+@h = global i32 0, align 4
+
+define void @test(ptr %p) {
+; CHECK-LABEL: define void @test(
+; CHECK-SAME: ptr [[P:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[P1:%.*]] = ptrtoint ptr [[P]] to i64
+; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[P1]], 8
+; CHECK-NEXT: [[UMAX2:%.*]] = call i64 @llvm.umax.i64(i64 [[TMP0]], i64 add (i64 ptrtoint (ptr @h to i64), i64 1))
+; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[UMAX2]], -5
+; CHECK-NEXT: [[TMP2:%.*]] = sub i64 [[TMP1]], [[P1]]
+; CHECK-NEXT: [[TMP3:%.*]] = lshr i64 [[TMP2]], 2
+; CHECK-NEXT: [[TMP4:%.*]] = add nuw nsw i64 [[TMP3]], 1
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP4]], 2
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_SCEVCHECK:%.*]]
+; CHECK: vector.scevcheck:
+; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[P1]], 8
+; CHECK-NEXT: [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[TMP5]], i64 add (i64 ptrtoint (ptr @h to i64), i64 1))
+; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[UMAX]], -5
+; CHECK-NEXT: [[TMP7:%.*]] = sub i64 [[TMP6]], [[P1]]
+; CHECK-NEXT: [[TMP8:%.*]] = lshr i64 [[TMP7]], 2
+; CHECK-NEXT: [[TMP9:%.*]] = icmp ugt i64 [[TMP8]], 65535
+; CHECK-NEXT: [[TMP10:%.*]] = trunc i64 [[TMP8]] to i16
+; CHECK-NEXT: [[TMP11:%.*]] = add i16 2, [[TMP10]]
+; CHECK-NEXT: [[TMP12:%.*]] = icmp ult i16 [[TMP11]], 2
+; CHECK-NEXT: [[TMP13:%.*]] = icmp ugt i64 [[TMP8]], 65535
+; CHECK-NEXT: [[TMP14:%.*]] = or i1 [[TMP12]], [[TMP13]]
+; CHECK-NEXT: br i1 [[TMP14]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
+; CHECK: vector.ph:
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP4]], 2
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP4]], [[N_MOD_VF]]
+; CHECK-NEXT: [[DOTCAST:%.*]] = trunc i64 [[N_VEC]] to i16
+; CHECK-NEXT: [[IND_END:%.*]] = add i16 1, [[DOTCAST]]
+; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+; CHECK: vector.body:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VECTOR_RECUR:%.*]] = phi i64 [ 1, [[VECTOR_PH]] ], [ [[TMP20:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[DOTCAST4:%.*]] = trunc i64 [[INDEX]] to i16
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i16 1, [[DOTCAST4]]
+; CHECK-NEXT: [[TMP15:%.*]] = add i16 [[OFFSET_IDX]], 0
+; CHECK-NEXT: [[TMP16:%.*]] = add i16 [[OFFSET_IDX]], 1
+; CHECK-NEXT: [[TMP17:%.*]] = add i16 [[TMP15]], 1
+; CHECK-NEXT: [[TMP18:%.*]] = add i16 [[TMP16]], 1
+; CHECK-NEXT: [[TMP19:%.*]] = zext i16 [[TMP17]] to i64
+; CHECK-NEXT: [[TMP20]] = zext i16 [[TMP18]] to i64
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
+; CHECK-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: middle.block:
+; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP4]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
+; CHECK: scalar.ph:
+; CHECK-NEXT: [[SCALAR_RECUR_INIT:%.*]] = phi i64 [ 1, [[VECTOR_SCEVCHECK]] ], [ 1, [[ENTRY:%.*]] ], [ [[TMP20]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i16 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 1, [[ENTRY]] ], [ 1, [[VECTOR_SCEVCHECK]] ]
+; CHECK-NEXT: [[BC_RESUME_VAL3:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ], [ 0, [[VECTOR_SCEVCHECK]] ]
+; CHECK-NEXT: br label [[FOR_BODY:%.*]]
+; CHECK: for.body:
+; CHECK-NEXT: [[SCALAR_RECUR:%.*]] = phi i64 [ [[SCALAR_RECUR_INIT]], [[SCALAR_PH]] ], [ [[IDX:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[INC_MERGE:%.*]] = phi i16 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[IDX_MERGE:%.*]] = phi i64 [ [[BC_RESUME_VAL3]], [[SCALAR_PH]] ], [ [[SCALAR_RECUR]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[ADD:%.*]] = shl i64 [[IDX_MERGE]], 1
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr [1 x i32], ptr [[P]], i64 0, i64 [[ADD]]
+; CHECK-NEXT: [[TMP22:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT: [[INC]] = add i16 [[INC_MERGE]], 1
+; CHECK-NEXT: [[IDX]] = zext i16 [[INC]] to i64
+; CHECK-NEXT: [[GEP:%.*]] = getelementptr i32, ptr [[P]], i64 [[IDX]]
+; CHECK-NEXT: [[CMP:%.*]] = icmp ugt ptr [[GEP]], @h
+; CHECK-NEXT: br i1 [[CMP]], label [[EXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK: exit:
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %for.body
+
+for.body:
+ %idx.ext.merge = phi i64 [ 1, %entry ], [ %idx, %for.body ]
+ %inc.merge = phi i16 [ 1, %entry ], [ %inc, %for.body ]
+ %idx.merge = phi i64 [ 0, %entry ], [ %idx.ext.merge, %for.body ]
+ %add = shl i64 %idx.merge, 1
+ %arrayidx = getelementptr [1 x i32], ptr %p, i64 0, i64 %add
+ %0 = load i32, ptr %arrayidx, align 4
+ %inc = add i16 %inc.merge, 1
+ %idx = zext i16 %inc to i64
+ %gep = getelementptr i32, ptr %p, i64 %idx
+ %cmp = icmp ugt ptr %gep, @h
+ br i1 %cmp, label %exit, label %for.body
+
+exit:
+ ret void
+}
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]]}
+;.
diff --git a/llvm/test/Transforms/LoopVectorize/X86/pr72969.ll b/llvm/test/Transforms/LoopVectorize/X86/pr72969.ll
new file mode 100644
index 0000000000000..4cf0399438e08
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/X86/pr72969.ll
@@ -0,0 +1,44 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -mtriple=x86_64 -mattr=-avx,-avx2,-avx512f,-sse2,-sse3,+sse -passes=loop-vectorize -S < %s | FileCheck %s
+
+@h = global i32 0, align 4
+
+define void @test(ptr %p) {
+; CHECK-LABEL: define void @test(
+; CHECK-SAME: ptr [[P:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: br label [[FOR_BODY:%.*]]
+; CHECK: for.body:
+; CHECK-NEXT: [[IDX_EXT_MERGE:%.*]] = phi i64 [ 1, [[ENTRY:%.*]] ], [ [[IDX:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[INC_MERGE:%.*]] = phi i16 [ 1, [[ENTRY]] ], [ [[INC:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[IDX_MERGE:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[IDX_EXT_MERGE]], [[FOR_BODY]] ]
+; CHECK-NEXT: [[ADD:%.*]] = shl i64 [[IDX_MERGE]], 1
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr [1 x i32], ptr [[P]], i64 0, i64 [[ADD]]
+; CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT: [[INC]] = add i16 [[INC_MERGE]], 1
+; CHECK-NEXT: [[IDX]] = zext i16 [[INC]] to i64
+; CHECK-NEXT: [[GEP:%.*]] = getelementptr i32, ptr [[P]], i64 [[IDX]]
+; CHECK-NEXT: [[CMP:%.*]] = icmp ugt ptr [[GEP]], @h
+; CHECK-NEXT: br i1 [[CMP]], label [[EXIT:%.*]], label [[FOR_BODY]]
+; CHECK: exit:
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %for.body
+
+for.body:
+ %idx.ext.merge = phi i64 [ 1, %entry ], [ %idx, %for.body ]
+ %inc.merge = phi i16 [ 1, %entry ], [ %inc, %for.body ]
+ %idx.merge = phi i64 [ 0, %entry ], [ %idx.ext.merge, %for.body ]
+ %add = shl i64 %idx.merge, 1
+ %arrayidx = getelementptr [1 x i32], ptr %p, i64 0, i64 %add
+ %0 = load i32, ptr %arrayidx, align 4
+ %inc = add i16 %inc.merge, 1
+ %idx = zext i16 %inc to i64
+ %gep = getelementptr i32, ptr %p, i64 %idx
+ %cmp = icmp ugt ptr %gep, @h
+ br i1 %cmp, label %exit, label %for.body
+
+exit:
+ ret void
+}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the fix.
Hmm, is this correct? It looks like in the VPlan, there's
so a vector phi and shuffle should be generated. It may be good to also add a use of the recurrence to the test, the load and GEP will get DCE'd. |
Fixed commit message.
I don't think this would apply, since nothing gets vectorized in the minimal X86 example, and I don't think we need an example showing vectorization to fix the bug. |
@fhahn Ping. Are you happy with the changes? |
What are your thoughts about the below? For that, we would need a case that gets vectorized with the change.
Ideally we would have a test for both cases that are impacted by the patch: vectorizing with the new cost and also not vectorizing (and not crashing). |
Okay, I struggled a lot with your requests, because the bug is very subtle:
I think we have no choice but to revert to my original version of X86 and AArch64 tests. |
@fhahn This is the best I could do (see previous comment). Do you think it's okay to merge? |
I think if you pass |
In getInstructionCost, when a phi node is identified as a fixed-order recurrence, and the vectorization factor indicates that we're vectorizing, we return the cost of a shuffle. However, if it also satisfies the condition isScalarAfterVectorization, attempting to get the shuffle cost will result in a crash from a cast. Fix this. Fixes llvm#72969.
Thanks @fhahn: the patch is now ready. Your comment about replacing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fhahn: the patch is now ready. Your comment about replacing the
load
with thestore
makes sense, but can you clarify why you thought-force-vector-width=4
would work? Also, if you know, what is the difference between sse and sse4.2 in the context of this test?
Thanks @fhahn: the patch is now ready. Your comment about replacing the load with the store makes sense, but can you clarify why you thought -force-vector-width=4 would work? Also, if you know, what is the difference between sse and sse4.2 in the context of this test?
The reason for not vectorizing was probably the cost-model deciding it is not profitable; -force-vector-width=4
side-steps that by forcing to vectorize with VF=4 if legally possible.
Not sure about sse4.2
.
// However, if the Phi is a scalar after vectorization, don't get shuffle | ||
// cost. | ||
if (VF.isVector() && Legal->isFixedOrderRecurrence(Phi) && | ||
!isScalarAfterVectorization(Phi, VF)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we now don't crash because the phi in the test is considered scalar after vectorization; but this doesn't match the code we generate in the test case, where we have a vector phi for the recurrence and the corresponding shuffle AFAICT.
We only support codegen for vector recurrences, so it would probably be better to avoid marking the phi as scalar after vectorization.
Will author a fresh patch soon. |
In getInstructionCost, when a phi node is identified as a fixed-order recurrence, and the vectorization factor indicates that we're vectorizing, we return the cost of a shuffle. However, if it also satisfies the condition isScalarAfterVectorization, attempting to get the shuffle cost will result in a crash from a cast. Fix this.
Fixes #72969.
-- 8< --
Based on #74111.