Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoopVectorize: fix phi cost when it is scalar after vectorization #74456

Closed
wants to merge 2 commits into from

Conversation

artagnon
Copy link
Contributor

@artagnon artagnon commented Dec 5, 2023

In getInstructionCost, when a phi node is identified as a fixed-order recurrence, and the vectorization factor indicates that we're vectorizing, we return the cost of a shuffle. However, if it also satisfies the condition isScalarAfterVectorization, attempting to get the shuffle cost will result in a crash from a cast. Fix this.

Fixes #72969.

-- 8< --
Based on #74111.

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 5, 2023

@llvm/pr-subscribers-llvm-transforms

Author: Ramkumar Ramachandra (artagnon)

Changes

In getInstructionCost, when a phi node is identified as a fixed-order recurrence, and the vectorization factor indicates that we're
vectorizing, we return the cost of a shuffle. However, if it also satisfies the condition isScalarAfterVectorization, this cost is incorrect, as it won't be converted into a shuffle. Fix the cost.

Fixes #72969.

-- 8< --
Based on #74111.


Full diff: https://github.com/llvm/llvm-project/pull/74456.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+4-1)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll (+99)
  • (added) llvm/test/Transforms/LoopVectorize/X86/pr72969.ll (+44)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 09a6e01226ab6..1166a39e3ff5e 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7176,7 +7176,10 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
     auto *Phi = cast<PHINode>(I);
 
     // First-order recurrences are replaced by vector shuffles inside the loop.
-    if (VF.isVector() && Legal->isFixedOrderRecurrence(Phi)) {
+    // However, if the Phi is a scalar after vectorization, it won't be
+    // converted into a shuffle.
+    if (VF.isVector() && Legal->isFixedOrderRecurrence(Phi) &&
+        !isScalarAfterVectorization(Phi, VF)) {
       SmallVector<int> Mask(VF.getKnownMinValue());
       std::iota(Mask.begin(), Mask.end(), VF.getKnownMinValue() - 1);
       return TTI.getShuffleCost(TargetTransformInfo::SK_Splice,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll b/llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll
new file mode 100644
index 0000000000000..832c55ba9a1f9
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/pr72969.ll
@@ -0,0 +1,99 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -mtriple=aarch64 -mattr=-sve,+neon -passes=loop-vectorize -S < %s | FileCheck %s
+
+@h = global i32 0, align 4
+
+define void @test(ptr %p) {
+; CHECK-LABEL: define void @test(
+; CHECK-SAME: ptr [[P:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[P1:%.*]] = ptrtoint ptr [[P]] to i64
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[P1]], 8
+; CHECK-NEXT:    [[UMAX2:%.*]] = call i64 @llvm.umax.i64(i64 [[TMP0]], i64 add (i64 ptrtoint (ptr @h to i64), i64 1))
+; CHECK-NEXT:    [[TMP1:%.*]] = add i64 [[UMAX2]], -5
+; CHECK-NEXT:    [[TMP2:%.*]] = sub i64 [[TMP1]], [[P1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = lshr i64 [[TMP2]], 2
+; CHECK-NEXT:    [[TMP4:%.*]] = add nuw nsw i64 [[TMP3]], 1
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP4]], 2
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_SCEVCHECK:%.*]]
+; CHECK:       vector.scevcheck:
+; CHECK-NEXT:    [[TMP5:%.*]] = add i64 [[P1]], 8
+; CHECK-NEXT:    [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[TMP5]], i64 add (i64 ptrtoint (ptr @h to i64), i64 1))
+; CHECK-NEXT:    [[TMP6:%.*]] = add i64 [[UMAX]], -5
+; CHECK-NEXT:    [[TMP7:%.*]] = sub i64 [[TMP6]], [[P1]]
+; CHECK-NEXT:    [[TMP8:%.*]] = lshr i64 [[TMP7]], 2
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp ugt i64 [[TMP8]], 65535
+; CHECK-NEXT:    [[TMP10:%.*]] = trunc i64 [[TMP8]] to i16
+; CHECK-NEXT:    [[TMP11:%.*]] = add i16 2, [[TMP10]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp ult i16 [[TMP11]], 2
+; CHECK-NEXT:    [[TMP13:%.*]] = icmp ugt i64 [[TMP8]], 65535
+; CHECK-NEXT:    [[TMP14:%.*]] = or i1 [[TMP12]], [[TMP13]]
+; CHECK-NEXT:    br i1 [[TMP14]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP4]], 2
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP4]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[DOTCAST:%.*]] = trunc i64 [[N_VEC]] to i16
+; CHECK-NEXT:    [[IND_END:%.*]] = add i16 1, [[DOTCAST]]
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VECTOR_RECUR:%.*]] = phi i64 [ 1, [[VECTOR_PH]] ], [ [[TMP20:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[DOTCAST4:%.*]] = trunc i64 [[INDEX]] to i16
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i16 1, [[DOTCAST4]]
+; CHECK-NEXT:    [[TMP15:%.*]] = add i16 [[OFFSET_IDX]], 0
+; CHECK-NEXT:    [[TMP16:%.*]] = add i16 [[OFFSET_IDX]], 1
+; CHECK-NEXT:    [[TMP17:%.*]] = add i16 [[TMP15]], 1
+; CHECK-NEXT:    [[TMP18:%.*]] = add i16 [[TMP16]], 1
+; CHECK-NEXT:    [[TMP19:%.*]] = zext i16 [[TMP17]] to i64
+; CHECK-NEXT:    [[TMP20]] = zext i16 [[TMP18]] to i64
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP4]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[SCALAR_RECUR_INIT:%.*]] = phi i64 [ 1, [[VECTOR_SCEVCHECK]] ], [ 1, [[ENTRY:%.*]] ], [ [[TMP20]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i16 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 1, [[ENTRY]] ], [ 1, [[VECTOR_SCEVCHECK]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL3:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ], [ 0, [[VECTOR_SCEVCHECK]] ]
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[SCALAR_RECUR:%.*]] = phi i64 [ [[SCALAR_RECUR_INIT]], [[SCALAR_PH]] ], [ [[IDX:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[INC_MERGE:%.*]] = phi i16 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[IDX_MERGE:%.*]] = phi i64 [ [[BC_RESUME_VAL3]], [[SCALAR_PH]] ], [ [[SCALAR_RECUR]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[ADD:%.*]] = shl i64 [[IDX_MERGE]], 1
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr [1 x i32], ptr [[P]], i64 0, i64 [[ADD]]
+; CHECK-NEXT:    [[TMP22:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[INC]] = add i16 [[INC_MERGE]], 1
+; CHECK-NEXT:    [[IDX]] = zext i16 [[INC]] to i64
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[P]], i64 [[IDX]]
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt ptr [[GEP]], @h
+; CHECK-NEXT:    br i1 [[CMP]], label [[EXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %for.body
+
+for.body:
+  %idx.ext.merge = phi i64 [ 1, %entry ], [ %idx, %for.body ]
+  %inc.merge = phi i16 [ 1, %entry ], [ %inc, %for.body ]
+  %idx.merge = phi i64 [ 0, %entry ], [ %idx.ext.merge, %for.body ]
+  %add = shl i64 %idx.merge, 1
+  %arrayidx = getelementptr [1 x i32], ptr %p, i64 0, i64 %add
+  %0 = load i32, ptr %arrayidx, align 4
+  %inc = add i16 %inc.merge, 1
+  %idx = zext i16 %inc to i64
+  %gep = getelementptr i32, ptr %p, i64 %idx
+  %cmp = icmp ugt ptr %gep, @h
+  br i1 %cmp, label %exit, label %for.body
+
+exit:
+  ret void
+}
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]]}
+;.
diff --git a/llvm/test/Transforms/LoopVectorize/X86/pr72969.ll b/llvm/test/Transforms/LoopVectorize/X86/pr72969.ll
new file mode 100644
index 0000000000000..4cf0399438e08
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/X86/pr72969.ll
@@ -0,0 +1,44 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt -mtriple=x86_64 -mattr=-avx,-avx2,-avx512f,-sse2,-sse3,+sse -passes=loop-vectorize -S < %s | FileCheck %s
+
+@h = global i32 0, align 4
+
+define void @test(ptr %p) {
+; CHECK-LABEL: define void @test(
+; CHECK-SAME: ptr [[P:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[IDX_EXT_MERGE:%.*]] = phi i64 [ 1, [[ENTRY:%.*]] ], [ [[IDX:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[INC_MERGE:%.*]] = phi i16 [ 1, [[ENTRY]] ], [ [[INC:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[IDX_MERGE:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[IDX_EXT_MERGE]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[ADD:%.*]] = shl i64 [[IDX_MERGE]], 1
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr [1 x i32], ptr [[P]], i64 0, i64 [[ADD]]
+; CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[INC]] = add i16 [[INC_MERGE]], 1
+; CHECK-NEXT:    [[IDX]] = zext i16 [[INC]] to i64
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[P]], i64 [[IDX]]
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt ptr [[GEP]], @h
+; CHECK-NEXT:    br i1 [[CMP]], label [[EXIT:%.*]], label [[FOR_BODY]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %for.body
+
+for.body:
+  %idx.ext.merge = phi i64 [ 1, %entry ], [ %idx, %for.body ]
+  %inc.merge = phi i16 [ 1, %entry ], [ %inc, %for.body ]
+  %idx.merge = phi i64 [ 0, %entry ], [ %idx.ext.merge, %for.body ]
+  %add = shl i64 %idx.merge, 1
+  %arrayidx = getelementptr [1 x i32], ptr %p, i64 0, i64 %add
+  %0 = load i32, ptr %arrayidx, align 4
+  %inc = add i16 %inc.merge, 1
+  %idx = zext i16 %inc to i64
+  %gep = getelementptr i32, ptr %p, i64 %idx
+  %cmp = icmp ugt ptr %gep, @h
+  br i1 %cmp, label %exit, label %for.body
+
+exit:
+  ret void
+}

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the fix.

@fhahn
Copy link
Contributor

fhahn commented Dec 12, 2023

However, if it also satisfies the condition isScalarAfterVectorization, this cost is incorrect, as it won't be converted into a shuffle. Fix the cost.

Hmm, is this correct? It looks like in the VPlan, there's

FIRST-ORDER-RECURRENCE-PHI ir<%idx.ext.merge> = phi ir<1>, ir<%idx>
EMIT vp<%8> = first-order splice ir<%idx.ext.merge>, ir<%idx>

so a vector phi and shuffle should be generated. It may be good to also add a use of the recurrence to the test, the load and GEP will get DCE'd.

@artagnon
Copy link
Contributor Author

Fixed commit message.

It may be good to also add a use of the recurrence to the test, the load and GEP will get DCE'd.

I don't think this would apply, since nothing gets vectorized in the minimal X86 example, and I don't think we need an example showing vectorization to fix the bug.

@artagnon
Copy link
Contributor Author

artagnon commented Jan 3, 2024

@fhahn Ping. Are you happy with the changes?

@fhahn
Copy link
Contributor

fhahn commented Jan 3, 2024

@fhahn Ping. Are you happy with the changes?

What are your thoughts about the below? For that, we would need a case that gets vectorized with the change.

Hmm, is this correct? It looks like in the VPlan, there's

 FIRST-ORDER-RECURRENCE-PHI ir<%idx.ext.merge> = phi ir<1>, ir<%idx>
 EMIT vp<%8> = first-order splice ir<%idx.ext.merge>, ir<%idx>

so a vector phi and shuffle should be generated.
It may be good to also add a use of the recurrence to the test, the load and GEP will get DCE'd.
I don't think this would apply, since nothing gets vectorized in the minimal X86 example, and I don't think we need an example showing vectorization to fix the bug.

Ideally we would have a test for both cases that are impacted by the patch: vectorizing with the new cost and also not vectorizing (and not crashing).

@artagnon
Copy link
Contributor Author

artagnon commented Jan 3, 2024

Ideally we would have a test for both cases that are impacted by the patch: vectorizing with the new cost and also not vectorizing (and not crashing).

Okay, I struggled a lot with your requests, because the bug is very subtle:

  1. It's impossible to get a test under X86 to first crash, and then be vectorized after the fix, since vectorization depends on sse4.2, and the bug isn't triggered when sse4.2 is enabled.
  2. I tried everything, but I think it's impossible to remove the SCEV checks.
  3. I tried to use the result of the recurrence in a lot of ways, but none of them get the test to crash first, and then be vectorized after the fix.

I think we have no choice but to revert to my original version of X86 and AArch64 tests.

@artagnon
Copy link
Contributor Author

artagnon commented Jan 8, 2024

@fhahn This is the best I could do (see previous comment). Do you think it's okay to merge?

@fhahn
Copy link
Contributor

fhahn commented Jan 8, 2024

It's impossible to get a test under X86 to first crash, and then be vectorized after the fix, since vectorization depends on sse4.2, and the bug isn't triggered when sse4.2 is enabled.

I think if you pass -force-vector-width=4 to the X86 test case you can force it to be vectorized. Replacing the load in the test case with a store (like store i64 0, ptr %arrayidx, align 8) would replace a recipe that will be removed with one that cannot be removed. Does that make sense?

In getInstructionCost, when a phi node is identified as a fixed-order
recurrence, and the vectorization factor indicates that we're
vectorizing, we return the cost of a shuffle. However, if it also
satisfies the condition isScalarAfterVectorization, attempting to get
the shuffle cost will result in a crash from a cast. Fix this.

Fixes llvm#72969.
@artagnon
Copy link
Contributor Author

artagnon commented Jan 8, 2024

Thanks @fhahn: the patch is now ready. Your comment about replacing the load with the store makes sense, but can you clarify why you thought -force-vector-width=4 would work? Also, if you know, what is the difference between sse and sse4.2 in the context of this test?

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fhahn: the patch is now ready. Your comment about replacing the load with the store makes sense, but can you clarify why you thought -force-vector-width=4 would work? Also, if you know, what is the difference between sse and sse4.2 in the context of this test?

Thanks @fhahn: the patch is now ready. Your comment about replacing the load with the store makes sense, but can you clarify why you thought -force-vector-width=4 would work? Also, if you know, what is the difference between sse and sse4.2 in the context of this test?

The reason for not vectorizing was probably the cost-model deciding it is not profitable; -force-vector-width=4 side-steps that by forcing to vectorize with VF=4 if legally possible.

Not sure about sse4.2.

// However, if the Phi is a scalar after vectorization, don't get shuffle
// cost.
if (VF.isVector() && Legal->isFixedOrderRecurrence(Phi) &&
!isScalarAfterVectorization(Phi, VF)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we now don't crash because the phi in the test is considered scalar after vectorization; but this doesn't match the code we generate in the test case, where we have a vector phi for the recurrence and the corresponding shuffle AFAICT.

We only support codegen for vector recurrences, so it would probably be better to avoid marking the phi as scalar after vectorization.

@artagnon
Copy link
Contributor Author

artagnon commented Apr 6, 2024

Will author a fresh patch soon.

@artagnon artagnon closed this Apr 6, 2024
@artagnon artagnon deleted the lv-crash branch April 6, 2024 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clang crash: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.
4 participants