[VPlan] Allow truncation for lanes in VPScalarIVStepsRecipe#175268
[VPlan] Allow truncation for lanes in VPScalarIVStepsRecipe#175268googlewalt merged 1 commit intollvm:mainfrom
Conversation
VPScalarIVStepsRecipe relies on APInt truncation in order to vectorize blocks with a width greater than the maximum value the types of some of their (changing) operands are able to hold (e.g., an i1 input with a vector width of 4). Simply reenable implicit truncation in ConstantInt::get() to cover this case. Remove the helper function given it is only called in one place to prevent accidentally using it elsewhere where we probably do not want implicit truncation turned on. This fixes another case that we saw after acb78bd did not fix that issue, which had the same stack trace. We still want to keep lane constants as unsigned. Somewhat similar to 6d1e7d4.
|
@llvm/pr-subscribers-llvm-transforms Author: Aiden Grossman (boomanaiden154) ChangesVPScalarIVStepsRecipe relies on APInt truncation in order to vectorize blocks with a width greater than the maximum value the types of some of their (changing) operands are able to hold (e.g., an i1 input with a vector width of 4). Simply reenable implicit truncation in ConstantInt::get() to cover this case. Remove the helper function given it is only called in one place to prevent accidentally using it elsewhere where we probably do not want implicit truncation turned on. This fixes another case that we saw after Somewhat similar to 6d1e7d4. Full diff: https://github.com/llvm/llvm-project/pull/175268.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index bfa704589a6dd..2c0772320c3cf 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -2344,12 +2344,6 @@ InstructionCost VPHeaderPHIRecipe::computeCost(ElementCount VF,
return Ctx.TTI.getCFInstrCost(Instruction::PHI, Ctx.CostKind);
}
-/// A helper function that returns an integer or floating-point constant with
-/// value C.
-static Constant *getUnsignedIntOrFpConstant(Type *Ty, uint64_t C) {
- return Ty->isIntegerTy() ? ConstantInt::get(Ty, C) : ConstantFP::get(Ty, C);
-}
-
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
void VPWidenIntOrFpInductionRecipe::printRecipe(
raw_ostream &O, const Twine &Indent, VPSlotTracker &SlotTracker) const {
@@ -2451,8 +2445,14 @@ void VPScalarIVStepsRecipe::execute(VPTransformState &State) {
StartIdx0 = Builder.CreateSIToFP(StartIdx0, BaseIVTy);
for (unsigned Lane = StartLane; Lane < EndLane; ++Lane) {
- Value *StartIdx = Builder.CreateBinOp(
- AddOp, StartIdx0, getUnsignedIntOrFpConstant(BaseIVTy, Lane));
+ // It is okay if the induction variable type cannot hold the lane number,
+ // we expect truncation in this case.
+ Constant *LaneValue =
+ BaseIVTy->isIntegerTy()
+ ? ConstantInt::get(BaseIVTy, Lane, /*IsSigned=*/false,
+ /*ImplicitTrunc=*/true)
+ : ConstantFP::get(BaseIVTy, Lane);
+ Value *StartIdx = Builder.CreateBinOp(AddOp, StartIdx0, LaneValue);
// The step returned by `createStepForVF` is a runtime-evaluated value
// when VF is scalable. Otherwise, it should be folded into a Constant.
assert((State.VF.isScalable() || isa<Constant>(StartIdx)) &&
diff --git a/llvm/test/Transforms/LoopVectorize/X86/vplan-single-bit-ind-var-width-4.ll b/llvm/test/Transforms/LoopVectorize/X86/vplan-single-bit-ind-var-width-4.ll
new file mode 100644
index 0000000000000..cdfe9c30d10af
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/X86/vplan-single-bit-ind-var-width-4.ll
@@ -0,0 +1,68 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 6
+; RUN: opt -passes=loop-vectorize -force-vector-width=4 -S %s 2>&1 | FileCheck %s
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-grtev4-linux-gnu"
+
+define void @copy_bitcast_fusion(ptr noalias %foo, ptr noalias %bar) {
+; CHECK-LABEL: define void @copy_bitcast_fusion(
+; CHECK-SAME: ptr noalias [[FOO:%.*]], ptr noalias [[BAR:%.*]]) {
+; CHECK-NEXT: br label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[TMP1:%.*]] = select i1 false, i64 1, i64 0
+; CHECK-NEXT: [[TMP2:%.*]] = select i1 true, i64 1, i64 0
+; CHECK-NEXT: [[TMP3:%.*]] = select i1 false, i64 1, i64 0
+; CHECK-NEXT: [[TMP4:%.*]] = select i1 true, i64 1, i64 0
+; CHECK-NEXT: [[TMP5:%.*]] = getelementptr { float, float }, ptr [[FOO]], i64 [[TMP1]]
+; CHECK-NEXT: [[TMP6:%.*]] = getelementptr { float, float }, ptr [[FOO]], i64 [[TMP2]]
+; CHECK-NEXT: [[TMP7:%.*]] = getelementptr { float, float }, ptr [[FOO]], i64 [[TMP3]]
+; CHECK-NEXT: [[TMP8:%.*]] = getelementptr { float, float }, ptr [[FOO]], i64 [[TMP4]]
+; CHECK-NEXT: [[TMP9:%.*]] = load float, ptr [[TMP5]], align 4
+; CHECK-NEXT: [[TMP10:%.*]] = load float, ptr [[TMP6]], align 4
+; CHECK-NEXT: [[TMP11:%.*]] = load float, ptr [[TMP7]], align 4
+; CHECK-NEXT: [[TMP12:%.*]] = load float, ptr [[TMP8]], align 4
+; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x float> poison, float [[TMP9]], i32 0
+; CHECK-NEXT: [[TMP14:%.*]] = insertelement <4 x float> [[TMP13]], float [[TMP10]], i32 1
+; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x float> [[TMP14]], float [[TMP11]], i32 2
+; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x float> [[TMP15]], float [[TMP12]], i32 3
+; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <4 x float> [[TMP16]], <4 x float> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <8 x float> [[TMP17]], <8 x float> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <16 x float> [[TMP18]], <16 x float> <float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef>, <24 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
+; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <24 x float> [[TMP19]], <24 x float> poison, <24 x i32> <i32 0, i32 4, i32 8, i32 12, i32 16, i32 20, i32 1, i32 5, i32 9, i32 13, i32 17, i32 21, i32 2, i32 6, i32 10, i32 14, i32 18, i32 22, i32 3, i32 7, i32 11, i32 15, i32 19, i32 23>
+; CHECK-NEXT: store <24 x float> [[INTERLEAVED_VEC]], ptr [[BAR]], align 4
+; CHECK-NEXT: br label %[[MIDDLE_BLOCK:.*]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br label %[[EXIT:.*]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+ br label %body
+
+body:
+ %iv = phi i64 [ 0, %0 ], [ %ptr3, %body ]
+ %iv.trunc = trunc i64 %iv to i1
+ %iv.trunc2 = select i1 %iv.trunc, i64 1, i64 0
+ %unpack.ptr = getelementptr { float, float }, ptr %foo, i64 %iv.trunc2
+ %unpack = load float, ptr %unpack.ptr, align 4
+ %idx3 = mul i64 %iv, 24
+ %bar.ptr = getelementptr i8, ptr %bar, i64 %idx3
+ store float %unpack, ptr %bar.ptr, align 4
+ %repack4 = getelementptr i8, ptr %bar.ptr, i64 4
+ store float 0.000000e+00, ptr %repack4, align 4
+ %ptr1 = getelementptr i8, ptr %bar.ptr, i64 8
+ store float 0.000000e+00, ptr %ptr1, align 4
+ %repack4.1 = getelementptr i8, ptr %bar.ptr, i64 12
+ store float 0.000000e+00, ptr %repack4.1, align 4
+ %ptr2 = getelementptr i8, ptr %bar.ptr, i64 16
+ store float 0.000000e+00, ptr %ptr2, align 4
+ %repack4.2 = getelementptr i8, ptr %bar.ptr, i64 20
+ store float 0.000000e+00, ptr %repack4.2, align 4
+ %ptr3 = add i64 %iv, 1
+ %exitcond.not = icmp eq i64 %ptr3, 4
+ br i1 %exitcond.not, label %exit, label %body
+
+exit:
+ ret void
+}
|
|
@llvm/pr-subscribers-vectorizers Author: Aiden Grossman (boomanaiden154) ChangesVPScalarIVStepsRecipe relies on APInt truncation in order to vectorize blocks with a width greater than the maximum value the types of some of their (changing) operands are able to hold (e.g., an i1 input with a vector width of 4). Simply reenable implicit truncation in ConstantInt::get() to cover this case. Remove the helper function given it is only called in one place to prevent accidentally using it elsewhere where we probably do not want implicit truncation turned on. This fixes another case that we saw after Somewhat similar to 6d1e7d4. Full diff: https://github.com/llvm/llvm-project/pull/175268.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index bfa704589a6dd..2c0772320c3cf 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -2344,12 +2344,6 @@ InstructionCost VPHeaderPHIRecipe::computeCost(ElementCount VF,
return Ctx.TTI.getCFInstrCost(Instruction::PHI, Ctx.CostKind);
}
-/// A helper function that returns an integer or floating-point constant with
-/// value C.
-static Constant *getUnsignedIntOrFpConstant(Type *Ty, uint64_t C) {
- return Ty->isIntegerTy() ? ConstantInt::get(Ty, C) : ConstantFP::get(Ty, C);
-}
-
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
void VPWidenIntOrFpInductionRecipe::printRecipe(
raw_ostream &O, const Twine &Indent, VPSlotTracker &SlotTracker) const {
@@ -2451,8 +2445,14 @@ void VPScalarIVStepsRecipe::execute(VPTransformState &State) {
StartIdx0 = Builder.CreateSIToFP(StartIdx0, BaseIVTy);
for (unsigned Lane = StartLane; Lane < EndLane; ++Lane) {
- Value *StartIdx = Builder.CreateBinOp(
- AddOp, StartIdx0, getUnsignedIntOrFpConstant(BaseIVTy, Lane));
+ // It is okay if the induction variable type cannot hold the lane number,
+ // we expect truncation in this case.
+ Constant *LaneValue =
+ BaseIVTy->isIntegerTy()
+ ? ConstantInt::get(BaseIVTy, Lane, /*IsSigned=*/false,
+ /*ImplicitTrunc=*/true)
+ : ConstantFP::get(BaseIVTy, Lane);
+ Value *StartIdx = Builder.CreateBinOp(AddOp, StartIdx0, LaneValue);
// The step returned by `createStepForVF` is a runtime-evaluated value
// when VF is scalable. Otherwise, it should be folded into a Constant.
assert((State.VF.isScalable() || isa<Constant>(StartIdx)) &&
diff --git a/llvm/test/Transforms/LoopVectorize/X86/vplan-single-bit-ind-var-width-4.ll b/llvm/test/Transforms/LoopVectorize/X86/vplan-single-bit-ind-var-width-4.ll
new file mode 100644
index 0000000000000..cdfe9c30d10af
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/X86/vplan-single-bit-ind-var-width-4.ll
@@ -0,0 +1,68 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 6
+; RUN: opt -passes=loop-vectorize -force-vector-width=4 -S %s 2>&1 | FileCheck %s
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-grtev4-linux-gnu"
+
+define void @copy_bitcast_fusion(ptr noalias %foo, ptr noalias %bar) {
+; CHECK-LABEL: define void @copy_bitcast_fusion(
+; CHECK-SAME: ptr noalias [[FOO:%.*]], ptr noalias [[BAR:%.*]]) {
+; CHECK-NEXT: br label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[TMP1:%.*]] = select i1 false, i64 1, i64 0
+; CHECK-NEXT: [[TMP2:%.*]] = select i1 true, i64 1, i64 0
+; CHECK-NEXT: [[TMP3:%.*]] = select i1 false, i64 1, i64 0
+; CHECK-NEXT: [[TMP4:%.*]] = select i1 true, i64 1, i64 0
+; CHECK-NEXT: [[TMP5:%.*]] = getelementptr { float, float }, ptr [[FOO]], i64 [[TMP1]]
+; CHECK-NEXT: [[TMP6:%.*]] = getelementptr { float, float }, ptr [[FOO]], i64 [[TMP2]]
+; CHECK-NEXT: [[TMP7:%.*]] = getelementptr { float, float }, ptr [[FOO]], i64 [[TMP3]]
+; CHECK-NEXT: [[TMP8:%.*]] = getelementptr { float, float }, ptr [[FOO]], i64 [[TMP4]]
+; CHECK-NEXT: [[TMP9:%.*]] = load float, ptr [[TMP5]], align 4
+; CHECK-NEXT: [[TMP10:%.*]] = load float, ptr [[TMP6]], align 4
+; CHECK-NEXT: [[TMP11:%.*]] = load float, ptr [[TMP7]], align 4
+; CHECK-NEXT: [[TMP12:%.*]] = load float, ptr [[TMP8]], align 4
+; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x float> poison, float [[TMP9]], i32 0
+; CHECK-NEXT: [[TMP14:%.*]] = insertelement <4 x float> [[TMP13]], float [[TMP10]], i32 1
+; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x float> [[TMP14]], float [[TMP11]], i32 2
+; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x float> [[TMP15]], float [[TMP12]], i32 3
+; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <4 x float> [[TMP16]], <4 x float> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <8 x float> [[TMP17]], <8 x float> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <16 x float> [[TMP18]], <16 x float> <float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef>, <24 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
+; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <24 x float> [[TMP19]], <24 x float> poison, <24 x i32> <i32 0, i32 4, i32 8, i32 12, i32 16, i32 20, i32 1, i32 5, i32 9, i32 13, i32 17, i32 21, i32 2, i32 6, i32 10, i32 14, i32 18, i32 22, i32 3, i32 7, i32 11, i32 15, i32 19, i32 23>
+; CHECK-NEXT: store <24 x float> [[INTERLEAVED_VEC]], ptr [[BAR]], align 4
+; CHECK-NEXT: br label %[[MIDDLE_BLOCK:.*]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br label %[[EXIT:.*]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+ br label %body
+
+body:
+ %iv = phi i64 [ 0, %0 ], [ %ptr3, %body ]
+ %iv.trunc = trunc i64 %iv to i1
+ %iv.trunc2 = select i1 %iv.trunc, i64 1, i64 0
+ %unpack.ptr = getelementptr { float, float }, ptr %foo, i64 %iv.trunc2
+ %unpack = load float, ptr %unpack.ptr, align 4
+ %idx3 = mul i64 %iv, 24
+ %bar.ptr = getelementptr i8, ptr %bar, i64 %idx3
+ store float %unpack, ptr %bar.ptr, align 4
+ %repack4 = getelementptr i8, ptr %bar.ptr, i64 4
+ store float 0.000000e+00, ptr %repack4, align 4
+ %ptr1 = getelementptr i8, ptr %bar.ptr, i64 8
+ store float 0.000000e+00, ptr %ptr1, align 4
+ %repack4.1 = getelementptr i8, ptr %bar.ptr, i64 12
+ store float 0.000000e+00, ptr %repack4.1, align 4
+ %ptr2 = getelementptr i8, ptr %bar.ptr, i64 16
+ store float 0.000000e+00, ptr %ptr2, align 4
+ %repack4.2 = getelementptr i8, ptr %bar.ptr, i64 20
+ store float 0.000000e+00, ptr %repack4.2, align 4
+ %ptr3 = add i64 %iv, 1
+ %exitcond.not = icmp eq i64 %ptr3, 4
+ br i1 %exitcond.not, label %exit, label %body
+
+exit:
+ ret void
+}
|
You can test this locally with the following command:git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef([^a-zA-Z0-9_-]|$)|UndefValue::get)' 'HEAD~1' HEAD llvm/test/Transforms/LoopVectorize/X86/vplan-single-bit-ind-var-width-4.ll llvm/lib/Transforms/Vectorize/VPlanRecipes.cppThe following files introduce new uses of undef:
Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields In tests, avoid using For example, this is considered a bad practice: define void @fn() {
...
br i1 undef, ...
}Please use the following instead: define void @fn(i1 %cond) {
...
br i1 %cond, ...
}Please refer to the Undefined Behavior Manual for more information. |
| ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 6 | ||
| ; RUN: opt -passes=loop-vectorize -force-vector-width=4 -S %s 2>&1 | FileCheck %s | ||
|
|
||
| target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128" |
There was a problem hiding this comment.
You probably can drop the target triple here
There was a problem hiding this comment.
This didn't reproduce for me without the triple.
) VPScalarIVStepsRecipe relies on APInt truncation in order to vectorize blocks with a width greater than the maximum value the types of some of their (changing) operands are able to hold (e.g., an i1 input with a vector width of 4). Simply reenable implicit truncation in ConstantInt::get() to cover this case. Remove the helper function given it is only called in one place to prevent accidentally using it elsewhere where we probably do not want implicit truncation turned on. This fixes another case that we saw after acb78bd did not fix that issue, which had the same stack trace. We still want to keep lane constants as unsigned. Somewhat similar to 6d1e7d4. This test case comes from a tensorflow/XLA compilation from a test case in https://github.com/google-research/spherical-cnn.
VPScalarIVStepsRecipe relies on APInt truncation in order to vectorize blocks with a width greater than the maximum value the types of some of their (changing) operands are able to hold (e.g., an i1 input with a vector width of 4). Simply reenable implicit truncation in ConstantInt::get() to cover this case.
Remove the helper function given it is only called in one place to prevent accidentally using it elsewhere where we probably do not want implicit truncation turned on.
This fixes another case that we saw after
acb78bd did not fix that issue, which had the same stack trace. We still want to keep lane constants as unsigned.
Somewhat similar to 6d1e7d4.
This test case comes from a tensorflow/XLA compilation from a test case in https://github.com/google-research/spherical-cnn.