-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SystemZ] Provide improved cost estimates #83873
Conversation
This commit provides better cost estimates for the llvm.vector.reduce.add intrinsic on SystemZ. These apply to all vector lengths and integer types up to i128. For integer types larger than i128, we fall back to the default cost estimate. This has the effect of lowering the estimated costs of most common instances of the intrinsic. The expected performance impact of this is minimal with a tendency to slightly improve performance of some benchmarks. This commit also provides a test to check the proper computation of the new estimates, as well as the fallback for types larger than i128.
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-systemz Author: Dominik Steenken (dominik-steenken) ChangesThis commit provides better cost estimates for This has the effect of lowering the estimated costs of most common instances of the intrinsic. The expected performance impact of this is minimal with a tendency to slightly improve performance of some benchmarks. This commit also provides a test to check the proper computation of the new estimates, as well as the fallback for types larger than i128. Full diff: https://github.com/llvm/llvm-project/pull/83873.diff 2 Files Affected:
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
index 9370fb51a96c56..2d0c3e55198225 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
@@ -20,6 +20,8 @@
#include "llvm/CodeGen/TargetLowering.h"
#include "llvm/IR/IntrinsicInst.h"
#include "llvm/Support/Debug.h"
+#include "llvm/Support/MathExtras.h"
+
using namespace llvm;
#define DEBUG_TYPE "systemztti"
@@ -1284,9 +1286,35 @@ InstructionCost SystemZTTIImpl::getInterleavedMemoryOpCost(
return NumVectorMemOps + NumPermutes;
}
-static int getVectorIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy) {
+static int
+getVectorIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
+ const SmallVectorImpl<Type *> &ParamTys) {
if (RetTy->isVectorTy() && ID == Intrinsic::bswap)
return getNumVectorRegs(RetTy); // VPERM
+
+ if (ID == Intrinsic::vector_reduce_add) {
+ // Retrieve number and size of elements for the vector op
+ auto *VTy = cast<FixedVectorType>(ParamTys.front());
+ unsigned NumElements = VTy->getNumElements();
+ unsigned ScalarSize = VTy->getScalarSizeInBits();
+ // For scalar sizes >128 bits, we fall back to the generic cost estimate
+ if (ScalarSize > SystemZ::VectorBits)
+ return -1;
+ // How many elements can a single vector register hold
+ unsigned MaxElemsPerVector = SystemZ::VectorBits / ScalarSize;
+ // How many vector regs are needed to represent the input elements (V)
+ unsigned VectorRegsNeeded = getNumVectorRegs(VTy);
+ // How many instructions are needed for the final sum of vector elems (S)
+ unsigned LastVectorHandling =
+ 2 * Log2_32_Ceil(std::min(NumElements, MaxElemsPerVector));
+ // We use vector adds to create a sum vector, which takes
+ // V/2 + V/4 + ... = V - 1 operations.
+ // Then, we need S operations to sum up the elements of that sum vector,
+ // for a total of V + S - 1 operations.
+ int Cost = VectorRegsNeeded + LastVectorHandling - 1;
+ assert(Cost > 0 && "Predicted cost of vector.reduce.add must be > 0");
+ return Cost;
+ }
return -1;
}
@@ -1294,7 +1322,7 @@ InstructionCost
SystemZTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) {
InstructionCost Cost =
- getVectorIntrinsicInstrCost(ICA.getID(), ICA.getReturnType());
+ getVectorIntrinsicInstrCost(ICA.getID(), ICA.getReturnType(), ICA.getArgTypes());
if (Cost != -1)
return Cost;
return BaseT::getIntrinsicInstrCost(ICA, CostKind);
diff --git a/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll b/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
new file mode 100644
index 00000000000000..061e5ece44a4e7
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
@@ -0,0 +1,128 @@
+; RUN: opt < %s -mtriple=systemz-unknown -mcpu=z13 -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s
+
+define void @reduce(ptr %src, ptr %dst) {
+; CHECK-LABEL: 'reduce'
+; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %R2_64 = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> %V2_64)
+; CHECK: Cost Model: Found an estimated cost of 3 for instruction: %R4_64 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %V4_64)
+; CHECK: Cost Model: Found an estimated cost of 5 for instruction: %R8_64 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %V8_64)
+; CHECK: Cost Model: Found an estimated cost of 9 for instruction: %R16_64 = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> %V16_64)
+; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %R2_32 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %V2_32)
+; CHECK: Cost Model: Found an estimated cost of 4 for instruction: %R4_32 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %V4_32)
+; CHECK: Cost Model: Found an estimated cost of 5 for instruction: %R8_32 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %V8_32)
+; CHECK: Cost Model: Found an estimated cost of 7 for instruction: %R16_32 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %V16_32)
+; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %R2_16 = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> %V2_16)
+; CHECK: Cost Model: Found an estimated cost of 4 for instruction: %R4_16 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %V4_16)
+; CHECK: Cost Model: Found an estimated cost of 6 for instruction: %R8_16 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %V8_16)
+; CHECK: Cost Model: Found an estimated cost of 7 for instruction: %R16_16 = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %V16_16)
+; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %R2_8 = call i8 @llvm.vector.reduce.add.v2i8(<2 x i8> %V2_8)
+; CHECK: Cost Model: Found an estimated cost of 4 for instruction: %R4_8 = call i8 @llvm.vector.reduce.add.v4i8(<4 x i8> %V4_8)
+; CHECK: Cost Model: Found an estimated cost of 6 for instruction: %R8_8 = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %V8_8)
+; CHECK: Cost Model: Found an estimated cost of 8 for instruction: %R16_8 = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %V16_8)
+;
+; CHECK: Cost Model: Found an estimated cost of 15 for instruction: %R128_8 = call i8 @llvm.vector.reduce.add.v128i8(<128 x i8> %V128_8)
+; CHECK: Cost Model: Found an estimated cost of 20 for instruction: %R4_256 = call i256 @llvm.vector.reduce.add.v4i256(<4 x i256> %V4_256)
+
+ ; REDUCEADD64
+
+ %V2_64 = load <2 x i64>, ptr %src, align 8
+ %R2_64 = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> %V2_64)
+ store volatile i64 %R2_64, ptr %dst, align 4
+
+ %V4_64 = load <4 x i64>, ptr %src, align 8
+ %R4_64 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %V4_64)
+ store volatile i64 %R4_64, ptr %dst, align 4
+
+ %V8_64 = load <8 x i64>, ptr %src, align 8
+ %R8_64 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %V8_64)
+ store volatile i64 %R8_64, ptr %dst, align 4
+
+ %V16_64 = load <16 x i64>, ptr %src, align 8
+ %R16_64 = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> %V16_64)
+ store volatile i64 %R16_64, ptr %dst, align 4
+
+ ; REDUCEADD32
+
+ %V2_32 = load <2 x i32>, ptr %src, align 8
+ %R2_32 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %V2_32)
+ store volatile i32 %R2_32, ptr %dst, align 4
+
+ %V4_32 = load <4 x i32>, ptr %src, align 8
+ %R4_32 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %V4_32)
+ store volatile i32 %R4_32, ptr %dst, align 4
+
+ %V8_32 = load <8 x i32>, ptr %src, align 8
+ %R8_32 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %V8_32)
+ store volatile i32 %R8_32, ptr %dst, align 4
+
+ %V16_32 = load <16 x i32>, ptr %src, align 8
+ %R16_32 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %V16_32)
+ store volatile i32 %R16_32, ptr %dst, align 4
+
+ ; REDUCEADD16
+
+ %V2_16 = load <2 x i16>, ptr %src, align 8
+ %R2_16 = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> %V2_16)
+ store volatile i16 %R2_16, ptr %dst, align 4
+
+ %V4_16 = load <4 x i16>, ptr %src, align 8
+ %R4_16 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %V4_16)
+ store volatile i16 %R4_16, ptr %dst, align 4
+
+ %V8_16 = load <8 x i16>, ptr %src, align 8
+ %R8_16 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %V8_16)
+ store volatile i16 %R8_16, ptr %dst, align 4
+
+ %V16_16 = load <16 x i16>, ptr %src, align 8
+ %R16_16 = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %V16_16)
+ store volatile i16 %R16_16, ptr %dst, align 4
+
+ ; REDUCEADD8
+
+ %V2_8 = load <2 x i8>, ptr %src, align 8
+ %R2_8 = call i8 @llvm.vector.reduce.add.v2i8(<2 x i8> %V2_8)
+ store volatile i8 %R2_8, ptr %dst, align 4
+
+ %V4_8 = load <4 x i8>, ptr %src, align 8
+ %R4_8 = call i8 @llvm.vector.reduce.add.v4i8(<4 x i8> %V4_8)
+ store volatile i8 %R4_8, ptr %dst, align 4
+
+ %V8_8 = load <8 x i8>, ptr %src, align 8
+ %R8_8 = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %V8_8)
+ store volatile i8 %R8_8, ptr %dst, align 4
+
+ %V16_8 = load <16 x i8>, ptr %src, align 8
+ %R16_8 = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %V16_8)
+ store volatile i8 %R16_8, ptr %dst, align 4
+
+ ; EXTREME VALUES
+
+ %V128_8 = load <128 x i8>, ptr %src, align 8
+ %R128_8 = call i8 @llvm.vector.reduce.add.v128i8(<128 x i8> %V128_8)
+ store volatile i8 %R128_8, ptr %dst, align 4
+
+ %V4_256 = load <4 x i256>, ptr %src, align 8
+ %R4_256 = call i256 @llvm.vector.reduce.add.v4i256(<4 x i256> %V4_256)
+ store volatile i256 %R4_256, ptr %dst, align 8
+
+ ret void
+}
+
+declare i64 @llvm.vector.reduce.add.v2i64(<2 x i64>)
+declare i64 @llvm.vector.reduce.add.v4i64(<4 x i64>)
+declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)
+declare i64 @llvm.vector.reduce.add.v16i64(<16 x i64>)
+declare i32 @llvm.vector.reduce.add.v2i32(<2 x i32>)
+declare i32 @llvm.vector.reduce.add.v4i32(<4 x i32>)
+declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>)
+declare i32 @llvm.vector.reduce.add.v16i32(<16 x i32>)
+declare i16 @llvm.vector.reduce.add.v2i16(<2 x i16>)
+declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>)
+declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>)
+declare i16 @llvm.vector.reduce.add.v16i16(<16 x i16>)
+declare i8 @llvm.vector.reduce.add.v2i8(<2 x i8>)
+declare i8 @llvm.vector.reduce.add.v4i8(<4 x i8>)
+declare i8 @llvm.vector.reduce.add.v8i8(<8 x i8>)
+declare i8 @llvm.vector.reduce.add.v16i8(<16 x i8>)
+
+declare i8 @llvm.vector.reduce.add.v128i8(<128 x i8>)
+declare i256 @llvm.vector.reduce.add.v4i256(<4 x i256>)
|
@llvm/pr-subscribers-llvm-analysis Author: Dominik Steenken (dominik-steenken) ChangesThis commit provides better cost estimates for This has the effect of lowering the estimated costs of most common instances of the intrinsic. The expected performance impact of this is minimal with a tendency to slightly improve performance of some benchmarks. This commit also provides a test to check the proper computation of the new estimates, as well as the fallback for types larger than i128. Full diff: https://github.com/llvm/llvm-project/pull/83873.diff 2 Files Affected:
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
index 9370fb51a96c56..2d0c3e55198225 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
@@ -20,6 +20,8 @@
#include "llvm/CodeGen/TargetLowering.h"
#include "llvm/IR/IntrinsicInst.h"
#include "llvm/Support/Debug.h"
+#include "llvm/Support/MathExtras.h"
+
using namespace llvm;
#define DEBUG_TYPE "systemztti"
@@ -1284,9 +1286,35 @@ InstructionCost SystemZTTIImpl::getInterleavedMemoryOpCost(
return NumVectorMemOps + NumPermutes;
}
-static int getVectorIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy) {
+static int
+getVectorIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
+ const SmallVectorImpl<Type *> &ParamTys) {
if (RetTy->isVectorTy() && ID == Intrinsic::bswap)
return getNumVectorRegs(RetTy); // VPERM
+
+ if (ID == Intrinsic::vector_reduce_add) {
+ // Retrieve number and size of elements for the vector op
+ auto *VTy = cast<FixedVectorType>(ParamTys.front());
+ unsigned NumElements = VTy->getNumElements();
+ unsigned ScalarSize = VTy->getScalarSizeInBits();
+ // For scalar sizes >128 bits, we fall back to the generic cost estimate
+ if (ScalarSize > SystemZ::VectorBits)
+ return -1;
+ // How many elements can a single vector register hold
+ unsigned MaxElemsPerVector = SystemZ::VectorBits / ScalarSize;
+ // How many vector regs are needed to represent the input elements (V)
+ unsigned VectorRegsNeeded = getNumVectorRegs(VTy);
+ // How many instructions are needed for the final sum of vector elems (S)
+ unsigned LastVectorHandling =
+ 2 * Log2_32_Ceil(std::min(NumElements, MaxElemsPerVector));
+ // We use vector adds to create a sum vector, which takes
+ // V/2 + V/4 + ... = V - 1 operations.
+ // Then, we need S operations to sum up the elements of that sum vector,
+ // for a total of V + S - 1 operations.
+ int Cost = VectorRegsNeeded + LastVectorHandling - 1;
+ assert(Cost > 0 && "Predicted cost of vector.reduce.add must be > 0");
+ return Cost;
+ }
return -1;
}
@@ -1294,7 +1322,7 @@ InstructionCost
SystemZTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) {
InstructionCost Cost =
- getVectorIntrinsicInstrCost(ICA.getID(), ICA.getReturnType());
+ getVectorIntrinsicInstrCost(ICA.getID(), ICA.getReturnType(), ICA.getArgTypes());
if (Cost != -1)
return Cost;
return BaseT::getIntrinsicInstrCost(ICA, CostKind);
diff --git a/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll b/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
new file mode 100644
index 00000000000000..061e5ece44a4e7
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
@@ -0,0 +1,128 @@
+; RUN: opt < %s -mtriple=systemz-unknown -mcpu=z13 -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s
+
+define void @reduce(ptr %src, ptr %dst) {
+; CHECK-LABEL: 'reduce'
+; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %R2_64 = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> %V2_64)
+; CHECK: Cost Model: Found an estimated cost of 3 for instruction: %R4_64 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %V4_64)
+; CHECK: Cost Model: Found an estimated cost of 5 for instruction: %R8_64 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %V8_64)
+; CHECK: Cost Model: Found an estimated cost of 9 for instruction: %R16_64 = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> %V16_64)
+; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %R2_32 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %V2_32)
+; CHECK: Cost Model: Found an estimated cost of 4 for instruction: %R4_32 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %V4_32)
+; CHECK: Cost Model: Found an estimated cost of 5 for instruction: %R8_32 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %V8_32)
+; CHECK: Cost Model: Found an estimated cost of 7 for instruction: %R16_32 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %V16_32)
+; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %R2_16 = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> %V2_16)
+; CHECK: Cost Model: Found an estimated cost of 4 for instruction: %R4_16 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %V4_16)
+; CHECK: Cost Model: Found an estimated cost of 6 for instruction: %R8_16 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %V8_16)
+; CHECK: Cost Model: Found an estimated cost of 7 for instruction: %R16_16 = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %V16_16)
+; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %R2_8 = call i8 @llvm.vector.reduce.add.v2i8(<2 x i8> %V2_8)
+; CHECK: Cost Model: Found an estimated cost of 4 for instruction: %R4_8 = call i8 @llvm.vector.reduce.add.v4i8(<4 x i8> %V4_8)
+; CHECK: Cost Model: Found an estimated cost of 6 for instruction: %R8_8 = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %V8_8)
+; CHECK: Cost Model: Found an estimated cost of 8 for instruction: %R16_8 = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %V16_8)
+;
+; CHECK: Cost Model: Found an estimated cost of 15 for instruction: %R128_8 = call i8 @llvm.vector.reduce.add.v128i8(<128 x i8> %V128_8)
+; CHECK: Cost Model: Found an estimated cost of 20 for instruction: %R4_256 = call i256 @llvm.vector.reduce.add.v4i256(<4 x i256> %V4_256)
+
+ ; REDUCEADD64
+
+ %V2_64 = load <2 x i64>, ptr %src, align 8
+ %R2_64 = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> %V2_64)
+ store volatile i64 %R2_64, ptr %dst, align 4
+
+ %V4_64 = load <4 x i64>, ptr %src, align 8
+ %R4_64 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %V4_64)
+ store volatile i64 %R4_64, ptr %dst, align 4
+
+ %V8_64 = load <8 x i64>, ptr %src, align 8
+ %R8_64 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %V8_64)
+ store volatile i64 %R8_64, ptr %dst, align 4
+
+ %V16_64 = load <16 x i64>, ptr %src, align 8
+ %R16_64 = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> %V16_64)
+ store volatile i64 %R16_64, ptr %dst, align 4
+
+ ; REDUCEADD32
+
+ %V2_32 = load <2 x i32>, ptr %src, align 8
+ %R2_32 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %V2_32)
+ store volatile i32 %R2_32, ptr %dst, align 4
+
+ %V4_32 = load <4 x i32>, ptr %src, align 8
+ %R4_32 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %V4_32)
+ store volatile i32 %R4_32, ptr %dst, align 4
+
+ %V8_32 = load <8 x i32>, ptr %src, align 8
+ %R8_32 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %V8_32)
+ store volatile i32 %R8_32, ptr %dst, align 4
+
+ %V16_32 = load <16 x i32>, ptr %src, align 8
+ %R16_32 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %V16_32)
+ store volatile i32 %R16_32, ptr %dst, align 4
+
+ ; REDUCEADD16
+
+ %V2_16 = load <2 x i16>, ptr %src, align 8
+ %R2_16 = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> %V2_16)
+ store volatile i16 %R2_16, ptr %dst, align 4
+
+ %V4_16 = load <4 x i16>, ptr %src, align 8
+ %R4_16 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %V4_16)
+ store volatile i16 %R4_16, ptr %dst, align 4
+
+ %V8_16 = load <8 x i16>, ptr %src, align 8
+ %R8_16 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %V8_16)
+ store volatile i16 %R8_16, ptr %dst, align 4
+
+ %V16_16 = load <16 x i16>, ptr %src, align 8
+ %R16_16 = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %V16_16)
+ store volatile i16 %R16_16, ptr %dst, align 4
+
+ ; REDUCEADD8
+
+ %V2_8 = load <2 x i8>, ptr %src, align 8
+ %R2_8 = call i8 @llvm.vector.reduce.add.v2i8(<2 x i8> %V2_8)
+ store volatile i8 %R2_8, ptr %dst, align 4
+
+ %V4_8 = load <4 x i8>, ptr %src, align 8
+ %R4_8 = call i8 @llvm.vector.reduce.add.v4i8(<4 x i8> %V4_8)
+ store volatile i8 %R4_8, ptr %dst, align 4
+
+ %V8_8 = load <8 x i8>, ptr %src, align 8
+ %R8_8 = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %V8_8)
+ store volatile i8 %R8_8, ptr %dst, align 4
+
+ %V16_8 = load <16 x i8>, ptr %src, align 8
+ %R16_8 = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %V16_8)
+ store volatile i8 %R16_8, ptr %dst, align 4
+
+ ; EXTREME VALUES
+
+ %V128_8 = load <128 x i8>, ptr %src, align 8
+ %R128_8 = call i8 @llvm.vector.reduce.add.v128i8(<128 x i8> %V128_8)
+ store volatile i8 %R128_8, ptr %dst, align 4
+
+ %V4_256 = load <4 x i256>, ptr %src, align 8
+ %R4_256 = call i256 @llvm.vector.reduce.add.v4i256(<4 x i256> %V4_256)
+ store volatile i256 %R4_256, ptr %dst, align 8
+
+ ret void
+}
+
+declare i64 @llvm.vector.reduce.add.v2i64(<2 x i64>)
+declare i64 @llvm.vector.reduce.add.v4i64(<4 x i64>)
+declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)
+declare i64 @llvm.vector.reduce.add.v16i64(<16 x i64>)
+declare i32 @llvm.vector.reduce.add.v2i32(<2 x i32>)
+declare i32 @llvm.vector.reduce.add.v4i32(<4 x i32>)
+declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>)
+declare i32 @llvm.vector.reduce.add.v16i32(<16 x i32>)
+declare i16 @llvm.vector.reduce.add.v2i16(<2 x i16>)
+declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>)
+declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>)
+declare i16 @llvm.vector.reduce.add.v16i16(<16 x i16>)
+declare i8 @llvm.vector.reduce.add.v2i8(<2 x i8>)
+declare i8 @llvm.vector.reduce.add.v4i8(<4 x i8>)
+declare i8 @llvm.vector.reduce.add.v8i8(<8 x i8>)
+declare i8 @llvm.vector.reduce.add.v16i8(<16 x i8>)
+
+declare i8 @llvm.vector.reduce.add.v128i8(<128 x i8>)
+declare i256 @llvm.vector.reduce.add.v4i256(<4 x i256>)
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
if (RetTy->isVectorTy() && ID == Intrinsic::bswap) | ||
return getNumVectorRegs(RetTy); // VPERM | ||
|
||
if (ID == Intrinsic::vector_reduce_add) { | ||
// Retrieve number and size of elements for the vector op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments should be prose with proper punctuation. (see https://llvm.org/docs/CodingStandards.html#commenting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, i've re-worded the comments and added closing punctuation. Please let me know if this addresses your concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
// Then, we need S operations to sum up the elements of that sum vector, | ||
// for a total of V + S - 1 operations. | ||
int Cost = VectorRegsNeeded + LastVectorHandling - 1; | ||
assert(Cost > 0 && "Predicted cost of vector.reduce.add must be > 0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be certain this assert never triggers? I guess VectorRegsNeeded is at least 1, but why is LastVectorHandling also at least 1? This seems to imply both NumElements and MaxElemsPerVector are at least 2. This would not be true for v1i128 - are we sure this routine gets never used for that type?
A failing assertion is an internal compiler error in production, so we should try to avoid whereever possible. In code like this, where we have some alternative (like returning -1 to fall back to generic code), we should probably use that rather than an assert if we run into any unexpected input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected a vector.reduce.add.v1i128
to essentially be a noop and to be caught before we get to this point. A quick test seems to support this assumption where a debug build produced the following code for a vector.reduce.add.v1i128
:
vl %v0, 0(%r2), 3
vst %v0, 0(%r3)
without running into the assert mentioned above. However, i'm not sure how representative that is and i don't think i can say with any certainty that it couldn't happen. Essentially, since this formula calculates the number of operations needed, it faithfully, and correctly (see above), returns 0
for that case, which makes it a bad idea to have an assert
that applies there. Point taken.
The original motivation behind the assert
was the concern that some other change might in the future affect the execution of this function in such a way that the computed cost would be -1
, and then the compiler would silently fall back to the generic cost computation. That, combined with the coding guideline saying that i should use asserts to check assumptions (even then, the assumption to be checked ought to have been Cost >= 0
).
But, the tests that were added would also catch an unexpected fallback like that, so i think the best thing o do would be to just remove the assert
entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think this can just be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@dominik-steenken Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested Please check whether problems have been caused by your change specifically, as How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
This commit provides better cost estimates for
the llvm.vector.reduce.add intrinsic on SystemZ. These apply to all vector lengths and integer types up to i128. For integer types larger than i128, we fall back to the default cost estimate.
This has the effect of lowering the estimated costs of most common instances of the intrinsic. The expected performance impact of this is minimal with a tendency to slightly improve performance of some benchmarks.
This commit also provides a test to check the proper computation of the new estimates, as well as the fallback for types larger than i128.