[SystemZ] Provide improved cost estimates #83873

dominik-steenken · 2024-03-04T16:53:57Z

This commit provides better cost estimates for
the llvm.vector.reduce.add intrinsic on SystemZ. These apply to all vector lengths and integer types up to i128. For integer types larger than i128, we fall back to the default cost estimate.

This has the effect of lowering the estimated costs of most common instances of the intrinsic. The expected performance impact of this is minimal with a tendency to slightly improve performance of some benchmarks.

This commit also provides a test to check the proper computation of the new estimates, as well as the fallback for types larger than i128.

This commit provides better cost estimates for the llvm.vector.reduce.add intrinsic on SystemZ. These apply to all vector lengths and integer types up to i128. For integer types larger than i128, we fall back to the default cost estimate. This has the effect of lowering the estimated costs of most common instances of the intrinsic. The expected performance impact of this is minimal with a tendency to slightly improve performance of some benchmarks. This commit also provides a test to check the proper computation of the new estimates, as well as the fallback for types larger than i128.

github-actions · 2024-03-04T16:54:18Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2024-03-04T16:54:49Z

@llvm/pr-subscribers-backend-systemz

Author: Dominik Steenken (dominik-steenken)

Changes

This commit provides better cost estimates for
the llvm.vector.reduce.add intrinsic on SystemZ. These apply to all vector lengths and integer types up to i128. For integer types larger than i128, we fall back to the default cost estimate.

This has the effect of lowering the estimated costs of most common instances of the intrinsic. The expected performance impact of this is minimal with a tendency to slightly improve performance of some benchmarks.

This commit also provides a test to check the proper computation of the new estimates, as well as the fallback for types larger than i128.

Full diff: https://github.com/llvm/llvm-project/pull/83873.diff

2 Files Affected:

(modified) llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp (+30-2)
(added) llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll (+128)

diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
index 9370fb51a96c56..2d0c3e55198225 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
@@ -20,6 +20,8 @@
 #include "llvm/CodeGen/TargetLowering.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/Support/Debug.h"
+#include "llvm/Support/MathExtras.h"
+
 using namespace llvm;
 
 #define DEBUG_TYPE "systemztti"
@@ -1284,9 +1286,35 @@ InstructionCost SystemZTTIImpl::getInterleavedMemoryOpCost(
   return NumVectorMemOps + NumPermutes;
 }
 
-static int getVectorIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy) {
+static int
+getVectorIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
+                            const SmallVectorImpl<Type *> &ParamTys) {
   if (RetTy->isVectorTy() && ID == Intrinsic::bswap)
     return getNumVectorRegs(RetTy); // VPERM
+
+  if (ID == Intrinsic::vector_reduce_add) {
+    // Retrieve number and size of elements for the vector op
+    auto *VTy = cast<FixedVectorType>(ParamTys.front());
+    unsigned NumElements = VTy->getNumElements();
+    unsigned ScalarSize = VTy->getScalarSizeInBits();
+    // For scalar sizes >128 bits, we fall back to the generic cost estimate
+    if (ScalarSize > SystemZ::VectorBits)
+      return -1;
+    // How many elements can a single vector register hold
+    unsigned MaxElemsPerVector = SystemZ::VectorBits / ScalarSize;
+    // How many vector regs are needed to represent the input elements (V)
+    unsigned VectorRegsNeeded = getNumVectorRegs(VTy);
+    // How many instructions are needed for the final sum of vector elems (S)
+    unsigned LastVectorHandling =
+        2 * Log2_32_Ceil(std::min(NumElements, MaxElemsPerVector));
+    // We use vector adds to create a sum vector, which takes
+    // V/2 + V/4 + ... = V - 1 operations.
+    // Then, we need S operations to sum up the elements of that sum vector,
+    // for a total of V + S - 1 operations.
+    int Cost = VectorRegsNeeded + LastVectorHandling - 1;
+    assert(Cost > 0 && "Predicted cost of vector.reduce.add must be > 0");
+    return Cost;
+  }
   return -1;
 }
 
@@ -1294,7 +1322,7 @@ InstructionCost
 SystemZTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
                                       TTI::TargetCostKind CostKind) {
   InstructionCost Cost =
-      getVectorIntrinsicInstrCost(ICA.getID(), ICA.getReturnType());
+      getVectorIntrinsicInstrCost(ICA.getID(), ICA.getReturnType(), ICA.getArgTypes());
   if (Cost != -1)
     return Cost;
   return BaseT::getIntrinsicInstrCost(ICA, CostKind);
diff --git a/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll b/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
new file mode 100644
index 00000000000000..061e5ece44a4e7
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
@@ -0,0 +1,128 @@
+; RUN: opt < %s -mtriple=systemz-unknown -mcpu=z13 -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s
+
+define void @reduce(ptr %src, ptr %dst) {
+; CHECK-LABEL: 'reduce'
+; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_64 = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> %V2_64)
+; CHECK:  Cost Model: Found an estimated cost of 3 for instruction: %R4_64 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %V4_64)
+; CHECK:  Cost Model: Found an estimated cost of 5 for instruction: %R8_64 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %V8_64)
+; CHECK:  Cost Model: Found an estimated cost of 9 for instruction: %R16_64 = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> %V16_64)
+; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_32 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %V2_32)
+; CHECK:  Cost Model: Found an estimated cost of 4 for instruction: %R4_32 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %V4_32)
+; CHECK:  Cost Model: Found an estimated cost of 5 for instruction: %R8_32 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %V8_32)
+; CHECK:  Cost Model: Found an estimated cost of 7 for instruction: %R16_32 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %V16_32)
+; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_16 = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> %V2_16)
+; CHECK:  Cost Model: Found an estimated cost of 4 for instruction: %R4_16 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %V4_16)
+; CHECK:  Cost Model: Found an estimated cost of 6 for instruction: %R8_16 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %V8_16)
+; CHECK:  Cost Model: Found an estimated cost of 7 for instruction: %R16_16 = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %V16_16)
+; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_8 = call i8 @llvm.vector.reduce.add.v2i8(<2 x i8> %V2_8)
+; CHECK:  Cost Model: Found an estimated cost of 4 for instruction: %R4_8 = call i8 @llvm.vector.reduce.add.v4i8(<4 x i8> %V4_8)
+; CHECK:  Cost Model: Found an estimated cost of 6 for instruction: %R8_8 = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %V8_8)
+; CHECK:  Cost Model: Found an estimated cost of 8 for instruction: %R16_8 = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %V16_8)
+;
+; CHECK:  Cost Model: Found an estimated cost of 15 for instruction: %R128_8 = call i8 @llvm.vector.reduce.add.v128i8(<128 x i8> %V128_8)
+; CHECK:  Cost Model: Found an estimated cost of 20 for instruction: %R4_256 = call i256 @llvm.vector.reduce.add.v4i256(<4 x i256> %V4_256)
+
+  ; REDUCEADD64
+
+  %V2_64 = load <2 x i64>, ptr %src, align 8
+  %R2_64 = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> %V2_64)
+  store volatile i64 %R2_64, ptr %dst, align 4
+
+  %V4_64 = load <4 x i64>, ptr %src, align 8
+  %R4_64 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %V4_64)
+  store volatile i64 %R4_64, ptr %dst, align 4
+
+  %V8_64 = load <8 x i64>, ptr %src, align 8
+  %R8_64 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %V8_64)
+  store volatile i64 %R8_64, ptr %dst, align 4
+
+  %V16_64 = load <16 x i64>, ptr %src, align 8
+  %R16_64 = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> %V16_64)
+  store volatile i64 %R16_64, ptr %dst, align 4
+
+  ; REDUCEADD32
+
+  %V2_32 = load <2 x i32>, ptr %src, align 8
+  %R2_32 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %V2_32)
+  store volatile i32 %R2_32, ptr %dst, align 4
+
+  %V4_32 = load <4 x i32>, ptr %src, align 8
+  %R4_32 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %V4_32)
+  store volatile i32 %R4_32, ptr %dst, align 4
+
+  %V8_32 = load <8 x i32>, ptr %src, align 8
+  %R8_32 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %V8_32)
+  store volatile i32 %R8_32, ptr %dst, align 4
+
+  %V16_32 = load <16 x i32>, ptr %src, align 8
+  %R16_32 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %V16_32)
+  store volatile i32 %R16_32, ptr %dst, align 4
+
+  ; REDUCEADD16
+
+  %V2_16 = load <2 x i16>, ptr %src, align 8
+  %R2_16 = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> %V2_16)
+  store volatile i16 %R2_16, ptr %dst, align 4
+
+  %V4_16 = load <4 x i16>, ptr %src, align 8
+  %R4_16 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %V4_16)
+  store volatile i16 %R4_16, ptr %dst, align 4
+
+  %V8_16 = load <8 x i16>, ptr %src, align 8
+  %R8_16 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %V8_16)
+  store volatile i16 %R8_16, ptr %dst, align 4
+
+  %V16_16 = load <16 x i16>, ptr %src, align 8
+  %R16_16 = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %V16_16)
+  store volatile i16 %R16_16, ptr %dst, align 4
+
+  ; REDUCEADD8
+
+  %V2_8 = load <2 x i8>, ptr %src, align 8
+  %R2_8 = call i8 @llvm.vector.reduce.add.v2i8(<2 x i8> %V2_8)
+  store volatile i8 %R2_8, ptr %dst, align 4
+
+  %V4_8 = load <4 x i8>, ptr %src, align 8
+  %R4_8 = call i8 @llvm.vector.reduce.add.v4i8(<4 x i8> %V4_8)
+  store volatile i8 %R4_8, ptr %dst, align 4
+
+  %V8_8 = load <8 x i8>, ptr %src, align 8
+  %R8_8 = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %V8_8)
+  store volatile i8 %R8_8, ptr %dst, align 4
+
+  %V16_8 = load <16 x i8>, ptr %src, align 8
+  %R16_8 = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %V16_8)
+  store volatile i8 %R16_8, ptr %dst, align 4
+
+  ; EXTREME VALUES
+
+  %V128_8 = load <128 x i8>, ptr %src, align 8
+  %R128_8 = call i8 @llvm.vector.reduce.add.v128i8(<128 x i8> %V128_8)
+  store volatile i8 %R128_8, ptr %dst, align 4
+
+  %V4_256 = load <4 x i256>, ptr %src, align 8
+  %R4_256 = call i256 @llvm.vector.reduce.add.v4i256(<4 x i256> %V4_256)
+  store volatile i256 %R4_256, ptr %dst, align 8
+
+  ret void
+}
+
+declare i64 @llvm.vector.reduce.add.v2i64(<2 x i64>)
+declare i64 @llvm.vector.reduce.add.v4i64(<4 x i64>)
+declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)
+declare i64 @llvm.vector.reduce.add.v16i64(<16 x i64>)
+declare i32 @llvm.vector.reduce.add.v2i32(<2 x i32>)
+declare i32 @llvm.vector.reduce.add.v4i32(<4 x i32>)
+declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>)
+declare i32 @llvm.vector.reduce.add.v16i32(<16 x i32>)
+declare i16 @llvm.vector.reduce.add.v2i16(<2 x i16>)
+declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>)
+declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>)
+declare i16 @llvm.vector.reduce.add.v16i16(<16 x i16>)
+declare i8 @llvm.vector.reduce.add.v2i8(<2 x i8>)
+declare i8 @llvm.vector.reduce.add.v4i8(<4 x i8>)
+declare i8 @llvm.vector.reduce.add.v8i8(<8 x i8>)
+declare i8 @llvm.vector.reduce.add.v16i8(<16 x i8>)
+
+declare i8 @llvm.vector.reduce.add.v128i8(<128 x i8>)
+declare i256 @llvm.vector.reduce.add.v4i256(<4 x i256>)

llvmbot · 2024-03-04T16:54:49Z

@llvm/pr-subscribers-llvm-analysis

Author: Dominik Steenken (dominik-steenken)

Changes

This commit provides better cost estimates for
the llvm.vector.reduce.add intrinsic on SystemZ. These apply to all vector lengths and integer types up to i128. For integer types larger than i128, we fall back to the default cost estimate.

This has the effect of lowering the estimated costs of most common instances of the intrinsic. The expected performance impact of this is minimal with a tendency to slightly improve performance of some benchmarks.

This commit also provides a test to check the proper computation of the new estimates, as well as the fallback for types larger than i128.

Full diff: https://github.com/llvm/llvm-project/pull/83873.diff

2 Files Affected:

(modified) llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp (+30-2)
(added) llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll (+128)

diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
index 9370fb51a96c56..2d0c3e55198225 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
@@ -20,6 +20,8 @@
 #include "llvm/CodeGen/TargetLowering.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/Support/Debug.h"
+#include "llvm/Support/MathExtras.h"
+
 using namespace llvm;
 
 #define DEBUG_TYPE "systemztti"
@@ -1284,9 +1286,35 @@ InstructionCost SystemZTTIImpl::getInterleavedMemoryOpCost(
   return NumVectorMemOps + NumPermutes;
 }
 
-static int getVectorIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy) {
+static int
+getVectorIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
+                            const SmallVectorImpl<Type *> &ParamTys) {
   if (RetTy->isVectorTy() && ID == Intrinsic::bswap)
     return getNumVectorRegs(RetTy); // VPERM
+
+  if (ID == Intrinsic::vector_reduce_add) {
+    // Retrieve number and size of elements for the vector op
+    auto *VTy = cast<FixedVectorType>(ParamTys.front());
+    unsigned NumElements = VTy->getNumElements();
+    unsigned ScalarSize = VTy->getScalarSizeInBits();
+    // For scalar sizes >128 bits, we fall back to the generic cost estimate
+    if (ScalarSize > SystemZ::VectorBits)
+      return -1;
+    // How many elements can a single vector register hold
+    unsigned MaxElemsPerVector = SystemZ::VectorBits / ScalarSize;
+    // How many vector regs are needed to represent the input elements (V)
+    unsigned VectorRegsNeeded = getNumVectorRegs(VTy);
+    // How many instructions are needed for the final sum of vector elems (S)
+    unsigned LastVectorHandling =
+        2 * Log2_32_Ceil(std::min(NumElements, MaxElemsPerVector));
+    // We use vector adds to create a sum vector, which takes
+    // V/2 + V/4 + ... = V - 1 operations.
+    // Then, we need S operations to sum up the elements of that sum vector,
+    // for a total of V + S - 1 operations.
+    int Cost = VectorRegsNeeded + LastVectorHandling - 1;
+    assert(Cost > 0 && "Predicted cost of vector.reduce.add must be > 0");
+    return Cost;
+  }
   return -1;
 }
 
@@ -1294,7 +1322,7 @@ InstructionCost
 SystemZTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
                                       TTI::TargetCostKind CostKind) {
   InstructionCost Cost =
-      getVectorIntrinsicInstrCost(ICA.getID(), ICA.getReturnType());
+      getVectorIntrinsicInstrCost(ICA.getID(), ICA.getReturnType(), ICA.getArgTypes());
   if (Cost != -1)
     return Cost;
   return BaseT::getIntrinsicInstrCost(ICA, CostKind);
diff --git a/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll b/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
new file mode 100644
index 00000000000000..061e5ece44a4e7
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
@@ -0,0 +1,128 @@
+; RUN: opt < %s -mtriple=systemz-unknown -mcpu=z13 -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s
+
+define void @reduce(ptr %src, ptr %dst) {
+; CHECK-LABEL: 'reduce'
+; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_64 = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> %V2_64)
+; CHECK:  Cost Model: Found an estimated cost of 3 for instruction: %R4_64 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %V4_64)
+; CHECK:  Cost Model: Found an estimated cost of 5 for instruction: %R8_64 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %V8_64)
+; CHECK:  Cost Model: Found an estimated cost of 9 for instruction: %R16_64 = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> %V16_64)
+; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_32 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %V2_32)
+; CHECK:  Cost Model: Found an estimated cost of 4 for instruction: %R4_32 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %V4_32)
+; CHECK:  Cost Model: Found an estimated cost of 5 for instruction: %R8_32 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %V8_32)
+; CHECK:  Cost Model: Found an estimated cost of 7 for instruction: %R16_32 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %V16_32)
+; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_16 = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> %V2_16)
+; CHECK:  Cost Model: Found an estimated cost of 4 for instruction: %R4_16 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %V4_16)
+; CHECK:  Cost Model: Found an estimated cost of 6 for instruction: %R8_16 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %V8_16)
+; CHECK:  Cost Model: Found an estimated cost of 7 for instruction: %R16_16 = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %V16_16)
+; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_8 = call i8 @llvm.vector.reduce.add.v2i8(<2 x i8> %V2_8)
+; CHECK:  Cost Model: Found an estimated cost of 4 for instruction: %R4_8 = call i8 @llvm.vector.reduce.add.v4i8(<4 x i8> %V4_8)
+; CHECK:  Cost Model: Found an estimated cost of 6 for instruction: %R8_8 = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %V8_8)
+; CHECK:  Cost Model: Found an estimated cost of 8 for instruction: %R16_8 = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %V16_8)
+;
+; CHECK:  Cost Model: Found an estimated cost of 15 for instruction: %R128_8 = call i8 @llvm.vector.reduce.add.v128i8(<128 x i8> %V128_8)
+; CHECK:  Cost Model: Found an estimated cost of 20 for instruction: %R4_256 = call i256 @llvm.vector.reduce.add.v4i256(<4 x i256> %V4_256)
+
+  ; REDUCEADD64
+
+  %V2_64 = load <2 x i64>, ptr %src, align 8
+  %R2_64 = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> %V2_64)
+  store volatile i64 %R2_64, ptr %dst, align 4
+
+  %V4_64 = load <4 x i64>, ptr %src, align 8
+  %R4_64 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %V4_64)
+  store volatile i64 %R4_64, ptr %dst, align 4
+
+  %V8_64 = load <8 x i64>, ptr %src, align 8
+  %R8_64 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %V8_64)
+  store volatile i64 %R8_64, ptr %dst, align 4
+
+  %V16_64 = load <16 x i64>, ptr %src, align 8
+  %R16_64 = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> %V16_64)
+  store volatile i64 %R16_64, ptr %dst, align 4
+
+  ; REDUCEADD32
+
+  %V2_32 = load <2 x i32>, ptr %src, align 8
+  %R2_32 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %V2_32)
+  store volatile i32 %R2_32, ptr %dst, align 4
+
+  %V4_32 = load <4 x i32>, ptr %src, align 8
+  %R4_32 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %V4_32)
+  store volatile i32 %R4_32, ptr %dst, align 4
+
+  %V8_32 = load <8 x i32>, ptr %src, align 8
+  %R8_32 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %V8_32)
+  store volatile i32 %R8_32, ptr %dst, align 4
+
+  %V16_32 = load <16 x i32>, ptr %src, align 8
+  %R16_32 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %V16_32)
+  store volatile i32 %R16_32, ptr %dst, align 4
+
+  ; REDUCEADD16
+
+  %V2_16 = load <2 x i16>, ptr %src, align 8
+  %R2_16 = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> %V2_16)
+  store volatile i16 %R2_16, ptr %dst, align 4
+
+  %V4_16 = load <4 x i16>, ptr %src, align 8
+  %R4_16 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %V4_16)
+  store volatile i16 %R4_16, ptr %dst, align 4
+
+  %V8_16 = load <8 x i16>, ptr %src, align 8
+  %R8_16 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %V8_16)
+  store volatile i16 %R8_16, ptr %dst, align 4
+
+  %V16_16 = load <16 x i16>, ptr %src, align 8
+  %R16_16 = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %V16_16)
+  store volatile i16 %R16_16, ptr %dst, align 4
+
+  ; REDUCEADD8
+
+  %V2_8 = load <2 x i8>, ptr %src, align 8
+  %R2_8 = call i8 @llvm.vector.reduce.add.v2i8(<2 x i8> %V2_8)
+  store volatile i8 %R2_8, ptr %dst, align 4
+
+  %V4_8 = load <4 x i8>, ptr %src, align 8
+  %R4_8 = call i8 @llvm.vector.reduce.add.v4i8(<4 x i8> %V4_8)
+  store volatile i8 %R4_8, ptr %dst, align 4
+
+  %V8_8 = load <8 x i8>, ptr %src, align 8
+  %R8_8 = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %V8_8)
+  store volatile i8 %R8_8, ptr %dst, align 4
+
+  %V16_8 = load <16 x i8>, ptr %src, align 8
+  %R16_8 = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %V16_8)
+  store volatile i8 %R16_8, ptr %dst, align 4
+
+  ; EXTREME VALUES
+
+  %V128_8 = load <128 x i8>, ptr %src, align 8
+  %R128_8 = call i8 @llvm.vector.reduce.add.v128i8(<128 x i8> %V128_8)
+  store volatile i8 %R128_8, ptr %dst, align 4
+
+  %V4_256 = load <4 x i256>, ptr %src, align 8
+  %R4_256 = call i256 @llvm.vector.reduce.add.v4i256(<4 x i256> %V4_256)
+  store volatile i256 %R4_256, ptr %dst, align 8
+
+  ret void
+}
+
+declare i64 @llvm.vector.reduce.add.v2i64(<2 x i64>)
+declare i64 @llvm.vector.reduce.add.v4i64(<4 x i64>)
+declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)
+declare i64 @llvm.vector.reduce.add.v16i64(<16 x i64>)
+declare i32 @llvm.vector.reduce.add.v2i32(<2 x i32>)
+declare i32 @llvm.vector.reduce.add.v4i32(<4 x i32>)
+declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>)
+declare i32 @llvm.vector.reduce.add.v16i32(<16 x i32>)
+declare i16 @llvm.vector.reduce.add.v2i16(<2 x i16>)
+declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>)
+declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>)
+declare i16 @llvm.vector.reduce.add.v16i16(<16 x i16>)
+declare i8 @llvm.vector.reduce.add.v2i8(<2 x i8>)
+declare i8 @llvm.vector.reduce.add.v4i8(<4 x i8>)
+declare i8 @llvm.vector.reduce.add.v8i8(<8 x i8>)
+declare i8 @llvm.vector.reduce.add.v16i8(<16 x i8>)
+
+declare i8 @llvm.vector.reduce.add.v128i8(<128 x i8>)
+declare i256 @llvm.vector.reduce.add.v4i256(<4 x i256>)

github-actions · 2024-03-04T16:56:24Z

✅ With the latest revision this PR passed the C/C++ code formatter.

redstar · 2024-03-04T19:01:41Z

llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp

  if (RetTy->isVectorTy() && ID == Intrinsic::bswap)
    return getNumVectorRegs(RetTy); // VPERM
+
+  if (ID == Intrinsic::vector_reduce_add) {
+    // Retrieve number and size of elements for the vector op


Comments should be prose with proper punctuation. (see https://llvm.org/docs/CodingStandards.html#commenting)

Hi, i've re-worded the comments and added closing punctuation. Please let me know if this addresses your concerns.

uweigand · 2024-03-06T15:26:56Z

llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp

+    // Then, we need S operations to sum up the elements of that sum vector,
+    // for a total of V + S - 1 operations.
+    int Cost = VectorRegsNeeded + LastVectorHandling - 1;
+    assert(Cost > 0 && "Predicted cost of vector.reduce.add must be > 0");


Can we be certain this assert never triggers? I guess VectorRegsNeeded is at least 1, but why is LastVectorHandling also at least 1? This seems to imply both NumElements and MaxElemsPerVector are at least 2. This would not be true for v1i128 - are we sure this routine gets never used for that type?

A failing assertion is an internal compiler error in production, so we should try to avoid whereever possible. In code like this, where we have some alternative (like returning -1 to fall back to generic code), we should probably use that rather than an assert if we run into any unexpected input.

I would have expected a vector.reduce.add.v1i128 to essentially be a noop and to be caught before we get to this point. A quick test seems to support this assumption where a debug build produced the following code for a vector.reduce.add.v1i128:

vl %v0, 0(%r2), 3 vst %v0, 0(%r3)

without running into the assert mentioned above. However, i'm not sure how representative that is and i don't think i can say with any certainty that it couldn't happen. Essentially, since this formula calculates the number of operations needed, it faithfully, and correctly (see above), returns 0 for that case, which makes it a bad idea to have an assert that applies there. Point taken.

The original motivation behind the assert was the concern that some other change might in the future affect the execution of this function in such a way that the computed cost would be -1, and then the compiler would silently fall back to the generic cost computation. That, combined with the coding guideline saying that i should use asserts to check assumptions (even then, the assumption to be checked ought to have been Cost >= 0).

But, the tests that were added would also catch an unexpected fallback like that, so i think the best thing o do would be to just remove the assert entirely?

Agreed, I think this can just be removed.

uweigand

LGTM, thanks!

github-actions · 2024-03-11T09:41:16Z

@dominik-steenken Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested
by our build bots. If there is a problem with a build, you may recieve a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as
the builds can include changes from many authors. It is not uncommon for your
change to be included in a build that fails due to someone else's changes, or
infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself.
This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

llvmbot added backend:SystemZ llvm:analysis labels Mar 4, 2024

redstar reviewed Mar 4, 2024

View reviewed changes

uweigand self-requested a review March 4, 2024 21:22

Rephrase comments, reformat

e484198

uweigand reviewed Mar 6, 2024

View reviewed changes

remove assert

0a1520b

uweigand approved these changes Mar 11, 2024

View reviewed changes

uweigand merged commit 718962f into llvm:main Mar 11, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SystemZ] Provide improved cost estimates #83873

[SystemZ] Provide improved cost estimates #83873

dominik-steenken commented Mar 4, 2024

github-actions bot commented Mar 4, 2024

llvmbot commented Mar 4, 2024

llvmbot commented Mar 4, 2024

github-actions bot commented Mar 4, 2024 •

edited

redstar Mar 4, 2024

dominik-steenken Mar 6, 2024

redstar Mar 7, 2024

uweigand Mar 6, 2024

dominik-steenken Mar 7, 2024

uweigand Mar 8, 2024

dominik-steenken Mar 11, 2024

uweigand left a comment

github-actions bot commented Mar 11, 2024

[SystemZ] Provide improved cost estimates #83873

[SystemZ] Provide improved cost estimates #83873

Conversation

dominik-steenken commented Mar 4, 2024

github-actions bot commented Mar 4, 2024

llvmbot commented Mar 4, 2024

llvmbot commented Mar 4, 2024

github-actions bot commented Mar 4, 2024 • edited

redstar Mar 4, 2024

Choose a reason for hiding this comment

dominik-steenken Mar 6, 2024

Choose a reason for hiding this comment

redstar Mar 7, 2024

Choose a reason for hiding this comment

uweigand Mar 6, 2024

Choose a reason for hiding this comment

dominik-steenken Mar 7, 2024

Choose a reason for hiding this comment

uweigand Mar 8, 2024

Choose a reason for hiding this comment

dominik-steenken Mar 11, 2024

Choose a reason for hiding this comment

uweigand left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 11, 2024

github-actions bot commented Mar 4, 2024 •

edited