Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SystemZ] Add custom handling of legal vectors with reduce-add. #88495

Merged
merged 4 commits into from
Apr 12, 2024

Conversation

dominik-steenken
Copy link
Contributor

@dominik-steenken dominik-steenken commented Apr 12, 2024

This commit skips the expansion of the vector.reduce.add intrinsic on vector-enabled SystemZ targets in order to introduce custom handling of vector.reduce.add for legal vector types using the VSUM instructions. This is limited to full vectors with scalar types up to i32 due to performance concerns.

It also adds testing for the generation of such custom handling, and adapts the related cost computation, as well as the testing for that.

The expected result is a performance boost in certain benchmarks that make heavy use of vector.reduce.add with other benchmarks remaining constant.

For instance, the assembly for vector.reduce.add<4 x i32> changes from

        vmrlg   %v0, %v24, %v24
        vaf     %v0, %v24, %v0
        vrepf   %v1, %v0, 1
        vaf     %v0, %v0, %v1
        vlgvf   %r2, %v0, 0

to

        vgbm    %v0, 0
        vsumqf  %v0, %v24, %v0
        vlgvf   %r2, %v0, 3

This commit skips the expansion of the vector.reduce.add intrinsic
on vector-enabled SystemZ targets in order to introduce
custom handling of vector.reduce.add for legal vector types using
the VSUM instructions. This is limited to full vectors with scalar
types up to i32 due to performance concerns.
It also adds testing for the generation of such custom handling,
and adapts the related cost computation, as well as the testing
for that.
The expected result is a performance boost in certain benchmarks
that make heavy use of vectore.reduce.add with other benchmarks
remaining constant.
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 12, 2024

@llvm/pr-subscribers-backend-systemz

@llvm/pr-subscribers-llvm-analysis

Author: Dominik Steenken (dominik-steenken)

Changes

This commit skips the expansion of the vector.reduce.add intrinsic on vector-enabled SystemZ targets in order to introduce custom handling of vector.reduce.add for legal vector types using the VSUM instructions. This is limited to full vectors with scalar types up to i32 due to performance concerns.

It also adds testing for the generation of such custom handling, and adapts the related cost computation, as well as the testing for that.

The expected result is a performance boost in certain benchmarks that make heavy use of vector.reduce.add with other benchmarks remaining constant.

For instance, the assembly for vector.reduce.add&lt;8 x i16&gt; changes from

        vmrlg   %v0, %v24, %v24
        vaf     %v0, %v24, %v0
        vrepf   %v1, %v0, 1
        vaf     %v0, %v0, %v1
        vlgvf   %r2, %v0, 0

to

        vgbm    %v0, 0
        vsumqf  %v0, %v24, %v0
        vlgvf   %r2, %v0, 3

Patch is 20.19 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/88495.diff

6 Files Affected:

  • (modified) llvm/lib/Target/SystemZ/SystemZISelLowering.cpp (+45)
  • (modified) llvm/lib/Target/SystemZ/SystemZISelLowering.h (+1)
  • (modified) llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp (+30-5)
  • (modified) llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h (+2)
  • (modified) llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll (+12-12)
  • (added) llvm/test/CodeGen/SystemZ/vec-reduce-add-01.ll (+289)
diff --git a/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp b/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
index 3b85a6ac0371ed..8de4a5ab79396a 100644
--- a/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
@@ -16,6 +16,7 @@
 #include "SystemZMachineFunctionInfo.h"
 #include "SystemZTargetMachine.h"
 #include "llvm/CodeGen/CallingConvLower.h"
+#include "llvm/CodeGen/ISDOpcodes.h"
 #include "llvm/CodeGen/MachineInstrBuilder.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
 #include "llvm/CodeGen/TargetLoweringObjectFileImpl.h"
@@ -23,6 +24,7 @@
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsS390.h"
 #include "llvm/Support/CommandLine.h"
+#include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/KnownBits.h"
 #include <cctype>
 #include <optional>
@@ -444,6 +446,11 @@ SystemZTargetLowering::SystemZTargetLowering(const TargetMachine &TM,
       setOperationAction(ISD::SRL, VT, Custom);
       setOperationAction(ISD::ROTL, VT, Custom);
 
+      // Add ISD::VECREDUCE_ADD as custom in order to implement
+      // it with VZERO+VSUM
+      if (Subtarget.hasVector()) {
+        setOperationAction(ISD::VECREDUCE_ADD, VT, Custom);
+      }
       // Map SETCCs onto one of VCE, VCH or VCHL, swapping the operands
       // and inverting the result as necessary.
       setOperationAction(ISD::SETCC, VT, Custom);
@@ -6133,6 +6140,8 @@ SDValue SystemZTargetLowering::LowerOperation(SDValue Op,
     return lowerOR(Op, DAG);
   case ISD::CTPOP:
     return lowerCTPOP(Op, DAG);
+  case ISD::VECREDUCE_ADD:
+    return lowerVECREDUCE_ADD(Op, DAG);
   case ISD::ATOMIC_FENCE:
     return lowerATOMIC_FENCE(Op, DAG);
   case ISD::ATOMIC_SWAP:
@@ -9505,3 +9514,39 @@ SDValue SystemZTargetLowering::lowerGET_ROUNDING(SDValue Op,
 
   return DAG.getMergeValues({RetVal, Chain}, dl);
 }
+
+SDValue SystemZTargetLowering::lowerVECREDUCE_ADD(SDValue Op,
+                                                  SelectionDAG &DAG) const {
+  EVT VT = Op.getValueType();
+  Op = Op.getOperand(0);
+  EVT OpVT = Op.getValueType();
+
+  assert(OpVT.isVector() && "Operand type for VECREDUCE_ADD is not a vector.");
+
+  SDLoc DL(Op);
+
+  // load a 0 vector for the third operand of VSUM.
+  SDValue Zero = DAG.getSplatBuildVector(OpVT, DL, DAG.getConstant(0, DL, VT));
+
+  // execute VSUM.
+  switch (OpVT.getScalarSizeInBits()) {
+  case 8:
+  case 16:
+    Op = DAG.getNode(SystemZISD::VSUM, DL, MVT::v4i32, Op,
+                     DAG.getBitcast(OpVT, Zero));
+    LLVM_FALLTHROUGH;
+  case 32:
+  case 64:
+    Op = DAG.getNode(SystemZISD::VSUM, DL, MVT::i128, Op,
+                     DAG.getBitcast(Op.getValueType(), Zero));
+    break;
+  case 128:
+    break; // VSUM over v1i128 should not happen and would be a noop
+  default:
+    llvm_unreachable("Unexpected scalar size.");
+  }
+  // Cast to original vector type, retrieve last element.
+  return DAG.getNode(
+      ISD::EXTRACT_VECTOR_ELT, DL, VT, DAG.getBitcast(OpVT, Op),
+      DAG.getConstant(OpVT.getVectorNumElements() - 1, DL, MVT::i32));
+}
diff --git a/llvm/lib/Target/SystemZ/SystemZISelLowering.h b/llvm/lib/Target/SystemZ/SystemZISelLowering.h
index baf4ba41654879..a9526ecffd4db6 100644
--- a/llvm/lib/Target/SystemZ/SystemZISelLowering.h
+++ b/llvm/lib/Target/SystemZ/SystemZISelLowering.h
@@ -691,6 +691,7 @@ class SystemZTargetLowering : public TargetLowering {
   SDValue lowerBITCAST(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerOR(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerCTPOP(SDValue Op, SelectionDAG &DAG) const;
+  SDValue lowerVECREDUCE_ADD(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerATOMIC_FENCE(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerATOMIC_LOAD(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerATOMIC_STORE(SDValue Op, SelectionDAG &DAG) const;
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
index e4adb7be564952..12c89413675e92 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
@@ -19,6 +19,7 @@
 #include "llvm/CodeGen/CostTable.h"
 #include "llvm/CodeGen/TargetLowering.h"
 #include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Intrinsics.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/MathExtras.h"
 
@@ -1295,18 +1296,14 @@ getVectorIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
   if (ID == Intrinsic::vector_reduce_add) {
     // Retrieve number and size of elements for the vector op.
     auto *VTy = cast<FixedVectorType>(ParamTys.front());
-    unsigned NumElements = VTy->getNumElements();
     unsigned ScalarSize = VTy->getScalarSizeInBits();
     // For scalar sizes >128 bits, we fall back to the generic cost estimate.
     if (ScalarSize > SystemZ::VectorBits)
       return -1;
-    // A single vector register can hold this many elements.
-    unsigned MaxElemsPerVector = SystemZ::VectorBits / ScalarSize;
     // This many vector regs are needed to represent the input elements (V).
     unsigned VectorRegsNeeded = getNumVectorRegs(VTy);
     // This many instructions are needed for the final sum of vector elems (S).
-    unsigned LastVectorHandling =
-        2 * Log2_32_Ceil(std::min(NumElements, MaxElemsPerVector));
+    unsigned LastVectorHandling = (ScalarSize < 32) ? 3 : 2;
     // We use vector adds to create a sum vector, which takes
     // V/2 + V/4 + ... = V - 1 operations.
     // Then, we need S operations to sum up the elements of that sum vector,
@@ -1326,3 +1323,31 @@ SystemZTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     return Cost;
   return BaseT::getIntrinsicInstrCost(ICA, CostKind);
 }
+
+bool SystemZTTIImpl::shouldExpandReduction(const IntrinsicInst *II) const {
+  // Always expand on Subtargets without vector instructions
+  if (!ST->hasVector())
+    return true;
+
+  // Always expand for operands that do not fill one vector reg
+  auto *Type = cast<FixedVectorType>(II->getOperand(0)->getType());
+  unsigned NumElts = Type->getNumElements();
+  unsigned ScalarSize = Type->getScalarSizeInBits();
+  unsigned MaxElts = SystemZ::VectorBits / ScalarSize;
+  if (NumElts < MaxElts)
+    return true;
+
+  // Otherwise
+  switch (II->getIntrinsicID()) {
+  // Do not expand vector.reduce.add
+  case Intrinsic::vector_reduce_add:
+    // Except for i64, since the performance benefit is dubious there
+    if (ScalarSize < 64) {
+      return false;
+    } else {
+      return true;
+    }
+  default:
+    return true;
+  }
+}
\ No newline at end of file
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
index 2cccdf6d17dacf..0153fb4f6ff485 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
@@ -126,6 +126,8 @@ class SystemZTTIImpl : public BasicTTIImplBase<SystemZTTIImpl> {
 
   InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
                                         TTI::TargetCostKind CostKind);
+  
+  bool shouldExpandReduction(const IntrinsicInst *II) const;
   /// @}
 };
 
diff --git a/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll b/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
index 061e5ece44a4e7..90b5b746c914ab 100644
--- a/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
+++ b/llvm/test/Analysis/CostModel/SystemZ/reduce-add.ll
@@ -7,19 +7,19 @@ define void @reduce(ptr %src, ptr %dst) {
 ; CHECK:  Cost Model: Found an estimated cost of 5 for instruction: %R8_64 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %V8_64)
 ; CHECK:  Cost Model: Found an estimated cost of 9 for instruction: %R16_64 = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> %V16_64)
 ; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_32 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %V2_32)
-; CHECK:  Cost Model: Found an estimated cost of 4 for instruction: %R4_32 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %V4_32)
-; CHECK:  Cost Model: Found an estimated cost of 5 for instruction: %R8_32 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %V8_32)
-; CHECK:  Cost Model: Found an estimated cost of 7 for instruction: %R16_32 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %V16_32)
-; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_16 = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> %V2_16)
-; CHECK:  Cost Model: Found an estimated cost of 4 for instruction: %R4_16 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %V4_16)
-; CHECK:  Cost Model: Found an estimated cost of 6 for instruction: %R8_16 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %V8_16)
-; CHECK:  Cost Model: Found an estimated cost of 7 for instruction: %R16_16 = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %V16_16)
-; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R2_8 = call i8 @llvm.vector.reduce.add.v2i8(<2 x i8> %V2_8)
-; CHECK:  Cost Model: Found an estimated cost of 4 for instruction: %R4_8 = call i8 @llvm.vector.reduce.add.v4i8(<4 x i8> %V4_8)
-; CHECK:  Cost Model: Found an estimated cost of 6 for instruction: %R8_8 = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %V8_8)
-; CHECK:  Cost Model: Found an estimated cost of 8 for instruction: %R16_8 = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %V16_8)
+; CHECK:  Cost Model: Found an estimated cost of 2 for instruction: %R4_32 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %V4_32)
+; CHECK:  Cost Model: Found an estimated cost of 3 for instruction: %R8_32 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %V8_32)
+; CHECK:  Cost Model: Found an estimated cost of 5 for instruction: %R16_32 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %V16_32)
+; CHECK:  Cost Model: Found an estimated cost of 3 for instruction: %R2_16 = call i16 @llvm.vector.reduce.add.v2i16(<2 x i16> %V2_16)
+; CHECK:  Cost Model: Found an estimated cost of 3 for instruction: %R4_16 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %V4_16)
+; CHECK:  Cost Model: Found an estimated cost of 3 for instruction: %R8_16 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %V8_16)
+; CHECK:  Cost Model: Found an estimated cost of 4 for instruction: %R16_16 = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %V16_16)
+; CHECK:  Cost Model: Found an estimated cost of 3 for instruction: %R2_8 = call i8 @llvm.vector.reduce.add.v2i8(<2 x i8> %V2_8)
+; CHECK:  Cost Model: Found an estimated cost of 3 for instruction: %R4_8 = call i8 @llvm.vector.reduce.add.v4i8(<4 x i8> %V4_8)
+; CHECK:  Cost Model: Found an estimated cost of 3 for instruction: %R8_8 = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %V8_8)
+; CHECK:  Cost Model: Found an estimated cost of 3 for instruction: %R16_8 = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %V16_8)
 ;
-; CHECK:  Cost Model: Found an estimated cost of 15 for instruction: %R128_8 = call i8 @llvm.vector.reduce.add.v128i8(<128 x i8> %V128_8)
+; CHECK:  Cost Model: Found an estimated cost of 10 for instruction: %R128_8 = call i8 @llvm.vector.reduce.add.v128i8(<128 x i8> %V128_8)
 ; CHECK:  Cost Model: Found an estimated cost of 20 for instruction: %R4_256 = call i256 @llvm.vector.reduce.add.v4i256(<4 x i256> %V4_256)
 
   ; REDUCEADD64
diff --git a/llvm/test/CodeGen/SystemZ/vec-reduce-add-01.ll b/llvm/test/CodeGen/SystemZ/vec-reduce-add-01.ll
new file mode 100644
index 00000000000000..56b151d7f9412a
--- /dev/null
+++ b/llvm/test/CodeGen/SystemZ/vec-reduce-add-01.ll
@@ -0,0 +1,289 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; Test vector add reduction instrinsic
+;
+; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z16 | FileCheck %s
+
+; 1 vector length
+declare i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %a)
+declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %a)
+declare i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %a)
+declare i64 @llvm.vector.reduce.add.v2i64(<2 x i64> %a)
+declare i128 @llvm.vector.reduce.add.v1i128(<1 x i128> %a)
+; 2 vector lengths
+declare i8 @llvm.vector.reduce.add.v32i8(<32 x i8> %a)
+declare i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %a)
+declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %a)
+declare i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %a)
+declare i128 @llvm.vector.reduce.add.v2i128(<2 x i128> %a)
+; ; TODO
+; ; 4 vector lengths
+declare i8 @llvm.vector.reduce.add.v64i8(<64 x i8> %a)
+declare i16 @llvm.vector.reduce.add.v32i16(<32 x i16> %a)
+declare i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %a)
+declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %a)
+declare i128 @llvm.vector.reduce.add.v4i128(<4 x i128> %a)
+; ; Subvector lengths
+declare i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %a)
+declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %a)
+declare i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %a)
+declare i64 @llvm.vector.reduce.add.v1i64(<1 x i64> %a)
+
+define i8 @f1_1(<16 x i8> %a) {
+; CHECK-LABEL: f1_1:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vgbm %v0, 0
+; CHECK-NEXT:    vsumb %v1, %v24, %v0
+; CHECK-NEXT:    vsumqf %v0, %v1, %v0
+; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+  %redadd = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> %a)
+  ret i8 %redadd
+}
+
+define i16 @f1_2(<8 x i16> %a) {
+; CHECK-LABEL: f1_2:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vgbm %v0, 0
+; CHECK-NEXT:    vsumh %v1, %v24, %v0
+; CHECK-NEXT:    vsumqf %v0, %v1, %v0
+; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+  %redadd = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %a)
+  ret i16 %redadd
+}
+
+define i32 @f1_3(<4 x i32> %a) {
+; CHECK-LABEL: f1_3:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vgbm %v0, 0
+; CHECK-NEXT:    vsumqf %v0, %v24, %v0
+; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+
+  %redadd = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %a)
+  ret i32 %redadd
+}
+
+define i64 @f1_4(<2 x i64> %a) {
+; CHECK-LABEL: f1_4:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vrepg %v0, %v24, 1
+; CHECK-NEXT:    vag %v0, %v24, %v0
+; CHECK-NEXT:    vlgvg %r2, %v0, 0
+; CHECK-NEXT:    br %r14
+
+  %redadd = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> %a)
+  ret i64 %redadd
+}
+
+define i128 @f1_5(<1 x i128> %a) {
+; CHECK-LABEL: f1_5:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vst %v24, 0(%r2), 3
+; CHECK-NEXT:    br %r14
+  %redadd = call i128 @llvm.vector.reduce.add.v1i128(<1 x i128> %a)
+  ret i128 %redadd
+}
+
+define i8 @f2_1(<32 x i8> %a) {
+; CHECK-LABEL: f2_1:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vab %v0, %v24, %v26
+; CHECK-NEXT:    vgbm %v1, 0
+; CHECK-NEXT:    vsumb %v0, %v0, %v1
+; CHECK-NEXT:    vsumqf %v0, %v0, %v1
+; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+  %redadd = call i8 @llvm.vector.reduce.add.v32i8(<32 x i8> %a)
+  ret i8 %redadd
+}
+
+define i16 @f2_2(<16 x i16> %a) {
+; CHECK-LABEL: f2_2:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vah %v0, %v24, %v26
+; CHECK-NEXT:    vgbm %v1, 0
+; CHECK-NEXT:    vsumh %v0, %v0, %v1
+; CHECK-NEXT:    vsumqf %v0, %v0, %v1
+; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+  %redadd = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> %a)
+  ret i16 %redadd
+}
+
+define i32 @f2_3(<8 x i32> %a) {
+; CHECK-LABEL: f2_3:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vaf %v0, %v24, %v26
+; CHECK-NEXT:    vgbm %v1, 0
+; CHECK-NEXT:    vsumqf %v0, %v0, %v1
+; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+
+  %redadd = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %a)
+  ret i32 %redadd
+}
+
+define i64 @f2_4(<4 x i64> %a) {
+; CHECK-LABEL: f2_4:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vag %v0, %v24, %v26
+; CHECK-NEXT:    vrepg %v1, %v0, 1
+; CHECK-NEXT:    vag %v0, %v0, %v1
+; CHECK-NEXT:    vlgvg %r2, %v0, 0
+; CHECK-NEXT:    br %r14
+
+  %redadd = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %a)
+  ret i64 %redadd
+}
+
+define i128 @f2_5(<2 x i128> %a) {
+; CHECK-LABEL: f2_5:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vl %v0, 16(%r3), 3
+; CHECK-NEXT:    vl %v1, 0(%r3), 3
+; CHECK-NEXT:    vaq %v0, %v1, %v0
+; CHECK-NEXT:    vst %v0, 0(%r2), 3
+; CHECK-NEXT:    br %r14
+  %redadd = call i128 @llvm.vector.reduce.add.v2i128(<2 x i128> %a)
+  ret i128 %redadd
+}
+
+define i8 @f3_1(<64 x i8> %a) {
+; CHECK-LABEL: f3_1:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vab %v0, %v26, %v30
+; CHECK-NEXT:    vab %v1, %v24, %v28
+; CHECK-NEXT:    vab %v0, %v1, %v0
+; CHECK-NEXT:    vgbm %v1, 0
+; CHECK-NEXT:    vsumb %v0, %v0, %v1
+; CHECK-NEXT:    vsumqf %v0, %v0, %v1
+; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+  %redadd = call i8 @llvm.vector.reduce.add.v64i8(<64 x i8> %a)
+  ret i8 %redadd
+}
+
+define i16 @f3_2(<32 x i16> %a) {
+; CHECK-LABEL: f3_2:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vah %v0, %v26, %v30
+; CHECK-NEXT:    vah %v1, %v24, %v28
+; CHECK-NEXT:    vah %v0, %v1, %v0
+; CHECK-NEXT:    vgbm %v1, 0
+; CHECK-NEXT:    vsumh %v0, %v0, %v1
+; CHECK-NEXT:    vsumqf %v0, %v0, %v1
+; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+  %redadd = call i16 @llvm.vector.reduce.add.v32i16(<32 x i16> %a)
+  ret i16 %redadd
+}
+
+define i32 @f3_3(<16 x i32> %a) {
+; CHECK-LABEL: f3_3:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vaf %v0, %v26, %v30
+; CHECK-NEXT:    vaf %v1, %v24, %v28
+; CHECK-NEXT:    vaf %v0, %v1, %v0
+; CHECK-NEXT:    vgbm %v1, 0
+; CHECK-NEXT:    vsumqf %v0, %v0, %v1
+; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+
+  %redadd = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %a)
+  ret i32 %redadd
+}
+
+define i64 @f3_4(<8 x i64> %a) {
+; CHECK-LABEL: f3_4:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vag %v0, %v26, %v30
+; CHECK-NEXT:    vag %v1, %v24, %v28
+; CHECK-NEXT:    vag %v0, %v1, %v0
+; CHECK-NEXT:    vrepg %v1, %v0, 1
+; CHECK-NEXT:    vag %v0, %v0, %v1
+; CHECK-NEXT:    vlgvg %r2, %v0, 0
+; CHECK-NEXT:    br %r14
+
+  %redadd = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %a)
+  ret i64 %redadd
+}
+
+define i128 @f3_5(<4 x i128> %a) {
+; CHECK-LABEL: f3_5:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vl %v0, 32(%r3), 3
+; CHECK-NEXT:    vl %v1, 0(%r3), 3
+; CHECK-NEXT:    vl %v2, 48(%r3), 3
+; CHECK-NEXT:    vl %v3, 16(%r3), 3
+; CHECK-NEXT:    vaq %v2, %v3, %v2
+; CHECK-NEXT:    vaq %v0, %v1, %v0
+; CHECK-NEXT:    vaq %v0, %v0, %v2
+; CHECK-NEXT:    vst %v0, 0(%r2), 3
+; CHECK-NEXT:    br %r14
+  %redadd = call i128 @llvm.vector.reduce.add.v4i128(<4 x i128> %a)
+  ret i128 %redadd
+}
+
+
+define i8 @f4_1(<8 x i8> %a) {
+; CHECK-LABEL: f4_1:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vpkg %v0, %v24, %v24
+; CHECK-NEXT:    vab %v0, %v24, %v0
+; CHECK-NEXT:    vpkf %v1, %v0, %v0
+; CHECK-NEXT:    vab %v0, %v0, %v1
+; CHECK-NEXT:    vrepb %v1, %v0, 1
+; CHECK-NEXT:    vab %v0, %v0, %v1
+; CHECK-NEXT:    vlgvb %r2, %v0, 0
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+  %redadd = call i8 @llvm.vector.reduce.add.v8i8(<8 x i8> %a)
+  ret i8 %redadd
+}
+
+define i16 @f4_2(<4 x i16> %a) {
+; CHECK-LABEL: f4_2:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vpkg %v0, %v24, %v24
+; CHECK-NEXT:    vah %v0, %v24, %v0
+; CHECK-NEXT:    vreph %v1, %v0, 1
+; CHECK-NEXT:    vah %v0, %v0, %v1
+; CHECK-NEXT:    vlgvh %r2, %v0, 0
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+  %redadd = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %a)
+  ret i16 %redadd
+}
+
+define i32 @f4_3(<2 x i32> %a) {
+; CHECK-LABEL: f4_3:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vrepf %v0, %v24, 1
+; CHECK-NEXT:    vaf %v0, %v24, %v0
+; CHECK-NEXT:    vlgvf %r2, %v0, 0
+; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
+; CHECK-NEXT:    br %r14
+
+  %redadd = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %a)
+  ret i32 %redadd
+}
+
+define i64 @f4_4(<1 x i64> %a) {
+; CHECK...
[truncated]

Copy link

github-actions bot commented Apr 12, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@dominik-steenken
Copy link
Contributor Author

@uweigand FYI

Copy link
Member

@uweigand uweigand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, but see a couple of inline comments.

@@ -444,6 +446,11 @@ SystemZTargetLowering::SystemZTargetLowering(const TargetMachine &TM,
setOperationAction(ISD::SRL, VT, Custom);
setOperationAction(ISD::ROTL, VT, Custom);

// Add ISD::VECREDUCE_ADD as custom in order to implement
// it with VZERO+VSUM
if (Subtarget.hasVector()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this check - this whole block is executed only if the vector type is "legal", which pre-supposes "hasVector".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will remove the check.

case 8:
case 16:
Op = DAG.getNode(SystemZISD::VSUM, DL, MVT::v4i32, Op,
DAG.getBitcast(OpVT, Zero));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't Zero already have the correct type here, so we don't need the bitcast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, for 8 and 16 bit scalars we do not need the bitcast. I will remove it.

return false;
} else {
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No braces around single-line statements. Also, this whole test is maybe simpler as return ScalarSize >= 64; ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. that's definitely clearer.

default:
return true;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline at the end of the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add.

@dominik-steenken
Copy link
Contributor Author

@uweigand Thank you for the comments, I believe i have implemented them all.

Copy link
Member

@uweigand uweigand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks!

@uweigand uweigand merged commit b794dc2 into llvm:main Apr 12, 2024
3 of 4 checks passed
bazuzi pushed a commit to bazuzi/llvm-project that referenced this pull request Apr 15, 2024
…#88495)

This commit skips the expansion of the `vector.reduce.add` intrinsic on
vector-enabled SystemZ targets in order to introduce custom handling of
`vector.reduce.add` for legal vector types using the VSUM instructions.
This is limited to full vectors with scalar types up to `i32` due to
performance concerns.

It also adds testing for the generation of such custom handling, and
adapts the related cost computation, as well as the testing for that.

The expected result is a performance boost in certain benchmarks that
make heavy use of `vector.reduce.add` with other benchmarks remaining
constant.

For instance, the assembly for `vector.reduce.add<4 x i32>` changes from
```hlasm
        vmrlg   %v0, %v24, %v24
        vaf     %v0, %v24, %v0
        vrepf   %v1, %v0, 1
        vaf     %v0, %v0, %v1
        vlgvf   %r2, %v0, 0
```
to
```hlasm
        vgbm    %v0, 0
        vsumqf  %v0, %v24, %v0
        vlgvf   %r2, %v0, 3
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants