Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Improve llvm.reduce.fmaximum/minimum lowering #75484

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

simeonkr
Copy link
Contributor

The default lowering of VECREDUCE_FMAXIMUM/VECREDUCE_FMINIMUM to RISC-V involves splitting the vector multiple times and performing logarithmically many (in terms of vector length) FMINIMUM/FMAXIMUM operations. Given that even a single such operation generates a large sequence of instructions, a better strategy is needed.

This patch transforms such reductions into an equivalent sequence of a reduction fmin/fmax, a reduction sum to detect any NaNs, and a scalar select to choose the correct result. Given that these reduction operations are natively supported in RISC-V, this leads to a much more efficient sequence of instructions.

The transformation is performed in the DAG combiner, before the type legalizer has a chance to split the reduction and generate FMINIMUM/FMAXIMUM nodes.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 14, 2023

@llvm/pr-subscribers-backend-risc-v

Author: Simeon K (simeonkr)

Changes

The default lowering of VECREDUCE_FMAXIMUM/VECREDUCE_FMINIMUM to RISC-V involves splitting the vector multiple times and performing logarithmically many (in terms of vector length) FMINIMUM/FMAXIMUM operations. Given that even a single such operation generates a large sequence of instructions, a better strategy is needed.

This patch transforms such reductions into an equivalent sequence of a reduction fmin/fmax, a reduction sum to detect any NaNs, and a scalar select to choose the correct result. Given that these reduction operations are natively supported in RISC-V, this leads to a much more efficient sequence of instructions.

The transformation is performed in the DAG combiner, before the type legalizer has a chance to split the reduction and generate FMINIMUM/FMAXIMUM nodes.


Patch is 34.11 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/75484.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+18-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll (+948-2)
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index cf1b11c14b6d0f..471e30bd4434d4 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -1368,7 +1368,8 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
 
   setTargetDAGCombine({ISD::INTRINSIC_VOID, ISD::INTRINSIC_W_CHAIN,
                        ISD::INTRINSIC_WO_CHAIN, ISD::ADD, ISD::SUB, ISD::AND,
-                       ISD::OR, ISD::XOR, ISD::SETCC, ISD::SELECT});
+                       ISD::OR, ISD::XOR, ISD::SETCC, ISD::SELECT,
+                       ISD::VECREDUCE_FMAXIMUM, ISD::VECREDUCE_FMINIMUM});
   if (Subtarget.is64Bit())
     setTargetDAGCombine(ISD::SRA);
 
@@ -15650,6 +15651,22 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
 
     return SDValue();
   }
+  case ISD::VECREDUCE_FMAXIMUM:
+  case ISD::VECREDUCE_FMINIMUM: {
+    EVT RT = N->getValueType(0);
+    SDValue N0 = N->getOperand(0);
+
+    // Reduction fmax/fmin + separate reduction sum to propagate NaNs
+    unsigned ReducedMinMaxOpc =
+      N->getOpcode() == ISD::VECREDUCE_FMAXIMUM ? ISD::VECREDUCE_FMAX : 
+                                                  ISD::VECREDUCE_FMIN;
+    SDValue MinMax = DAG.getNode(ReducedMinMaxOpc, DL, RT, N0);
+    if (N0->getFlags().hasNoNaNs())
+      return MinMax;
+    SDValue Sum = DAG.getNode(ISD::VECREDUCE_FADD, DL, RT, N0);
+    SDValue SumIsNonNan = DAG.getSetCC(DL, XLenVT, Sum, Sum, ISD::SETOEQ);
+    return DAG.getSelect(DL, RT, SumIsNonNan, MinMax, Sum);
+  }
   }
 
   return SDValue();
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll
index 3f6aa72bc2e3b2..2de4b32dab197a 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -target-abi=ilp32d -mattr=+v,+zfh,+zvfh,+f,+d -verify-machineinstrs < %s | FileCheck %s
-; RUN: llc -mtriple=riscv64 -target-abi=lp64d -mattr=+v,+zfh,+zvfh,+f,+d  -verify-machineinstrs < %s | FileCheck %s
+; RUN: llc -mtriple=riscv32 -target-abi=ilp32d -mattr=+v,+zfh,+zvfh,+f,+d -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32
+; RUN: llc -mtriple=riscv64 -target-abi=lp64d -mattr=+v,+zfh,+zvfh,+f,+d  -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64
 
 declare half @llvm.vector.reduce.fadd.v1f16(half, <1 x half>)
 
@@ -1592,3 +1592,949 @@ define float @vreduce_nsz_fadd_v4f32(ptr %x, float %s) {
   %red = call reassoc nsz float @llvm.vector.reduce.fadd.v4f32(float %s, <4 x float> %v)
   ret float %red
 }
+
+declare half @llvm.vector.reduce.fminimum.v2f16(<2 x half>)
+
+define half @vreduce_fminimum_v2f16(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v2f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 2, e16, mf4, ta, ma
+; CHECK-NEXT:    vle16.v v8, (a0)
+; CHECK-NEXT:    lui a0, 1048568
+; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    feq.h a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB99_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfredmin.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB99_2:
+; CHECK-NEXT:    ret
+  %v = load <2 x half>, ptr %x
+  %red = call half @llvm.vector.reduce.fminimum.v2f16(<2 x half> %v)
+  ret half %red
+}
+
+declare half @llvm.vector.reduce.fminimum.v4f16(<4 x half>)
+
+define half @vreduce_fminimum_v4f16(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT:    vle16.v v8, (a0)
+; CHECK-NEXT:    lui a0, 1048568
+; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    feq.h a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB100_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfredmin.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB100_2:
+; CHECK-NEXT:    ret
+  %v = load <4 x half>, ptr %x
+  %red = call half @llvm.vector.reduce.fminimum.v4f16(<4 x half> %v)
+  ret half %red
+}
+
+define half @vreduce_fminimum_v4f16_nonans(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f16_nonans:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT:    vle16.v v8, (a0)
+; CHECK-NEXT:    lui a0, 1048568
+; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    feq.h a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB101_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfredmin.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB101_2:
+; CHECK-NEXT:    ret
+  %v = load <4 x half>, ptr %x
+  %red = call nnan half @llvm.vector.reduce.fminimum.v4f16(<4 x half> %v)
+  ret half %red
+}
+
+define half @vreduce_fminimum_v4f16_nonans_noinfs(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f16_nonans_noinfs:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT:    vle16.v v8, (a0)
+; CHECK-NEXT:    lui a0, 1048568
+; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    feq.h a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB102_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfredmin.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB102_2:
+; CHECK-NEXT:    ret
+  %v = load <4 x half>, ptr %x
+  %red = call nnan ninf half @llvm.vector.reduce.fminimum.v4f16(<4 x half> %v)
+  ret half %red
+}
+
+declare half @llvm.vector.reduce.fminimum.v128f16(<128 x half>)
+
+define half @vreduce_fminimum_v128f16(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v128f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    li a1, 64
+; CHECK-NEXT:    vsetvli zero, a1, e16, m8, ta, ma
+; CHECK-NEXT:    vle16.v v8, (a0)
+; CHECK-NEXT:    addi a0, a0, 128
+; CHECK-NEXT:    vle16.v v16, (a0)
+; CHECK-NEXT:    vfadd.vv v24, v8, v16
+; CHECK-NEXT:    lui a0, 1048568
+; CHECK-NEXT:    vmv.s.x v0, a0
+; CHECK-NEXT:    vfredusum.vs v24, v24, v0
+; CHECK-NEXT:    vfmv.f.s fa0, v24
+; CHECK-NEXT:    feq.h a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB103_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfmin.vv v8, v8, v16
+; CHECK-NEXT:    vfredmin.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB103_2:
+; CHECK-NEXT:    ret
+  %v = load <128 x half>, ptr %x
+  %red = call half @llvm.vector.reduce.fminimum.v128f16(<128 x half> %v)
+  ret half %red
+}
+
+declare float @llvm.vector.reduce.fminimum.v2f32(<2 x float>)
+
+define float @vreduce_fminimum_v2f32(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v2f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
+; CHECK-NEXT:    vle32.v v8, (a0)
+; CHECK-NEXT:    lui a0, 524288
+; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    feq.s a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB104_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfredmin.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB104_2:
+; CHECK-NEXT:    ret
+  %v = load <2 x float>, ptr %x
+  %red = call float @llvm.vector.reduce.fminimum.v2f32(<2 x float> %v)
+  ret float %red
+}
+
+declare float @llvm.vector.reduce.fminimum.v4f32(<4 x float>)
+
+define float @vreduce_fminimum_v4f32(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vle32.v v8, (a0)
+; CHECK-NEXT:    lui a0, 524288
+; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    feq.s a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB105_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfredmin.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB105_2:
+; CHECK-NEXT:    ret
+  %v = load <4 x float>, ptr %x
+  %red = call float @llvm.vector.reduce.fminimum.v4f32(<4 x float> %v)
+  ret float %red
+}
+
+define float @vreduce_fminimum_v4f32_nonans(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f32_nonans:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vle32.v v8, (a0)
+; CHECK-NEXT:    lui a0, 524288
+; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    feq.s a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB106_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfredmin.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB106_2:
+; CHECK-NEXT:    ret
+  %v = load <4 x float>, ptr %x
+  %red = call nnan float @llvm.vector.reduce.fminimum.v4f32(<4 x float> %v)
+  ret float %red
+}
+
+define float @vreduce_fminimum_v4f32_nonans_noinfs(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f32_nonans_noinfs:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vle32.v v8, (a0)
+; CHECK-NEXT:    lui a0, 524288
+; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    feq.s a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB107_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfredmin.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB107_2:
+; CHECK-NEXT:    ret
+  %v = load <4 x float>, ptr %x
+  %red = call nnan ninf float @llvm.vector.reduce.fminimum.v4f32(<4 x float> %v)
+  ret float %red
+}
+
+declare float @llvm.vector.reduce.fminimum.v128f32(<128 x float>)
+
+define float @vreduce_fminimum_v128f32(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v128f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a1, vlenb
+; CHECK-NEXT:    slli a1, a1, 4
+; CHECK-NEXT:    sub sp, sp, a1
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
+; CHECK-NEXT:    li a1, 32
+; CHECK-NEXT:    vsetvli zero, a1, e32, m8, ta, ma
+; CHECK-NEXT:    vle32.v v16, (a0)
+; CHECK-NEXT:    addi a1, a0, 384
+; CHECK-NEXT:    vle32.v v8, (a1)
+; CHECK-NEXT:    addi a1, a0, 256
+; CHECK-NEXT:    addi a0, a0, 128
+; CHECK-NEXT:    vle32.v v0, (a0)
+; CHECK-NEXT:    vle32.v v24, (a1)
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vs8r.v v8, (a0) # Unknown-size Folded Spill
+; CHECK-NEXT:    vfadd.vv v8, v0, v8
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vs8r.v v16, (a0) # Unknown-size Folded Spill
+; CHECK-NEXT:    vfadd.vv v16, v16, v24
+; CHECK-NEXT:    vfadd.vv v8, v16, v8
+; CHECK-NEXT:    lui a0, 524288
+; CHECK-NEXT:    vmv.s.x v16, a0
+; CHECK-NEXT:    vfredusum.vs v8, v8, v16
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:    feq.s a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB108_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vfmin.vv v8, v0, v8
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vfmin.vv v16, v16, v24
+; CHECK-NEXT:    vfmin.vv v8, v16, v8
+; CHECK-NEXT:    vfredmin.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB108_2:
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add sp, sp, a0
+; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    ret
+  %v = load <128 x float>, ptr %x
+  %red = call float @llvm.vector.reduce.fminimum.v128f32(<128 x float> %v)
+  ret float %red
+}
+
+declare double @llvm.vector.reduce.fminimum.v2f64(<2 x double>)
+
+define double @vreduce_fminimum_v2f64(ptr %x) {
+; RV32-LABEL: vreduce_fminimum_v2f64:
+; RV32:       # %bb.0:
+; RV32-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
+; RV32-NEXT:    vle64.v v8, (a0)
+; RV32-NEXT:    fcvt.d.w fa5, zero
+; RV32-NEXT:    fneg.d fa5, fa5
+; RV32-NEXT:    vfmv.s.f v9, fa5
+; RV32-NEXT:    vfredusum.vs v9, v8, v9
+; RV32-NEXT:    vfmv.f.s fa0, v9
+; RV32-NEXT:    feq.d a0, fa0, fa0
+; RV32-NEXT:    beqz a0, .LBB109_2
+; RV32-NEXT:  # %bb.1:
+; RV32-NEXT:    vfredmin.vs v8, v8, v8
+; RV32-NEXT:    vfmv.f.s fa0, v8
+; RV32-NEXT:  .LBB109_2:
+; RV32-NEXT:    ret
+;
+; RV64-LABEL: vreduce_fminimum_v2f64:
+; RV64:       # %bb.0:
+; RV64-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
+; RV64-NEXT:    vle64.v v8, (a0)
+; RV64-NEXT:    li a0, -1
+; RV64-NEXT:    slli a0, a0, 63
+; RV64-NEXT:    vmv.s.x v9, a0
+; RV64-NEXT:    vfredusum.vs v9, v8, v9
+; RV64-NEXT:    vfmv.f.s fa0, v9
+; RV64-NEXT:    feq.d a0, fa0, fa0
+; RV64-NEXT:    beqz a0, .LBB109_2
+; RV64-NEXT:  # %bb.1:
+; RV64-NEXT:    vfredmin.vs v8, v8, v8
+; RV64-NEXT:    vfmv.f.s fa0, v8
+; RV64-NEXT:  .LBB109_2:
+; RV64-NEXT:    ret
+  %v = load <2 x double>, ptr %x
+  %red = call double @llvm.vector.reduce.fminimum.v2f64(<2 x double> %v)
+  ret double %red
+}
+
+declare double @llvm.vector.reduce.fminimum.v4f64(<4 x double>)
+
+define double @vreduce_fminimum_v4f64(ptr %x) {
+; RV32-LABEL: vreduce_fminimum_v4f64:
+; RV32:       # %bb.0:
+; RV32-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
+; RV32-NEXT:    vle64.v v8, (a0)
+; RV32-NEXT:    fcvt.d.w fa5, zero
+; RV32-NEXT:    fneg.d fa5, fa5
+; RV32-NEXT:    vfmv.s.f v10, fa5
+; RV32-NEXT:    vfredusum.vs v10, v8, v10
+; RV32-NEXT:    vfmv.f.s fa0, v10
+; RV32-NEXT:    feq.d a0, fa0, fa0
+; RV32-NEXT:    beqz a0, .LBB110_2
+; RV32-NEXT:  # %bb.1:
+; RV32-NEXT:    vfredmin.vs v8, v8, v8
+; RV32-NEXT:    vfmv.f.s fa0, v8
+; RV32-NEXT:  .LBB110_2:
+; RV32-NEXT:    ret
+;
+; RV64-LABEL: vreduce_fminimum_v4f64:
+; RV64:       # %bb.0:
+; RV64-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
+; RV64-NEXT:    vle64.v v8, (a0)
+; RV64-NEXT:    li a0, -1
+; RV64-NEXT:    slli a0, a0, 63
+; RV64-NEXT:    vmv.s.x v10, a0
+; RV64-NEXT:    vfredusum.vs v10, v8, v10
+; RV64-NEXT:    vfmv.f.s fa0, v10
+; RV64-NEXT:    feq.d a0, fa0, fa0
+; RV64-NEXT:    beqz a0, .LBB110_2
+; RV64-NEXT:  # %bb.1:
+; RV64-NEXT:    vfredmin.vs v8, v8, v8
+; RV64-NEXT:    vfmv.f.s fa0, v8
+; RV64-NEXT:  .LBB110_2:
+; RV64-NEXT:    ret
+  %v = load <4 x double>, ptr %x
+  %red = call double @llvm.vector.reduce.fminimum.v4f64(<4 x double> %v)
+  ret double %red
+}
+
+define double @vreduce_fminimum_v4f64_nonans(ptr %x) {
+; RV32-LABEL: vreduce_fminimum_v4f64_nonans:
+; RV32:       # %bb.0:
+; RV32-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
+; RV32-NEXT:    vle64.v v8, (a0)
+; RV32-NEXT:    fcvt.d.w fa5, zero
+; RV32-NEXT:    fneg.d fa5, fa5
+; RV32-NEXT:    vfmv.s.f v10, fa5
+; RV32-NEXT:    vfredusum.vs v10, v8, v10
+; RV32-NEXT:    vfmv.f.s fa0, v10
+; RV32-NEXT:    feq.d a0, fa0, fa0
+; RV32-NEXT:    beqz a0, .LBB111_2
+; RV32-NEXT:  # %bb.1:
+; RV32-NEXT:    vfredmin.vs v8, v8, v8
+; RV32-NEXT:    vfmv.f.s fa0, v8
+; RV32-NEXT:  .LBB111_2:
+; RV32-NEXT:    ret
+;
+; RV64-LABEL: vreduce_fminimum_v4f64_nonans:
+; RV64:       # %bb.0:
+; RV64-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
+; RV64-NEXT:    vle64.v v8, (a0)
+; RV64-NEXT:    li a0, -1
+; RV64-NEXT:    slli a0, a0, 63
+; RV64-NEXT:    vmv.s.x v10, a0
+; RV64-NEXT:    vfredusum.vs v10, v8, v10
+; RV64-NEXT:    vfmv.f.s fa0, v10
+; RV64-NEXT:    feq.d a0, fa0, fa0
+; RV64-NEXT:    beqz a0, .LBB111_2
+; RV64-NEXT:  # %bb.1:
+; RV64-NEXT:    vfredmin.vs v8, v8, v8
+; RV64-NEXT:    vfmv.f.s fa0, v8
+; RV64-NEXT:  .LBB111_2:
+; RV64-NEXT:    ret
+  %v = load <4 x double>, ptr %x
+  %red = call nnan double @llvm.vector.reduce.fminimum.v4f64(<4 x double> %v)
+  ret double %red
+}
+
+define double @vreduce_fminimum_v4f64_nonans_noinfs(ptr %x) {
+; RV32-LABEL: vreduce_fminimum_v4f64_nonans_noinfs:
+; RV32:       # %bb.0:
+; RV32-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
+; RV32-NEXT:    vle64.v v8, (a0)
+; RV32-NEXT:    fcvt.d.w fa5, zero
+; RV32-NEXT:    fneg.d fa5, fa5
+; RV32-NEXT:    vfmv.s.f v10, fa5
+; RV32-NEXT:    vfredusum.vs v10, v8, v10
+; RV32-NEXT:    vfmv.f.s fa0, v10
+; RV32-NEXT:    feq.d a0, fa0, fa0
+; RV32-NEXT:    beqz a0, .LBB112_2
+; RV32-NEXT:  # %bb.1:
+; RV32-NEXT:    vfredmin.vs v8, v8, v8
+; RV32-NEXT:    vfmv.f.s fa0, v8
+; RV32-NEXT:  .LBB112_2:
+; RV32-NEXT:    ret
+;
+; RV64-LABEL: vreduce_fminimum_v4f64_nonans_noinfs:
+; RV64:       # %bb.0:
+; RV64-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
+; RV64-NEXT:    vle64.v v8, (a0)
+; RV64-NEXT:    li a0, -1
+; RV64-NEXT:    slli a0, a0, 63
+; RV64-NEXT:    vmv.s.x v10, a0
+; RV64-NEXT:    vfredusum.vs v10, v8, v10
+; RV64-NEXT:    vfmv.f.s fa0, v10
+; RV64-NEXT:    feq.d a0, fa0, fa0
+; RV64-NEXT:    beqz a0, .LBB112_2
+; RV64-NEXT:  # %bb.1:
+; RV64-NEXT:    vfredmin.vs v8, v8, v8
+; RV64-NEXT:    vfmv.f.s fa0, v8
+; RV64-NEXT:  .LBB112_2:
+; RV64-NEXT:    ret
+  %v = load <4 x double>, ptr %x
+  %red = call nnan ninf double @llvm.vector.reduce.fminimum.v4f64(<4 x double> %v)
+  ret double %red
+}
+
+declare double @llvm.vector.reduce.fminimum.v32f64(<32 x double>)
+
+define double @vreduce_fminimum_v32f64(ptr %x) {
+; RV32-LABEL: vreduce_fminimum_v32f64:
+; RV32:       # %bb.0:
+; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
+; RV32-NEXT:    vle64.v v8, (a0)
+; RV32-NEXT:    addi a0, a0, 128
+; RV32-NEXT:    vle64.v v16, (a0)
+; RV32-NEXT:    vfadd.vv v24, v8, v16
+; RV32-NEXT:    fcvt.d.w fa5, zero
+; RV32-NEXT:    fneg.d fa5, fa5
+; RV32-NEXT:    vfmv.s.f v0, fa5
+; RV32-NEXT:    vfredusum.vs v24, v24, v0
+; RV32-NEXT:    vfmv.f.s fa0, v24
+; RV32-NEXT:    feq.d a0, fa0, fa0
+; RV32-NEXT:    beqz a0, .LBB113_2
+; RV32-NEXT:  # %bb.1:
+; RV32-NEXT:    vfmin.vv v8, v8, v16
+; RV32-NEXT:    vfredmin.vs v8, v8, v8
+; RV32-NEXT:    vfmv.f.s fa0, v8
+; RV32-NEXT:  .LBB113_2:
+; RV32-NEXT:    ret
+;
+; RV64-LABEL: vreduce_fminimum_v32f64:
+; RV64:       # %bb.0:
+; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
+; RV64-NEXT:    vle64.v v8, (a0)
+; RV64-NEXT:    addi a0, a0, 128
+; RV64-NEXT:    vle64.v v16, (a0)
+; RV64-NEXT:    vfadd.vv v24, v8, v16
+; RV64-NEXT:    li a0, -1
+; RV64-NEXT:    slli a0, a0, 63
+; RV64-NEXT:    vmv.s.x v0, a0
+; RV64-NEXT:    vfredusum.vs v24, v24, v0
+; RV64-NEXT:    vfmv.f.s fa0, v24
+; RV64-NEXT:    feq.d a0, fa0, fa0
+; RV64-NEXT:    beqz a0, .LBB113_2
+; RV64-NEXT:  # %bb.1:
+; RV64-NEXT:    vfmin.vv v8, v8, v16
+; RV64-NEXT:    vfredmin.vs v8, v8, v8
+; RV64-NEXT:    vfmv.f.s fa0, v8
+; RV64-NEXT:  .LBB113_2:
+; RV64-NEXT:    ret
+  %v = load <32 x double>, ptr %x
+  %red = call double @llvm.vector.reduce.fminimum.v32f64(<32 x double> %v)
+  ret double %red
+}
+
+declare half @llvm.vector.reduce.fmaximum.v2f16(<2 x half>)
+
+define half @vreduce_fmaximum_v2f16(ptr %x) {
+; CHECK-LABEL: vreduce_fmaximum_v2f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 2, e16, mf4, ta, ma
+; CHECK-NEXT:    vle16.v v8, (a0)
+; CHECK-NEXT:    lui a0, 1048568
+; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    feq.h a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB114_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfredmax.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB114_2:
+; CHECK-NEXT:    ret
+  %v = load <2 x half>, ptr %x
+  %red = call half @llvm.vector.reduce.fmaximum.v2f16(<2 x half> %v)
+  ret half %red
+}
+
+declare half @llvm.vector.reduce.fmaximum.v4f16(<4 x half>)
+
+define half @vreduce_fmaximum_v4f16(ptr %x) {
+; CHECK-LABEL: vreduce_fmaximum_v4f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT:    vle16.v v8, (a0)
+; CHECK-NEXT:    lui a0, 1048568
+; CHECK-NEXT:    vmv.s.x v9, a0
+; CHECK-NEXT:    vfredusum.vs v9, v8, v9
+; CHECK-NEXT:    vfmv.f.s fa0, v9
+; CHECK-NEXT:    feq.h a0, fa0, fa0
+; CHECK-NEXT:    beqz a0, .LBB115_2
+; CHECK-NEXT:  # %bb.1:
+; CHECK-NEXT:    vfredmax.vs v8, v8, v8
+; CHECK-NEXT:    vfmv.f.s fa0, v8
+; CHECK-NEXT:  .LBB115_2:
+; CHECK-NEXT:    ret
+  %v = load <4 x half>, ptr %x
+  %red = call half @llvm.vector.reduce.fmaximum.v4f16(<4 x half> %v)
+  ret half %red
+}
+
+define half @vreduce_fmaximum_v4f16_nonans(ptr %x) {
+; CHECK-LABEL: vreduce_fmaximum_v4f16_nonans:...
[truncated]

Copy link

github-actions bot commented Dec 14, 2023

:white_check_mark: With the latest revision this PR passed the C/C++ code formatter.

The default lowering of VECREDUCE_FMAXIMUM/VECREDUCE_FMINIMUM to RISC-V
involves splitting the vector multiple times and performing
logarithmically many (in terms of vector length) FMINIMUM/FMAXIMUM
operations. Given that even a single such operation generates a large
sequence of instructions, a better strategy is needed.

This patch transforms such reductions into an equivalent sequence of a
reduction fmin/fmax, a reduction sum to detect any NaNs, and a scalar
select to choose the correct result. Given that these reduction
operations are natively supported in RISC-V, this leads to a much more
efficient sequence of instructions.

The transformation is performed in the DAG combiner, before the type
legalizer has a chance to split the reduction and generate
FMINIMUM/FMAXIMUM nodes.
@topperc
Copy link
Collaborator

topperc commented Dec 14, 2023

Are you sure the FMINIMUM/FMAXIMUM came from the type legalizer? If the vector type is legal, the type legalizer shouldn't touch it. I would expect LegalizeVectorOps or LegalizeDAG to be where it gets broken down. So I think we could use custom lowering instead of a DAG combine.


// Reduction fmax/fmin + separate reduction sum to propagate NaNs
unsigned ReducedMinMaxOpc = N->getOpcode() == ISD::VECREDUCE_FMAXIMUM
? ISD::VECREDUCE_FMAX
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISD::VECREDUCE_FMAX/FMIN aren't guaranteed to preserve the order of -0.0 and +0.0. If generic DAG combiner ends up seeing a constnat vector for some reason after this change, it might incorrectly fold it.

@simeonkr
Copy link
Contributor Author

simeonkr commented Dec 14, 2023

Are you sure the FMINIMUM/FMAXIMUM came from the type legalizer? If the vector type is legal, the type legalizer shouldn't touch it. I would expect LegalizeVectorOps or LegalizeDAG to be where it gets broken down. So I think we could use custom lowering instead of a DAG combine.

It happens in DAGTypeLegalizer::SplitVecOp_VECREDUCE(), in the case of illegal vector types. I'm assuming that we care about them, given that we consider them when testing other reduction operations? Without the patch, here are the DAGs after DAG combine and type legalization, respectively:

Optimized lowered selection DAG: %bb.0 'vreduce_fmaximum_v128f32:'
SelectionDAG has 9 nodes:
  t0: ch,glue = EntryToken
        t2: i32,ch = CopyFromReg t0, Register:i32 %0
      t5: v128f32,ch = load<(load (s4096) from %ir.x)> t0, t2, undef:i32
    t6: f32 = vecreduce_fmaximum t5
  t8: ch,glue = CopyToReg t0, Register:f32 $f10_f, t6
  t9: ch = RISCVISD::RET_GLUE t8, Register:f32 $f10_f, t8:1

Type-legalized selection DAG: %bb.0 'vreduce_fmaximum_v128f32:'
SelectionDAG has 20 nodes:
  t0: ch,glue = EntryToken
  t2: i32,ch = CopyFromReg t0, Register:i32 %0
          t22: v32f32,ch = load<(load (s1024) from %ir.x, align 512)> t0, t2, undef:i32
          t17: v32f32,ch = load<(load (s1024) from %ir.x + 256, align 256, basealign 512)> t0, t12, undef:i32
        t26: v32f32 = fmaximum t22, t17
            t23: i32 = add nuw t2, Constant:i32<128>
          t24: v32f32,ch = load<(load (s1024) from %ir.x + 128, basealign 512)> t0, t23, undef:i32
            t19: i32 = add nuw t12, Constant:i32<128>
          t20: v32f32,ch = load<(load (s1024) from %ir.x + 384, basealign 512)> t0, t19, undef:i32
        t27: v32f32 = fmaximum t24, t20
      t28: v32f32 = fmaximum t26, t27
    t29: f32 = vecreduce_fmaximum t28
  t8: ch,glue = CopyToReg t0, Register:f32 $f10_f, t29
  t12: i32 = add nuw t2, Constant:i32<256>
  t9: ch = RISCVISD::RET_GLUE t8, Register:f32 $f10_f, t8:1

Hence if I handle the reduction in RISCVTargetLowering::LowerOperation() as I originally tried to do, it will be too late to avoid the FMAXIMUM/FMINIMUMS. It still leads to some improvement but not a significant amount.

@simeonkr
Copy link
Contributor Author

simeonkr commented Jan 3, 2024

Ping

1 similar comment
@simeonkr
Copy link
Contributor Author

Ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants