-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Improve llvm.reduce.fmaximum/minimum lowering #75484
base: main
Are you sure you want to change the base?
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-risc-v Author: Simeon K (simeonkr) ChangesThe default lowering of VECREDUCE_FMAXIMUM/VECREDUCE_FMINIMUM to RISC-V involves splitting the vector multiple times and performing logarithmically many (in terms of vector length) FMINIMUM/FMAXIMUM operations. Given that even a single such operation generates a large sequence of instructions, a better strategy is needed. This patch transforms such reductions into an equivalent sequence of a reduction fmin/fmax, a reduction sum to detect any NaNs, and a scalar select to choose the correct result. Given that these reduction operations are natively supported in RISC-V, this leads to a much more efficient sequence of instructions. The transformation is performed in the DAG combiner, before the type legalizer has a chance to split the reduction and generate FMINIMUM/FMAXIMUM nodes. Patch is 34.11 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/75484.diff 2 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index cf1b11c14b6d0f..471e30bd4434d4 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -1368,7 +1368,8 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
setTargetDAGCombine({ISD::INTRINSIC_VOID, ISD::INTRINSIC_W_CHAIN,
ISD::INTRINSIC_WO_CHAIN, ISD::ADD, ISD::SUB, ISD::AND,
- ISD::OR, ISD::XOR, ISD::SETCC, ISD::SELECT});
+ ISD::OR, ISD::XOR, ISD::SETCC, ISD::SELECT,
+ ISD::VECREDUCE_FMAXIMUM, ISD::VECREDUCE_FMINIMUM});
if (Subtarget.is64Bit())
setTargetDAGCombine(ISD::SRA);
@@ -15650,6 +15651,22 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
return SDValue();
}
+ case ISD::VECREDUCE_FMAXIMUM:
+ case ISD::VECREDUCE_FMINIMUM: {
+ EVT RT = N->getValueType(0);
+ SDValue N0 = N->getOperand(0);
+
+ // Reduction fmax/fmin + separate reduction sum to propagate NaNs
+ unsigned ReducedMinMaxOpc =
+ N->getOpcode() == ISD::VECREDUCE_FMAXIMUM ? ISD::VECREDUCE_FMAX :
+ ISD::VECREDUCE_FMIN;
+ SDValue MinMax = DAG.getNode(ReducedMinMaxOpc, DL, RT, N0);
+ if (N0->getFlags().hasNoNaNs())
+ return MinMax;
+ SDValue Sum = DAG.getNode(ISD::VECREDUCE_FADD, DL, RT, N0);
+ SDValue SumIsNonNan = DAG.getSetCC(DL, XLenVT, Sum, Sum, ISD::SETOEQ);
+ return DAG.getSelect(DL, RT, SumIsNonNan, MinMax, Sum);
+ }
}
return SDValue();
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll
index 3f6aa72bc2e3b2..2de4b32dab197a 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll
@@ -1,6 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -target-abi=ilp32d -mattr=+v,+zfh,+zvfh,+f,+d -verify-machineinstrs < %s | FileCheck %s
-; RUN: llc -mtriple=riscv64 -target-abi=lp64d -mattr=+v,+zfh,+zvfh,+f,+d -verify-machineinstrs < %s | FileCheck %s
+; RUN: llc -mtriple=riscv32 -target-abi=ilp32d -mattr=+v,+zfh,+zvfh,+f,+d -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV32
+; RUN: llc -mtriple=riscv64 -target-abi=lp64d -mattr=+v,+zfh,+zvfh,+f,+d -verify-machineinstrs < %s | FileCheck %s --check-prefixes=CHECK,RV64
declare half @llvm.vector.reduce.fadd.v1f16(half, <1 x half>)
@@ -1592,3 +1592,949 @@ define float @vreduce_nsz_fadd_v4f32(ptr %x, float %s) {
%red = call reassoc nsz float @llvm.vector.reduce.fadd.v4f32(float %s, <4 x float> %v)
ret float %red
}
+
+declare half @llvm.vector.reduce.fminimum.v2f16(<2 x half>)
+
+define half @vreduce_fminimum_v2f16(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v2f16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e16, mf4, ta, ma
+; CHECK-NEXT: vle16.v v8, (a0)
+; CHECK-NEXT: lui a0, 1048568
+; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vfredusum.vs v9, v8, v9
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: feq.h a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB99_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfredmin.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB99_2:
+; CHECK-NEXT: ret
+ %v = load <2 x half>, ptr %x
+ %red = call half @llvm.vector.reduce.fminimum.v2f16(<2 x half> %v)
+ ret half %red
+}
+
+declare half @llvm.vector.reduce.fminimum.v4f16(<4 x half>)
+
+define half @vreduce_fminimum_v4f16(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT: vle16.v v8, (a0)
+; CHECK-NEXT: lui a0, 1048568
+; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vfredusum.vs v9, v8, v9
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: feq.h a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB100_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfredmin.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB100_2:
+; CHECK-NEXT: ret
+ %v = load <4 x half>, ptr %x
+ %red = call half @llvm.vector.reduce.fminimum.v4f16(<4 x half> %v)
+ ret half %red
+}
+
+define half @vreduce_fminimum_v4f16_nonans(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f16_nonans:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT: vle16.v v8, (a0)
+; CHECK-NEXT: lui a0, 1048568
+; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vfredusum.vs v9, v8, v9
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: feq.h a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB101_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfredmin.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB101_2:
+; CHECK-NEXT: ret
+ %v = load <4 x half>, ptr %x
+ %red = call nnan half @llvm.vector.reduce.fminimum.v4f16(<4 x half> %v)
+ ret half %red
+}
+
+define half @vreduce_fminimum_v4f16_nonans_noinfs(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f16_nonans_noinfs:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT: vle16.v v8, (a0)
+; CHECK-NEXT: lui a0, 1048568
+; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vfredusum.vs v9, v8, v9
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: feq.h a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB102_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfredmin.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB102_2:
+; CHECK-NEXT: ret
+ %v = load <4 x half>, ptr %x
+ %red = call nnan ninf half @llvm.vector.reduce.fminimum.v4f16(<4 x half> %v)
+ ret half %red
+}
+
+declare half @llvm.vector.reduce.fminimum.v128f16(<128 x half>)
+
+define half @vreduce_fminimum_v128f16(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v128f16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: li a1, 64
+; CHECK-NEXT: vsetvli zero, a1, e16, m8, ta, ma
+; CHECK-NEXT: vle16.v v8, (a0)
+; CHECK-NEXT: addi a0, a0, 128
+; CHECK-NEXT: vle16.v v16, (a0)
+; CHECK-NEXT: vfadd.vv v24, v8, v16
+; CHECK-NEXT: lui a0, 1048568
+; CHECK-NEXT: vmv.s.x v0, a0
+; CHECK-NEXT: vfredusum.vs v24, v24, v0
+; CHECK-NEXT: vfmv.f.s fa0, v24
+; CHECK-NEXT: feq.h a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB103_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfmin.vv v8, v8, v16
+; CHECK-NEXT: vfredmin.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB103_2:
+; CHECK-NEXT: ret
+ %v = load <128 x half>, ptr %x
+ %red = call half @llvm.vector.reduce.fminimum.v128f16(<128 x half> %v)
+ ret half %red
+}
+
+declare float @llvm.vector.reduce.fminimum.v2f32(<2 x float>)
+
+define float @vreduce_fminimum_v2f32(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v2f32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
+; CHECK-NEXT: vle32.v v8, (a0)
+; CHECK-NEXT: lui a0, 524288
+; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vfredusum.vs v9, v8, v9
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: feq.s a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB104_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfredmin.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB104_2:
+; CHECK-NEXT: ret
+ %v = load <2 x float>, ptr %x
+ %red = call float @llvm.vector.reduce.fminimum.v2f32(<2 x float> %v)
+ ret float %red
+}
+
+declare float @llvm.vector.reduce.fminimum.v4f32(<4 x float>)
+
+define float @vreduce_fminimum_v4f32(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vle32.v v8, (a0)
+; CHECK-NEXT: lui a0, 524288
+; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vfredusum.vs v9, v8, v9
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: feq.s a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB105_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfredmin.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB105_2:
+; CHECK-NEXT: ret
+ %v = load <4 x float>, ptr %x
+ %red = call float @llvm.vector.reduce.fminimum.v4f32(<4 x float> %v)
+ ret float %red
+}
+
+define float @vreduce_fminimum_v4f32_nonans(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f32_nonans:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vle32.v v8, (a0)
+; CHECK-NEXT: lui a0, 524288
+; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vfredusum.vs v9, v8, v9
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: feq.s a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB106_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfredmin.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB106_2:
+; CHECK-NEXT: ret
+ %v = load <4 x float>, ptr %x
+ %red = call nnan float @llvm.vector.reduce.fminimum.v4f32(<4 x float> %v)
+ ret float %red
+}
+
+define float @vreduce_fminimum_v4f32_nonans_noinfs(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v4f32_nonans_noinfs:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vle32.v v8, (a0)
+; CHECK-NEXT: lui a0, 524288
+; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vfredusum.vs v9, v8, v9
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: feq.s a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB107_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfredmin.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB107_2:
+; CHECK-NEXT: ret
+ %v = load <4 x float>, ptr %x
+ %red = call nnan ninf float @llvm.vector.reduce.fminimum.v4f32(<4 x float> %v)
+ ret float %red
+}
+
+declare float @llvm.vector.reduce.fminimum.v128f32(<128 x float>)
+
+define float @vreduce_fminimum_v128f32(ptr %x) {
+; CHECK-LABEL: vreduce_fminimum_v128f32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addi sp, sp, -16
+; CHECK-NEXT: .cfi_def_cfa_offset 16
+; CHECK-NEXT: csrr a1, vlenb
+; CHECK-NEXT: slli a1, a1, 4
+; CHECK-NEXT: sub sp, sp, a1
+; CHECK-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
+; CHECK-NEXT: li a1, 32
+; CHECK-NEXT: vsetvli zero, a1, e32, m8, ta, ma
+; CHECK-NEXT: vle32.v v16, (a0)
+; CHECK-NEXT: addi a1, a0, 384
+; CHECK-NEXT: vle32.v v8, (a1)
+; CHECK-NEXT: addi a1, a0, 256
+; CHECK-NEXT: addi a0, a0, 128
+; CHECK-NEXT: vle32.v v0, (a0)
+; CHECK-NEXT: vle32.v v24, (a1)
+; CHECK-NEXT: addi a0, sp, 16
+; CHECK-NEXT: vs8r.v v8, (a0) # Unknown-size Folded Spill
+; CHECK-NEXT: vfadd.vv v8, v0, v8
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: slli a0, a0, 3
+; CHECK-NEXT: add a0, sp, a0
+; CHECK-NEXT: addi a0, a0, 16
+; CHECK-NEXT: vs8r.v v16, (a0) # Unknown-size Folded Spill
+; CHECK-NEXT: vfadd.vv v16, v16, v24
+; CHECK-NEXT: vfadd.vv v8, v16, v8
+; CHECK-NEXT: lui a0, 524288
+; CHECK-NEXT: vmv.s.x v16, a0
+; CHECK-NEXT: vfredusum.vs v8, v8, v16
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: feq.s a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB108_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: addi a0, sp, 16
+; CHECK-NEXT: vl8r.v v8, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT: vfmin.vv v8, v0, v8
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: slli a0, a0, 3
+; CHECK-NEXT: add a0, sp, a0
+; CHECK-NEXT: addi a0, a0, 16
+; CHECK-NEXT: vl8r.v v16, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT: vfmin.vv v16, v16, v24
+; CHECK-NEXT: vfmin.vv v8, v16, v8
+; CHECK-NEXT: vfredmin.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB108_2:
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: slli a0, a0, 4
+; CHECK-NEXT: add sp, sp, a0
+; CHECK-NEXT: addi sp, sp, 16
+; CHECK-NEXT: ret
+ %v = load <128 x float>, ptr %x
+ %red = call float @llvm.vector.reduce.fminimum.v128f32(<128 x float> %v)
+ ret float %red
+}
+
+declare double @llvm.vector.reduce.fminimum.v2f64(<2 x double>)
+
+define double @vreduce_fminimum_v2f64(ptr %x) {
+; RV32-LABEL: vreduce_fminimum_v2f64:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 2, e64, m1, ta, ma
+; RV32-NEXT: vle64.v v8, (a0)
+; RV32-NEXT: fcvt.d.w fa5, zero
+; RV32-NEXT: fneg.d fa5, fa5
+; RV32-NEXT: vfmv.s.f v9, fa5
+; RV32-NEXT: vfredusum.vs v9, v8, v9
+; RV32-NEXT: vfmv.f.s fa0, v9
+; RV32-NEXT: feq.d a0, fa0, fa0
+; RV32-NEXT: beqz a0, .LBB109_2
+; RV32-NEXT: # %bb.1:
+; RV32-NEXT: vfredmin.vs v8, v8, v8
+; RV32-NEXT: vfmv.f.s fa0, v8
+; RV32-NEXT: .LBB109_2:
+; RV32-NEXT: ret
+;
+; RV64-LABEL: vreduce_fminimum_v2f64:
+; RV64: # %bb.0:
+; RV64-NEXT: vsetivli zero, 2, e64, m1, ta, ma
+; RV64-NEXT: vle64.v v8, (a0)
+; RV64-NEXT: li a0, -1
+; RV64-NEXT: slli a0, a0, 63
+; RV64-NEXT: vmv.s.x v9, a0
+; RV64-NEXT: vfredusum.vs v9, v8, v9
+; RV64-NEXT: vfmv.f.s fa0, v9
+; RV64-NEXT: feq.d a0, fa0, fa0
+; RV64-NEXT: beqz a0, .LBB109_2
+; RV64-NEXT: # %bb.1:
+; RV64-NEXT: vfredmin.vs v8, v8, v8
+; RV64-NEXT: vfmv.f.s fa0, v8
+; RV64-NEXT: .LBB109_2:
+; RV64-NEXT: ret
+ %v = load <2 x double>, ptr %x
+ %red = call double @llvm.vector.reduce.fminimum.v2f64(<2 x double> %v)
+ ret double %red
+}
+
+declare double @llvm.vector.reduce.fminimum.v4f64(<4 x double>)
+
+define double @vreduce_fminimum_v4f64(ptr %x) {
+; RV32-LABEL: vreduce_fminimum_v4f64:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 4, e64, m2, ta, ma
+; RV32-NEXT: vle64.v v8, (a0)
+; RV32-NEXT: fcvt.d.w fa5, zero
+; RV32-NEXT: fneg.d fa5, fa5
+; RV32-NEXT: vfmv.s.f v10, fa5
+; RV32-NEXT: vfredusum.vs v10, v8, v10
+; RV32-NEXT: vfmv.f.s fa0, v10
+; RV32-NEXT: feq.d a0, fa0, fa0
+; RV32-NEXT: beqz a0, .LBB110_2
+; RV32-NEXT: # %bb.1:
+; RV32-NEXT: vfredmin.vs v8, v8, v8
+; RV32-NEXT: vfmv.f.s fa0, v8
+; RV32-NEXT: .LBB110_2:
+; RV32-NEXT: ret
+;
+; RV64-LABEL: vreduce_fminimum_v4f64:
+; RV64: # %bb.0:
+; RV64-NEXT: vsetivli zero, 4, e64, m2, ta, ma
+; RV64-NEXT: vle64.v v8, (a0)
+; RV64-NEXT: li a0, -1
+; RV64-NEXT: slli a0, a0, 63
+; RV64-NEXT: vmv.s.x v10, a0
+; RV64-NEXT: vfredusum.vs v10, v8, v10
+; RV64-NEXT: vfmv.f.s fa0, v10
+; RV64-NEXT: feq.d a0, fa0, fa0
+; RV64-NEXT: beqz a0, .LBB110_2
+; RV64-NEXT: # %bb.1:
+; RV64-NEXT: vfredmin.vs v8, v8, v8
+; RV64-NEXT: vfmv.f.s fa0, v8
+; RV64-NEXT: .LBB110_2:
+; RV64-NEXT: ret
+ %v = load <4 x double>, ptr %x
+ %red = call double @llvm.vector.reduce.fminimum.v4f64(<4 x double> %v)
+ ret double %red
+}
+
+define double @vreduce_fminimum_v4f64_nonans(ptr %x) {
+; RV32-LABEL: vreduce_fminimum_v4f64_nonans:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 4, e64, m2, ta, ma
+; RV32-NEXT: vle64.v v8, (a0)
+; RV32-NEXT: fcvt.d.w fa5, zero
+; RV32-NEXT: fneg.d fa5, fa5
+; RV32-NEXT: vfmv.s.f v10, fa5
+; RV32-NEXT: vfredusum.vs v10, v8, v10
+; RV32-NEXT: vfmv.f.s fa0, v10
+; RV32-NEXT: feq.d a0, fa0, fa0
+; RV32-NEXT: beqz a0, .LBB111_2
+; RV32-NEXT: # %bb.1:
+; RV32-NEXT: vfredmin.vs v8, v8, v8
+; RV32-NEXT: vfmv.f.s fa0, v8
+; RV32-NEXT: .LBB111_2:
+; RV32-NEXT: ret
+;
+; RV64-LABEL: vreduce_fminimum_v4f64_nonans:
+; RV64: # %bb.0:
+; RV64-NEXT: vsetivli zero, 4, e64, m2, ta, ma
+; RV64-NEXT: vle64.v v8, (a0)
+; RV64-NEXT: li a0, -1
+; RV64-NEXT: slli a0, a0, 63
+; RV64-NEXT: vmv.s.x v10, a0
+; RV64-NEXT: vfredusum.vs v10, v8, v10
+; RV64-NEXT: vfmv.f.s fa0, v10
+; RV64-NEXT: feq.d a0, fa0, fa0
+; RV64-NEXT: beqz a0, .LBB111_2
+; RV64-NEXT: # %bb.1:
+; RV64-NEXT: vfredmin.vs v8, v8, v8
+; RV64-NEXT: vfmv.f.s fa0, v8
+; RV64-NEXT: .LBB111_2:
+; RV64-NEXT: ret
+ %v = load <4 x double>, ptr %x
+ %red = call nnan double @llvm.vector.reduce.fminimum.v4f64(<4 x double> %v)
+ ret double %red
+}
+
+define double @vreduce_fminimum_v4f64_nonans_noinfs(ptr %x) {
+; RV32-LABEL: vreduce_fminimum_v4f64_nonans_noinfs:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 4, e64, m2, ta, ma
+; RV32-NEXT: vle64.v v8, (a0)
+; RV32-NEXT: fcvt.d.w fa5, zero
+; RV32-NEXT: fneg.d fa5, fa5
+; RV32-NEXT: vfmv.s.f v10, fa5
+; RV32-NEXT: vfredusum.vs v10, v8, v10
+; RV32-NEXT: vfmv.f.s fa0, v10
+; RV32-NEXT: feq.d a0, fa0, fa0
+; RV32-NEXT: beqz a0, .LBB112_2
+; RV32-NEXT: # %bb.1:
+; RV32-NEXT: vfredmin.vs v8, v8, v8
+; RV32-NEXT: vfmv.f.s fa0, v8
+; RV32-NEXT: .LBB112_2:
+; RV32-NEXT: ret
+;
+; RV64-LABEL: vreduce_fminimum_v4f64_nonans_noinfs:
+; RV64: # %bb.0:
+; RV64-NEXT: vsetivli zero, 4, e64, m2, ta, ma
+; RV64-NEXT: vle64.v v8, (a0)
+; RV64-NEXT: li a0, -1
+; RV64-NEXT: slli a0, a0, 63
+; RV64-NEXT: vmv.s.x v10, a0
+; RV64-NEXT: vfredusum.vs v10, v8, v10
+; RV64-NEXT: vfmv.f.s fa0, v10
+; RV64-NEXT: feq.d a0, fa0, fa0
+; RV64-NEXT: beqz a0, .LBB112_2
+; RV64-NEXT: # %bb.1:
+; RV64-NEXT: vfredmin.vs v8, v8, v8
+; RV64-NEXT: vfmv.f.s fa0, v8
+; RV64-NEXT: .LBB112_2:
+; RV64-NEXT: ret
+ %v = load <4 x double>, ptr %x
+ %red = call nnan ninf double @llvm.vector.reduce.fminimum.v4f64(<4 x double> %v)
+ ret double %red
+}
+
+declare double @llvm.vector.reduce.fminimum.v32f64(<32 x double>)
+
+define double @vreduce_fminimum_v32f64(ptr %x) {
+; RV32-LABEL: vreduce_fminimum_v32f64:
+; RV32: # %bb.0:
+; RV32-NEXT: vsetivli zero, 16, e64, m8, ta, ma
+; RV32-NEXT: vle64.v v8, (a0)
+; RV32-NEXT: addi a0, a0, 128
+; RV32-NEXT: vle64.v v16, (a0)
+; RV32-NEXT: vfadd.vv v24, v8, v16
+; RV32-NEXT: fcvt.d.w fa5, zero
+; RV32-NEXT: fneg.d fa5, fa5
+; RV32-NEXT: vfmv.s.f v0, fa5
+; RV32-NEXT: vfredusum.vs v24, v24, v0
+; RV32-NEXT: vfmv.f.s fa0, v24
+; RV32-NEXT: feq.d a0, fa0, fa0
+; RV32-NEXT: beqz a0, .LBB113_2
+; RV32-NEXT: # %bb.1:
+; RV32-NEXT: vfmin.vv v8, v8, v16
+; RV32-NEXT: vfredmin.vs v8, v8, v8
+; RV32-NEXT: vfmv.f.s fa0, v8
+; RV32-NEXT: .LBB113_2:
+; RV32-NEXT: ret
+;
+; RV64-LABEL: vreduce_fminimum_v32f64:
+; RV64: # %bb.0:
+; RV64-NEXT: vsetivli zero, 16, e64, m8, ta, ma
+; RV64-NEXT: vle64.v v8, (a0)
+; RV64-NEXT: addi a0, a0, 128
+; RV64-NEXT: vle64.v v16, (a0)
+; RV64-NEXT: vfadd.vv v24, v8, v16
+; RV64-NEXT: li a0, -1
+; RV64-NEXT: slli a0, a0, 63
+; RV64-NEXT: vmv.s.x v0, a0
+; RV64-NEXT: vfredusum.vs v24, v24, v0
+; RV64-NEXT: vfmv.f.s fa0, v24
+; RV64-NEXT: feq.d a0, fa0, fa0
+; RV64-NEXT: beqz a0, .LBB113_2
+; RV64-NEXT: # %bb.1:
+; RV64-NEXT: vfmin.vv v8, v8, v16
+; RV64-NEXT: vfredmin.vs v8, v8, v8
+; RV64-NEXT: vfmv.f.s fa0, v8
+; RV64-NEXT: .LBB113_2:
+; RV64-NEXT: ret
+ %v = load <32 x double>, ptr %x
+ %red = call double @llvm.vector.reduce.fminimum.v32f64(<32 x double> %v)
+ ret double %red
+}
+
+declare half @llvm.vector.reduce.fmaximum.v2f16(<2 x half>)
+
+define half @vreduce_fmaximum_v2f16(ptr %x) {
+; CHECK-LABEL: vreduce_fmaximum_v2f16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e16, mf4, ta, ma
+; CHECK-NEXT: vle16.v v8, (a0)
+; CHECK-NEXT: lui a0, 1048568
+; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vfredusum.vs v9, v8, v9
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: feq.h a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB114_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfredmax.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB114_2:
+; CHECK-NEXT: ret
+ %v = load <2 x half>, ptr %x
+ %red = call half @llvm.vector.reduce.fmaximum.v2f16(<2 x half> %v)
+ ret half %red
+}
+
+declare half @llvm.vector.reduce.fmaximum.v4f16(<4 x half>)
+
+define half @vreduce_fmaximum_v4f16(ptr %x) {
+; CHECK-LABEL: vreduce_fmaximum_v4f16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT: vle16.v v8, (a0)
+; CHECK-NEXT: lui a0, 1048568
+; CHECK-NEXT: vmv.s.x v9, a0
+; CHECK-NEXT: vfredusum.vs v9, v8, v9
+; CHECK-NEXT: vfmv.f.s fa0, v9
+; CHECK-NEXT: feq.h a0, fa0, fa0
+; CHECK-NEXT: beqz a0, .LBB115_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: vfredmax.vs v8, v8, v8
+; CHECK-NEXT: vfmv.f.s fa0, v8
+; CHECK-NEXT: .LBB115_2:
+; CHECK-NEXT: ret
+ %v = load <4 x half>, ptr %x
+ %red = call half @llvm.vector.reduce.fmaximum.v4f16(<4 x half> %v)
+ ret half %red
+}
+
+define half @vreduce_fmaximum_v4f16_nonans(ptr %x) {
+; CHECK-LABEL: vreduce_fmaximum_v4f16_nonans:...
[truncated]
|
|
The default lowering of VECREDUCE_FMAXIMUM/VECREDUCE_FMINIMUM to RISC-V involves splitting the vector multiple times and performing logarithmically many (in terms of vector length) FMINIMUM/FMAXIMUM operations. Given that even a single such operation generates a large sequence of instructions, a better strategy is needed. This patch transforms such reductions into an equivalent sequence of a reduction fmin/fmax, a reduction sum to detect any NaNs, and a scalar select to choose the correct result. Given that these reduction operations are natively supported in RISC-V, this leads to a much more efficient sequence of instructions. The transformation is performed in the DAG combiner, before the type legalizer has a chance to split the reduction and generate FMINIMUM/FMAXIMUM nodes.
da1ad68
to
2fb1940
Compare
Are you sure the FMINIMUM/FMAXIMUM came from the type legalizer? If the vector type is legal, the type legalizer shouldn't touch it. I would expect LegalizeVectorOps or LegalizeDAG to be where it gets broken down. So I think we could use custom lowering instead of a DAG combine. |
|
||
// Reduction fmax/fmin + separate reduction sum to propagate NaNs | ||
unsigned ReducedMinMaxOpc = N->getOpcode() == ISD::VECREDUCE_FMAXIMUM | ||
? ISD::VECREDUCE_FMAX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISD::VECREDUCE_FMAX/FMIN aren't guaranteed to preserve the order of -0.0 and +0.0. If generic DAG combiner ends up seeing a constnat vector for some reason after this change, it might incorrectly fold it.
It happens in
Hence if I handle the reduction in |
Ping |
1 similar comment
Ping |
The default lowering of VECREDUCE_FMAXIMUM/VECREDUCE_FMINIMUM to RISC-V involves splitting the vector multiple times and performing logarithmically many (in terms of vector length) FMINIMUM/FMAXIMUM operations. Given that even a single such operation generates a large sequence of instructions, a better strategy is needed.
This patch transforms such reductions into an equivalent sequence of a reduction fmin/fmax, a reduction sum to detect any NaNs, and a scalar select to choose the correct result. Given that these reduction operations are natively supported in RISC-V, this leads to a much more efficient sequence of instructions.
The transformation is performed in the DAG combiner, before the type legalizer has a chance to split the reduction and generate FMINIMUM/FMAXIMUM nodes.