Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU: Optimize set_rounding if input is known to fit in 2 bits #88588

Merged
merged 6 commits into from
May 3, 2024

Conversation

arsenm
Copy link
Contributor

@arsenm arsenm commented Apr 12, 2024

We don't need to figure out the weird extended rounding modes or
handle offsets to keep the lookup table in 64-bits.

https://reviews.llvm.org/D153258

Depends #88587

Use a shift of a magic constant and some offseting to convert from
flt_rounds values.

I don't know why the enum defines Dynamic = 7. The standard suggests
-1 is the cannot determine value. If we could start the extended
values at 4 we wouldn't the extra compare sub and select.

https://reviews.llvm.org/D153257
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 12, 2024

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-ir

Author: Matt Arsenault (arsenm)

Changes

We don't need to figure out the weird extended rounding modes or
handle offsets to keep the lookup table in 64-bits.

https://reviews.llvm.org/D153258

Depends #88587


Patch is 79.86 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/88588.diff

8 Files Affected:

  • (modified) llvm/docs/AMDGPUUsage.rst (+6)
  • (modified) llvm/docs/LangRef.rst (+2)
  • (modified) llvm/docs/ReleaseNotes.rst (+3-1)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+93)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.h (+1)
  • (modified) llvm/lib/Target/AMDGPU/SIModeRegisterDefaults.cpp (+113)
  • (modified) llvm/lib/Target/AMDGPU/SIModeRegisterDefaults.h (+12)
  • (added) llvm/test/CodeGen/AMDGPU/llvm.set.rounding.ll (+1715)
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 22c1d1f186ea54..aacb026f965fc7 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1157,6 +1157,12 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
                                                    register do not exactly match the FLT_ROUNDS values,
                                                    so a conversion is performed.
 
+  :ref:`llvm.set.rounding<int_set_rounding>`       Input value expected to be one of the valid results
+                                                   from '``llvm.get.rounding``'. Rounding mode is
+                                                   undefined if not passed a valid input. This should be
+                                                   a wave uniform value. In case of a divergent input
+                                                   value, the first active lane's value will be used.
+
   :ref:`llvm.get.fpenv<int_get_fpenv>`             Returns the current value of the AMDGPU floating point environment.
                                                    This stores information related to the current rounding mode,
                                                    denormalization mode, enabled traps, and floating point exceptions.
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index f6ada292b93b10..e07b131619fde9 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -26653,6 +26653,8 @@ specified by C standard:
 Other values may be used to represent additional rounding modes, supported by a
 target. These values are target-specific.
 
+.. _int_set_rounding:
+
 '``llvm.set.rounding``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/llvm/docs/ReleaseNotes.rst b/llvm/docs/ReleaseNotes.rst
index d2d542752b555e..13d8b54b44e082 100644
--- a/llvm/docs/ReleaseNotes.rst
+++ b/llvm/docs/ReleaseNotes.rst
@@ -74,6 +74,8 @@ Changes to the AMDGPU Backend
 
 * Implemented the ``llvm.get.fpenv`` and ``llvm.set.fpenv`` intrinsics.
 
+* Implemented :ref:`llvm.get.rounding <int_get_rounding>` and :ref:`llvm.set.rounding <int_set_rounding>`
+
 Changes to the ARM Backend
 --------------------------
 * FEAT_F32MM is no longer activated by default when using `+sve` on v8.6-A or greater. The feature is still available and can be used by adding `+f32mm` to the command line options.
@@ -133,7 +135,7 @@ Changes to the C API
   functions for accessing the values in a blockaddress constant.
 
 * Added ``LLVMConstStringInContext2`` function, which better matches the C++
-  API by using ``size_t`` for string length. Deprecated ``LLVMConstStringInContext``. 
+  API by using ``size_t`` for string length. Deprecated ``LLVMConstStringInContext``.
 
 * Added the following functions for accessing a function's prefix data:
 
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 14948ef9ea8d17..a76481bb726f99 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -877,6 +877,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
 
   setOperationAction(ISD::STACKSAVE, MVT::Other, Custom);
   setOperationAction(ISD::GET_ROUNDING, MVT::i32, Custom);
+  setOperationAction(ISD::SET_ROUNDING, MVT::Other, Custom);
   setOperationAction(ISD::GET_FPENV, MVT::i64, Custom);
   setOperationAction(ISD::SET_FPENV, MVT::i64, Custom);
 
@@ -4056,6 +4057,96 @@ SDValue SITargetLowering::lowerGET_ROUNDING(SDValue Op,
   return DAG.getMergeValues({Result, GetReg.getValue(1)}, SL);
 }
 
+SDValue SITargetLowering::lowerSET_ROUNDING(SDValue Op,
+                                            SelectionDAG &DAG) const {
+  SDLoc SL(Op);
+
+  SDValue NewMode = Op.getOperand(1);
+  assert(NewMode.getValueType() == MVT::i32);
+
+  // Index a table of 4-bit entries mapping from the C FLT_ROUNDS values to the
+  // hardware MODE.fp_round values.
+  if (auto *ConstMode = dyn_cast<ConstantSDNode>(NewMode)) {
+      uint32_t ClampedVal = std::min(
+          static_cast<uint32_t>(ConstMode->getZExtValue()),
+          static_cast<uint32_t>(AMDGPU::TowardZeroF32_TowardNegativeF64));
+      NewMode = DAG.getConstant(
+          AMDGPU::decodeFltRoundToHWConversionTable(ClampedVal), SL, MVT::i32);
+  } else {
+    // If we know the input can only be one of the supported standard modes in
+    // the range 0-3, we can use a simplified mapping to hardware values.
+    KnownBits KB = DAG.computeKnownBits(NewMode);
+    const bool UseReducedTable = KB.countMinLeadingZeros() >= 30;
+    // The supported standard values are 0-3. The extended values start at 8. We
+    // need to offset by 4 if the value is in the extended range.
+
+    if (UseReducedTable) {
+      // Truncate to the low 32-bits.
+      SDValue BitTable = DAG.getConstant(
+        AMDGPU::FltRoundToHWConversionTable & 0xffff, SL, MVT::i32);
+
+      SDValue Two = DAG.getConstant(2, SL, MVT::i32);
+      SDValue RoundModeTimesNumBits =
+        DAG.getNode(ISD::SHL, SL, MVT::i32, NewMode, Two);
+
+      SDValue TableValue =
+        DAG.getNode(ISD::SRL, SL, MVT::i32, BitTable, RoundModeTimesNumBits);
+      NewMode = DAG.getNode(ISD::TRUNCATE, SL, MVT::i32, TableValue);
+
+      // TODO: SimplifyDemandedBits on the setreg source here can likely reduce
+      // the table extracted bits into inline immediates.
+    } else {
+      // is_standard = value < 4;
+      // table_index = is_standard ? value : (value - 4)
+      // MODE.fp_round = (bit_table >> table_index) & 0xf
+      SDValue BitTable =
+        DAG.getConstant(AMDGPU::FltRoundToHWConversionTable, SL, MVT::i64);
+
+      SDValue Four = DAG.getConstant(4, SL, MVT::i32);
+      SDValue IsStandardValue =
+        DAG.getSetCC(SL, MVT::i1, NewMode, Four, ISD::SETULT);
+      SDValue OffsetEnum = DAG.getNode(ISD::SUB, SL, MVT::i32, NewMode, Four);
+
+      SDValue IndexVal = DAG.getNode(ISD::SELECT, SL, MVT::i32, IsStandardValue,
+                                     NewMode, OffsetEnum);
+
+      SDValue Two = DAG.getConstant(2, SL, MVT::i32);
+      SDValue RoundModeTimesNumBits =
+        DAG.getNode(ISD::SHL, SL, MVT::i32, IndexVal, Two);
+
+      SDValue TableValue =
+        DAG.getNode(ISD::SRL, SL, MVT::i64, BitTable, RoundModeTimesNumBits);
+      SDValue TruncTable = DAG.getNode(ISD::TRUNCATE, SL, MVT::i32, TableValue);
+
+      // No need to mask out the high bits since the setreg will ignore them
+      // anyway.
+      NewMode = TruncTable;
+    }
+
+    // Insert a readfirstlane in case the value is a VGPR. We could do this
+    // earlier and keep more operations scalar, but that interferes with
+    // combining the source.
+    SDValue ReadFirstLaneID =
+      DAG.getTargetConstant(Intrinsic::amdgcn_readfirstlane, SL, MVT::i32);
+    NewMode = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, SL, MVT::i32,
+                          ReadFirstLaneID, NewMode);
+  }
+
+  // N.B. The setreg will be later folded into s_round_mode on supported
+  // targets.
+  SDValue IntrinID =
+      DAG.getTargetConstant(Intrinsic::amdgcn_s_setreg, SL, MVT::i32);
+  uint32_t BothRoundHwReg =
+      AMDGPU::Hwreg::HwregEncoding::encode(AMDGPU::Hwreg::ID_MODE, 0, 4);
+  SDValue RoundBothImm = DAG.getTargetConstant(BothRoundHwReg, SL, MVT::i32);
+
+  SDValue SetReg =
+      DAG.getNode(ISD::INTRINSIC_VOID, SL, Op->getVTList(), Op.getOperand(0),
+                  IntrinID, RoundBothImm, NewMode);
+
+  return SetReg;
+}
+
 SDValue SITargetLowering::lowerPREFETCH(SDValue Op, SelectionDAG &DAG) const {
   if (Op->isDivergent())
     return SDValue();
@@ -5743,6 +5834,8 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
     return LowerSTACKSAVE(Op, DAG);
   case ISD::GET_ROUNDING:
     return lowerGET_ROUNDING(Op, DAG);
+  case ISD::SET_ROUNDING:
+    return lowerSET_ROUNDING(Op, DAG);
   case ISD::PREFETCH:
     return lowerPREFETCH(Op, DAG);
   case ISD::FP_EXTEND:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index 9856a2923d38f7..08aa2a5991631d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -422,6 +422,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   SDValue LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerSTACKSAVE(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerGET_ROUNDING(SDValue Op, SelectionDAG &DAG) const;
+  SDValue lowerSET_ROUNDING(SDValue Op, SelectionDAG &DAG) const;
 
   SDValue lowerPREFETCH(SDValue Op, SelectionDAG &DAG) const;
   SDValue lowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) const;
diff --git a/llvm/lib/Target/AMDGPU/SIModeRegisterDefaults.cpp b/llvm/lib/Target/AMDGPU/SIModeRegisterDefaults.cpp
index 2684a1e3c3358a..f03fd0b1f4088a 100644
--- a/llvm/lib/Target/AMDGPU/SIModeRegisterDefaults.cpp
+++ b/llvm/lib/Target/AMDGPU/SIModeRegisterDefaults.cpp
@@ -174,3 +174,116 @@ static_assert(decodeIndexFltRoundConversionTable(getModeRegisterRoundMode(
 static_assert(decodeIndexFltRoundConversionTable(getModeRegisterRoundMode(
                   HWTowardNegative, HWTowardPositive)) ==
               TowardNegativeF32_TowardPositiveF64);
+
+// Decode FLT_ROUNDS into the hardware value where the two rounding modes are
+// the same and use a standard value
+static constexpr uint64_t encodeFltRoundsToHWTableSame(uint32_t HWVal,
+                                                       uint32_t FltRoundsVal) {
+  if (FltRoundsVal > TowardNegative)
+    FltRoundsVal -= ExtendedFltRoundOffset;
+
+  return static_cast<uint64_t>(getModeRegisterRoundMode(HWVal, HWVal))
+         << (FltRoundsVal << 2);
+}
+
+/// Decode FLT_ROUNDS into the hardware value where the two rounding modes
+/// different and use an extended value.
+static constexpr uint64_t encodeFltRoundsToHWTable(uint32_t HWF32Val,
+                                                   uint32_t HWF64Val,
+                                                   uint32_t FltRoundsVal) {
+  if (FltRoundsVal > TowardNegative)
+    FltRoundsVal -= ExtendedFltRoundOffset;
+  return static_cast<uint64_t>(getModeRegisterRoundMode(HWF32Val, HWF64Val))
+         << (FltRoundsVal << 2);
+}
+
+constexpr uint64_t AMDGPU::FltRoundToHWConversionTable =
+    encodeFltRoundsToHWTableSame(HWTowardZero, TowardZeroF32_TowardZeroF64) |
+    encodeFltRoundsToHWTableSame(HWNearestTiesToEven,
+                                 NearestTiesToEvenF32_NearestTiesToEvenF64) |
+    encodeFltRoundsToHWTableSame(HWTowardPositive,
+                                 TowardPositiveF32_TowardPositiveF64) |
+    encodeFltRoundsToHWTableSame(HWTowardNegative,
+                                 TowardNegativeF32_TowardNegativeF64) |
+
+    encodeFltRoundsToHWTable(HWTowardZero, HWNearestTiesToEven,
+                             TowardZeroF32_NearestTiesToEvenF64) |
+    encodeFltRoundsToHWTable(HWTowardZero, HWTowardPositive,
+                             TowardZeroF32_TowardPositiveF64) |
+    encodeFltRoundsToHWTable(HWTowardZero, HWTowardNegative,
+                             TowardZeroF32_TowardNegativeF64) |
+
+    encodeFltRoundsToHWTable(HWNearestTiesToEven, HWTowardZero,
+                             NearestTiesToEvenF32_TowardZeroF64) |
+    encodeFltRoundsToHWTable(HWNearestTiesToEven, HWTowardPositive,
+                             NearestTiesToEvenF32_TowardPositiveF64) |
+    encodeFltRoundsToHWTable(HWNearestTiesToEven, HWTowardNegative,
+                             NearestTiesToEvenF32_TowardNegativeF64) |
+
+    encodeFltRoundsToHWTable(HWTowardPositive, HWTowardZero,
+                             TowardPositiveF32_TowardZeroF64) |
+    encodeFltRoundsToHWTable(HWTowardPositive, HWNearestTiesToEven,
+                             TowardPositiveF32_NearestTiesToEvenF64) |
+    encodeFltRoundsToHWTable(HWTowardPositive, HWTowardNegative,
+                             TowardPositiveF32_TowardNegativeF64) |
+
+    encodeFltRoundsToHWTable(HWTowardNegative, HWTowardZero,
+                             TowardNegativeF32_TowardZeroF64) |
+    encodeFltRoundsToHWTable(HWTowardNegative, HWNearestTiesToEven,
+                             TowardNegativeF32_NearestTiesToEvenF64) |
+    encodeFltRoundsToHWTable(HWTowardNegative, HWTowardPositive,
+                             TowardNegativeF32_TowardPositiveF64);
+
+// Verify evaluation of FltRoundToHWConversionTable
+
+static_assert(decodeFltRoundToHWConversionTable(AMDGPUFltRounds::TowardZero) ==
+              getModeRegisterRoundMode(HWTowardZero, HWTowardZero));
+static_assert(
+    decodeFltRoundToHWConversionTable(AMDGPUFltRounds::NearestTiesToEven) ==
+    getModeRegisterRoundMode(HWNearestTiesToEven, HWNearestTiesToEven));
+static_assert(
+    decodeFltRoundToHWConversionTable(AMDGPUFltRounds::TowardPositive) ==
+    getModeRegisterRoundMode(HWTowardPositive, HWTowardPositive));
+static_assert(
+    decodeFltRoundToHWConversionTable(AMDGPUFltRounds::TowardNegative) ==
+    getModeRegisterRoundMode(HWTowardNegative, HWTowardNegative));
+
+static_assert(
+    decodeFltRoundToHWConversionTable(NearestTiesToEvenF32_TowardPositiveF64) ==
+    getModeRegisterRoundMode(HWNearestTiesToEven, HWTowardPositive));
+static_assert(
+    decodeFltRoundToHWConversionTable(NearestTiesToEvenF32_TowardNegativeF64) ==
+    getModeRegisterRoundMode(HWNearestTiesToEven, HWTowardNegative));
+static_assert(
+    decodeFltRoundToHWConversionTable(NearestTiesToEvenF32_TowardZeroF64) ==
+    getModeRegisterRoundMode(HWNearestTiesToEven, HWTowardZero));
+
+static_assert(
+    decodeFltRoundToHWConversionTable(TowardPositiveF32_NearestTiesToEvenF64) ==
+    getModeRegisterRoundMode(HWTowardPositive, HWNearestTiesToEven));
+static_assert(
+    decodeFltRoundToHWConversionTable(TowardPositiveF32_TowardNegativeF64) ==
+    getModeRegisterRoundMode(HWTowardPositive, HWTowardNegative));
+static_assert(
+    decodeFltRoundToHWConversionTable(TowardPositiveF32_TowardZeroF64) ==
+    getModeRegisterRoundMode(HWTowardPositive, HWTowardZero));
+
+static_assert(
+    decodeFltRoundToHWConversionTable(TowardNegativeF32_NearestTiesToEvenF64) ==
+    getModeRegisterRoundMode(HWTowardNegative, HWNearestTiesToEven));
+static_assert(
+    decodeFltRoundToHWConversionTable(TowardNegativeF32_TowardPositiveF64) ==
+    getModeRegisterRoundMode(HWTowardNegative, HWTowardPositive));
+static_assert(
+    decodeFltRoundToHWConversionTable(TowardNegativeF32_TowardZeroF64) ==
+    getModeRegisterRoundMode(HWTowardNegative, HWTowardZero));
+
+static_assert(
+    decodeFltRoundToHWConversionTable(TowardZeroF32_NearestTiesToEvenF64) ==
+    getModeRegisterRoundMode(HWTowardZero, HWNearestTiesToEven));
+static_assert(
+    decodeFltRoundToHWConversionTable(TowardZeroF32_TowardPositiveF64) ==
+    getModeRegisterRoundMode(HWTowardZero, HWTowardPositive));
+static_assert(
+    decodeFltRoundToHWConversionTable(TowardZeroF32_TowardNegativeF64) ==
+    getModeRegisterRoundMode(HWTowardZero, HWTowardNegative));
diff --git a/llvm/lib/Target/AMDGPU/SIModeRegisterDefaults.h b/llvm/lib/Target/AMDGPU/SIModeRegisterDefaults.h
index 9fbd74c3eede32..1bfb5add50f7ec 100644
--- a/llvm/lib/Target/AMDGPU/SIModeRegisterDefaults.h
+++ b/llvm/lib/Target/AMDGPU/SIModeRegisterDefaults.h
@@ -144,6 +144,18 @@ static constexpr uint32_t F64FltRoundOffset = 2;
 // values.
 extern const uint64_t FltRoundConversionTable;
 
+// Bit indexed table to convert from FLT_ROUNDS values to hardware rounding mode
+// values
+extern const uint64_t FltRoundToHWConversionTable;
+
+/// Read the hardware rounding mode equivalent of a AMDGPUFltRounds value.
+constexpr uint32_t decodeFltRoundToHWConversionTable(uint32_t FltRounds) {
+  uint32_t IndexVal = FltRounds;
+  if (IndexVal > TowardNegative)
+    IndexVal -= ExtendedFltRoundOffset;
+  return (FltRoundToHWConversionTable >> (IndexVal << 2)) & 0xf;
+}
+
 } // end namespace AMDGPU
 
 } // end namespace llvm
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.set.rounding.ll b/llvm/test/CodeGen/AMDGPU/llvm.set.rounding.ll
new file mode 100644
index 00000000000000..ca90f6fd88514a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.set.rounding.ll
@@ -0,0 +1,1715 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
+; RUN: llc -march=amdgcn -mcpu=tahiti < %s | FileCheck -check-prefixes=GCN,GFX678,GFX6 %s
+; RUN: llc -march=amdgcn -mcpu=hawaii < %s | FileCheck -check-prefixes=GCN,GFX678,GFX7 %s
+; RUN: llc -march=amdgcn -mcpu=fiji < %s | FileCheck -check-prefixes=GCN,GFX678,GFX8 %s
+; RUN: llc -march=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GFX9 %s
+; RUN: llc -march=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX1011,GFX10 %s
+; RUN: llc -march=amdgcn -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GCN,GFX1011,GFX11 %s
+
+declare void @llvm.set.rounding(i32)
+declare i32 @llvm.get.rounding()
+
+define amdgpu_gfx void @s_set_rounding(i32 inreg %rounding) {
+; GFX678-LABEL: s_set_rounding:
+; GFX678:       ; %bb.0:
+; GFX678-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX678-NEXT:    s_add_i32 s34, s4, -4
+; GFX678-NEXT:    s_cmp_lt_u32 s4, 4
+; GFX678-NEXT:    s_cselect_b32 s34, s4, s34
+; GFX678-NEXT:    s_lshl_b32 s36, s34, 2
+; GFX678-NEXT:    s_mov_b32 s34, 0x1c84a50f
+; GFX678-NEXT:    s_mov_b32 s35, 0xb73e62d9
+; GFX678-NEXT:    s_lshr_b64 s[34:35], s[34:35], s36
+; GFX678-NEXT:    s_setreg_b32 hwreg(HW_REG_MODE, 0, 4), s34
+; GFX678-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9-LABEL: s_set_rounding:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_add_i32 s34, s4, -4
+; GFX9-NEXT:    s_cmp_lt_u32 s4, 4
+; GFX9-NEXT:    s_cselect_b32 s34, s4, s34
+; GFX9-NEXT:    s_lshl_b32 s36, s34, 2
+; GFX9-NEXT:    s_mov_b32 s34, 0x1c84a50f
+; GFX9-NEXT:    s_mov_b32 s35, 0xb73e62d9
+; GFX9-NEXT:    s_lshr_b64 s[34:35], s[34:35], s36
+; GFX9-NEXT:    s_setreg_b32 hwreg(HW_REG_MODE, 0, 4), s34
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: s_set_rounding:
+; GFX10:       ; %bb.0:
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    s_add_i32 s34, s4, -4
+; GFX10-NEXT:    s_cmp_lt_u32 s4, 4
+; GFX10-NEXT:    s_cselect_b32 s34, s4, s34
+; GFX10-NEXT:    s_lshl_b32 s36, s34, 2
+; GFX10-NEXT:    s_mov_b32 s34, 0x1c84a50f
+; GFX10-NEXT:    s_mov_b32 s35, 0xb73e62d9
+; GFX10-NEXT:    s_lshr_b64 s[34:35], s[34:35], s36
+; GFX10-NEXT:    s_setreg_b32 hwreg(HW_REG_MODE, 0, 4), s34
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-LABEL: s_set_rounding:
+; GFX11:       ; %bb.0:
+; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT:    s_add_i32 s0, s4, -4
+; GFX11-NEXT:    s_cmp_lt_u32 s4, 4
+; GFX11-NEXT:    s_cselect_b32 s0, s4, s0
+; GFX11-NEXT:    s_lshl_b32 s2, s0, 2
+; GFX11-NEXT:    s_mov_b32 s0, 0x1c84a50f
+; GFX11-NEXT:    s_mov_b32 s1, 0xb73e62d9
+; GFX11-NEXT:    s_lshr_b64 s[0:1], s[0:1], s2
+; GFX11-NEXT:    s_setreg_b32 hwreg(HW_REG_MODE, 0, 4), s0
+; GFX11-NEXT:    s_setpc_b64 s[30:31]
+  call void @llvm.set.rounding(i32 %rounding)
+  ret void
+}
+
+define amdgpu_kernel void @s_set_rounding_kernel(i32 inreg %rounding) {
+; GFX6-LABEL: s_set_rounding_kernel:
+; GFX6:       ; %bb.0:
+; GFX6-NEXT:    s_load_dword s2, s[0:1], 0x9
+; GFX6-NEXT:    s_mov_b32 s0, 0x1c84a50f
+; GFX6-NEXT:    s_mov_b32 s1, 0xb73e62d9
+; GFX6-NEXT:    ;;#ASMSTART
+; GFX6-NEXT:    ;;#ASMEND
+; GFX6-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX6-NEXT:    s_add_i32 s3, s2, -4
+; GFX6-NEXT:    s_cmp_lt_u32 s2, 4
+; GFX6-NEXT:    s_cselect_b32 s2, s2, s3
+; GFX6-NEXT:    s_lshl_b32 s2, s2, 2
+; GFX6-NEXT:    s_lshr_b64 s[0:1], s[0:1], s2
+; GFX6-NEXT:    s_setreg_b32 hwreg(HW_REG_MODE, 0, 4), s0
+; GFX6-NEXT:    s_endpgm
+;
+; GFX7-LABEL: s_set_rounding_kernel:
+; GFX7:       ; %bb.0:
+; GFX7-NEXT:    s_load_dword s2, s[0:1], 0x9
+; GFX7-NEXT:    s_mov_b32 s0, 0x1c84a50f
+; GFX7-NEXT:    s_mov_b32 s1, 0xb73e62d9
+; GFX7-NEXT:    ;;#ASMSTART
+; GFX7-NEXT:    ;;#ASMEND
+; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX7-NEXT:    s_add_i32 s3, s2, -4
+; GFX7-NEXT:    s_cmp_lt_u32 s2, 4
+; GFX7-NEXT:    s_cselect_b32 s2, s2, s3
+; GFX7-NEXT:    s_lshl_b32 s2, s2, 2
+; GFX7-NEXT:    s_lshr_b64 s[0:1], s[0:1], s2
+; GFX7-NEXT:    s_setreg_b32 hwreg(HW_REG_MODE, 0, 4), s0
+; GFX7-NEXT:    s_endpgm
+;
+; GFX8-LABEL: s_set_rounding_kernel:
+; GFX8:       ; %bb.0:
+; GFX8-NEXT:    s_load_dwo...
[truncated]

Copy link

github-actions bot commented Apr 12, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

We don't need to figure out the weird extended rounding modes or
handle offsets to keep the lookup table in 64-bits.

https://reviews.llvm.org/D153258
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty straightforward given we already have llvm.get_rounding implemented.

; GFX11-LABEL: s_set_rounding_i3_zeroext:
; GFX11: ; %bb.0:
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: v_cmp_lt_u16_e64 s0, s4, 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to your patch, but this is horrible codegen, using v_cmp to compare sgpr values.

@arsenm arsenm force-pushed the amdgpu-optimize-set-rounding branch from 1a0d915 to cfdf94a Compare May 3, 2024 08:00
@arsenm arsenm merged commit b4e751e into llvm:main May 3, 2024
3 of 4 checks passed
@arsenm arsenm deleted the amdgpu-optimize-set-rounding branch May 3, 2024 09:17
joker-eph added a commit that referenced this pull request May 4, 2024
Revert "AMDGPU: Try to fix build error with old gcc"
This reverts commit c7ad12d.

Revert "AMDGPU: Use umin in set.rounding expansion"
This reverts commit a56f0b5.

Revert "AMDGPU: Optimize set_rounding if input is known to fit in 2 bits (#88588)"
This reverts commit b4e751e.

Revert "AMDGPU: Implement llvm.set.rounding (#88587)"
This reverts commit 9731b77.
sookach pushed a commit to sookach/llvm-project that referenced this pull request May 4, 2024
Revert "AMDGPU: Try to fix build error with old gcc"
This reverts commit c7ad12d.

Revert "AMDGPU: Use umin in set.rounding expansion"
This reverts commit a56f0b5.

Revert "AMDGPU: Optimize set_rounding if input is known to fit in 2 bits (llvm#88588)"
This reverts commit b4e751e.

Revert "AMDGPU: Implement llvm.set.rounding (llvm#88587)"
This reverts commit 9731b77.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants