AMDGPU: Avoid introducing unnecessary fabs in fast fdiv lowering #172553

arsenm · 2025-12-16T20:57:49Z

If the sign bit of the denominator is known 0, do not emit the fabs.
Also, extend this to handle min/max with fabs inputs.

I originally tried to do this as the general combine on fabs, but
it proved to be too much trouble at this time. This is mostly
complexity introduced by expanding the various min/maxes into
canonicalizes, and then not being able to assume the sign bit
of canonicalize (fabs x) without nnan.

This defends against future code size regressions in the atan2 and
atan2pi library functions.

arsenm · 2025-12-16T20:58:02Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-12-16T20:58:45Z

@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

If the sign bit of the denominator is known 0, do not emit the fabs.
Also, extend this to handle min/max with fabs inputs.

I originally tried to do this as the general combine on fabs, but
it proved to be too much trouble at this time. This is mostly
complexity introduced by expanding the various min/maxes into
canonicalizes, and then not being able to assume the sign bit
of canonicalize (fabs x) without nnan.

This defends against future code size regressions in the atan2 and
atan2pi library functions.

Full diff: https://github.com/llvm/llvm-project/pull/172553.diff

3 Files Affected:

(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp (+15)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+4-1)
(modified) llvm/test/CodeGen/AMDGPU/fabs-known-signbit-combine-fast-fdiv-lowering.ll (+6-6)

diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index 69491c6f2c565..4482df15242d9 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -2791,6 +2791,7 @@ bool SelectionDAG::SignBitIsZero(SDValue Op, unsigned Depth) const {
   return MaskedValueIsZero(Op, APInt::getSignMask(BitWidth), Depth);
 }
 
+// TODO: Should have argument to specify if sign bit of nan is ignorable.
 bool SelectionDAG::SignBitIsZeroFP(SDValue Op, unsigned Depth) const {
   if (Depth >= MaxRecursionDepth)
     return false; // Limit search depth.
@@ -2812,6 +2813,20 @@ bool SelectionDAG::SignBitIsZeroFP(SDValue Op, unsigned Depth) const {
   case ISD::FEXP2:
   case ISD::FEXP10:
     return Op->getFlags().hasNoNaNs();
+  case ISD::FMINNUM:
+  case ISD::FMINNUM_IEEE:
+  case ISD::FMINIMUM:
+  case ISD::FMINIMUMNUM:
+    return SignBitIsZeroFP(Op.getOperand(1), Depth + 1) &&
+           SignBitIsZeroFP(Op.getOperand(0), Depth + 1);
+  case ISD::FMAXNUM:
+  case ISD::FMAXNUM_IEEE:
+  case ISD::FMAXIMUM:
+  case ISD::FMAXIMUMNUM:
+    // TODO: If we can ignore the sign bit of nans, only one side being known 0
+    // is sufficient.
+    return SignBitIsZeroFP(Op.getOperand(1), Depth + 1) &&
+           SignBitIsZeroFP(Op.getOperand(0), Depth + 1);
   default:
     return false;
   }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index ff50fdfe9b09f..afdeed658b76e 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -12336,7 +12336,10 @@ SDValue SITargetLowering::lowerFDIV_FAST(SDValue Op, SelectionDAG &DAG) const {
   SDValue LHS = Op.getOperand(1);
   SDValue RHS = Op.getOperand(2);
 
-  SDValue r1 = DAG.getNode(ISD::FABS, SL, MVT::f32, RHS, Flags);
+  // TODO: The combiner should probably handle elimination of redundant fabs.
+  SDValue r1 = DAG.SignBitIsZeroFP(RHS)
+                   ? RHS
+                   : DAG.getNode(ISD::FABS, SL, MVT::f32, RHS, Flags);
 
   const APFloat K0Val(0x1p+96f);
   const SDValue K0 = DAG.getConstantFP(K0Val, SL, MVT::f32);
diff --git a/llvm/test/CodeGen/AMDGPU/fabs-known-signbit-combine-fast-fdiv-lowering.ll b/llvm/test/CodeGen/AMDGPU/fabs-known-signbit-combine-fast-fdiv-lowering.ll
index 038252e4cb1e4..750f390e79110 100644
--- a/llvm/test/CodeGen/AMDGPU/fabs-known-signbit-combine-fast-fdiv-lowering.ll
+++ b/llvm/test/CodeGen/AMDGPU/fabs-known-signbit-combine-fast-fdiv-lowering.ll
@@ -73,7 +73,7 @@ define float @fdiv_fast_daz_rhs_signbit_known_zero_maxnum_fabs(float %x, float %
 ; CHECK-NEXT:    v_max_f32_e32 v1, v1, v2
 ; CHECK-NEXT:    s_mov_b32 s4, 0x6f800000
 ; CHECK-NEXT:    v_mov_b32_e32 v2, 0x2f800000
-; CHECK-NEXT:    v_cmp_gt_f32_e64 vcc, |v1|, s4
+; CHECK-NEXT:    v_cmp_lt_f32_e32 vcc, s4, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v2, 1.0, v2, vcc
 ; CHECK-NEXT:    v_mul_f32_e32 v1, v1, v2
 ; CHECK-NEXT:    v_rcp_f32_e32 v1, v1
@@ -97,7 +97,7 @@ define float @fdiv_fast_daz_rhs_signbit_known_zero_minnum_fabs(float %x, float %
 ; CHECK-NEXT:    v_min_f32_e32 v1, v1, v2
 ; CHECK-NEXT:    s_mov_b32 s4, 0x6f800000
 ; CHECK-NEXT:    v_mov_b32_e32 v2, 0x2f800000
-; CHECK-NEXT:    v_cmp_gt_f32_e64 vcc, |v1|, s4
+; CHECK-NEXT:    v_cmp_lt_f32_e32 vcc, s4, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v2, 1.0, v2, vcc
 ; CHECK-NEXT:    v_mul_f32_e32 v1, v1, v2
 ; CHECK-NEXT:    v_rcp_f32_e32 v1, v1
@@ -122,7 +122,7 @@ define float @fdiv_fast_daz_rhs_signbit_known_zero_maximum_fabs(float %x, float
 ; CHECK-NEXT:    v_cndmask_b32_e32 v1, v4, v3, vcc
 ; CHECK-NEXT:    s_mov_b32 s4, 0x6f800000
 ; CHECK-NEXT:    v_mov_b32_e32 v2, 0x2f800000
-; CHECK-NEXT:    v_cmp_gt_f32_e64 vcc, |v1|, s4
+; CHECK-NEXT:    v_cmp_lt_f32_e32 vcc, s4, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v2, 1.0, v2, vcc
 ; CHECK-NEXT:    v_mul_f32_e32 v1, v1, v2
 ; CHECK-NEXT:    v_rcp_f32_e32 v1, v1
@@ -147,7 +147,7 @@ define float @fdiv_fast_daz_rhs_signbit_known_zero_minimum_fabs(float %x, float
 ; CHECK-NEXT:    v_cndmask_b32_e32 v1, v4, v3, vcc
 ; CHECK-NEXT:    s_mov_b32 s4, 0x6f800000
 ; CHECK-NEXT:    v_mov_b32_e32 v2, 0x2f800000
-; CHECK-NEXT:    v_cmp_gt_f32_e64 vcc, |v1|, s4
+; CHECK-NEXT:    v_cmp_lt_f32_e32 vcc, s4, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v2, 1.0, v2, vcc
 ; CHECK-NEXT:    v_mul_f32_e32 v1, v1, v2
 ; CHECK-NEXT:    v_rcp_f32_e32 v1, v1
@@ -171,7 +171,7 @@ define float @fdiv_fast_daz_rhs_signbit_known_zero_maximumnum_fabs(float %x, flo
 ; CHECK-NEXT:    v_max_f32_e32 v1, v1, v2
 ; CHECK-NEXT:    s_mov_b32 s4, 0x6f800000
 ; CHECK-NEXT:    v_mov_b32_e32 v2, 0x2f800000
-; CHECK-NEXT:    v_cmp_gt_f32_e64 vcc, |v1|, s4
+; CHECK-NEXT:    v_cmp_lt_f32_e32 vcc, s4, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v2, 1.0, v2, vcc
 ; CHECK-NEXT:    v_mul_f32_e32 v1, v1, v2
 ; CHECK-NEXT:    v_rcp_f32_e32 v1, v1
@@ -195,7 +195,7 @@ define float @fdiv_fast_daz_rhs_signbit_known_zero_minimumnum_fabs(float %x, flo
 ; CHECK-NEXT:    v_min_f32_e32 v1, v1, v2
 ; CHECK-NEXT:    s_mov_b32 s4, 0x6f800000
 ; CHECK-NEXT:    v_mov_b32_e32 v2, 0x2f800000
-; CHECK-NEXT:    v_cmp_gt_f32_e64 vcc, |v1|, s4
+; CHECK-NEXT:    v_cmp_lt_f32_e32 vcc, s4, v1
 ; CHECK-NEXT:    v_cndmask_b32_e32 v2, 1.0, v2, vcc
 ; CHECK-NEXT:    v_mul_f32_e32 v1, v1, v2
 ; CHECK-NEXT:    v_rcp_f32_e32 v1, v1

The compiler knows how to select the right division path depending on the denormal mode (and based on the implied 2.5 ulp limit by the OpenCL deafults). This results in almost identical code. Currently the new result has a code size regression due to an unnecessary use of a droppable fabs modifier (which llvm#172553 avoids).

If the sign bit of the denominator is known 0, do not emit the fabs. Also, extend this to handle min/max with fabs inputs. I originally tried to do this as the general combine on fabs, but it proved to be too much trouble at this time. This is mostly complexity introduced by expanding the various min/maxes into canonicalizes, and then not being able to assume the sign bit of canonicalize (fabs x) without nnan. This defends against future code size regressions in the atan2 and atan2pi library functions.

arsenm mentioned this pull request Dec 16, 2025

AMDGPU: Add baseline test for redundant fabs on fdiv expansion #172552

Merged

arsenm added backend:AMDGPU floating-point Floating-point math labels Dec 16, 2025 — with Graphite App

arsenm requested review from Pierre-vh, cdevadas, jayfoad, rampitec and shiltian December 16, 2025 20:58

arsenm marked this pull request as ready for review December 16, 2025 20:58

llvmbot added the llvm:SelectionDAG SelectionDAGISel as well label Dec 16, 2025

arsenm mentioned this pull request Dec 16, 2025

device-libs: Remove DAZ_OPT check in atan2/atan2pi ROCm/llvm-project#863

Merged

rampitec approved these changes Dec 16, 2025

View reviewed changes

Base automatically changed from users/arsenm/amdgpu/add-baseline-test-dag-fabs-fold to main December 16, 2025 22:26

arsenm force-pushed the users/arsenm/dag/fold-fabs-if-signbit-known-0 branch from ae61164 to e8d0ee0 Compare December 16, 2025 22:29

arsenm merged commit 68aea8e into main Dec 16, 2025
10 checks passed

arsenm deleted the users/arsenm/dag/fold-fabs-if-signbit-known-0 branch December 16, 2025 23:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU: Avoid introducing unnecessary fabs in fast fdiv lowering #172553

AMDGPU: Avoid introducing unnecessary fabs in fast fdiv lowering #172553

arsenm commented Dec 16, 2025

Uh oh!

arsenm commented Dec 16, 2025 •

edited

Loading

Uh oh!

llvmbot commented Dec 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AMDGPU: Avoid introducing unnecessary fabs in fast fdiv lowering #172553

AMDGPU: Avoid introducing unnecessary fabs in fast fdiv lowering #172553

Conversation

arsenm commented Dec 16, 2025

Uh oh!

arsenm commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arsenm commented Dec 16, 2025 •

edited

Loading

llvmbot commented Dec 16, 2025 •

edited

Loading