Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Fix legalization of frem(-0.0, y) #70448

Closed
wants to merge 2 commits into from
Closed

[AMDGPU] Fix legalization of frem(-0.0, y) #70448

wants to merge 2 commits into from

Conversation

jayfoad
Copy link
Contributor

@jayfoad jayfoad commented Oct 27, 2023

We legalized frem(x, y) -> x - y * trunc(x / y). When x is -0.0 this
evaluates to +0.0 but the result should be -0.0.

Fix this by legalizing to copysign(x - y * trunc(x / y), x).

We legalized frem(x, y) -> x - y * trunc(x / y). When x is -0.0 this
evaluates to +0.0 but the result should be -0.0.

Fix this by legalizing to copysign(x - y * trunc(x / y), x).
@llvmbot
Copy link
Collaborator

llvmbot commented Oct 27, 2023

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-globalisel

Author: Jay Foad (jayfoad)

Changes

We legalized frem(x, y) -> x - y * trunc(x / y). When x is -0.0 this
evaluates to +0.0 but the result should be -0.0.

Fix this by legalizing to copysign(x - y * trunc(x / y), x).


Patch is 139.68 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/70448.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+6-2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+13-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/frem.ll (+164-21)
  • (modified) llvm/test/CodeGen/AMDGPU/frem.ll (+544-336)
  • (modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+8-4)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index adf4e0139e03c1d..28f6ad62f2c37ed 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -2267,19 +2267,23 @@ SDValue AMDGPUTargetLowering::LowerSDIVREM(SDValue Op,
   return DAG.getMergeValues(Res, DL);
 }
 
-// (frem x, y) -> (fma (fneg (ftrunc (fdiv x, y))), y, x)
+// (frem x, y) -> (fcopysign (fma (fneg (ftrunc (fdiv x, y))), y, x), x)
+// The fcopysign is only required to get the correct result -0.0 when x is -0.0
+// (and y is non-zero). With NSZ it can be dropped.
 SDValue AMDGPUTargetLowering::LowerFREM(SDValue Op, SelectionDAG &DAG) const {
   SDLoc SL(Op);
   EVT VT = Op.getValueType();
   auto Flags = Op->getFlags();
   SDValue X = Op.getOperand(0);
   SDValue Y = Op.getOperand(1);
+  bool NSZ = mayIgnoreSignedZero(Op);
 
   SDValue Div = DAG.getNode(ISD::FDIV, SL, VT, X, Y, Flags);
   SDValue Trunc = DAG.getNode(ISD::FTRUNC, SL, VT, Div, Flags);
   SDValue Neg = DAG.getNode(ISD::FNEG, SL, VT, Trunc, Flags);
   // TODO: For f32 use FMAD instead if !hasFastFMA32?
-  return DAG.getNode(ISD::FMA, SL, VT, Neg, Y, X, Flags);
+  SDValue FMA = DAG.getNode(ISD::FMA, SL, VT, Neg, Y, X, Flags);
+  return NSZ ? FMA : DAG.getNode(ISD::FCOPYSIGN , SL, VT, FMA, X);
 }
 
 SDValue AMDGPUTargetLowering::LowerFCEIL(SDValue Op, SelectionDAG &DAG) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 3d70ed150df12f8..0775d8a3c2396f7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -2341,6 +2341,14 @@ bool AMDGPULegalizerInfo::legalizeFceil(
   return true;
 }
 
+static bool allowNoSignedZeros(const MachineFunction &MF, unsigned Flags) {
+  return (Flags & MachineInstr::FmNsz) ||
+      MF.getTarget().Options.NoSignedZerosFPMath;
+}
+
+// Legalize frem(x, y) -> copysign(x - y * trunc(x / y), x)
+// The copysign is only required to get the correct result -0.0 when x is -0.0
+// (and y is non-zero). With NSZ it can be dropped.
 bool AMDGPULegalizerInfo::legalizeFrem(
   MachineInstr &MI, MachineRegisterInfo &MRI,
   MachineIRBuilder &B) const {
@@ -2349,11 +2357,15 @@ bool AMDGPULegalizerInfo::legalizeFrem(
     Register Src1Reg = MI.getOperand(2).getReg();
     auto Flags = MI.getFlags();
     LLT Ty = MRI.getType(DstReg);
+    bool NSZ = allowNoSignedZeros(B.getMF(), Flags);
 
     auto Div = B.buildFDiv(Ty, Src0Reg, Src1Reg, Flags);
     auto Trunc = B.buildIntrinsicTrunc(Ty, Div, Flags);
     auto Neg = B.buildFNeg(Ty, Trunc, Flags);
-    B.buildFMA(DstReg, Neg, Src1Reg, Src0Reg, Flags);
+    DstOp FMADst = NSZ ? DstOp(DstReg) : DstOp(Ty);
+    auto FMA = B.buildFMA(FMADst, Neg, Src1Reg, Src0Reg, Flags);
+    if (!NSZ)
+      B.buildFCopysign(DstReg, FMA, Src0Reg);
     MI.eraseFromParent();
     return true;
 }
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/frem.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/frem.ll
index 0417b97a53c0ff3..da7fca95d5eef07 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/frem.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/frem.ll
@@ -29,7 +29,10 @@ define amdgpu_kernel void @frem_f16(ptr addrspace(1) %out, ptr addrspace(1) %in1
 ; CI-NEXT:    s_mov_b32 s7, 0xf000
 ; CI-NEXT:    v_div_fixup_f32 v2, v2, v1, v0
 ; CI-NEXT:    v_trunc_f32_e32 v2, v2
-; CI-NEXT:    v_fma_f32 v0, -v2, v1, v0
+; CI-NEXT:    v_fma_f32 v1, -v2, v1, v0
+; CI-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
+; CI-NEXT:    v_and_b32_e32 v0, 0x80000000, v0
+; CI-NEXT:    v_or_b32_e32 v0, v1, v0
 ; CI-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_MODE, 2, 2), 0
 ; CI-NEXT:    v_cvt_f16_f32_e32 v0, v0
 ; CI-NEXT:    buffer_store_short v0, off, s[4:7], 0
@@ -46,12 +49,15 @@ define amdgpu_kernel void @frem_f16(ptr addrspace(1) %out, ptr addrspace(1) %in1
 ; VI-NEXT:    v_cvt_f32_f16_e32 v0, s2
 ; VI-NEXT:    v_cvt_f32_f16_e32 v2, s0
 ; VI-NEXT:    v_mov_b32_e32 v1, s0
+; VI-NEXT:    s_and_b32 s0, s2, 0xffff8000
 ; VI-NEXT:    v_rcp_f32_e32 v2, v2
 ; VI-NEXT:    v_mul_f32_e32 v0, v0, v2
 ; VI-NEXT:    v_cvt_f16_f32_e32 v0, v0
 ; VI-NEXT:    v_div_fixup_f16 v0, v0, v1, s2
 ; VI-NEXT:    v_trunc_f16_e32 v0, v0
-; VI-NEXT:    v_fma_f16 v2, -v0, v1, s2
+; VI-NEXT:    v_fma_f16 v0, -v0, v1, s2
+; VI-NEXT:    v_and_b32_e32 v0, 0x7fff, v0
+; VI-NEXT:    v_or_b32_e32 v2, s0, v0
 ; VI-NEXT:    v_mov_b32_e32 v0, s4
 ; VI-NEXT:    v_mov_b32_e32 v1, s5
 ; VI-NEXT:    flat_store_short v[0:1], v2
@@ -126,7 +132,10 @@ define amdgpu_kernel void @unsafe_frem_f16(ptr addrspace(1) %out, ptr addrspace(
 ; CI-NEXT:    v_rcp_f32_e32 v2, v1
 ; CI-NEXT:    v_mul_f32_e32 v2, v0, v2
 ; CI-NEXT:    v_trunc_f32_e32 v2, v2
-; CI-NEXT:    v_fma_f32 v0, -v2, v1, v0
+; CI-NEXT:    v_fma_f32 v1, -v2, v1, v0
+; CI-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
+; CI-NEXT:    v_and_b32_e32 v0, 0x80000000, v0
+; CI-NEXT:    v_or_b32_e32 v0, v1, v0
 ; CI-NEXT:    v_cvt_f16_f32_e32 v0, v0
 ; CI-NEXT:    buffer_store_short v0, off, s[4:7], 0
 ; CI-NEXT:    s_endpgm
@@ -143,7 +152,10 @@ define amdgpu_kernel void @unsafe_frem_f16(ptr addrspace(1) %out, ptr addrspace(
 ; VI-NEXT:    v_rcp_f16_e32 v0, s0
 ; VI-NEXT:    v_mul_f16_e32 v0, s2, v0
 ; VI-NEXT:    v_trunc_f16_e32 v0, v0
-; VI-NEXT:    v_fma_f16 v2, -v0, s0, v1
+; VI-NEXT:    v_fma_f16 v0, -v0, s0, v1
+; VI-NEXT:    v_and_b32_e32 v0, 0x7fff, v0
+; VI-NEXT:    s_and_b32 s0, s2, 0xffff8000
+; VI-NEXT:    v_or_b32_e32 v2, s0, v0
 ; VI-NEXT:    v_mov_b32_e32 v0, s4
 ; VI-NEXT:    v_mov_b32_e32 v1, s5
 ; VI-NEXT:    flat_store_short v[0:1], v2
@@ -178,11 +190,14 @@ define amdgpu_kernel void @frem_f32(ptr addrspace(1) %out, ptr addrspace(1) %in1
 ; CI-NEXT:    v_fma_f32 v1, -v1, v4, v2
 ; CI-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4, 2), 0
 ; CI-NEXT:    v_div_fmas_f32 v1, v1, v3, v4
+; CI-NEXT:    s_and_b32 s0, s2, 0x80000000
 ; CI-NEXT:    s_mov_b32 s6, -1
 ; CI-NEXT:    s_mov_b32 s7, 0xf000
 ; CI-NEXT:    v_div_fixup_f32 v1, v1, v0, s2
 ; CI-NEXT:    v_trunc_f32_e32 v1, v1
 ; CI-NEXT:    v_fma_f32 v0, -v1, v0, s2
+; CI-NEXT:    v_and_b32_e32 v0, 0x7fffffff, v0
+; CI-NEXT:    v_or_b32_e32 v0, s0, v0
 ; CI-NEXT:    buffer_store_dword v0, off, s[4:7], 0
 ; CI-NEXT:    s_endpgm
 ;
@@ -207,9 +222,12 @@ define amdgpu_kernel void @frem_f32(ptr addrspace(1) %out, ptr addrspace(1) %in1
 ; VI-NEXT:    v_fma_f32 v1, -v1, v4, v2
 ; VI-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4, 2), 0
 ; VI-NEXT:    v_div_fmas_f32 v1, v1, v3, v4
+; VI-NEXT:    s_and_b32 s0, s2, 0x80000000
 ; VI-NEXT:    v_div_fixup_f32 v1, v1, v0, s2
 ; VI-NEXT:    v_trunc_f32_e32 v1, v1
-; VI-NEXT:    v_fma_f32 v2, -v1, v0, s2
+; VI-NEXT:    v_fma_f32 v0, -v1, v0, s2
+; VI-NEXT:    v_and_b32_e32 v0, 0x7fffffff, v0
+; VI-NEXT:    v_or_b32_e32 v2, s0, v0
 ; VI-NEXT:    v_mov_b32_e32 v0, s4
 ; VI-NEXT:    v_mov_b32_e32 v1, s5
 ; VI-NEXT:    flat_store_dword v[0:1], v2
@@ -282,6 +300,9 @@ define amdgpu_kernel void @unsafe_frem_f32(ptr addrspace(1) %out, ptr addrspace(
 ; CI-NEXT:    v_mul_f32_e32 v0, s2, v0
 ; CI-NEXT:    v_trunc_f32_e32 v0, v0
 ; CI-NEXT:    v_fma_f32 v0, -v0, s0, v1
+; CI-NEXT:    v_and_b32_e32 v0, 0x7fffffff, v0
+; CI-NEXT:    s_and_b32 s0, s2, 0x80000000
+; CI-NEXT:    v_or_b32_e32 v0, s0, v0
 ; CI-NEXT:    buffer_store_dword v0, off, s[4:7], 0
 ; CI-NEXT:    s_endpgm
 ;
@@ -297,7 +318,10 @@ define amdgpu_kernel void @unsafe_frem_f32(ptr addrspace(1) %out, ptr addrspace(
 ; VI-NEXT:    v_rcp_f32_e32 v0, s0
 ; VI-NEXT:    v_mul_f32_e32 v0, s2, v0
 ; VI-NEXT:    v_trunc_f32_e32 v0, v0
-; VI-NEXT:    v_fma_f32 v2, -v0, s0, v1
+; VI-NEXT:    v_fma_f32 v0, -v0, s0, v1
+; VI-NEXT:    v_and_b32_e32 v0, 0x7fffffff, v0
+; VI-NEXT:    s_and_b32 s0, s2, 0x80000000
+; VI-NEXT:    v_or_b32_e32 v2, s0, v0
 ; VI-NEXT:    v_mov_b32_e32 v0, s4
 ; VI-NEXT:    v_mov_b32_e32 v1, s5
 ; VI-NEXT:    flat_store_dword v[0:1], v2
@@ -325,6 +349,9 @@ define amdgpu_kernel void @frem_f64(ptr addrspace(1) %out, ptr addrspace(1) %in1
 ; CI-NEXT:    v_mov_b32_e32 v1, s1
 ; CI-NEXT:    v_div_scale_f64 v[2:3], s[0:1], v[0:1], v[0:1], s[2:3]
 ; CI-NEXT:    v_div_scale_f64 v[8:9], vcc, s[2:3], v[0:1], s[2:3]
+; CI-NEXT:    s_mov_b32 s0, 0
+; CI-NEXT:    s_brev_b32 s1, 1
+; CI-NEXT:    s_and_b64 s[0:1], s[2:3], s[0:1]
 ; CI-NEXT:    v_rcp_f64_e32 v[4:5], v[2:3]
 ; CI-NEXT:    v_fma_f64 v[6:7], -v[2:3], v[4:5], 1.0
 ; CI-NEXT:    v_fma_f64 v[4:5], v[4:5], v[6:7], v[4:5]
@@ -336,6 +363,9 @@ define amdgpu_kernel void @frem_f64(ptr addrspace(1) %out, ptr addrspace(1) %in1
 ; CI-NEXT:    v_div_fixup_f64 v[2:3], v[2:3], v[0:1], s[2:3]
 ; CI-NEXT:    v_trunc_f64_e32 v[2:3], v[2:3]
 ; CI-NEXT:    v_fma_f64 v[0:1], -v[2:3], v[0:1], s[2:3]
+; CI-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
+; CI-NEXT:    v_or_b32_e32 v0, s0, v0
+; CI-NEXT:    v_or_b32_e32 v1, s1, v1
 ; CI-NEXT:    buffer_store_dwordx2 v[0:1], off, s[4:7], 0
 ; CI-NEXT:    s_endpgm
 ;
@@ -351,6 +381,9 @@ define amdgpu_kernel void @frem_f64(ptr addrspace(1) %out, ptr addrspace(1) %in1
 ; VI-NEXT:    v_mov_b32_e32 v1, s1
 ; VI-NEXT:    v_div_scale_f64 v[2:3], s[0:1], v[0:1], v[0:1], s[2:3]
 ; VI-NEXT:    v_div_scale_f64 v[8:9], vcc, s[2:3], v[0:1], s[2:3]
+; VI-NEXT:    s_mov_b32 s0, 0
+; VI-NEXT:    s_brev_b32 s1, 1
+; VI-NEXT:    s_and_b64 s[0:1], s[2:3], s[0:1]
 ; VI-NEXT:    v_rcp_f64_e32 v[4:5], v[2:3]
 ; VI-NEXT:    v_fma_f64 v[6:7], -v[2:3], v[4:5], 1.0
 ; VI-NEXT:    v_fma_f64 v[4:5], v[4:5], v[6:7], v[4:5]
@@ -364,6 +397,9 @@ define amdgpu_kernel void @frem_f64(ptr addrspace(1) %out, ptr addrspace(1) %in1
 ; VI-NEXT:    v_fma_f64 v[0:1], -v[2:3], v[0:1], s[2:3]
 ; VI-NEXT:    v_mov_b32_e32 v2, s4
 ; VI-NEXT:    v_mov_b32_e32 v3, s5
+; VI-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
+; VI-NEXT:    v_or_b32_e32 v0, s0, v0
+; VI-NEXT:    v_or_b32_e32 v1, s1, v1
 ; VI-NEXT:    flat_store_dwordx2 v[2:3], v[0:1]
 ; VI-NEXT:    s_endpgm
    %r0 = load double, ptr addrspace(1) %in1, align 8
@@ -453,6 +489,12 @@ define amdgpu_kernel void @unsafe_frem_f64(ptr addrspace(1) %out, ptr addrspace(
 ; CI-NEXT:    v_fma_f64 v[0:1], v[6:7], v[0:1], v[4:5]
 ; CI-NEXT:    v_trunc_f64_e32 v[0:1], v[0:1]
 ; CI-NEXT:    v_fma_f64 v[0:1], -v[0:1], s[0:1], v[2:3]
+; CI-NEXT:    s_mov_b32 s0, 0
+; CI-NEXT:    s_brev_b32 s1, 1
+; CI-NEXT:    s_and_b64 s[0:1], s[2:3], s[0:1]
+; CI-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
+; CI-NEXT:    v_or_b32_e32 v0, s0, v0
+; CI-NEXT:    v_or_b32_e32 v1, s1, v1
 ; CI-NEXT:    buffer_store_dwordx2 v[0:1], off, s[4:7], 0
 ; CI-NEXT:    s_endpgm
 ;
@@ -476,8 +518,14 @@ define amdgpu_kernel void @unsafe_frem_f64(ptr addrspace(1) %out, ptr addrspace(
 ; VI-NEXT:    v_fma_f64 v[0:1], v[6:7], v[0:1], v[4:5]
 ; VI-NEXT:    v_trunc_f64_e32 v[0:1], v[0:1]
 ; VI-NEXT:    v_fma_f64 v[0:1], -v[0:1], s[0:1], v[2:3]
+; VI-NEXT:    s_mov_b32 s0, 0
+; VI-NEXT:    s_brev_b32 s1, 1
+; VI-NEXT:    s_and_b64 s[0:1], s[2:3], s[0:1]
 ; VI-NEXT:    v_mov_b32_e32 v2, s4
 ; VI-NEXT:    v_mov_b32_e32 v3, s5
+; VI-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
+; VI-NEXT:    v_or_b32_e32 v0, s0, v0
+; VI-NEXT:    v_or_b32_e32 v1, s1, v1
 ; VI-NEXT:    flat_store_dwordx2 v[2:3], v[0:1]
 ; VI-NEXT:    s_endpgm
                              ptr addrspace(1) %in2) #1 {
@@ -515,7 +563,10 @@ define amdgpu_kernel void @frem_v2f16(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; CI-NEXT:    v_div_fmas_f32 v2, v2, v4, v5
 ; CI-NEXT:    v_div_fixup_f32 v2, v2, v1, v0
 ; CI-NEXT:    v_trunc_f32_e32 v2, v2
-; CI-NEXT:    v_fma_f32 v0, -v2, v1, v0
+; CI-NEXT:    v_fma_f32 v1, -v2, v1, v0
+; CI-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
+; CI-NEXT:    v_and_b32_e32 v0, 0x80000000, v0
+; CI-NEXT:    v_or_b32_e32 v0, v1, v0
 ; CI-NEXT:    v_cvt_f32_f16_e32 v1, s3
 ; CI-NEXT:    v_cvt_f32_f16_e32 v2, s6
 ; CI-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_MODE, 2, 2), 0
@@ -536,7 +587,10 @@ define amdgpu_kernel void @frem_v2f16(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; CI-NEXT:    s_mov_b32 s7, 0xf000
 ; CI-NEXT:    v_div_fixup_f32 v3, v3, v2, v1
 ; CI-NEXT:    v_trunc_f32_e32 v3, v3
-; CI-NEXT:    v_fma_f32 v1, -v3, v2, v1
+; CI-NEXT:    v_fma_f32 v2, -v3, v2, v1
+; CI-NEXT:    v_and_b32_e32 v2, 0x7fffffff, v2
+; CI-NEXT:    v_and_b32_e32 v1, 0x80000000, v1
+; CI-NEXT:    v_or_b32_e32 v1, v2, v1
 ; CI-NEXT:    v_cvt_f16_f32_e32 v1, v1
 ; CI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; CI-NEXT:    v_or_b32_e32 v0, v0, v1
@@ -559,6 +613,7 @@ define amdgpu_kernel void @frem_v2f16(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; VI-NEXT:    v_rcp_f32_e32 v2, v2
 ; VI-NEXT:    s_lshr_b32 s1, s2, 16
 ; VI-NEXT:    v_rcp_f32_e32 v3, v3
+; VI-NEXT:    s_and_b32 s0, s2, 0xffff8000
 ; VI-NEXT:    v_mul_f32_e32 v0, v0, v2
 ; VI-NEXT:    v_cvt_f16_f32_e32 v0, v0
 ; VI-NEXT:    v_mov_b32_e32 v2, s3
@@ -566,13 +621,19 @@ define amdgpu_kernel void @frem_v2f16(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; VI-NEXT:    v_trunc_f16_e32 v0, v0
 ; VI-NEXT:    v_fma_f16 v0, -v0, v1, s2
 ; VI-NEXT:    v_cvt_f32_f16_e32 v1, s1
+; VI-NEXT:    v_and_b32_e32 v0, 0x7fff, v0
+; VI-NEXT:    v_or_b32_e32 v0, s0, v0
+; VI-NEXT:    s_and_b32 s0, s1, 0xffff8000
 ; VI-NEXT:    v_mul_f32_e32 v1, v1, v3
 ; VI-NEXT:    v_cvt_f16_f32_e32 v1, v1
 ; VI-NEXT:    v_div_fixup_f16 v1, v1, v2, s1
 ; VI-NEXT:    v_trunc_f16_e32 v1, v1
 ; VI-NEXT:    v_fma_f16 v1, -v1, v2, s1
+; VI-NEXT:    v_and_b32_e32 v1, 0x7fff, v1
+; VI-NEXT:    v_or_b32_e32 v1, s0, v1
+; VI-NEXT:    v_and_b32_e32 v1, 0xffff, v1
 ; VI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; VI-NEXT:    v_or_b32_e32 v2, v0, v1
+; VI-NEXT:    v_or_b32_sdwa v2, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_mov_b32_e32 v0, s4
 ; VI-NEXT:    v_mov_b32_e32 v1, s5
 ; VI-NEXT:    flat_store_dword v[0:1], v2
@@ -614,7 +675,10 @@ define amdgpu_kernel void @frem_v4f16(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; CI-NEXT:    v_div_fmas_f32 v2, v2, v4, v5
 ; CI-NEXT:    v_div_fixup_f32 v2, v2, v1, v0
 ; CI-NEXT:    v_trunc_f32_e32 v2, v2
-; CI-NEXT:    v_fma_f32 v0, -v2, v1, v0
+; CI-NEXT:    v_fma_f32 v1, -v2, v1, v0
+; CI-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
+; CI-NEXT:    v_and_b32_e32 v0, 0x80000000, v0
+; CI-NEXT:    v_or_b32_e32 v0, v1, v0
 ; CI-NEXT:    v_cvt_f32_f16_e32 v1, s8
 ; CI-NEXT:    v_cvt_f32_f16_e32 v2, s10
 ; CI-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_MODE, 2, 2), 0
@@ -633,7 +697,10 @@ define amdgpu_kernel void @frem_v4f16(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; CI-NEXT:    v_div_fmas_f32 v3, v3, v5, v6
 ; CI-NEXT:    v_div_fixup_f32 v3, v3, v2, v1
 ; CI-NEXT:    v_trunc_f32_e32 v3, v3
-; CI-NEXT:    v_fma_f32 v1, -v3, v2, v1
+; CI-NEXT:    v_fma_f32 v2, -v3, v2, v1
+; CI-NEXT:    v_and_b32_e32 v2, 0x7fffffff, v2
+; CI-NEXT:    v_and_b32_e32 v1, 0x80000000, v1
+; CI-NEXT:    v_or_b32_e32 v1, v2, v1
 ; CI-NEXT:    v_cvt_f32_f16_e32 v2, s3
 ; CI-NEXT:    v_cvt_f32_f16_e32 v3, s1
 ; CI-NEXT:    v_cvt_f16_f32_e32 v1, v1
@@ -651,7 +718,10 @@ define amdgpu_kernel void @frem_v4f16(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; CI-NEXT:    v_div_fmas_f32 v4, v4, v6, v7
 ; CI-NEXT:    v_div_fixup_f32 v4, v4, v3, v2
 ; CI-NEXT:    v_trunc_f32_e32 v4, v4
-; CI-NEXT:    v_fma_f32 v2, -v4, v3, v2
+; CI-NEXT:    v_fma_f32 v3, -v4, v3, v2
+; CI-NEXT:    v_and_b32_e32 v3, 0x7fffffff, v3
+; CI-NEXT:    v_and_b32_e32 v2, 0x80000000, v2
+; CI-NEXT:    v_or_b32_e32 v2, v3, v2
 ; CI-NEXT:    v_cvt_f32_f16_e32 v3, s9
 ; CI-NEXT:    v_cvt_f32_f16_e32 v4, s11
 ; CI-NEXT:    v_cvt_f16_f32_e32 v2, v2
@@ -673,7 +743,10 @@ define amdgpu_kernel void @frem_v4f16(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; CI-NEXT:    s_mov_b32 s7, 0xf000
 ; CI-NEXT:    v_div_fixup_f32 v5, v5, v4, v3
 ; CI-NEXT:    v_trunc_f32_e32 v5, v5
-; CI-NEXT:    v_fma_f32 v3, -v5, v4, v3
+; CI-NEXT:    v_fma_f32 v4, -v5, v4, v3
+; CI-NEXT:    v_and_b32_e32 v4, 0x7fffffff, v4
+; CI-NEXT:    v_and_b32_e32 v3, 0x80000000, v3
+; CI-NEXT:    v_or_b32_e32 v3, v4, v3
 ; CI-NEXT:    v_cvt_f16_f32_e32 v3, v3
 ; CI-NEXT:    v_lshlrev_b32_e32 v1, 16, v3
 ; CI-NEXT:    v_or_b32_e32 v1, v2, v1
@@ -708,30 +781,44 @@ define amdgpu_kernel void @frem_v4f16(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; VI-NEXT:    v_cvt_f32_f16_e32 v1, s6
 ; VI-NEXT:    v_cvt_f32_f16_e32 v5, s9
 ; VI-NEXT:    s_lshr_b32 s7, s3, 16
+; VI-NEXT:    v_and_b32_e32 v0, 0x7fff, v0
 ; VI-NEXT:    v_mul_f32_e32 v1, v1, v3
 ; VI-NEXT:    v_cvt_f16_f32_e32 v1, v1
 ; VI-NEXT:    v_mov_b32_e32 v3, s1
 ; VI-NEXT:    v_rcp_f32_e32 v5, v5
+; VI-NEXT:    s_and_b32 s0, s2, 0xffff8000
 ; VI-NEXT:    v_div_fixup_f16 v1, v1, v2, s6
 ; VI-NEXT:    v_trunc_f16_e32 v1, v1
 ; VI-NEXT:    v_fma_f16 v1, -v1, v2, s6
 ; VI-NEXT:    v_cvt_f32_f16_e32 v2, s3
-; VI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
-; VI-NEXT:    v_or_b32_e32 v0, v0, v1
+; VI-NEXT:    v_or_b32_e32 v0, s0, v0
+; VI-NEXT:    v_and_b32_e32 v1, 0x7fff, v1
+; VI-NEXT:    s_and_b32 s0, s6, 0xffff8000
 ; VI-NEXT:    v_mul_f32_e32 v2, v2, v4
 ; VI-NEXT:    v_cvt_f16_f32_e32 v2, v2
 ; VI-NEXT:    v_mov_b32_e32 v4, s9
+; VI-NEXT:    v_or_b32_e32 v1, s0, v1
+; VI-NEXT:    s_and_b32 s0, s3, 0xffff8000
 ; VI-NEXT:    v_div_fixup_f16 v2, v2, v3, s3
 ; VI-NEXT:    v_trunc_f16_e32 v2, v2
 ; VI-NEXT:    v_fma_f16 v2, -v2, v3, s3
 ; VI-NEXT:    v_cvt_f32_f16_e32 v3, s7
+; VI-NEXT:    v_and_b32_e32 v2, 0x7fff, v2
+; VI-NEXT:    v_or_b32_e32 v2, s0, v2
+; VI-NEXT:    s_and_b32 s0, s7, 0xffff8000
 ; VI-NEXT:    v_mul_f32_e32 v3, v3, v5
 ; VI-NEXT:    v_cvt_f16_f32_e32 v3, v3
+; VI-NEXT:    v_and_b32_e32 v1, 0xffff, v1
+; VI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; VI-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_div_fixup_f16 v3, v3, v4, s7
 ; VI-NEXT:    v_trunc_f16_e32 v3, v3
 ; VI-NEXT:    v_fma_f16 v3, -v3, v4, s7
-; VI-NEXT:    v_lshlrev_b32_e32 v1, 16, v3
-; VI-NEXT:    v_or_b32_e32 v1, v2, v1
+; VI-NEXT:    v_and_b32_e32 v3, 0x7fff, v3
+; VI-NEXT:    v_or_b32_e32 v3, s0, v3
+; VI-NEXT:    v_and_b32_e32 v1, 0xffff, v3
+; VI-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
+; VI-NEXT:    v_or_b32_sdwa v1, v2, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; VI-NEXT:    v_mov_b32_e32 v2, s4
 ; VI-NEXT:    v_mov_b32_e32 v3, s5
 ; VI-NEXT:    flat_store_dwordx2 v[2:3], v[0:1]
@@ -766,10 +853,13 @@ define amdgpu_kernel void @frem_v2f32(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; CI-NEXT:    v_fma_f32 v1, -v1, v4, v2
 ; CI-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4, 2), 0
 ; CI-NEXT:    v_div_fmas_f32 v1, v1, v3, v4
+; CI-NEXT:    s_and_b32 s0, s2, 0x80000000
 ; CI-NEXT:    v_div_fixup_f32 v1, v1, v0, s2
 ; CI-NEXT:    v_trunc_f32_e32 v1, v1
 ; CI-NEXT:    v_fma_f32 v0, -v1, v0, s2
+; CI-NEXT:    v_and_b32_e32 v0, 0x7fffffff, v0
 ; CI-NEXT:    v_mov_b32_e32 v1, s1
+; CI-NEXT:    v_or_b32_e32 v0, s0, v0
 ; CI-NEXT:    v_div_scale_f32 v2, s[0:1], v1, v1, s3
 ; CI-NEXT:    v_div_scale_f32 v3, vcc, s3, v1, s3
 ; CI-NEXT:    v_rcp_f32_e32 v4, v2
@@ -782,11 +872,14 @@ define amdgpu_kernel void @frem_v2f32(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; CI-NEXT:    v_fma_f32 v2, -v2, v5, v3
 ; CI-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4, 2), 0
 ; CI-NEXT:    v_div_fmas_f32 v2, v2, v4, v5
+; CI-NEXT:    s_and_b32 s0, s3, 0x80000000
 ; CI-NEXT:    s_mov_b32 s6, -1
 ; CI-NEXT:    s_mov_b32 s7, 0xf000
 ; CI-NEXT:    v_div_fixup_f32 v2, v2, v1, s3
 ; CI-NEXT:    v_trunc_f32_e32 v2, v2
 ; CI-NEXT:    v_fma_f32 v1, -v2, v1, s3
+; CI-NEXT:    v_and_b32_e32 v1, 0x7fffffff, v1
+; CI-NEXT:    v_or_b32_e32 v1, s0, v1
 ; CI-NEXT:    buffer_store_dwordx2 v[0:1], off, s[4:7], 0
 ; CI-NEXT:    s_endpgm
 ;
@@ -811,10 +904,13 @@ define amdgpu_kernel void @frem_v2f32(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; VI-NEXT:    v_fma_f32 v1, -v1, v4, v2
 ; VI-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4, 2), 0
 ; VI-NEXT:    v_div_fmas_f32 v1, v1, v3, v4
+; VI-NEXT:    s_and_b32 s0, s2, 0x80000000
 ; VI-NEXT:    v_div_fixup_f32 v1, v1, v0, s2
 ; VI-NEXT:    v_trunc_f32_e32 v1, v1
 ; VI-NEXT:    v_fma_f32 v0, -v1, v0, s2
+; VI-NEXT:    v_and_b32_e32 v0, 0x7fffffff, v0
 ; VI-NEXT:    v_mov_b32_e32 v1, s1
+; VI-NEXT:    v_or_b32_e32 v0, s0, v0
 ; VI-NEXT:    v_div_scale_f32 v2, s[0:1], v1, v1, s3
 ; VI-NEXT:    v_div_scale_f32 v3, vcc, s3, v1, s3
 ; VI-NEXT:    v_rcp_f32_e32 v4, v2
@@ -827,10 +923,13 @@ define amdgpu_kernel void @frem_v2f32(ptr addrspace(1) %out, ptr addrspace(1) %i
 ; VI-NEXT:    v_fma_f32 v2, -v2, v5, v3
 ; VI-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4...
[truncated]

; CI-NEXT: v_fma_f32 v1, -v2, v1, v0
; CI-NEXT: v_and_b32_e32 v1, 0x7fffffff, v1
; CI-NEXT: v_and_b32_e32 v0, 0x80000000, v0
; CI-NEXT: v_or_b32_e32 v0, v1, v0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the fix is applied here in frem_f16 and in unsafe_frem_f16 but not in fast_frem_f16.

@jayfoad
Copy link
Contributor Author

jayfoad commented Oct 27, 2023

@b-sumner FYI.

@github-actions
Copy link

github-actions bot commented Oct 27, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@jayfoad
Copy link
Contributor Author

jayfoad commented Oct 30, 2023

It has been pointed out to me that this also changes the sign of the result when the division x / y overflows.

Previously: frem(huge, tiny) -> infinity with opposite sign from x
Now: frem(huge, tiny) -> infinity with same sign as x

I'm not sure if this is a good thing or a bad thing.

@arsenm
Copy link
Contributor

arsenm commented Nov 8, 2023

I think the frem lowering is just broken in general for large values

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure frem is just generally broken, and this at least fixes one way it's broken. We never wired up OpenCL to use frem for example, so I didn't know anything was actually using this.

@jayfoad
Copy link
Contributor Author

jayfoad commented Nov 8, 2023

The motivation for this has gone away. I wanted it for Vulkan/SPIR-V but it turns out that SPIR-V does not specify the sign of a zero result from OpFRem.

@jayfoad jayfoad closed this Nov 8, 2023
@jayfoad
Copy link
Contributor Author

jayfoad commented Nov 14, 2023

Actually there is still interest in this, either for its own sake (to follow the LLVM definition in LangRef) or because the SPIR-V definition may change to match the LLVM definition.


// Legalize frem(x, y) -> copysign(x - y * trunc(x / y), x)
// The copysign is only required to get the correct result -0.0 when x is -0.0
// (and y is non-zero). With NSZ it can be dropped.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also drop it if isKnownNeverZero(x)?

@arsenm
Copy link
Contributor

arsenm commented Nov 30, 2023

Should probably do this if it follows the LangRef. But again, this lowering is wrong to begin with

@jayfoad
Copy link
Contributor Author

jayfoad commented Nov 30, 2023

Actually there is still interest in this, either for its own sake (to follow the LLVM definition in LangRef) or because the SPIR-V definition may change to match the LLVM definition.

The last I heard from Khronos is that the SPIR-V definition is not going to change - the sign of a zero result will remain undefined.

@jayfoad jayfoad closed this by deleting the head repository Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants