[AMDGPU] Disable atomic optimization of fadd/fsub with result #96479

jayfoad · 2024-06-24T11:44:51Z

An atomic fadd instruction like this should return %x:

; value at %ptr is %x
%r = atomicrmw fadd ptr %ptr, float %y

After atomic optimization, if %y is uniform, the result is calculated
as %r = %x + * %y * +0.0. This has a couple of problems:

If %y is Inf or NaN, this will return NaN instead of %x.
If %x is -0.0 and %y is positive, this will return +0.0 instead of
-0.0.

Avoid these problems by disabling the "%y is uniform" path if there are
any uses of the result.

An atomic fadd instruction like this should return %x: ; value at %ptr is %x %r = atomicrmw fadd ptr %ptr, float %y After atomic optimization, the result is calculated as %r = %x + * %y * +0.0. This has a couple of problems: 1. If %y is Inf or NaN, this will return NaN instead of %x. 2. If %x is -0.0 and %y is positive, this will return +0.0 instead of -0.0. Avoid these problems by only optimizing fadd/fsub if there are no uses of the result.

llvmbot · 2024-06-24T11:45:22Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Jay Foad (jayfoad)

Changes

An atomic fadd instruction like this should return %x:

; value at %ptr is %x
%r = atomicrmw fadd ptr %ptr, float %y

After atomic optimization, the result is calculated as
%r = %x + * %y * +0.0. This has a couple of problems:

If %y is Inf or NaN, this will return NaN instead of %x.
If %x is -0.0 and %y is positive, this will return +0.0 instead of
-0.0.

Avoid these problems by only optimizing fadd/fsub if there are no uses
of the result.

Patch is 176.25 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96479.diff

6 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp (+8-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+15-76)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+16-68)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd-wrong-subtarget.ll (+14-30)
(modified) llvm/test/CodeGen/AMDGPU/global_atomic_optimizer_fp_rtn.ll (+36-862)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+323-1036)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
index 38cc5a9bef969..83b7a8dfe2d00 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
@@ -202,11 +202,17 @@ void AMDGPUAtomicOptimizerImpl::visitAtomicRMWInst(AtomicRMWInst &I) {
   case AtomicRMWInst::Min:
   case AtomicRMWInst::UMax:
   case AtomicRMWInst::UMin:
-  case AtomicRMWInst::FAdd:
-  case AtomicRMWInst::FSub:
   case AtomicRMWInst::FMax:
   case AtomicRMWInst::FMin:
     break;
+  case AtomicRMWInst::FAdd:
+  case AtomicRMWInst::FSub:
+    if (!I.use_empty()) {
+      // Bail out because the way we would calculate the result value is
+      // incorrect in the presence of NaNs and infinities.
+      return;
+    }
+    break;
   }
 
   // Only 32 and 64 bit floating point atomic ops are supported.
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll
index e48d281f37c9a..6bb6029bb6791 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll
@@ -149,91 +149,30 @@ define amdgpu_ps float @global_atomic_fadd_f32_rtn_atomicrmw(ptr addrspace(1) %p
 }
 
 define amdgpu_ps float @global_atomic_fadd_f32_saddr_rtn_atomicrmw(ptr addrspace(1) inreg %ptr, float %data) #0 {
+  ; GFX90A_GFX940-LABEL: name: global_atomic_fadd_f32_saddr_rtn_atomicrmw
+  ; GFX90A_GFX940: bb.1 (%ir-block.0):
+  ; GFX90A_GFX940-NEXT:   liveins: $sgpr0, $sgpr1, $vgpr0
+  ; GFX90A_GFX940-NEXT: {{  $}}
+  ; GFX90A_GFX940-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
+  ; GFX90A_GFX940-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
+  ; GFX90A_GFX940-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GFX90A_GFX940-NEXT:   [[COPY2:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GFX90A_GFX940-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GFX90A_GFX940-NEXT:   [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN:%[0-9]+]]:vgpr_32 = GLOBAL_ATOMIC_ADD_F32_SADDR_RTN [[V_MOV_B32_e32_]], [[COPY2]], [[REG_SEQUENCE]], 0, 1, implicit $exec :: (load store syncscope("wavefront") monotonic (s32) on %ir.ptr, addrspace 1)
+  ; GFX90A_GFX940-NEXT:   $vgpr0 = COPY [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN]]
+  ; GFX90A_GFX940-NEXT:   SI_RETURN_TO_EPILOG implicit $vgpr0
+  ;
   ; GFX11-LABEL: name: global_atomic_fadd_f32_saddr_rtn_atomicrmw
   ; GFX11: bb.1 (%ir-block.0):
-  ; GFX11-NEXT:   successors: %bb.2(0x40000000), %bb.4(0x40000000)
   ; GFX11-NEXT:   liveins: $sgpr0, $sgpr1, $vgpr0
   ; GFX11-NEXT: {{  $}}
   ; GFX11-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GFX11-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
   ; GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
   ; GFX11-NEXT:   [[COPY2:%[0-9]+]]:vgpr_32 = COPY $vgpr0
-  ; GFX11-NEXT:   [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
-  ; GFX11-NEXT:   [[SI_PS_LIVE:%[0-9]+]]:sreg_32_xm0_xexec = SI_PS_LIVE
-  ; GFX11-NEXT:   [[SI_IF:%[0-9]+]]:sreg_32_xm0_xexec = SI_IF [[SI_PS_LIVE]], %bb.4, implicit-def $exec, implicit-def $scc, implicit $exec
-  ; GFX11-NEXT:   S_BRANCH %bb.2
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT: bb.2 (%ir-block.5):
-  ; GFX11-NEXT:   successors: %bb.3(0x40000000), %bb.5(0x40000000)
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $exec_lo
-  ; GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
-  ; GFX11-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[COPY3]]
-  ; GFX11-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_]]
-  ; GFX11-NEXT:   [[V_MBCNT_LO_U32_B32_e64_:%[0-9]+]]:vgpr_32 = V_MBCNT_LO_U32_B32_e64 [[COPY4]], [[COPY5]], implicit $exec
-  ; GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 -2147483648
-  ; GFX11-NEXT:   [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_1]]
-  ; GFX11-NEXT:   [[V_SET_INACTIVE_B32_:%[0-9]+]]:vgpr_32 = V_SET_INACTIVE_B32 [[COPY2]], [[COPY6]], implicit-def dead $scc, implicit $exec
-  ; GFX11-NEXT:   [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 2147483648
-  ; GFX11-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_2]]
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[COPY7]], [[V_SET_INACTIVE_B32_]], 273, 15, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_SET_INACTIVE_B32_]], 0, [[V_MOV_B32_dpp]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_2]]
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp1:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[COPY8]], [[V_ADD_F32_e64_]], 274, 15, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_ADD_F32_e64_]], 0, [[V_MOV_B32_dpp1]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   [[COPY9:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_2]]
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp2:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[COPY9]], [[V_ADD_F32_e64_1]], 276, 15, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_2:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_ADD_F32_e64_1]], 0, [[V_MOV_B32_dpp2]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_2]]
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp3:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[COPY10]], [[V_ADD_F32_e64_2]], 280, 15, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_3:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_ADD_F32_e64_2]], 0, [[V_MOV_B32_dpp3]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 -1
-  ; GFX11-NEXT:   [[V_PERMLANEX16_B32_e64_:%[0-9]+]]:vgpr_32 = V_PERMLANEX16_B32_e64 0, [[V_ADD_F32_e64_3]], 0, [[S_MOV_B32_3]], 0, [[S_MOV_B32_3]], [[V_ADD_F32_e64_3]], 0, implicit $exec
-  ; GFX11-NEXT:   [[COPY11:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_2]]
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp4:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[COPY11]], [[V_PERMLANEX16_B32_e64_]], 228, 10, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_4:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_ADD_F32_e64_3]], 0, [[V_MOV_B32_dpp4]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   [[COPY12:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_2]]
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp5:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[COPY12]], [[V_ADD_F32_e64_4]], 273, 15, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[S_MOV_B32_4:%[0-9]+]]:sreg_32 = S_MOV_B32 15
-  ; GFX11-NEXT:   [[V_READLANE_B32_:%[0-9]+]]:sreg_32 = V_READLANE_B32 [[V_ADD_F32_e64_4]], [[S_MOV_B32_4]]
-  ; GFX11-NEXT:   [[S_MOV_B32_5:%[0-9]+]]:sreg_32 = S_MOV_B32 16
-  ; GFX11-NEXT:   [[V_WRITELANE_B32_:%[0-9]+]]:vgpr_32 = V_WRITELANE_B32 [[V_READLANE_B32_]], [[S_MOV_B32_5]], [[V_MOV_B32_dpp5]]
-  ; GFX11-NEXT:   [[S_MOV_B32_6:%[0-9]+]]:sreg_32 = S_MOV_B32 31
-  ; GFX11-NEXT:   [[V_READLANE_B32_1:%[0-9]+]]:sreg_32 = V_READLANE_B32 [[V_ADD_F32_e64_4]], [[S_MOV_B32_6]]
-  ; GFX11-NEXT:   [[COPY13:%[0-9]+]]:vgpr_32 = COPY [[V_READLANE_B32_1]]
-  ; GFX11-NEXT:   [[STRICT_WWM:%[0-9]+]]:vgpr_32 = STRICT_WWM [[COPY13]], implicit $exec
-  ; GFX11-NEXT:   [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_]]
-  ; GFX11-NEXT:   [[V_CMP_EQ_U32_e64_:%[0-9]+]]:sreg_32_xm0_xexec = V_CMP_EQ_U32_e64 [[V_MBCNT_LO_U32_B32_e64_]], [[COPY14]], implicit $exec
-  ; GFX11-NEXT:   [[SI_IF1:%[0-9]+]]:sreg_32_xm0_xexec = SI_IF [[V_CMP_EQ_U32_e64_]], %bb.5, implicit-def $exec, implicit-def $scc, implicit $exec
-  ; GFX11-NEXT:   S_BRANCH %bb.3
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT: bb.3 (%ir-block.36):
-  ; GFX11-NEXT:   successors: %bb.5(0x80000000)
-  ; GFX11-NEXT: {{  $}}
   ; GFX11-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
-  ; GFX11-NEXT:   [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN:%[0-9]+]]:vgpr_32 = GLOBAL_ATOMIC_ADD_F32_SADDR_RTN [[V_MOV_B32_e32_]], [[STRICT_WWM]], [[REG_SEQUENCE]], 0, 1, implicit $exec :: (load store syncscope("wavefront") monotonic (s32) on %ir.ptr, addrspace 1)
-  ; GFX11-NEXT:   S_BRANCH %bb.5
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT: bb.4.Flow:
-  ; GFX11-NEXT:   successors: %bb.6(0x80000000)
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT:   [[PHI:%[0-9]+]]:vgpr_32 = PHI %41, %bb.5, [[DEF]], %bb.1
-  ; GFX11-NEXT:   SI_END_CF [[SI_IF]], implicit-def $exec, implicit-def $scc, implicit $exec
-  ; GFX11-NEXT:   S_BRANCH %bb.6
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT: bb.5 (%ir-block.39):
-  ; GFX11-NEXT:   successors: %bb.4(0x80000000)
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT:   [[PHI1:%[0-9]+]]:vgpr_32 = PHI [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN]], %bb.3, [[DEF]], %bb.2
-  ; GFX11-NEXT:   SI_END_CF [[SI_IF1]], implicit-def $exec, implicit-def $scc, implicit $exec
-  ; GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[PHI1]], implicit $exec
-  ; GFX11-NEXT:   [[STRICT_WWM1:%[0-9]+]]:vgpr_32 = STRICT_WWM [[V_WRITELANE_B32_]], implicit $exec
-  ; GFX11-NEXT:   [[COPY15:%[0-9]+]]:vgpr_32 = COPY [[V_READFIRSTLANE_B32_]]
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_5:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[COPY15]], 0, [[STRICT_WWM1]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   S_BRANCH %bb.4
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT: bb.6 (%ir-block.47):
-  ; GFX11-NEXT:   $vgpr0 = COPY [[PHI]]
+  ; GFX11-NEXT:   [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN:%[0-9]+]]:vgpr_32 = GLOBAL_ATOMIC_ADD_F32_SADDR_RTN [[V_MOV_B32_e32_]], [[COPY2]], [[REG_SEQUENCE]], 0, 1, implicit $exec :: (load store syncscope("wavefront") monotonic (s32) on %ir.ptr, addrspace 1)
+  ; GFX11-NEXT:   $vgpr0 = COPY [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN]]
   ; GFX11-NEXT:   SI_RETURN_TO_EPILOG implicit $vgpr0
   %ret = atomicrmw fadd ptr addrspace(1) %ptr, float %data syncscope("wavefront") monotonic
   ret float %ret
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll b/llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll
index 3454e9d1019e5..976fa13df1e22 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll
@@ -155,82 +155,30 @@ define amdgpu_ps float @global_atomic_fadd_f32_rtn_atomicrmw(ptr addrspace(1) %p
 }
 
 define amdgpu_ps float @global_atomic_fadd_f32_saddr_rtn_atomicrmw(ptr addrspace(1) inreg %ptr, float %data) #0 {
+  ; GFX90A_GFX940-LABEL: name: global_atomic_fadd_f32_saddr_rtn_atomicrmw
+  ; GFX90A_GFX940: bb.0 (%ir-block.0):
+  ; GFX90A_GFX940-NEXT:   liveins: $sgpr0, $sgpr1, $vgpr0
+  ; GFX90A_GFX940-NEXT: {{  $}}
+  ; GFX90A_GFX940-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+  ; GFX90A_GFX940-NEXT:   [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+  ; GFX90A_GFX940-NEXT:   [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+  ; GFX90A_GFX940-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY2]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GFX90A_GFX940-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GFX90A_GFX940-NEXT:   [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN:%[0-9]+]]:vgpr_32 = GLOBAL_ATOMIC_ADD_F32_SADDR_RTN killed [[V_MOV_B32_e32_]], [[COPY]], killed [[REG_SEQUENCE]], 0, 1, implicit $exec :: (load store syncscope("wavefront") monotonic (s32) on %ir.ptr, addrspace 1)
+  ; GFX90A_GFX940-NEXT:   $vgpr0 = COPY [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN]]
+  ; GFX90A_GFX940-NEXT:   SI_RETURN_TO_EPILOG $vgpr0
+  ;
   ; GFX11-LABEL: name: global_atomic_fadd_f32_saddr_rtn_atomicrmw
   ; GFX11: bb.0 (%ir-block.0):
-  ; GFX11-NEXT:   successors: %bb.1(0x40000000), %bb.3(0x40000000)
   ; GFX11-NEXT:   liveins: $sgpr0, $sgpr1, $vgpr0
   ; GFX11-NEXT: {{  $}}
   ; GFX11-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
   ; GFX11-NEXT:   [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr1
   ; GFX11-NEXT:   [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY2]], %subreg.sub0, [[COPY1]], %subreg.sub1
-  ; GFX11-NEXT:   [[COPY3:%[0-9]+]]:sreg_64 = COPY [[REG_SEQUENCE]]
-  ; GFX11-NEXT:   [[SI_PS_LIVE:%[0-9]+]]:sreg_32 = SI_PS_LIVE
-  ; GFX11-NEXT:   [[DEF:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
-  ; GFX11-NEXT:   [[SI_IF:%[0-9]+]]:sreg_32 = SI_IF killed [[SI_PS_LIVE]], %bb.3, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
-  ; GFX11-NEXT:   S_BRANCH %bb.1
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT: bb.1 (%ir-block.5):
-  ; GFX11-NEXT:   successors: %bb.2(0x40000000), %bb.4(0x40000000)
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT:   [[COPY4:%[0-9]+]]:sreg_32 = COPY $exec_lo
-  ; GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
-  ; GFX11-NEXT:   [[V_MBCNT_LO_U32_B32_e64_:%[0-9]+]]:vgpr_32 = V_MBCNT_LO_U32_B32_e64 [[COPY4]], [[S_MOV_B32_]], implicit $exec
-  ; GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 -2147483648
-  ; GFX11-NEXT:   [[V_SET_INACTIVE_B32_:%[0-9]+]]:vgpr_32 = V_SET_INACTIVE_B32 [[COPY]], killed [[S_MOV_B32_1]], implicit-def dead $scc, implicit $exec
-  ; GFX11-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 -2147483648, implicit $exec
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_MOV_B32_e32_]], [[V_SET_INACTIVE_B32_]], 273, 15, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_SET_INACTIVE_B32_]], 0, killed [[V_MOV_B32_dpp]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp1:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_MOV_B32_e32_]], [[V_ADD_F32_e64_]], 274, 15, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_1:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_ADD_F32_e64_]], 0, killed [[V_MOV_B32_dpp1]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp2:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_MOV_B32_e32_]], [[V_ADD_F32_e64_1]], 276, 15, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_2:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_ADD_F32_e64_1]], 0, killed [[V_MOV_B32_dpp2]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp3:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_MOV_B32_e32_]], [[V_ADD_F32_e64_2]], 280, 15, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_3:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_ADD_F32_e64_2]], 0, killed [[V_MOV_B32_dpp3]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 -1
-  ; GFX11-NEXT:   [[V_PERMLANEX16_B32_e64_:%[0-9]+]]:vgpr_32 = V_PERMLANEX16_B32_e64 0, [[V_ADD_F32_e64_3]], 0, [[S_MOV_B32_2]], 0, [[S_MOV_B32_2]], [[V_ADD_F32_e64_3]], 0, implicit $exec
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp4:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_MOV_B32_e32_]], killed [[V_PERMLANEX16_B32_e64_]], 228, 10, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_4:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, [[V_ADD_F32_e64_3]], 0, killed [[V_MOV_B32_dpp4]], 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   [[V_MOV_B32_dpp5:%[0-9]+]]:vgpr_32 = V_MOV_B32_dpp [[V_MOV_B32_e32_]], [[V_ADD_F32_e64_4]], 273, 15, 15, 0, implicit $exec
-  ; GFX11-NEXT:   [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 15
-  ; GFX11-NEXT:   [[V_READLANE_B32_:%[0-9]+]]:sreg_32 = V_READLANE_B32 [[V_ADD_F32_e64_4]], killed [[S_MOV_B32_3]]
-  ; GFX11-NEXT:   [[S_MOV_B32_4:%[0-9]+]]:sreg_32 = S_MOV_B32 16
-  ; GFX11-NEXT:   [[V_WRITELANE_B32_:%[0-9]+]]:vgpr_32 = V_WRITELANE_B32 killed [[V_READLANE_B32_]], killed [[S_MOV_B32_4]], [[V_MOV_B32_dpp5]]
-  ; GFX11-NEXT:   [[S_MOV_B32_5:%[0-9]+]]:sreg_32 = S_MOV_B32 31
-  ; GFX11-NEXT:   [[V_READLANE_B32_1:%[0-9]+]]:sreg_32 = V_READLANE_B32 [[V_ADD_F32_e64_4]], killed [[S_MOV_B32_5]]
-  ; GFX11-NEXT:   early-clobber %2:sgpr_32 = STRICT_WWM killed [[V_READLANE_B32_1]], implicit $exec
-  ; GFX11-NEXT:   [[V_CMP_EQ_U32_e64_:%[0-9]+]]:sreg_32 = V_CMP_EQ_U32_e64 killed [[V_MBCNT_LO_U32_B32_e64_]], [[S_MOV_B32_]], implicit $exec
-  ; GFX11-NEXT:   [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
-  ; GFX11-NEXT:   [[SI_IF1:%[0-9]+]]:sreg_32 = SI_IF killed [[V_CMP_EQ_U32_e64_]], %bb.4, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
-  ; GFX11-NEXT:   S_BRANCH %bb.2
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT: bb.2 (%ir-block.36):
-  ; GFX11-NEXT:   successors: %bb.4(0x80000000)
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT:   [[V_MOV_B32_e32_1:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
-  ; GFX11-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY %2
-  ; GFX11-NEXT:   [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN:%[0-9]+]]:vgpr_32 = GLOBAL_ATOMIC_ADD_F32_SADDR_RTN killed [[V_MOV_B32_e32_1]], [[COPY5]], [[COPY3]], 0, 1, implicit $exec :: (load store syncscope("wavefront") monotonic (s32) on %ir.ptr, addrspace 1)
-  ; GFX11-NEXT:   S_BRANCH %bb.4
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT: bb.3.Flow:
-  ; GFX11-NEXT:   successors: %bb.5(0x80000000)
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT:   [[PHI:%[0-9]+]]:vgpr_32 = PHI [[DEF]], %bb.0, %7, %bb.4
-  ; GFX11-NEXT:   SI_END_CF [[SI_IF]], implicit-def dead $exec, implicit-def dead $scc, implicit $exec
-  ; GFX11-NEXT:   S_BRANCH %bb.5
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT: bb.4 (%ir-block.39):
-  ; GFX11-NEXT:   successors: %bb.3(0x80000000)
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT:   [[PHI1:%[0-9]+]]:vgpr_32 = PHI [[DEF1]], %bb.1, [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN]], %bb.2
-  ; GFX11-NEXT:   SI_END_CF [[SI_IF1]], implicit-def dead $exec, implicit-def dead $scc, implicit $exec
-  ; GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[PHI1]], implicit $exec
-  ; GFX11-NEXT:   early-clobber %44:vgpr_32 = STRICT_WWM [[V_WRITELANE_B32_]], implicit $exec
-  ; GFX11-NEXT:   [[V_ADD_F32_e64_5:%[0-9]+]]:vgpr_32 = nofpexcept V_ADD_F32_e64 0, killed [[V_READFIRSTLANE_B32_]], 0, killed %44, 0, 0, implicit $mode, implicit $exec
-  ; GFX11-NEXT:   S_BRANCH %bb.3
-  ; GFX11-NEXT: {{  $}}
-  ; GFX11-NEXT: bb.5 (%ir-block.47):
-  ; GFX11-NEXT:   $vgpr0 = COPY [[PHI]]
+  ; GFX11-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+  ; GFX11-NEXT:   [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN:%[0-9]+]]:vgpr_32 = GLOBAL_ATOMIC_ADD_F32_SADDR_RTN killed [[V_MOV_B32_e32_]], [[COPY]], killed [[REG_SEQUENCE]], 0, 1, implicit $exec :: (load store syncscope("wavefront") monotonic (s32) on %ir.ptr, addrspace 1)
+  ; GFX11-NEXT:   $vgpr0 = COPY [[GLOBAL_ATOMIC_ADD_F32_SADDR_RTN]]
   ; GFX11-NEXT:   SI_RETURN_TO_EPILOG $vgpr0
   %ret = atomicrmw fadd ptr addrspace(1) %ptr, float %data syncscope("wavefront") monotonic
   ret float %ret
diff --git a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd-wrong-subtarget.ll b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd-wrong-subtarget.ll
index 9fc0b5c57cc3a..ee48b9df67d63 100644
--- a/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd-wrong-subtarget.ll
+++ b/llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd-wrong-subtarget.ll
@@ -4,43 +4,27 @@
 define amdgpu_kernel void @global_atomic_fadd_ret_f32_wrong_subtarget(ptr addrspace(1) %ptr) #1 {
 ; GCN-LABEL: global_atomic_fadd_ret_f32_wrong_subtarget:
 ; GCN:       ; %bb.0:
-; GCN-NEXT:    s_mov_b64 s[4:5], exec
-; GCN-NEXT:    v_mbcnt_lo_u32_b32 v0, s4, 0
-; GCN-NEXT:    v_mbcnt_hi_u32_b32 v0, s5, v0
-; GCN-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
-; GCN-NEXT:    ; implicit-def: $vgpr1
-; GCN-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GCN-NEXT:    s_cbranch_execz .LBB0_4
-; GCN-NEXT:  ; %bb.1:
 ; GCN-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x0
-; GCN-NEXT:    s_bcnt1_i32_b64 s7, s[4:5]
-; GCN-NEXT:    v_cvt_f32_ubyte0_e32 v1, s7
-; GCN-NEXT:    s_mov_b64 s[4:5], 0
-; GCN-NEXT:    v_mul_f32_e32 v2, 4.0, v1
+; GCN-NEXT:    s_mov_b64 s[2:3], 0
+; GCN-NEXT:    v_mov_b32_e32 v0, 0
 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    s_load_dword s6, s[0:1], 0x0
-; GCN-NEXT:    v_mov_b32_e32 v3, 0
+; GCN-NEXT:    s_load_dword s4, s[0:1], 0x0
 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    v_mov_b32_e32 v1, s6
-; GCN-NEXT:  .LBB0_2: ; %atomicrmw.start
+; GCN-NEXT:    v_mov_b32_e32 v1, s4
+; GCN-NEXT:  .LBB0_1: ; %atomicrmw.start
 ; GCN-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GCN-NEXT:    v_mov_b32_e32 v5, v1
-; GCN-NEXT:    v_add_f32_e32 v4, v5, v2
-; GCN-NEXT:    global_atomic_cmpswap v1, v3, v[4:5], s[0:1] glc
+; GCN-NEXT:    v_mov_b32_e32 v2, v1
+; GCN-NEXT:    v_add_f32_e32 v1, 4.0, v2
+; GCN-NEXT:    global_atomic_cmpswap v1, v0, v[1:2], s[0:1] glc
 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-NEXT:    buffer_wbinvl1
-; GCN-NEXT:    v_cmp_eq_u32_e32 vcc, v1, v5
-; GCN-NEXT:    s_or_b64 s[4:5], vcc, s[4:...
[truncated]

jayfoad · 2024-06-24T11:49:57Z

%r = %x + * %y * +0.0

We actually calculate %r = %x + * %y * uitofp(MbCnt) and the problem is in the first active lane where MbCnt is 0. I think we could avoid all (?) of these problems by first calculating %y * uitofp(MbCnt) and then overwriting the first active lane with -0.0, before multiplying by %y.

There might be opportunities to simplify this if %y is known not to be NaN or infinity. There are definitely opportunities to simplify if we don't care about NaNs or infinities or signed zeroes -- but unfortunately the IR atomicrmw instruction does not have fast math flags.

arsenm · 2024-06-24T11:57:20Z

I thought everything worked as long as we used -0 as the fadd identity value for the initial value

arsenm · 2024-06-24T12:01:58Z

but unfortunately the IR atomicrmw instruction does not have fast math flags.

If we really have to consider this, it's better to rely on the function attributes as a placeholder until we have a per-instruction solution to atomicrmw

jayfoad · 2024-06-24T12:06:50Z

I thought everything worked as long as we used -0 as the fadd identity value for the initial value

As far as I know everything works for getting the correct result in memory. It's only the extra step to work out the per-lane result of the atomicrmw instruction that has this problem. And maybe it's only wrong when %y is uniform -- I have not thought too much about the "scan" path when %y is divergent.

Anyway I want to get this correct before worrying too much about reoptimising it. Currently it is causing some Vulkan CTS tests to fail.

arsenm · 2024-06-24T12:09:02Z

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp

+  case AtomicRMWInst::FSub:
+    if (!I.use_empty()) {
+      // Bail out because the way we would calculate the result value is
+      // incorrect in the presence of NaNs and infinities.


Add a todo to fix the 0->-0 case and check no-nan/no-infs?

jayfoad · 2024-06-24T13:19:14Z

And maybe it's only wrong when %y is uniform -- I have not thought too much about the "scan" path when %y is divergent.

I think the divergent path is OK, so I've changed the patch to just avoid the uniform path for fadd/fsub-with-result.

Add a todo to fix the 0->-0 case and check no-nan/no-infs?

I've added an explanatory FIXME comment by the code that generates the offending fmul.

jayfoad · 2024-06-27T14:44:12Z

Eager ping! since this is a correctness issue causing CTS test failures for us.

jayfoad · 2024-07-01T10:09:52Z

Ping!

jayfoad · 2024-07-03T10:31:32Z

@b-sumner @pravinjagtap if the performance of the "uniform fadd/fsub with result" case is important then there might be ways to reinstate the uniform case with an extra bug fix, like writing -0.0 into the first active lane of %y * +0.0.

arsenm · 2024-07-03T10:36:08Z

@b-sumner @pravinjagtap if the performance of the "uniform fadd/fsub with result" case is important then there might be ways to reinstate the uniform case with an extra bug fix, like writing -0.0 into the first active lane of %y * +0.0.

If you're not going to work on this, can you open an issue for it

jayfoad · 2024-07-03T10:40:41Z

@b-sumner @pravinjagtap if the performance of the "uniform fadd/fsub with result" case is important then there might be ways to reinstate the uniform case with an extra bug fix, like writing -0.0 into the first active lane of %y * +0.0.

If you're not going to work on this, can you open an issue for it

#97554

b-sumner · 2024-07-03T14:52:30Z

We have very important supercomputer customers waiting for this who are going to be dissatisfied if it only works when the result is not used. We need an optimized implementation for the returned result case. If the CTS broken makes this happen faster then I think it should remain broken.

Well, I'm not sure which implementation of the optimization we're talking about. Is it the WMM or the other or both?

jayfoad · 2024-07-03T15:22:28Z

We have very important supercomputer customers waiting for this who are going to be dissatisfied if it only works when the result is not used. We need an optimized implementation for the returned result case. If the CTS broken makes this happen faster then I think it should remain broken.

The golden rule of compiler development is correctness trumps performance. You're welcome to have a fast but broken implementation downstream. Or we could work together on fixing the fast path so it is also correct :)

Well, I'm not sure which implementation of the optimization we're talking about. Is it the WMM or the other or both?

The bug was in the uniform path, which does not need to use generate any DPP or "Iterative" code. The fix was to treat uniform inputs the same as divergent inputs, so they will generate some DPP or "Iterative" code.

b-sumner · 2024-07-03T15:49:21Z

The golden rule of compiler development is correctness trumps performance. You're welcome to have a fast but broken implementation downstream. Or we could work together on fixing the fast path so it is also correct :)

Absolutely. We need both correctness and performance.

The bug was in the uniform path, which does not need to use generate any DPP or "Iterative" code. The fix was to treat uniform inputs the same as divergent inputs, so they will generate some DPP or "Iterative" code.

Makes sense, when the result is needed.

…6479) An atomic fadd instruction like this should return %x: ; value at %ptr is %x %r = atomicrmw fadd ptr %ptr, float %y After atomic optimization, if %y is uniform, the result is calculated as %r = %x + * %y * +0.0. This has a couple of problems: 1. If %y is Inf or NaN, this will return NaN instead of %x. 2. If %x is -0.0 and %y is positive, this will return +0.0 instead of -0.0. Avoid these problems by disabling the "%y is uniform" path if there are any uses of the result.

pravinjagtap · 2024-06-27T15:03:22Z

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp

+      !I.use_empty()) {
+    // Disable the uniform return value calculation using fmul because it
+    // mishandles infinities, NaNs and signed zeros. FIXME.
+    ValDivergent = true;


If I understand correctly, are we forcing atomic fadd/fsub to take either Iterative/DPP approach by setting ValDivergent = true for all the cases (including mishandled cases and normal/valid uniform float values)?

…6479) An atomic fadd instruction like this should return %x: ; value at %ptr is %x %r = atomicrmw fadd ptr %ptr, float %y After atomic optimization, if %y is uniform, the result is calculated as %r = %x + * %y * +0.0. This has a couple of problems: 1. If %y is Inf or NaN, this will return NaN instead of %x. 2. If %x is -0.0 and %y is positive, this will return +0.0 instead of -0.0. Avoid these problems by disabling the "%y is uniform" path if there are any uses of the result.

llvmbot added backend:AMDGPU llvm:globalisel labels Jun 24, 2024

jayfoad requested a review from pravinjagtap June 24, 2024 11:45

jayfoad requested review from perlfu, piotrAMD and vikramRH June 24, 2024 11:50

arsenm reviewed Jun 24, 2024

View reviewed changes

Use scan path for uniform fadd/fsub

3a026f3

Merge remote-tracking branch 'origin/main' into atomicrmw-fadd-result

a72a5ac

arsenm approved these changes Jul 3, 2024

View reviewed changes

jayfoad requested a review from b-sumner July 3, 2024 10:23

jayfoad merged commit b76dd4e into llvm:main Jul 3, 2024
5 of 7 checks passed

jayfoad deleted the atomicrmw-fadd-result branch July 3, 2024 10:35

jayfoad mentioned this pull request Jul 3, 2024

AMDGPUAtomicOptimizer: re-enable uniform path for fadd/fsub with result #97554

Closed

pravinjagtap reviewed Jul 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Disable atomic optimization of fadd/fsub with result #96479

[AMDGPU] Disable atomic optimization of fadd/fsub with result #96479

jayfoad commented Jun 24, 2024 •

edited

Loading

llvmbot commented Jun 24, 2024 •

edited

Loading

jayfoad commented Jun 24, 2024

arsenm commented Jun 24, 2024

arsenm commented Jun 24, 2024

jayfoad commented Jun 24, 2024

arsenm Jun 24, 2024

jayfoad commented Jun 24, 2024

jayfoad commented Jun 27, 2024

jayfoad commented Jul 1, 2024

jayfoad commented Jul 3, 2024

arsenm commented Jul 3, 2024

jayfoad commented Jul 3, 2024

b-sumner commented Jul 3, 2024 •

edited

Loading

jayfoad commented Jul 3, 2024

b-sumner commented Jul 3, 2024

pravinjagtap Jun 27, 2024

jayfoad Jul 4, 2024

[AMDGPU] Disable atomic optimization of fadd/fsub with result #96479

[AMDGPU] Disable atomic optimization of fadd/fsub with result #96479

Conversation

jayfoad commented Jun 24, 2024 • edited Loading

llvmbot commented Jun 24, 2024 • edited Loading

jayfoad commented Jun 24, 2024

arsenm commented Jun 24, 2024

arsenm commented Jun 24, 2024

jayfoad commented Jun 24, 2024

arsenm Jun 24, 2024

Choose a reason for hiding this comment

jayfoad commented Jun 24, 2024

jayfoad commented Jun 27, 2024

jayfoad commented Jul 1, 2024

jayfoad commented Jul 3, 2024

arsenm commented Jul 3, 2024

jayfoad commented Jul 3, 2024

b-sumner commented Jul 3, 2024 • edited Loading

jayfoad commented Jul 3, 2024

b-sumner commented Jul 3, 2024

pravinjagtap Jun 27, 2024

Choose a reason for hiding this comment

jayfoad Jul 4, 2024

Choose a reason for hiding this comment

jayfoad commented Jun 24, 2024 •

edited

Loading

llvmbot commented Jun 24, 2024 •

edited

Loading

b-sumner commented Jul 3, 2024 •

edited

Loading