Skip to content

Conversation

@cdevadas
Copy link
Collaborator

This patch enables the multi-group xnack replay mode by
configuring the hardware MODE register at kernel entry.
This aligns the hardware behavior with the compiler's
existing multi-group s_wait_xcnt insertion logic.

This patch enables the multi-group xnack replay mode by
configuring the hardware MODE register at kernel entry.
This aligns the hardware behavior with the compiler's
existing multi-group s_wait_xcnt insertion logic.
Copy link
Collaborator Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@cdevadas cdevadas requested review from jayfoad and rampitec November 21, 2025 08:12
@cdevadas cdevadas marked this pull request as ready for review November 21, 2025 08:12
@llvmbot
Copy link
Member

llvmbot commented Nov 21, 2025

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Christudasan Devadasan (cdevadas)

Changes

This patch enables the multi-group xnack replay mode by
configuring the hardware MODE register at kernel entry.
This aligns the hardware behavior with the compiler's
existing multi-group s_wait_xcnt insertion logic.


Patch is 2.50 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169016.diff

149 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIFrameLowering.cpp (+11)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+48)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.abs.ll (+130-27)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/load-constant.96.ll (+29-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/minmaxabs-i64.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+14)
  • (modified) llvm/test/CodeGen/AMDGPU/add-max.ll (+22)
  • (modified) llvm/test/CodeGen/AMDGPU/add_u64.ll (+36-12)
  • (modified) llvm/test/CodeGen/AMDGPU/addrspacecast-gas.ll (+7)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16-conversions.ll (+14)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16-math.ll (+49)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+13)
  • (modified) llvm/test/CodeGen/AMDGPU/bitop3.ll (+130-36)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx1250.ll (+11)
  • (modified) llvm/test/CodeGen/AMDGPU/calling-conventions.ll (+95-6)
  • (modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (+17)
  • (modified) llvm/test/CodeGen/AMDGPU/code-size-estimate.ll (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/ds_read2-gfx1250.ll (+33-1)
  • (modified) llvm/test/CodeGen/AMDGPU/ds_write2.ll (+23)
  • (modified) llvm/test/CodeGen/AMDGPU/fcanonicalize.bf16.ll (+47)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-load-saddr-to-vaddr.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-saddr-atomics.ll (+166)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-saddr-load.ll (+212)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-saddr-store.ll (+103)
  • (modified) llvm/test/CodeGen/AMDGPU/fmax3.ll (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin3.ll (+8)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-atomics-gfx942.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+49)
  • (modified) llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll (+98-10)
  • (modified) llvm/test/CodeGen/AMDGPU/global-address.ll (+9-4)
  • (modified) llvm/test/CodeGen/AMDGPU/global-load-xcnt.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2bf16.ll (+16)
  • (modified) llvm/test/CodeGen/AMDGPU/literal64.ll (+23-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitop3.ll (+47-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cluster.id.ll (+12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cluster.load.async.to.lds.ll (+23)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cluster.load.ll (+11)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cluster.workgroup.id.ll (+28)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cluster.workgroup.max.flat.id.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cluster.workgroup.max.id.ll (+24)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.f16.fp8.ll (+54)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.fp8.dpp.ll (+9)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.fp8.e5m3.ll (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.fp8.f16.ll (+64)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.pk.f16.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.sat.pk.ll (+24)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scale.pk.ll (+30)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scalef32.pk16.gfx1250.ll (+24)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scalef32.pk8.ll (+36)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scalef32.sr.pk.gfx1250.ll (+36)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.scalef32.sr.pk16.ll (+16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.sr.pk.bf16.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.flat.prefetch.ll (+9)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.async.to.lds.ll (+15)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.prefetch.ll (+9)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.store.async.from.lds.ll (+15)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.is.private.ll (+3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.is.shared.ll (+3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.monitor.gfx1250.ll (+12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.tr.gfx1250.w32.ll (+23)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane.gfx1250.ll (+24)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll (+13)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.tfe.ll (+230-106)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.store.ll (+377-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.atomic.buffer.load.ll (+13)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.store.bf16.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.bf16.ll (+8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rsq.bf16.ll (+8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.wait.gfx1250.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sendmsg.rtn.ll (+13)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+14)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.tfe.ll (+230-106)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.store.ll (+20)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+14)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.tanh.ll (+24)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.tensor.load.store.ll (+12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.id.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma.gfx1250.w32.ll (+246)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma.imm.gfx1250.w32.ll (+174)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma.imod.gfx1250.w32.ll (+220)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma.index.gfx1250.w32.ll (+46)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.cos.bf16.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.prefetch.ll (+213-49)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.sin.bf16.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.sqrt.bf16.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+44)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+25)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-work-group-id-intrinsics.ll (+12-2)
  • (modified) llvm/test/CodeGen/AMDGPU/mad-mix-bf16.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/mad_64_32.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/mad_u64_u32.ll (+14)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+23)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll (+3)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+24)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+24)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (+48)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-cluster.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-lastuse.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-nontemporal.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-singlethread.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-wavefront.ll (+91)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll (+88)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll (+91)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-cluster.ll (+91)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-lastuse.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-nontemporal.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-singlethread.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll (+87)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-wavefront.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-cluster.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-nontemporal.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-singlethread.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-wavefront.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-agent.ll (+91)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-cluster.ll (+91)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-lastuse.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-nontemporal.ll (+5)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-singlethread.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-system.ll (+87)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-volatile.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-wavefront.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-workgroup.ll (+92)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (+37)
  • (modified) llvm/test/CodeGen/AMDGPU/minmax.ll (+141-38)
  • (modified) llvm/test/CodeGen/AMDGPU/mul.ll (+19)
  • (modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+78)
  • (modified) llvm/test/CodeGen/AMDGPU/preload-implicit-kernargs.ll (+23)
  • (modified) llvm/test/CodeGen/AMDGPU/preload-kernargs.ll (+34)
  • (modified) llvm/test/CodeGen/AMDGPU/reassoc-mul-add-1-to-mad.ll (+3)
  • (modified) llvm/test/CodeGen/AMDGPU/s-cluster-barrier.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/scale-offset-flat.ll (+23)
  • (modified) llvm/test/CodeGen/AMDGPU/scale-offset-global.ll (+22)
  • (modified) llvm/test/CodeGen/AMDGPU/scale-offset-scratch.ll (+21)
  • (modified) llvm/test/CodeGen/AMDGPU/scale-offset-smem.ll (+21)
  • (modified) llvm/test/CodeGen/AMDGPU/sub_u64.ll (+37-12)
  • (modified) llvm/test/CodeGen/AMDGPU/v_ashr_pk.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/wait-before-stores-with-scope_sys.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/whole-wave-functions.ll (+282-281)
  • (modified) llvm/test/CodeGen/AMDGPU/workgroup-id-in-arch-sgprs.ll (+6)
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index aa5ea77f17291..ffbb111d42221 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -772,6 +772,17 @@ void SIFrameLowering::emitEntryFunctionPrologue(MachineFunction &MF,
                                          PreloadedScratchRsrcReg,
                                          ScratchRsrcReg, ScratchWaveOffsetReg);
   }
+
+  if (ST.hasWaitXCnt()) {
+    // Set REPLAY_MODE (bit 25) in MODE register to enable multi-group XNACK
+    // replay. This aligns hardware behavior with the compiler's s_wait_xcnt
+    // insertion logic, which assumes multi-group mode by default.
+    unsigned RegEncoding =
+        AMDGPU::Hwreg::HwregEncoding::encode(AMDGPU::Hwreg::ID_MODE, 25, 1);
+    BuildMI(MBB, I, DL, TII->get(AMDGPU::S_SETREG_IMM32_B32))
+        .addImm(1)
+        .addImm(RegEncoding);
+  }
 }
 
 // Emit scratch RSRC setup code, assuming `ScratchRsrcReg != AMDGPU::NoReg`
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
index b5d593a9c15ed..58586129fb4e9 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
@@ -41,6 +41,7 @@ define amdgpu_kernel void @raw_buffer_atomic_add_noret_f64(<4 x i32> %rsrc, doub
 ;
 ; GFX1250-LABEL: raw_buffer_atomic_add_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -71,6 +72,7 @@ define amdgpu_ps void @raw_buffer_atomic_add_rtn_f64(<4 x i32> inreg %rsrc, doub
 ;
 ; GFX1250-LABEL: raw_buffer_atomic_add_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_add_f64 v[0:1], v2, s[0:3], null offen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -114,6 +116,7 @@ define amdgpu_kernel void @raw_buffer_atomic_add_rtn_f64_off4_slc(<4 x i32> %rsr
 ;
 ; GFX1250-LABEL: raw_buffer_atomic_add_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -160,6 +163,7 @@ define amdgpu_kernel void @raw_ptr_buffer_atomic_add_noret_f64(ptr addrspace(8)
 ;
 ; GFX1250-LABEL: raw_ptr_buffer_atomic_add_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -190,6 +194,7 @@ define amdgpu_ps void @raw_ptr_buffer_atomic_add_rtn_f64(ptr addrspace(8) inreg
 ;
 ; GFX1250-LABEL: raw_ptr_buffer_atomic_add_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_add_f64 v[0:1], v2, s[0:3], null offen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -233,6 +238,7 @@ define amdgpu_kernel void @raw_ptr_buffer_atomic_add_rtn_f64_off4_slc(ptr addrsp
 ;
 ; GFX1250-LABEL: raw_ptr_buffer_atomic_add_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -279,6 +285,7 @@ define amdgpu_kernel void @struct_buffer_atomic_add_noret_f64(<4 x i32> %rsrc, d
 ;
 ; GFX1250-LABEL: struct_buffer_atomic_add_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -309,6 +316,7 @@ define amdgpu_ps void @struct_buffer_atomic_add_rtn_f64(<4 x i32> inreg %rsrc, d
 ;
 ; GFX1250-LABEL: struct_buffer_atomic_add_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_add_f64 v[0:1], v2, s[0:3], null idxen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -352,6 +360,7 @@ define amdgpu_kernel void @struct_buffer_atomic_add_rtn_f64_off4_slc(<4 x i32> %
 ;
 ; GFX1250-LABEL: struct_buffer_atomic_add_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -397,6 +406,7 @@ define amdgpu_kernel void @struct_ptr_buffer_atomic_add_noret_f64(ptr addrspace(
 ;
 ; GFX1250-LABEL: struct_ptr_buffer_atomic_add_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -427,6 +437,7 @@ define amdgpu_ps void @struct_ptr_buffer_atomic_add_rtn_f64(ptr addrspace(8) inr
 ;
 ; GFX1250-LABEL: struct_ptr_buffer_atomic_add_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_add_f64 v[0:1], v2, s[0:3], null idxen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -470,6 +481,7 @@ define amdgpu_kernel void @struct_ptr_buffer_atomic_add_rtn_f64_off4_slc(ptr add
 ;
 ; GFX1250-LABEL: struct_ptr_buffer_atomic_add_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -515,6 +527,7 @@ define amdgpu_kernel void @raw_buffer_atomic_min_noret_f64(<4 x i32> %rsrc, doub
 ;
 ; GFX1250-LABEL: raw_buffer_atomic_min_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -545,6 +558,7 @@ define amdgpu_ps void @raw_buffer_atomic_min_rtn_f64(<4 x i32> inreg %rsrc, doub
 ;
 ; GFX1250-LABEL: raw_buffer_atomic_min_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_min_num_f64 v[0:1], v2, s[0:3], null offen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -588,6 +602,7 @@ define amdgpu_kernel void @raw_buffer_atomic_min_rtn_f64_off4_slc(<4 x i32> %rsr
 ;
 ; GFX1250-LABEL: raw_buffer_atomic_min_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -634,6 +649,7 @@ define amdgpu_kernel void @raw_ptr_buffer_atomic_min_noret_f64(ptr addrspace(8)
 ;
 ; GFX1250-LABEL: raw_ptr_buffer_atomic_min_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -664,6 +680,7 @@ define amdgpu_ps void @raw_ptr_buffer_atomic_min_rtn_f64(ptr addrspace(8) inreg
 ;
 ; GFX1250-LABEL: raw_ptr_buffer_atomic_min_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_min_num_f64 v[0:1], v2, s[0:3], null offen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -707,6 +724,7 @@ define amdgpu_kernel void @raw_ptr_buffer_atomic_min_rtn_f64_off4_slc(ptr addrsp
 ;
 ; GFX1250-LABEL: raw_ptr_buffer_atomic_min_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -753,6 +771,7 @@ define amdgpu_kernel void @struct_buffer_atomic_min_noret_f64(<4 x i32> %rsrc, d
 ;
 ; GFX1250-LABEL: struct_buffer_atomic_min_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -783,6 +802,7 @@ define amdgpu_ps void @struct_buffer_atomic_min_rtn_f64(<4 x i32> inreg %rsrc, d
 ;
 ; GFX1250-LABEL: struct_buffer_atomic_min_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_min_num_f64 v[0:1], v2, s[0:3], null idxen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -826,6 +846,7 @@ define amdgpu_kernel void @struct_buffer_atomic_min_rtn_f64_off4_slc(<4 x i32> %
 ;
 ; GFX1250-LABEL: struct_buffer_atomic_min_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -871,6 +892,7 @@ define amdgpu_kernel void @struct_ptr_buffer_atomic_min_noret_f64(ptr addrspace(
 ;
 ; GFX1250-LABEL: struct_ptr_buffer_atomic_min_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -901,6 +923,7 @@ define amdgpu_ps void @struct_ptr_buffer_atomic_min_rtn_f64(ptr addrspace(8) inr
 ;
 ; GFX1250-LABEL: struct_ptr_buffer_atomic_min_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_min_num_f64 v[0:1], v2, s[0:3], null idxen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -944,6 +967,7 @@ define amdgpu_kernel void @struct_ptr_buffer_atomic_min_rtn_f64_off4_slc(ptr add
 ;
 ; GFX1250-LABEL: struct_ptr_buffer_atomic_min_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -989,6 +1013,7 @@ define amdgpu_kernel void @raw_buffer_atomic_max_noret_f64(<4 x i32> %rsrc, doub
 ;
 ; GFX1250-LABEL: raw_buffer_atomic_max_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -1019,6 +1044,7 @@ define amdgpu_ps void @raw_buffer_atomic_max_rtn_f64(<4 x i32> inreg %rsrc, doub
 ;
 ; GFX1250-LABEL: raw_buffer_atomic_max_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_max_num_f64 v[0:1], v2, s[0:3], null offen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -1062,6 +1088,7 @@ define amdgpu_kernel void @raw_buffer_atomic_max_rtn_f64_off4_slc(<4 x i32> %rsr
 ;
 ; GFX1250-LABEL: raw_buffer_atomic_max_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -1108,6 +1135,7 @@ define amdgpu_kernel void @raw_ptr_buffer_atomic_max_noret_f64(ptr addrspace(8)
 ;
 ; GFX1250-LABEL: raw_ptr_buffer_atomic_max_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -1138,6 +1166,7 @@ define amdgpu_ps void @raw_ptr_buffer_atomic_max_rtn_f64(ptr addrspace(8) inreg
 ;
 ; GFX1250-LABEL: raw_ptr_buffer_atomic_max_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_max_num_f64 v[0:1], v2, s[0:3], null offen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -1181,6 +1210,7 @@ define amdgpu_kernel void @raw_ptr_buffer_atomic_max_rtn_f64_off4_slc(ptr addrsp
 ;
 ; GFX1250-LABEL: raw_ptr_buffer_atomic_max_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -1227,6 +1257,7 @@ define amdgpu_kernel void @struct_buffer_atomic_max_noret_f64(<4 x i32> %rsrc, d
 ;
 ; GFX1250-LABEL: struct_buffer_atomic_max_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -1257,6 +1288,7 @@ define amdgpu_ps void @struct_buffer_atomic_max_rtn_f64(<4 x i32> inreg %rsrc, d
 ;
 ; GFX1250-LABEL: struct_buffer_atomic_max_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_max_num_f64 v[0:1], v2, s[0:3], null idxen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -1300,6 +1332,7 @@ define amdgpu_kernel void @struct_buffer_atomic_max_rtn_f64_off4_slc(<4 x i32> %
 ;
 ; GFX1250-LABEL: struct_buffer_atomic_max_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -1345,6 +1378,7 @@ define amdgpu_kernel void @struct_ptr_buffer_atomic_max_noret_f64(ptr addrspace(
 ;
 ; GFX1250-LABEL: struct_ptr_buffer_atomic_max_noret_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -1375,6 +1409,7 @@ define amdgpu_ps void @struct_ptr_buffer_atomic_max_rtn_f64(ptr addrspace(8) inr
 ;
 ; GFX1250-LABEL: struct_ptr_buffer_atomic_max_rtn_f64:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    buffer_atomic_max_num_f64 v[0:1], v2, s[0:3], null idxen th:TH_ATOMIC_RETURN
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    flat_store_b64 v[0:1], v[0:1]
@@ -1418,6 +1453,7 @@ define amdgpu_kernel void @struct_ptr_buffer_atomic_max_rtn_f64_off4_slc(ptr add
 ;
 ; GFX1250-LABEL: struct_ptr_buffer_atomic_max_rtn_f64_off4_slc:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_clause 0x1
 ; GFX1250-NEXT:    s_load_b96 s[8:10], s[4:5], 0x34
 ; GFX1250-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
@@ -1486,6 +1522,7 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat(ptr addrspace(1) %pt
 ;
 ; GFX1250-LABEL: global_atomic_fadd_f64_noret_pat:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_mov_b32 s0, exec_lo
 ; GFX1250-NEXT:    s_mov_b32 s1, exec_lo
 ; GFX1250-NEXT:    v_mbcnt_lo_u32_b32 v0, s0, 0
@@ -1558,6 +1595,7 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat_agent(ptr addrspace(
 ;
 ; GFX1250-LABEL: global_atomic_fadd_f64_noret_pat_agent:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_mov_b32 s0, exec_lo
 ; GFX1250-NEXT:    s_mov_b32 s1, exec_lo
 ; GFX1250-NEXT:    v_mbcnt_lo_u32_b32 v0, s0, 0
@@ -1632,6 +1670,7 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat_system(ptr addrspace
 ;
 ; GFX1250-LABEL: global_atomic_fadd_f64_noret_pat_system:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_mov_b32 s0, exec_lo
 ; GFX1250-NEXT:    s_mov_b32 s1, exec_lo
 ; GFX1250-NEXT:    v_mbcnt_lo_u32_b32 v0, s0, 0
@@ -1704,6 +1743,7 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat_flush(ptr addrspace(
 ;
 ; GFX1250-LABEL: global_atomic_fadd_f64_noret_pat_flush:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_mov_b32 s0, exec_lo
 ; GFX1250-NEXT:    s_mov_b32 s1, exec_lo
 ; GFX1250-NEXT:    v_mbcnt_lo_u32_b32 v0, s0, 0
@@ -1894,6 +1934,7 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat_agent_safe(ptr addrs
 ;
 ; GFX1250-LABEL: global_atomic_fadd_f64_noret_pat_agent_safe:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_mov_b32 s0, exec_lo
 ; GFX1250-NEXT:    s_mov_b32 s1, exec_lo
 ; GFX1250-NEXT:    v_mbcnt_lo_u32_b32 v0, s0, 0
@@ -1949,6 +1990,7 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat(ptr %ptr) #1 {
 ;
 ; GFX1250-LABEL: flat_atomic_fadd_f64_noret_pat:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
 ; GFX1250-NEXT:    v_mov_b64_e32 v[0:1], 4.0
 ; GFX1250-NEXT:    v_mov_b32_e32 v2, 0
@@ -1991,6 +2033,7 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_agent(ptr %ptr) #1 {
 ;
 ; GFX1250-LABEL: flat_atomic_fadd_f64_noret_pat_agent:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
 ; GFX1250-NEXT:    v_mov_b64_e32 v[0:1], 4.0
 ; GFX1250-NEXT:    v_mov_b32_e32 v2, 0
@@ -2035,6 +2078,7 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_system(ptr %ptr) #1 {
 ;
 ; GFX1250-LABEL: flat_atomic_fadd_f64_noret_pat_system:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
 ; GFX1250-NEXT:    v_mov_b64_e32 v[0:1], 4.0
 ; GFX1250-NEXT:    v_mov_b32_e32 v2, 0
@@ -2197,6 +2241,7 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_agent_safe(ptr %ptr) {
 ;
 ; GFX1250-LABEL: flat_atomic_fadd_f64_noret_pat_agent_safe:
 ; GFX1250:       ; %bb.0: ; %main_body
+; GFX1250-NEXT:    s_setreg_imm32_b32 hwreg(HW_REG_WAVE_MODE, 25, 1), 1
 ; GFX1250-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
 ; GFX1250-NEXT:    v_mov_b64_e32 v[0:1], 4.0
 ; GFX1250-NEXT:    v_mov_b32_e32 v2, 0
@@ -2255,6 +2300,7 @@ define amdgpu_kerne...
[truncated]

@github-actions
Copy link

🐧 Linux x64 Test Results

  • 186436 tests passed
  • 4864 tests skipped

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator Author

cdevadas commented Nov 21, 2025

Merge activity

  • Nov 21, 2:10 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Nov 21, 2:12 PM UTC: @cdevadas merged this pull request with Graphite.

@cdevadas cdevadas merged commit a2dc4e0 into main Nov 21, 2025
15 checks passed
@cdevadas cdevadas deleted the users/cdevadas/enable-multi-group-xnack-replay branch November 21, 2025 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants