Skip to content

Conversation

@arsenm
Copy link
Contributor

@arsenm arsenm commented Nov 15, 2025

The main improvement is to the mfma tests. There are some
mild regressions scattered around, and a few major ones.
The worst regressions are in some of the bitcast tests;
these are cases where the SGPR argument list runs out
and uses VGPRs, and the copies-from-VGPR are misidentified
as divergent. Most of the shufflevector tests are also
regressions. These end up with cleaner MIR, but then get poor
regalloc decisions.

Copy link
Contributor Author

arsenm commented Nov 15, 2025

@llvmbot
Copy link
Member

llvmbot commented Nov 15, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

AMDGPU: Select vector reg class for divergent build_vector

The main improvement is to the mfma tests. There are some
mild regressions scattered around, and a few major ones.
The worst regressions are in some of the bitcast tests;
these are cases where the SGPR argument list runs out
and uses VGPRs, and the copies-from-VGPR are misidentified
as divergent. Most of the shufflevector tests are also
regressions. These end up with cleaner MIR, but then get poor
regalloc decisions.

test regressions


Patch is 7.13 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/168169.diff

114 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+7-3)
  • (modified) llvm/test/CodeGen/AMDGPU/a-v-flat-atomic-cmpxchg.ll (+159-133)
  • (modified) llvm/test/CodeGen/AMDGPU/a-v-flat-atomicrmw.ll (+239-236)
  • (modified) llvm/test/CodeGen/AMDGPU/a-v-global-atomic-cmpxchg.ll (+35-31)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+25415-26603)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.160bit.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.224bit.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.288bit.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.352bit.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.384bit.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.448bit.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+9853-9220)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll (+2050-1660)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.640bit.ll (+2617-2120)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll (+3092-2563)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.768bit.ll (+3100-2535)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.832bit.ll (+3292-2660)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.896bit.ll (+3648-2958)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll (+3736-2991)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-intrinsic-mmo-type.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/cluster_stores.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcombine-fmul-sel.ll (+36-76)
  • (modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-saddr-atomics.ll (+24-28)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+36-28)
  • (modified) llvm/test/CodeGen/AMDGPU/fmul-to-ldexp.ll (+23-23)
  • (modified) llvm/test/CodeGen/AMDGPU/fptoi.i128.ll (+58-58)
  • (modified) llvm/test/CodeGen/AMDGPU/frame-index-elimination.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f64.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/global-load-xcnt.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+9-7)
  • (modified) llvm/test/CodeGen/AMDGPU/isel-amdgpu-cs-chain-preserve-cc.ll (+56-76)
  • (modified) llvm/test/CodeGen/AMDGPU/issue92561-restore-undef-scc-verifier-error.ll (+9-12)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f16.ll (+3-6)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f32.ll (+10-20)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.ll (+5-10)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.f16.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.ll (+35-35)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f16.ll (+14-17)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f32.ll (+22-32)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.ll (+38-43)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.f16.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.f16.ll (+11-12)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.ll (+28-31)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.f16.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.ll (+3-6)
  • (modified) llvm/test/CodeGen/AMDGPU/legalize-soffset-mbuf.ll (+12-24)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.ll (+83-83)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.atomic.dim.gfx90a.ll (+3-9)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+22-22)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.make.buffer.rsrc.ll (+16-15)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.iterative.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.ll (+342-342)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-agent.ll (-91)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-cluster.ll (-91)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-singlethread.ll (-92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-system.ll (-87)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-wavefront.ll (-92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-workgroup.ll (-92)
  • (modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+750-934)
  • (modified) llvm/test/CodeGen/AMDGPU/mfma-no-register-aliasing.ll (+149-157)
  • (modified) llvm/test/CodeGen/AMDGPU/mmra.ll (+23-29)
  • (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands-non-ptr-intrinsics.ll (+139-136)
  • (modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/rem_i128.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/rewrite-vgpr-mfma-to-agpr.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/scalar_to_vector.gfx11plus.ll (+4-6)
  • (modified) llvm/test/CodeGen/AMDGPU/schedule-amdgpu-trackers.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sdiv64.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sgpr-to-vreg1-copy.ll (+11-9)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector-physreg-copy.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2i64.v2i64.ll (+24-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2p0.v2p0.ll (+24-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3f32.v2f32.ll (+87-82)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3f32.v3f32.ll (+110-90)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3f32.v4f32.ll (+197-180)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3i32.v2i32.ll (+87-82)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3i32.v3i32.ll (+110-90)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3i32.v4i32.ll (+197-180)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3i64.v2i64.ll (+54-38)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3p0.v2p0.ll (+54-38)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3p3.v2p3.ll (+87-82)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3p3.v3p3.ll (+110-90)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3p3.v4p3.ll (+197-180)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v2f32.ll (+23-22)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+763-721)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v4f32.ll (+278-240)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v2i32.ll (+23-22)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+763-721)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v4i32.ll (+278-240)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i64.v2i64.ll (+190-174)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i64.v3i64.ll (+106-98)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i64.v4i64.ll (+20-20)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p0.v2p0.ll (+190-174)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p0.v3p0.ll (+106-98)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p0.v4p0.ll (+20-20)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v2p3.ll (+23-22)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+763-721)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v4p3.ll (+278-240)
  • (modified) llvm/test/CodeGen/AMDGPU/sint_to_fp.f64.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.f64.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/vector_range_metadata.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-large-tuple-alloc-error.ll (+8-6)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange-ir.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll (+75-162)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 9308934c8baf8..ac0cb549d020b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -726,10 +726,14 @@ void AMDGPUDAGToDAGISel::Select(SDNode *N) {
       break;
     }
 
+    const SIRegisterInfo *TRI = Subtarget->getRegisterInfo();
     assert(VT.getVectorElementType().bitsEq(MVT::i32));
-    unsigned RegClassID =
-        SIRegisterInfo::getSGPRClassForBitWidth(NumVectorElts * 32)->getID();
-    SelectBuildVector(N, RegClassID);
+    const TargetRegisterClass *RegClass =
+        N->isDivergent()
+            ? TRI->getDefaultVectorSuperClassForBitWidth(NumVectorElts * 32)
+            : SIRegisterInfo::getSGPRClassForBitWidth(NumVectorElts * 32);
+
+    SelectBuildVector(N, RegClass->getID());
     return;
   }
   case ISD::VECTOR_SHUFFLE:
diff --git a/llvm/test/CodeGen/AMDGPU/a-v-flat-atomic-cmpxchg.ll b/llvm/test/CodeGen/AMDGPU/a-v-flat-atomic-cmpxchg.ll
index bc341f2baa804..e882769f97ac1 100644
--- a/llvm/test/CodeGen/AMDGPU/a-v-flat-atomic-cmpxchg.ll
+++ b/llvm/test/CodeGen/AMDGPU/a-v-flat-atomic-cmpxchg.ll
@@ -95,13 +95,13 @@ define void @flat_atomic_cmpxchg_i32_ret_a_a__a(ptr %ptr) #0 {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def a0
+; CHECK-NEXT:    ; def a1
 ; CHECK-NEXT:    ;;#ASMEND
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def a1
+; CHECK-NEXT:    ; def a0
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v2, a1
-; CHECK-NEXT:    v_accvgpr_read_b32 v3, a0
+; CHECK-NEXT:    v_accvgpr_read_b32 v3, a1
+; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
 ; CHECK-NEXT:    buffer_wbl2
 ; CHECK-NEXT:    flat_atomic_cmpswap v0, v[0:1], v[2:3] offset:40 glc
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -126,13 +126,13 @@ define void @flat_atomic_cmpxchg_i32_ret_a_a__v(ptr %ptr) #0 {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def a0
+; CHECK-NEXT:    ; def a1
 ; CHECK-NEXT:    ;;#ASMEND
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def a1
+; CHECK-NEXT:    ; def a0
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v2, a1
-; CHECK-NEXT:    v_accvgpr_read_b32 v3, a0
+; CHECK-NEXT:    v_accvgpr_read_b32 v3, a1
+; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
 ; CHECK-NEXT:    buffer_wbl2
 ; CHECK-NEXT:    flat_atomic_cmpswap v0, v[0:1], v[2:3] offset:40 glc
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -156,12 +156,14 @@ define void @flat_atomic_cmpxchg_i32_ret_v_a__v(ptr %ptr) #0 {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def a0
+; CHECK-NEXT:    ; def v2
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
+; CHECK-NEXT:    v_accvgpr_write_b32 a1, v2
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def v3
+; CHECK-NEXT:    ; def a0
 ; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_accvgpr_read_b32 v3, a1
+; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
 ; CHECK-NEXT:    buffer_wbl2
 ; CHECK-NEXT:    flat_atomic_cmpswap v0, v[0:1], v[2:3] offset:40 glc
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -299,12 +301,13 @@ define void @flat_atomic_cmpxchg_i32_ret_av_a__av(ptr %ptr) #0 {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def a0
+; CHECK-NEXT:    ; def a1
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def v3
+; CHECK-NEXT:    ; def a0
 ; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_accvgpr_read_b32 v3, a1
+; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
 ; CHECK-NEXT:    buffer_wbl2
 ; CHECK-NEXT:    flat_atomic_cmpswap v0, v[0:1], v[2:3] offset:40 glc
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -328,12 +331,13 @@ define void @flat_atomic_cmpxchg_i32_ret_a_av__av(ptr %ptr) #0 {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def a0
+; CHECK-NEXT:    ; def a1
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v3, a0
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def v2
+; CHECK-NEXT:    ; def a0
 ; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_accvgpr_read_b32 v3, a1
+; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
 ; CHECK-NEXT:    buffer_wbl2
 ; CHECK-NEXT:    flat_atomic_cmpswap v0, v[0:1], v[2:3] offset:40 glc
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
@@ -533,50 +537,55 @@ define void @flat_atomic_cmpxchg_i64_ret_a_a__a(ptr %ptr) #0 {
 ; CHECK-LABEL: flat_atomic_cmpxchg_i64_ret_a_a__a:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    v_add_co_u32_e32 v4, vcc, 0x50, v0
+; CHECK-NEXT:    v_add_co_u32_e32 v0, vcc, 0x50, v0
+; CHECK-NEXT:    s_mov_b64 s[4:5], src_private_base
+; CHECK-NEXT:    v_addc_co_u32_e32 v1, vcc, 0, v1, vcc
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def a[0:1]
+; CHECK-NEXT:    ; def a[2:3]
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v3, a1
-; CHECK-NEXT:    s_mov_b64 s[4:5], src_private_base
-; CHECK-NEXT:    v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
-; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
+; CHECK-NEXT:    v_accvgpr_read_b32 v5, a3
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def a[0:1]
+; CHECK-NEXT:    ; def a[4:5]
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v0, a0
-; CHECK-NEXT:    v_accvgpr_read_b32 v1, a1
-; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v5
+; CHECK-NEXT:    v_accvgpr_read_b32 v2, a4
+; CHECK-NEXT:    v_accvgpr_read_b32 v4, a2
+; CHECK-NEXT:    v_accvgpr_read_b32 v3, a5
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v1
 ; CHECK-NEXT:    ; implicit-def: $agpr0_agpr1
 ; CHECK-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; CHECK-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; CHECK-NEXT:    s_cbranch_execz .LBB15_2
 ; CHECK-NEXT:  ; %bb.1: ; %atomicrmw.global
+; CHECK-NEXT:    v_accvgpr_read_b32 v2, a4
+; CHECK-NEXT:    v_accvgpr_read_b32 v3, a5
+; CHECK-NEXT:    v_accvgpr_read_b32 v4, a2
+; CHECK-NEXT:    v_accvgpr_read_b32 v5, a3
 ; CHECK-NEXT:    buffer_wbl2
-; CHECK-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
+; CHECK-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[0:1], v[2:5] glc
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    buffer_invl2
 ; CHECK-NEXT:    buffer_wbinvl1_vol
 ; CHECK-NEXT:    ; implicit-def: $vgpr4_vgpr5
+; CHECK-NEXT:    ; implicit-def: $vgpr2_vgpr3
 ; CHECK-NEXT:    v_accvgpr_write_b32 a0, v0
 ; CHECK-NEXT:    v_accvgpr_write_b32 a1, v1
-; CHECK-NEXT:    ; implicit-def: $vgpr2_vgpr3
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:  .LBB15_2: ; %Flow
 ; CHECK-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; CHECK-NEXT:    s_cbranch_execz .LBB15_4
 ; CHECK-NEXT:  ; %bb.3: ; %atomicrmw.private
-; CHECK-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[4:5]
-; CHECK-NEXT:    v_cndmask_b32_e32 v6, -1, v4, vcc
-; CHECK-NEXT:    buffer_load_dword v4, v6, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v5, v6, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[0:1]
+; CHECK-NEXT:    v_cndmask_b32_e32 v6, -1, v0, vcc
+; CHECK-NEXT:    buffer_load_dword v0, v6, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v1, v6, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    s_waitcnt vmcnt(1)
-; CHECK-NEXT:    v_accvgpr_write_b32 a0, v4
+; CHECK-NEXT:    v_accvgpr_write_b32 a0, v0
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    v_cmp_eq_u64_e32 vcc, v[4:5], v[2:3]
-; CHECK-NEXT:    v_cndmask_b32_e32 v1, v5, v1, vcc
-; CHECK-NEXT:    v_accvgpr_write_b32 a1, v5
-; CHECK-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
-; CHECK-NEXT:    buffer_store_dword v1, v6, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[4:5]
+; CHECK-NEXT:    v_cndmask_b32_e32 v3, v1, v3, vcc
+; CHECK-NEXT:    v_accvgpr_write_b32 a1, v1
+; CHECK-NEXT:    v_cndmask_b32_e32 v0, v0, v2, vcc
+; CHECK-NEXT:    buffer_store_dword v3, v6, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    buffer_store_dword v0, v6, s[0:3], 0 offen
 ; CHECK-NEXT:  .LBB15_4: ; %atomicrmw.phi
 ; CHECK-NEXT:    s_or_b64 exec, exec, s[4:5]
@@ -598,50 +607,55 @@ define void @flat_atomic_cmpxchg_i64_ret_a_a__v(ptr %ptr) #0 {
 ; CHECK-LABEL: flat_atomic_cmpxchg_i64_ret_a_a__v:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    v_add_co_u32_e32 v6, vcc, 0x50, v0
+; CHECK-NEXT:    v_add_co_u32_e32 v0, vcc, 0x50, v0
+; CHECK-NEXT:    s_mov_b64 s[4:5], src_private_base
+; CHECK-NEXT:    v_addc_co_u32_e32 v1, vcc, 0, v1, vcc
 ; CHECK-NEXT:    ;;#ASMSTART
 ; CHECK-NEXT:    ; def a[0:1]
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v3, a1
-; CHECK-NEXT:    s_mov_b64 s[4:5], src_private_base
-; CHECK-NEXT:    v_addc_co_u32_e32 v7, vcc, 0, v1, vcc
-; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
+; CHECK-NEXT:    v_accvgpr_read_b32 v5, a1
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; def a[0:1]
+; CHECK-NEXT:    ; def a[2:3]
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v0, a0
-; CHECK-NEXT:    v_accvgpr_read_b32 v1, a1
-; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v7
-; CHECK-NEXT:    ; implicit-def: $vgpr4_vgpr5
+; CHECK-NEXT:    v_accvgpr_read_b32 v7, a3
+; CHECK-NEXT:    v_accvgpr_read_b32 v4, a0
+; CHECK-NEXT:    v_accvgpr_read_b32 v6, a2
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v1
+; CHECK-NEXT:    ; implicit-def: $vgpr2_vgpr3
 ; CHECK-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; CHECK-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; CHECK-NEXT:    s_cbranch_execz .LBB16_2
 ; CHECK-NEXT:  ; %bb.1: ; %atomicrmw.global
+; CHECK-NEXT:    v_accvgpr_read_b32 v2, a2
+; CHECK-NEXT:    v_accvgpr_read_b32 v3, a3
+; CHECK-NEXT:    v_accvgpr_read_b32 v4, a0
+; CHECK-NEXT:    v_accvgpr_read_b32 v5, a1
 ; CHECK-NEXT:    buffer_wbl2
-; CHECK-NEXT:    flat_atomic_cmpswap_x2 v[4:5], v[6:7], v[0:3] glc
+; CHECK-NEXT:    flat_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5] glc
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    buffer_invl2
 ; CHECK-NEXT:    buffer_wbinvl1_vol
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
+; CHECK-NEXT:    ; implicit-def: $vgpr4_vgpr5
 ; CHECK-NEXT:    ; implicit-def: $vgpr6_vgpr7
-; CHECK-NEXT:    ; implicit-def: $vgpr2_vgpr3
 ; CHECK-NEXT:  .LBB16_2: ; %Flow
 ; CHECK-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; CHECK-NEXT:    s_cbranch_execz .LBB16_4
 ; CHECK-NEXT:  ; %bb.3: ; %atomicrmw.private
-; CHECK-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[6:7]
-; CHECK-NEXT:    v_cndmask_b32_e32 v6, -1, v6, vcc
-; CHECK-NEXT:    buffer_load_dword v4, v6, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v5, v6, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[0:1]
+; CHECK-NEXT:    v_cndmask_b32_e32 v0, -1, v0, vcc
+; CHECK-NEXT:    buffer_load_dword v2, v0, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v3, v0, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    v_cmp_eq_u64_e32 vcc, v[4:5], v[2:3]
-; CHECK-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
-; CHECK-NEXT:    v_cndmask_b32_e32 v1, v5, v1, vcc
-; CHECK-NEXT:    buffer_store_dword v0, v6, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_store_dword v1, v6, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
+; CHECK-NEXT:    v_cndmask_b32_e32 v4, v2, v6, vcc
+; CHECK-NEXT:    v_cndmask_b32_e32 v1, v3, v7, vcc
+; CHECK-NEXT:    buffer_store_dword v4, v0, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:  .LBB16_4: ; %atomicrmw.phi
 ; CHECK-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; use v[4:5]
+; CHECK-NEXT:    ; use v[2:3]
 ; CHECK-NEXT:    ;;#ASMEND
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
@@ -658,48 +672,51 @@ define void @flat_atomic_cmpxchg_i64_ret_v_a__v(ptr %ptr) #0 {
 ; CHECK-LABEL: flat_atomic_cmpxchg_i64_ret_v_a__v:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    v_add_co_u32_e32 v6, vcc, 0x50, v0
+; CHECK-NEXT:    v_add_co_u32_e32 v4, vcc, 0x50, v0
 ; CHECK-NEXT:    s_mov_b64 s[4:5], src_private_base
-; CHECK-NEXT:    v_addc_co_u32_e32 v7, vcc, 0, v1, vcc
+; CHECK-NEXT:    v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
 ; CHECK-NEXT:    ;;#ASMSTART
 ; CHECK-NEXT:    ; def a[0:1]
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v0, a0
-; CHECK-NEXT:    v_accvgpr_read_b32 v1, a1
-; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v7
+; CHECK-NEXT:    v_accvgpr_read_b32 v7, a1
+; CHECK-NEXT:    v_accvgpr_read_b32 v6, a0
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v5
 ; CHECK-NEXT:    ;;#ASMSTART
 ; CHECK-NEXT:    ; def v[2:3]
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    ; implicit-def: $vgpr4_vgpr5
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; CHECK-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; CHECK-NEXT:    s_cbranch_execz .LBB17_2
 ; CHECK-NEXT:  ; %bb.1: ; %atomicrmw.global
+; CHECK-NEXT:    v_accvgpr_read_b32 v0, a0
+; CHECK-NEXT:    v_accvgpr_read_b32 v1, a1
 ; CHECK-NEXT:    buffer_wbl2
-; CHECK-NEXT:    flat_atomic_cmpswap_x2 v[4:5], v[6:7], v[0:3] glc
+; CHECK-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    buffer_invl2
 ; CHECK-NEXT:    buffer_wbinvl1_vol
-; CHECK-NEXT:    ; implicit-def: $vgpr6_vgpr7
+; CHECK-NEXT:    ; implicit-def: $vgpr4_vgpr5
 ; CHECK-NEXT:    ; implicit-def: $vgpr2_vgpr3
+; CHECK-NEXT:    ; implicit-def: $vgpr6_vgpr7
 ; CHECK-NEXT:  .LBB17_2: ; %Flow
 ; CHECK-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; CHECK-NEXT:    s_cbranch_execz .LBB17_4
 ; CHECK-NEXT:  ; %bb.3: ; %atomicrmw.private
-; CHECK-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[6:7]
-; CHECK-NEXT:    v_cndmask_b32_e32 v6, -1, v6, vcc
-; CHECK-NEXT:    buffer_load_dword v4, v6, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v5, v6, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[4:5]
+; CHECK-NEXT:    v_cndmask_b32_e32 v4, -1, v4, vcc
+; CHECK-NEXT:    buffer_load_dword v0, v4, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v1, v4, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    v_cmp_eq_u64_e32 vcc, v[4:5], v[2:3]
-; CHECK-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
-; CHECK-NEXT:    v_cndmask_b32_e32 v1, v5, v1, vcc
-; CHECK-NEXT:    buffer_store_dword v0, v6, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_store_dword v1, v6, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
+; CHECK-NEXT:    v_cndmask_b32_e32 v3, v0, v6, vcc
+; CHECK-NEXT:    v_cndmask_b32_e32 v2, v1, v7, vcc
+; CHECK-NEXT:    buffer_store_dword v3, v4, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_store_dword v2, v4, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:  .LBB17_4: ; %atomicrmw.phi
 ; CHECK-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; use v[4:5]
+; CHECK-NEXT:    ; use v[0:1]
 ; CHECK-NEXT:    ;;#ASMEND
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
@@ -716,48 +733,51 @@ define void @flat_atomic_cmpxchg_i64_ret_a_v__v(ptr %ptr) #0 {
 ; CHECK-LABEL: flat_atomic_cmpxchg_i64_ret_a_v__v:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    v_add_co_u32_e32 v6, vcc, 0x50, v0
+; CHECK-NEXT:    v_add_co_u32_e32 v4, vcc, 0x50, v0
 ; CHECK-NEXT:    s_mov_b64 s[4:5], src_private_base
-; CHECK-NEXT:    v_addc_co_u32_e32 v7, vcc, 0, v1, vcc
+; CHECK-NEXT:    v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
 ; CHECK-NEXT:    ;;#ASMSTART
 ; CHECK-NEXT:    ; def a[0:1]
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v3, a1
-; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
-; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v7
+; CHECK-NEXT:    v_accvgpr_read_b32 v7, a1
+; CHECK-NEXT:    v_accvgpr_read_b32 v6, a0
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v5
 ; CHECK-NEXT:    ;;#ASMSTART
 ; CHECK-NEXT:    ; def v[0:1]
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    ; implicit-def: $vgpr4_vgpr5
+; CHECK-NEXT:    ; implicit-def: $vgpr2_vgpr3
 ; CHECK-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; CHECK-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; CHECK-NEXT:    s_cbranch_execz .LBB18_2
 ; CHECK-NEXT:  ; %bb.1: ; %atomicrmw.global
+; CHECK-NEXT:    v_accvgpr_read_b32 v2, a0
+; CHECK-NEXT:    v_accvgpr_read_b32 v3, a1
 ; CHECK-NEXT:    buffer_wbl2
-; CHECK-NEXT:    flat_atomic_cmpswap_x2 v[4:5], v[6:7], v[0:3] glc
+; CHECK-NEXT:    flat_atomic_cmpswap_x2 v[2:3], v[4:5], v[0:3] glc
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    buffer_invl2
 ; CHECK-NEXT:    buffer_wbinvl1_vol
+; CHECK-NEXT:    ; implicit-def: $vgpr4_vgpr5
 ; CHECK-NEXT:    ; implicit-def: $vgpr6_vgpr7
-; CHECK-NEXT:    ; implicit-def: $vgpr2_vgpr3
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:  .LBB18_2: ; %Flow
 ; CHECK-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
 ; CHECK-NEXT:    s_cbranch_execz .LBB18_4
 ; CHECK-NEXT:  ; %bb.3: ; %atomicrmw.private
-; CHECK-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[6:7]
-; CHECK-NEXT:    v_cndmask_b32_e32 v6, -1, v6, vcc
-; CHECK-NEXT:    buffer_load_dword v4, v6, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_load_dword v5, v6, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[4:5]
+; CHECK-NEXT:    v_cndmask_b32_e32 v4, -1, v4, vcc
+; CHECK-NEXT:    buffer_load_dword v2, v4, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_load_dword v3, v4, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
-; CHECK-NEXT:    v_cmp_eq_u64_e32 vcc, v[4:5], v[2:3]
-; CHECK-NEXT:    v_cndmask_b32_e32 v0, v4, v0, vcc
-; CHECK-NEXT:    v_cndmask_b32_e32 v1, v5, v1, vcc
-; CHECK-NEXT:    buffer_store_dword v0, v6, s[0:3], 0 offen
-; CHECK-NEXT:    buffer_store_dword v1, v6, s[0:3], 0 offen offset:4
+; CHECK-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[6:7]
+; CHECK-NEXT:    v_cndmask_b32_e32 v0, v2, v0, vcc
+; CHECK-NEXT:    v_cndmask_b32_e32 v1, v3, v1, vcc
+; CHECK-NEXT:    buffer_store_dword v0, v4, s[0:3], 0 offen
+; CHECK-NEXT:    buffer_store_dword v1, v4, s[0:3], 0 offen offset:4
 ; CHECK-NEXT:  .LBB18_4: ; %atomicrmw.phi
 ; CHECK-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; CHECK-NEXT:    ;;#ASMSTART
-; CHECK-NEXT:    ; use v[4:5]
+; CHECK-NEXT:    ; use v[2:3]
 ; CHECK-NEXT:    ;;#ASMEND
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
@@ -947,48 +967,51 @@ define void @flat_atomic_cmpxchg_i64_ret_av_a__av(ptr %ptr) #0 {
 ; CHECK-LABEL: flat_atomic_cmpxchg_i64_ret_av_a__av:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; CHECK-NEXT:    v_add_co_u32_e32 v6, vcc, 0x50, v0
+; CHECK-NEXT:    v_add_co_u32_e32 v4, vcc, 0x50, v0
 ; CHECK-NEXT:    s_mov_b64 s[4:5], src_private_base
-; CHECK-NEXT:    v_addc_co_u32_e32 v7, vcc, 0, v1, vcc
+; CHECK-NEXT:    v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
 ; CHECK-NEXT:    ;;#ASMSTART
 ; CHECK-NEXT:    ; def a[0:1]
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    v_accvgpr_read_b32 v0, a0
-; CHECK-NEXT:    v_accvgpr_read_b32 v1, a1
-; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v7
+; CHECK-NEXT:    v_accvgpr_read_b32 v7, a1
+; CHECK-NEXT:    v_accvgpr_read_b32 v6, a0
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v5
 ; CHECK-NEXT:    ;;#ASMSTART
 ; CHECK-NEXT:    ; def v[2:3]
 ; CHECK-NEXT:    ;;#ASMEND
-; CHECK-NEXT:    ; implicit-def: $vgpr4_vgpr5
+; CHECK-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; CHECK-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; CHECK-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; CHECK-NEXT:    s_cbranch_execz .LBB22_2
 ; CHECK-NEXT:  ; %bb.1: ; %atomicrmw.global
+; CHECK-NEXT:    v_accvgpr_read_b32 v0, a0
+; CHECK-NEXT:    v_accvgpr_read_b32 v1, a1
 ; CHECK-NEXT:    buffer_wbl2
-; CHECK-NEXT:    flat_atomic_cmpswap_x2 v[4:5], v[6:7], v[0:3] glc
+; CHECK-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    buffer_invl2
 ; CHECK-NEXT:    buffer_wbinvl1_vol
-; CHECK-NEXT:    ; implicit-def: $vgpr6_vgpr7
+; CHECK-NEXT:    ; implicit-def: $vgpr4_vgpr5
 ; CHECK-NEXT:    ; implicit-def: $vgpr2_vgpr3
+; CHECK-NEXT:    ; implicit-def: $vgpr6_...
[truncated]

@arsenm arsenm marked this pull request as ready for review November 15, 2025 02:10
@arsenm arsenm force-pushed the users/arsenm/amdgpu/select-vector-classes-divergent-build-vector branch from 111d5aa to 25683ac Compare November 15, 2025 03:00
@arsenm arsenm force-pushed the users/arsenm/amdgpu/check-isVGPRImm-build-vector-v2i16-bitcast branch from ee34e82 to 9a8a0ec Compare November 15, 2025 04:25
@arsenm arsenm force-pushed the users/arsenm/amdgpu/select-vector-classes-divergent-build-vector branch from 25683ac to 6a0fc72 Compare November 15, 2025 04:25
Copy link
Contributor

@shiltian shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like there is anything significantly different, but in some cases GISel and SelectionDAG generate same code, which is nice.

This probably should have turned into a regular integer constant
earlier. This is to defend against future regressions.
The main improvement is to the mfma tests. There are some
mild regressions scattered around, and a few major ones.
The worst regressions are in some of the bitcast tests;
these are cases where the SGPR argument list runs out
and uses VGPRs, and the copies-from-VGPR are misidentified
as divergent. Most of the shufflevector tests are also
regressions. These end up with cleaner MIR, but then get poor
regalloc decisions.
@arsenm arsenm force-pushed the users/arsenm/amdgpu/check-isVGPRImm-build-vector-v2i16-bitcast branch from 9a8a0ec to f3c3a66 Compare November 15, 2025 05:07
@arsenm arsenm force-pushed the users/arsenm/amdgpu/select-vector-classes-divergent-build-vector branch from 6a0fc72 to 93256f5 Compare November 15, 2025 05:08
Base automatically changed from users/arsenm/amdgpu/check-isVGPRImm-build-vector-v2i16-bitcast to main November 15, 2025 05:42
@arsenm arsenm merged commit fbf74b2 into main Nov 15, 2025
12 of 13 checks passed
@arsenm arsenm deleted the users/arsenm/amdgpu/select-vector-classes-divergent-build-vector branch November 15, 2025 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants