Skip to content

Conversation

PrasoonMishra
Copy link

This patch implements an optimization to partition MUBUF load/store offsets into vector and scalar components for better scheduling and reduced VGPR pressure.

Transform buffer operations where voffset = add(uniform, divergent) by moving the uniform part to soffset and keeping the divergent part in voffset.

Before:
  v_add_u32 v1, v0, sN
  buffer_{load,store}_T v*, v1, s[bufDesc:bufDesc+3] offen

After:
  buffer_{load,store}_T v*, v0, s[bufDesc:bufDesc+3], sN offen

The optimization currently applies to raw buffer loads/stores in OFFEN addressing mode when soffset is initially zero.

Test coverage is provided by “buffer-offset-to-soffset-loads” and “buffer-offset-to-soffset-stores,” which include comprehensive validation across i8, i16, i32, vector (v2/v4), and float variants, including positive and negative cases.

This patch implements an optimization to partition MUBUF load/store offsets
into vector and scalar components for better address coalescing and reduced
VGPR pressure.

Transform buffer operations where voffset = add(uniform, divergent) by
moving the uniform part to soffset and keeping the divergent part in voffset:

Before:
  v_add_u32 v1, v0, sN
  buffer_{load,store}_T v*, v1, s[bufDesc:bufDesc+3] offen

After:
  buffer_{load,store}_T v*, v0, s[bufDesc:bufDesc+3], sN offen

The optimization currently applies to raw buffer loads/stores in OFFEN
addressing mode when soffset is initially zero.

Tests includes comprehensive validation of both buffer loads and stores
across various supported variants (i8, i16, i32, vectors, floats) with
positive and negative test cases.
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Sep 26, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Prasoon Mishra (PrasoonMishra)

Changes

This patch implements an optimization to partition MUBUF load/store offsets into vector and scalar components for better scheduling and reduced VGPR pressure.

Transform buffer operations where voffset = add(uniform, divergent) by moving the uniform part to soffset and keeping the divergent part in voffset.

Before:
  v_add_u32 v1, v0, sN
  buffer_{load,store}_T v*, v1, s[bufDesc:bufDesc+3] offen

After:
  buffer_{load,store}_T v*, v0, s[bufDesc:bufDesc+3], sN offen

The optimization currently applies to raw buffer loads/stores in OFFEN addressing mode when soffset is initially zero.

Test coverage is provided by “buffer-offset-to-soffset-loads” and “buffer-offset-to-soffset-stores,” which include comprehensive validation across i8, i16, i32, vector (v2/v4), and float variants, including positive and negative cases.


Patch is 43.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/160939.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+83-5)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h (+1)
  • (added) llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll (+398)
  • (added) llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-stores.ll (+399)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.ll (+8-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.load.ll (+8-4)
  • (modified) llvm/test/CodeGen/AMDGPU/smrd.ll (+12-12)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 2192a72bb27b7..3a67170ccc598 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -328,9 +328,6 @@ bool AMDGPUDAGToDAGISel::matchLoadD16FromBuildVector(SDNode *N) const {
 }
 
 void AMDGPUDAGToDAGISel::PreprocessISelDAG() {
-  if (!Subtarget->d16PreservesUnusedBits())
-    return;
-
   SelectionDAG::allnodes_iterator Position = CurDAG->allnodes_end();
 
   bool MadeChange = false;
@@ -341,8 +338,23 @@ void AMDGPUDAGToDAGISel::PreprocessISelDAG() {
 
     switch (N->getOpcode()) {
     case ISD::BUILD_VECTOR:
-      // TODO: Match load d16 from shl (extload:i16), 16
-      MadeChange |= matchLoadD16FromBuildVector(N);
+      // D16 optimization requires subtarget support
+      if (Subtarget->d16PreservesUnusedBits()) {
+        // TODO: Match load d16 from shl (extload:i16), 16
+        MadeChange |= matchLoadD16FromBuildVector(N);
+      }
+      break;
+    case AMDGPUISD::BUFFER_LOAD:
+    case AMDGPUISD::BUFFER_LOAD_UBYTE:
+    case AMDGPUISD::BUFFER_LOAD_USHORT:
+    case AMDGPUISD::BUFFER_LOAD_BYTE:
+    case AMDGPUISD::BUFFER_LOAD_SHORT:
+      MadeChange |= sinkUniformAddendIntoSOffset(N, false);
+      break;
+    case AMDGPUISD::BUFFER_STORE:
+    case AMDGPUISD::BUFFER_STORE_BYTE:
+    case AMDGPUISD::BUFFER_STORE_SHORT:
+      MadeChange |= sinkUniformAddendIntoSOffset(N, true);
       break;
     default:
       break;
@@ -356,6 +368,72 @@ void AMDGPUDAGToDAGISel::PreprocessISelDAG() {
   }
 }
 
+/// Sink uniform addends in buffer address calculations into SOffset.
+///
+/// Transforms buffer loads/stores with voffset = add(uniform, divergent)
+/// into voffset = divergent, soffset = uniform for better address coalescing.
+/// Only applies when the result will use OFFEN addressing mode.
+bool AMDGPUDAGToDAGISel::sinkUniformAddendIntoSOffset(SDNode *N, bool IsStore) {
+
+  // Buffer operand layout:
+  // Load:  (chain, rsrc, vindex, voffset, soffset, offset, cachepolicy, idxen)
+  // Store: (chain, vdata, rsrc, vindex, voffset, soffset, offset, cachepolicy, idxen)
+  const unsigned VIndexIdx = IsStore ? 3 : 2;
+  const unsigned VOffsetIdx = IsStore ? 4 : 3;
+  const unsigned SOffsetIdx = IsStore ? 5 : 4;
+  const unsigned IdxEnIdx = IsStore ? 8 : 7;
+
+  if (N->getNumOperands() <= IdxEnIdx)
+    return false;
+
+  SDValue VIndex = N->getOperand(VIndexIdx);
+  SDValue VOffset = N->getOperand(VOffsetIdx);
+  SDValue SOffset = N->getOperand(SOffsetIdx);
+  SDValue IdxEn = N->getOperand(IdxEnIdx);
+
+  // Only optimize OFFEN mode: vindex=0, idxen=0 guarantees this
+  if (!isNullConstant(VIndex) || !isNullConstant(IdxEn))
+    return false;
+
+  // Only optimize when soffset is currently zero
+  // TODO: Handle non-zero soffset by combining with uniform addend
+  if (!isNullConstant(SOffset))
+    return false;
+
+  // voffset must be ADD of uniform and divergent values
+  if (VOffset.getOpcode() != ISD::ADD)
+    return false;
+
+  // Identify uniform and divergent addends
+  auto IsUniform = [](SDValue V) {
+    return isa<ConstantSDNode>(V) || !V.getNode()->isDivergent();
+  };
+
+  SDValue LHS = VOffset.getOperand(0);
+  SDValue RHS = VOffset.getOperand(1);
+  bool LHSUniform = IsUniform(LHS);
+  bool RHSUniform = IsUniform(RHS);
+
+  // Need exactly one uniform and one divergent operand
+  if (LHSUniform == RHSUniform)
+    return false;
+
+  SDValue UniformAddend = LHSUniform ? LHS : RHS;
+  SDValue DivergentAddend = LHSUniform ? RHS : LHS;
+
+  // Perform the transformation: sink uniform part into soffset
+  // SIFixSGPRCopies will handle any SGPR register class fixups if needed.
+  SmallVector<SDValue, 8> NewOps(N->op_values());
+  NewOps[VOffsetIdx] = DivergentAddend;
+  NewOps[SOffsetIdx] = UniformAddend;
+
+  LLVM_DEBUG(dbgs() << "Sinking uniform addend into SOffset for buffer "
+                    << (IsStore ? "store" : "load") << '\n');
+
+  CurDAG->UpdateNodeOperands(N, NewOps);
+  return true;
+}
+
 bool AMDGPUDAGToDAGISel::isInlineImmediate(const SDNode *N) const {
   if (N->isUndef())
     return true;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
index 4fa0d3f72e1c7..8b4e8803955e3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h
@@ -81,6 +81,7 @@ class AMDGPUDAGToDAGISel : public SelectionDAGISel {
 
   bool runOnMachineFunction(MachineFunction &MF) override;
   bool matchLoadD16FromBuildVector(SDNode *N) const;
+  bool sinkUniformAddendIntoSOffset(SDNode *N, bool IsStore);
   void PreprocessISelDAG() override;
   void Select(SDNode *N) override;
   void PostprocessISelDAG() override;
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll b/llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll
new file mode 100644
index 0000000000000..a77768c631e3c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll
@@ -0,0 +1,398 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck -check-prefixes=CHECK,GFX900 %s
+
+; Test comprehensive patterns for ADD(divergent, uniform) optimization in buffer loads
+
+; Basic workitem.id.x + uniform
+define amdgpu_kernel void @test_basic_workitem_uniform(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_basic_workitem_uniform:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Reversed operands (uniform + divergent)
+define amdgpu_kernel void @test_reversed_operands(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_reversed_operands:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %soffset, %voffset  ; Reversed: uniform + divergent
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Multiple buffer loads with same pattern
+define amdgpu_kernel void @test_multiple_loads(ptr addrspace(1) %output, i32 %soffset1, i32 %soffset2) {
+; CHECK-LABEL: test_multiple_loads:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v1, v0, s[4:7], s2 offen
+; CHECK-NEXT:    buffer_load_dword v2, v0, s[4:7], s3 offen
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_add_u32_e32 v1, v1, v2
+; CHECK-NEXT:    global_store_dword v0, v1, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+
+  %sum1 = add i32 %voffset, %soffset1
+  %val1 = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum1, i32 0, i32 0)
+
+  %sum2 = add i32 %voffset, %soffset2
+  %val2 = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum2, i32 0, i32 0)
+
+  %result = add i32 %val1, %val2
+  store i32 %result, ptr addrspace(1) %output
+  ret void
+}
+
+; Different buffer load variants - byte load
+define amdgpu_kernel void @test_buffer_load_byte(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_byte:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_ubyte v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i8 @llvm.amdgcn.raw.buffer.load.i8(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  %ext = zext i8 %val to i32
+  store i32 %ext, ptr addrspace(1) %output
+  ret void
+}
+
+; Different buffer load variants - short load
+define amdgpu_kernel void @test_buffer_load_short(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_short:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_ushort v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i16 @llvm.amdgcn.raw.buffer.load.i16(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  %ext = zext i16 %val to i32
+  store i32 %ext, ptr addrspace(1) %output
+  ret void
+}
+
+; Vector loads - v2i32
+define amdgpu_kernel void @test_buffer_load_v2i32(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_v2i32:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v2, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dwordx2 v[0:1], v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dwordx2 v2, v[0:1], s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call <2 x i32> @llvm.amdgcn.raw.buffer.load.v2i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store <2 x i32> %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Vector loads - v4i32
+define amdgpu_kernel void @test_buffer_load_v4i32(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_v4i32:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v4, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dwordx4 v[0:3], v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call <4 x i32> @llvm.amdgcn.raw.buffer.load.v4i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store <4 x i32> %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Float loads
+define amdgpu_kernel void @test_buffer_load_float(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_float:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store float %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Complex divergent expression + uniform
+define amdgpu_kernel void @test_complex_divergent(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_complex_divergent:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    v_add_u32_e32 v0, v0, v1
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %tid_x = call i32 @llvm.amdgcn.workitem.id.x()
+  %tid_y = call i32 @llvm.amdgcn.workitem.id.y()
+  %divergent = add i32 %tid_x, %tid_y  ; Still divergent
+  %sum = add i32 %divergent, %soffset  ; divergent + uniform
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - both operands divergent
+define amdgpu_kernel void @test_both_divergent(ptr addrspace(1) %output) {
+; CHECK-LABEL: test_both_divergent:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_add_u32_e32 v0, v0, v1
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], 0 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %tid_x = call i32 @llvm.amdgcn.workitem.id.x()
+  %tid_y = call i32 @llvm.amdgcn.workitem.id.y()
+  %sum = add i32 %tid_x, %tid_y
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - both operands uniform
+define amdgpu_kernel void @test_both_uniform(ptr addrspace(1) %output, i32 %soffset1, i32 %soffset2) {
+; CHECK-LABEL: test_both_uniform:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_add_i32 s2, s2, s3
+; CHECK-NEXT:    v_mov_b32_e32 v0, s2
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[4:7], 0 offen
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %sum = add i32 %soffset1, %soffset2
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Nested in control flow
+define amdgpu_kernel void @test_control_flow(ptr addrspace(1) %output, i32 %soffset, i32 %condition) {
+; CHECK-LABEL: test_control_flow:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_cmp_lg_u32 s3, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB11_4
+; CHECK-NEXT:  ; %bb.1: ; %else
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    global_store_dword v1, v1, s[0:1]
+; CHECK-NEXT:    s_cbranch_execnz .LBB11_3
+; CHECK-NEXT:  .LBB11_2: ; %then
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[4:7], s2 offen
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:  .LBB11_3: ; %end
+; CHECK-NEXT:    s_endpgm
+; CHECK-NEXT:  .LBB11_4:
+; CHECK-NEXT:    s_branch .LBB11_2
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %cmp = icmp eq i32 %condition, 0
+  br i1 %cmp, label %then, label %else
+
+then:
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  br label %end
+
+else:
+  store i32 0, ptr addrspace(1) %output
+  br label %end
+
+end:
+  ret void
+}
+
+; Multiple uses of the ADD result - should still optimize buffer load
+define amdgpu_kernel void @test_multiple_uses(ptr addrspace(1) %output1, ptr addrspace(1) %output2, i32 %soffset) {
+; CHECK-LABEL: test_multiple_uses:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x34
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v2, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v1, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    v_add_u32_e32 v0, s6, v0
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v2, v1, s[0:1]
+; CHECK-NEXT:    global_store_dword v2, v0, s[2:3]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output1
+  store i32 %sum, ptr addrspace(1) %output2
+  ret void
+}
+
+; Chain of operations - workitem.id -> mul -> add -> buffer_load
+define amdgpu_kernel void @test_operation_chain(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_operation_chain:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    v_mul_u32_u24_e32 v0, 4, v0
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %scaled = mul i32 %tid, 4  ; Still divergent
+  %sum = add i32 %scaled, %soffset  ; divergent + uniform
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - Buffer load with non-zero soffset field already
+define amdgpu_kernel void @test_existing_soffset(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_existing_soffset:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_add_u32_e32 v0, s6, v0
+; CHECK-NEXT:    s_movk_i32 s6, 0x64
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 100, i32 0)  ; Non-zero soffset
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - Structured buffer loads
+define amdgpu_kerne...
[truncated]

Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figure this makes sense - though you might also want to look for ((divergent + uniform) + constant)

Does this need a GlobalISel version too? I forget where DAGtoDAG runs.

... the other cases are voffset = sgpr where you'll want to reset voffset to 0.

(The end goal here is that, if we're !IDXEN, voffset == 0, and we know that everything's naturally aligned, we can promote BUFFER_LOAD_* to S_BUFFER_LOAD_*)

Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff origin/main HEAD --extensions h,cpp -- llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 3a67170cc..085365996 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -377,7 +377,8 @@ bool AMDGPUDAGToDAGISel::sinkUniformAddendIntoSOffset(SDNode *N, bool IsStore) {
 
   // Buffer operand layout:
   // Load:  (chain, rsrc, vindex, voffset, soffset, offset, cachepolicy, idxen)
-  // Store: (chain, vdata, rsrc, vindex, voffset, soffset, offset, cachepolicy, idxen)
+  // Store: (chain, vdata, rsrc, vindex, voffset, soffset, offset, cachepolicy,
+  // idxen)
   const unsigned VIndexIdx = IsStore ? 3 : 2;
   const unsigned VOffsetIdx = IsStore ? 4 : 3;
   const unsigned SOffsetIdx = IsStore ? 5 : 4;

case ISD::BUILD_VECTOR:
// TODO: Match load d16 from shl (extload:i16), 16
MadeChange |= matchLoadD16FromBuildVector(N);
// D16 optimization requires subtarget support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird place to put the optimization. This doesn't have the special generalized RMW property that the d16 loads do.

Buffers don't just come out of nowhere, you could do this on the IR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were doing this on the MIR, I'd note that we do [vgpr + add] separation in ISelLowering/InstructionLegalizer.

So if anything I'd expect this to be with the other instruction-selection machinery.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that for the legacy intrinsics? The new intrinsics have the explicit soffset and voffset operands

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

Even with the new ones, if I have /*voffset=*/%x + 16 in the IR, ISel will pattern-match that + 16 part into offset:16 on the instruction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I claim that voffset => soffset should either go there or be an IR pass after uniformity analysis like you suggested.

(Making it an IR pass also means you can do s_buffer_load promotion there)

Copy link
Author

@PrasoonMishra PrasoonMishra Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm Thank you for the clarification. I wasn't aware of the RMW property distinction that justifies the D16 optimization in PreprocessISelDAG.
As suggested an IR pass after uniformity analysis will be better. I can also enable the s_buffer_load promotion in the same pass as mentioned by @krzysz00. I'll rework this as an IR transformation.

@PrasoonMishra
Copy link
Author

Closing this PR to rework as an IR transformation as discussed. Will open a new PR with the IR approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants