[AMDGPU] tensor_{load_to/store_from}_lds => ..._d2 simplification #171540

krzysz00 · 2025-12-10T00:00:44Z

This commit adds the rewrite

llvm.amdgcn.tensor.{load.to/store.from}.lds(
  <4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer,
  <4 x i32> zeroinitializer, i32 [cachepolicy])
=>
llvm.amdgcn.tensor.{load.to/store.from}.lds.d2(
  <4 x i32> %$d0, <8 x i32> %d1, i32 [cachepolicy])

This is justifed because, when the short encoding that uses the NULL SGPR for registers 2 and 3 is used, the hardware acts as if those registers were 0, including in the gather mode.

It is always safe not to run this transformation.

(Note: tests were LLM'd and then tweaked.)

This commit adds the rewrite ``` llvm.amdgcn.tensor.{load.to/store.from}.lds( <4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer, <4 x i32> zeroinitializer, i32 [cachepolicy]) => llvm.amdgcn.tensor.{load.to/store.from}.lds.d2( <4 x i32> %$d0, <8 x i32> %d1, i32 [cachepolicy]) ``` This is justifed because, when the short encoding that uses the NULL SGPR for registers 2 and 3 is used, the hardware acts as if those registers were 0, including in the gather mode. It is always safe not to run this transformation. (Note: tests were LLM'd and then tweaked.)

llvmbot · 2025-12-10T00:01:20Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Krzysztof Drewniak (krzysz00)

Changes

This commit adds the rewrite

llvm.amdgcn.tensor.{load.to/store.from}.lds(
  &lt;4 x i32&gt; %d0, &lt;8 x i32&gt; %d1, &lt;4 x i32&gt; zeroinitializer,
  &lt;4 x i32&gt; zeroinitializer, i32 [cachepolicy])
=&gt;
llvm.amdgcn.tensor.{load.to/store.from}.lds.d2(
  &lt;4 x i32&gt; %$d0, &lt;8 x i32&gt; %d1, i32 [cachepolicy])

This is justifed because, when the short encoding that uses the NULL SGPR for registers 2 and 3 is used, the hardware acts as if those registers were 0, including in the gather mode.

It is always safe not to run this transformation.

(Note: tests were LLM'd and then tweaked.)

Full diff: https://github.com/llvm/llvm-project/pull/171540.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+20)
(added) llvm/test/Transforms/InstCombine/AMDGPU/tensor-load-store-lds.ll (+125)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 47926734d64d4..d3525e1eca304 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -1737,6 +1737,26 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
     NewII->takeName(&II);
     return IC.replaceInstUsesWith(II, NewII);
   }
+  case Intrinsic::amdgcn_tensor_load_to_lds:
+  case Intrinsic::amdgcn_tensor_store_from_lds: {
+    Value *D2 = II.getArgOperand(2);
+    Value *D3 = II.getArgOperand(3);
+    // We know that not passing the second and third tensor DMA groups is
+    // equivalent to passing zeroes for those registers, so we rewrite to the
+    // shorter form here.
+    if (!match(D2, m_Zero()) || !match(D3, m_Zero()))
+      return std::nullopt;
+
+    auto ShortIntrinsic = IID == Intrinsic::amdgcn_tensor_load_to_lds
+                              ? Intrinsic::amdgcn_tensor_load_to_lds_d2
+                              : Intrinsic::amdgcn_tensor_store_from_lds_d2;
+    CallInst *NewII = IC.Builder.CreateIntrinsic(
+        ShortIntrinsic,
+        {II.getArgOperand(0), II.getArgOperand(1), II.getArgOperand(4)}, &II);
+    NewII->takeName(&II);
+    NewII->copyMetadata(II);
+    return IC.eraseInstFromFunction(II);
+  }
   }
   if (const AMDGPU::ImageDimIntrinsicInfo *ImageDimIntr =
             AMDGPU::getImageDimIntrinsicInfo(II.getIntrinsicID())) {
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/tensor-load-store-lds.ll b/llvm/test/Transforms/InstCombine/AMDGPU/tensor-load-store-lds.ll
new file mode 100644
index 0000000000000..e9cf704a8026e
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/tensor-load-store-lds.ll
@@ -0,0 +1,125 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=instcombine < %s | FileCheck %s
+
+; --------------------------------------------------------------------
+; tensor_load_to_lds: D2 and D3 are zero -> convert to _d2 variant
+; --------------------------------------------------------------------
+
+define void @test_tensor_load_to_lds_d2_d3_zero(<4 x i32> inreg %d0, <8 x i32> inreg %d1) {
+; CHECK-LABEL: define void @test_tensor_load_to_lds_d2_d3_zero(
+; CHECK-SAME: <4 x i32> inreg [[D0:%.*]], <8 x i32> inreg [[D1:%.*]]) {
+; CHECK-NEXT:    call void @llvm.amdgcn.tensor.load.to.lds.d2(<4 x i32> [[D0]], <8 x i32> [[D1]], i32 0)
+; CHECK-NEXT:    ret void
+;
+  call void @llvm.amdgcn.tensor.load.to.lds(<4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer, <4 x i32> zeroinitializer, i32 0)
+  ret void
+}
+
+; --------------------------------------------------------------------
+; non-matching patterns for tensor_load_to_lds simplification
+; --------------------------------------------------------------------
+
+define void @test_tensor_load_to_lds_d2_zero_d3_nonzero(<4 x i32> inreg %d0, <8 x i32> inreg %d1, <4 x i32> inreg %d3) {
+; CHECK-LABEL: define void @test_tensor_load_to_lds_d2_zero_d3_nonzero(
+; CHECK-SAME: <4 x i32> inreg [[D0:%.*]], <8 x i32> inreg [[D1:%.*]], <4 x i32> inreg [[D3:%.*]]) {
+; CHECK-NEXT:    call void @llvm.amdgcn.tensor.load.to.lds(<4 x i32> [[D0]], <8 x i32> [[D1]], <4 x i32> zeroinitializer, <4 x i32> [[D3]], i32 0)
+; CHECK-NEXT:    ret void
+;
+  call void @llvm.amdgcn.tensor.load.to.lds(<4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer, <4 x i32> %d3, i32 0)
+  ret void
+}
+
+define void @test_tensor_load_to_lds_d2_nonzero_d3_zero(<4 x i32> inreg %d0, <8 x i32> inreg %d1, <4 x i32> inreg %d2) {
+; CHECK-LABEL: define void @test_tensor_load_to_lds_d2_nonzero_d3_zero(
+; CHECK-SAME: <4 x i32> inreg [[D0:%.*]], <8 x i32> inreg [[D1:%.*]], <4 x i32> inreg [[D2:%.*]]) {
+; CHECK-NEXT:    call void @llvm.amdgcn.tensor.load.to.lds(<4 x i32> [[D0]], <8 x i32> [[D1]], <4 x i32> [[D2]], <4 x i32> zeroinitializer, i32 0)
+; CHECK-NEXT:    ret void
+;
+  call void @llvm.amdgcn.tensor.load.to.lds(<4 x i32> %d0, <8 x i32> %d1, <4 x i32> %d2, <4 x i32> zeroinitializer, i32 0)
+  ret void
+}
+
+define void @test_tensor_load_to_lds_d2_d3_nonzero(<4 x i32> inreg %d0, <8 x i32> inreg %d1, <4 x i32> inreg %d2, <4 x i32> inreg %d3) {
+; CHECK-LABEL: define void @test_tensor_load_to_lds_d2_d3_nonzero(
+; CHECK-SAME: <4 x i32> inreg [[D0:%.*]], <8 x i32> inreg [[D1:%.*]], <4 x i32> inreg [[D2:%.*]], <4 x i32> inreg [[D3:%.*]]) {
+; CHECK-NEXT:    call void @llvm.amdgcn.tensor.load.to.lds(<4 x i32> [[D0]], <8 x i32> [[D1]], <4 x i32> [[D2]], <4 x i32> [[D3]], i32 0)
+; CHECK-NEXT:    ret void
+;
+  call void @llvm.amdgcn.tensor.load.to.lds(<4 x i32> %d0, <8 x i32> %d1, <4 x i32> %d2, <4 x i32> %d3, i32 0)
+  ret void
+}
+
+; --------------------------------------------------------------------
+; tensor_store_from_lds: D2 and D3 are zero -> convert to _d2 variant
+; --------------------------------------------------------------------
+
+define void @test_tensor_store_from_lds_d2_d3_zero(<4 x i32> inreg %d0, <8 x i32> inreg %d1) {
+; CHECK-LABEL: define void @test_tensor_store_from_lds_d2_d3_zero(
+; CHECK-SAME: <4 x i32> inreg [[D0:%.*]], <8 x i32> inreg [[D1:%.*]]) {
+; CHECK-NEXT:    call void @llvm.amdgcn.tensor.store.from.lds.d2(<4 x i32> [[D0]], <8 x i32> [[D1]], i32 0)
+; CHECK-NEXT:    ret void
+;
+  call void @llvm.amdgcn.tensor.store.from.lds(<4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer, <4 x i32> zeroinitializer, i32 0)
+  ret void
+}
+
+; --------------------------------------------------------------------
+; non-matching patterns for tensor_store_from_lds simplification
+; --------------------------------------------------------------------
+
+define void @test_tensor_store_from_lds_d2_zero_d3_nonzero(<4 x i32> inreg %d0, <8 x i32> inreg %d1, <4 x i32> inreg %d3) {
+; CHECK-LABEL: define void @test_tensor_store_from_lds_d2_zero_d3_nonzero(
+; CHECK-SAME: <4 x i32> inreg [[D0:%.*]], <8 x i32> inreg [[D1:%.*]], <4 x i32> inreg [[D3:%.*]]) {
+; CHECK-NEXT:    call void @llvm.amdgcn.tensor.store.from.lds(<4 x i32> [[D0]], <8 x i32> [[D1]], <4 x i32> zeroinitializer, <4 x i32> [[D3]], i32 0)
+; CHECK-NEXT:    ret void
+;
+  call void @llvm.amdgcn.tensor.store.from.lds(<4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer, <4 x i32> %d3, i32 0)
+  ret void
+}
+
+define void @test_tensor_store_from_lds_d2_nonzero_d3_zero(<4 x i32> inreg %d0, <8 x i32> inreg %d1, <4 x i32> inreg %d2) {
+; CHECK-LABEL: define void @test_tensor_store_from_lds_d2_nonzero_d3_zero(
+; CHECK-SAME: <4 x i32> inreg [[D0:%.*]], <8 x i32> inreg [[D1:%.*]], <4 x i32> inreg [[D2:%.*]]) {
+; CHECK-NEXT:    call void @llvm.amdgcn.tensor.store.from.lds(<4 x i32> [[D0]], <8 x i32> [[D1]], <4 x i32> [[D2]], <4 x i32> zeroinitializer, i32 0)
+; CHECK-NEXT:    ret void
+;
+  call void @llvm.amdgcn.tensor.store.from.lds(<4 x i32> %d0, <8 x i32> %d1, <4 x i32> %d2, <4 x i32> zeroinitializer, i32 0)
+  ret void
+}
+
+define void @test_tensor_store_from_lds_d2_d3_nonzero(<4 x i32> inreg %d0, <8 x i32> inreg %d1, <4 x i32> inreg %d2, <4 x i32> inreg %d3) {
+; CHECK-LABEL: define void @test_tensor_store_from_lds_d2_d3_nonzero(
+; CHECK-SAME: <4 x i32> inreg [[D0:%.*]], <8 x i32> inreg [[D1:%.*]], <4 x i32> inreg [[D2:%.*]], <4 x i32> inreg [[D3:%.*]]) {
+; CHECK-NEXT:    call void @llvm.amdgcn.tensor.store.from.lds(<4 x i32> [[D0]], <8 x i32> [[D1]], <4 x i32> [[D2]], <4 x i32> [[D3]], i32 0)
+; CHECK-NEXT:    ret void
+;
+  call void @llvm.amdgcn.tensor.store.from.lds(<4 x i32> %d0, <8 x i32> %d1, <4 x i32> %d2, <4 x i32> %d3, i32 0)
+  ret void
+}
+
+; --------------------------------------------------------------------
+; ensure cachepolicy is preserved
+; --------------------------------------------------------------------
+
+define void @test_tensor_load_to_lds_d2_d3_zero_cachepolicy(<4 x i32> inreg %d0, <8 x i32> inreg %d1) {
+; CHECK-LABEL: define void @test_tensor_load_to_lds_d2_d3_zero_cachepolicy(
+; CHECK-SAME: <4 x i32> inreg [[D0:%.*]], <8 x i32> inreg [[D1:%.*]]) {
+; CHECK-NEXT:    call void @llvm.amdgcn.tensor.load.to.lds.d2(<4 x i32> [[D0]], <8 x i32> [[D1]], i32 1)
+; CHECK-NEXT:    ret void
+;
+  call void @llvm.amdgcn.tensor.load.to.lds(<4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer, <4 x i32> zeroinitializer, i32 1)
+  ret void
+}
+
+define void @test_tensor_store_from_lds_d2_d3_zero_cachepolicy(<4 x i32> inreg %d0, <8 x i32> inreg %d1) {
+; CHECK-LABEL: define void @test_tensor_store_from_lds_d2_d3_zero_cachepolicy(
+; CHECK-SAME: <4 x i32> inreg [[D0:%.*]], <8 x i32> inreg [[D1:%.*]]) {
+; CHECK-NEXT:    call void @llvm.amdgcn.tensor.store.from.lds.d2(<4 x i32> [[D0]], <8 x i32> [[D1]], i32 1)
+; CHECK-NEXT:    ret void
+;
+  call void @llvm.amdgcn.tensor.store.from.lds(<4 x i32> %d0, <8 x i32> %d1, <4 x i32> zeroinitializer, <4 x i32> zeroinitializer, i32 1)
+  ret void
+}
+
+declare void @llvm.amdgcn.tensor.load.to.lds(<4 x i32>, <8 x i32>, <4 x i32>, <4 x i32>, i32 immarg)
+declare void @llvm.amdgcn.tensor.store.from.lds(<4 x i32>, <8 x i32>, <4 x i32>, <4 x i32>, i32 immarg)

github-actions · 2025-12-10T00:25:58Z

🐧 Linux x64 Test Results

187268 tests passed
4939 tests skipped

✅ The build succeeded and all tests passed.

github-actions · 2025-12-10T00:25:58Z

🪟 Windows x64 Test Results

128541 tests passed
2804 tests skipped

✅ The build succeeded and all tests passed.

arsenm · 2025-12-10T13:57:40Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+    // We know that not passing the second and third tensor DMA groups is
+    // equivalent to passing zeroes for those registers, so we rewrite to the
+    // shorter form here.
+    if (!match(D2, m_Zero()) || !match(D3, m_Zero()))


Can you also do this for undef?

Yep, we're now matching undef/poison too

arsenm · 2025-12-10T13:57:55Z

llvm/test/Transforms/InstCombine/AMDGPU/tensor-load-store-lds.ll

+}
+
+declare void @llvm.amdgcn.tensor.load.to.lds(<4 x i32>, <8 x i32>, <4 x i32>, <4 x i32>, i32 immarg)
+declare void @llvm.amdgcn.tensor.store.from.lds(<4 x i32>, <8 x i32>, <4 x i32>, <4 x i32>, i32 immarg)


Test poison case?

Done (but no undef test because we seem to not like those)

krzysz00 requested review from changpeng and shiltian December 10, 2025 00:00

llvmbot added backend:AMDGPU llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Dec 10, 2025

arsenm reviewed Dec 10, 2025

View reviewed changes

Handle undef/poison, remove FMF copying that doesn't work

b6c72ce

krzysz00 requested a review from arsenm December 10, 2025 18:28

arsenm approved these changes Dec 12, 2025

View reviewed changes

krzysz00 merged commit e7dd7b8 into llvm:main Dec 15, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] tensor_{load_to/store_from}_lds => ..._d2 simplification #171540

[AMDGPU] tensor_{load_to/store_from}_lds => ..._d2 simplification #171540

Uh oh!

krzysz00 commented Dec 10, 2025

Uh oh!

llvmbot commented Dec 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 10, 2025 •

edited

Loading

Uh oh!

arsenm Dec 10, 2025

Uh oh!

krzysz00 Dec 10, 2025

Uh oh!

arsenm Dec 10, 2025

Uh oh!

krzysz00 Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[AMDGPU] tensor_{load_to/store_from}_lds => ..._d2 simplification #171540

[AMDGPU] tensor_{load_to/store_from}_lds => ..._d2 simplification #171540

Uh oh!

Conversation

krzysz00 commented Dec 10, 2025

Uh oh!

llvmbot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐧 Linux x64 Test Results

Uh oh!

github-actions bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪟 Windows x64 Test Results

Uh oh!

arsenm Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

krzysz00 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

krzysz00 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llvmbot commented Dec 10, 2025 •

edited

Loading

github-actions bot commented Dec 10, 2025 •

edited

Loading

github-actions bot commented Dec 10, 2025 •

edited

Loading