AMDGPU/PromoteAlloca: Fix handling of users of multiple allocas #172771

macurtis-amd · 2025-12-18T01:04:30Z

With recent refactoring, LDS promotion worklists for all allocas are populated upfront. In some cases, this results in a User in multiple lists. Then as each list is processed, a User might get deleted via removeFromParent, potentially leaving a dangling pointer in a subsequent worklist.

Currently this only occurs for memcpy and memmove. Prior to refactoring, these were handled by DeferredInstr, and were processed after the last use of the then singular worklist.

This change moves processing of DeferredInstr to after all worklists have be processed.

With recent refactoring, LDS promotion worklists for all allocas are populated upfront. In some cases, this results in a User in multiple lists. Then as each list is processed, a User might get deleted via removeFromParent, potentially leaving a dangling pointer in a subsequent worklist. Currently this only occurs for memcpy and memmove. Prior to refactoring, these were handled by DeferredInstr, and were processed after the last use of the then singular worklist. This change moves processing of DeferredInstr to after all worklists have be processed.

llvmbot · 2025-12-18T01:05:01Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (macurtis-amd)

Changes

With recent refactoring, LDS promotion worklists for all allocas are populated upfront. In some cases, this results in a User in multiple lists. Then as each list is processed, a User might get deleted via removeFromParent, potentially leaving a dangling pointer in a subsequent worklist.

Currently this only occurs for memcpy and memmove. Prior to refactoring, these were handled by DeferredInstr, and were processed after the last use of the then singular worklist.

This change moves processing of DeferredInstr to after all worklists have be processed.

Full diff: https://github.com/llvm/llvm-project/pull/172771.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+17-8)
(added) llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll (+70)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index 83b463c630d71..361a74d57c784 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -159,7 +159,10 @@ class AMDGPUPromoteAllocaImpl {
   void analyzePromoteToVector(AllocaAnalysis &AA) const;
   void promoteAllocaToVector(AllocaAnalysis &AA);
   void analyzePromoteToLDS(AllocaAnalysis &AA) const;
-  bool tryPromoteAllocaToLDS(AllocaAnalysis &AA, bool SufficientLDS);
+  bool tryPromoteAllocaToLDS(AllocaAnalysis &AA, bool SufficientLDS,
+                             SmallVector<IntrinsicInst *> &DeferredIntrs);
+  void finishDeferredAllocaToLDSPromotion(
+      SmallVector<IntrinsicInst *> &DeferredIntrs);
 
   void scoreAlloca(AllocaAnalysis &AA) const;
 
@@ -414,6 +417,7 @@ bool AMDGPUPromoteAllocaImpl::run(Function &F, bool PromoteToLDS) {
   // clang-format on
 
   bool Changed = false;
+  SmallVector<IntrinsicInst *> DeferredIntrs;
   for (AllocaAnalysis &AA : Allocas) {
     if (AA.Vector.Ty) {
       const unsigned AllocaCost =
@@ -435,9 +439,11 @@ bool AMDGPUPromoteAllocaImpl::run(Function &F, bool PromoteToLDS) {
       }
     }
 
-    if (AA.LDS.Enable && tryPromoteAllocaToLDS(AA, SufficientLDS))
+    if (AA.LDS.Enable &&
+        tryPromoteAllocaToLDS(AA, SufficientLDS, DeferredIntrs))
       Changed = true;
   }
+  finishDeferredAllocaToLDSPromotion(DeferredIntrs);
 
   // NOTE: tryPromoteAllocaToVector removes the alloca, so Allocas contains
   // dangling pointers. If we want to reuse it past this point, the loop above
@@ -1550,8 +1556,9 @@ bool AMDGPUPromoteAllocaImpl::hasSufficientLocalMem(const Function &F) {
 }
 
 // FIXME: Should try to pick the most likely to be profitable allocas first.
-bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaAnalysis &AA,
-                                                    bool SufficientLDS) {
+bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(
+    AllocaAnalysis &AA, bool SufficientLDS,
+    SmallVector<IntrinsicInst *> &DeferredIntrs) {
   LLVM_DEBUG(dbgs() << "Trying to promote to LDS: " << *AA.Alloca << '\n');
 
   // Not likely to have sufficient local memory for promotion.
@@ -1620,8 +1627,6 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaAnalysis &AA,
   AA.Alloca->replaceAllUsesWith(Offset);
   AA.Alloca->eraseFromParent();
 
-  SmallVector<IntrinsicInst *> DeferredIntrs;
-
   PointerType *NewPtrTy = PointerType::get(Context, AMDGPUAS::LOCAL_ADDRESS);
 
   for (Value *V : AA.LDS.Worklist) {
@@ -1730,7 +1735,13 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaAnalysis &AA,
     }
   }
 
+  return true;
+}
+
+void AMDGPUPromoteAllocaImpl::finishDeferredAllocaToLDSPromotion(
+    SmallVector<IntrinsicInst *> &DeferredIntrs) {
   for (IntrinsicInst *Intr : DeferredIntrs) {
+    IRBuilder<> Builder(Intr);
     Builder.SetInsertPoint(Intr);
     Intrinsic::ID ID = Intr->getIntrinsicID();
     assert(ID == Intrinsic::memcpy || ID == Intrinsic::memmove);
@@ -1748,6 +1759,4 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaAnalysis &AA,
 
     Intr->eraseFromParent();
   }
-
-  return true;
 }
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll
new file mode 100644
index 0000000000000..915e0910a5047
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll
@@ -0,0 +1,70 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -passes=amdgpu-promote-alloca < %s | FileCheck %s
+
+; This tests the case where a memcpy has two pointer operands are promoted to LDS
+; See `@llvm.memcpy.p5.p5.i64(... %alloca1, ... %alloca, ...)` below.
+
+
+%struct.barney = type { i8, double }
+
+; Function Attrs: nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none)
+define amdgpu_kernel void @zot() local_unnamed_addr #0 {
+; CHECK-LABEL: @zot(
+; CHECK-NEXT:  bb:
+; CHECK-NEXT:    [[TMP0:%.*]] = call noalias nonnull dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
+; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP0]], i64 1
+; CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr addrspace(4) [[TMP1]], align 4, !invariant.load [[META0:![0-9]+]]
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP0]], i64 2
+; CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr addrspace(4) [[TMP3]], align 4, !range [[RNG1:![0-9]+]], !invariant.load [[META0]]
+; CHECK-NEXT:    [[TMP5:%.*]] = lshr i32 [[TMP2]], 16
+; CHECK-NEXT:    [[TMP6:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.x()
+; CHECK-NEXT:    [[TMP7:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.y()
+; CHECK-NEXT:    [[TMP8:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.z()
+; CHECK-NEXT:    [[TMP9:%.*]] = mul nuw nsw i32 [[TMP5]], [[TMP4]]
+; CHECK-NEXT:    [[TMP10:%.*]] = mul i32 [[TMP9]], [[TMP6]]
+; CHECK-NEXT:    [[TMP11:%.*]] = mul nuw nsw i32 [[TMP7]], [[TMP4]]
+; CHECK-NEXT:    [[TMP12:%.*]] = add i32 [[TMP10]], [[TMP11]]
+; CHECK-NEXT:    [[TMP13:%.*]] = add i32 [[TMP12]], [[TMP8]]
+; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds [1024 x [[STRUCT_BARNEY:%.*]]], ptr addrspace(3) @zot.alloca, i32 0, i32 [[TMP13]]
+; CHECK-NEXT:    [[TMP15:%.*]] = call noalias nonnull dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
+; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP15]], i64 1
+; CHECK-NEXT:    [[TMP17:%.*]] = load i32, ptr addrspace(4) [[TMP16]], align 4, !invariant.load [[META0]]
+; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP15]], i64 2
+; CHECK-NEXT:    [[TMP19:%.*]] = load i32, ptr addrspace(4) [[TMP18]], align 4, !range [[RNG1]], !invariant.load [[META0]]
+; CHECK-NEXT:    [[TMP20:%.*]] = lshr i32 [[TMP17]], 16
+; CHECK-NEXT:    [[TMP21:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.x()
+; CHECK-NEXT:    [[TMP22:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.y()
+; CHECK-NEXT:    [[TMP23:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.z()
+; CHECK-NEXT:    [[TMP24:%.*]] = mul nuw nsw i32 [[TMP20]], [[TMP19]]
+; CHECK-NEXT:    [[TMP25:%.*]] = mul i32 [[TMP24]], [[TMP21]]
+; CHECK-NEXT:    [[TMP26:%.*]] = mul nuw nsw i32 [[TMP22]], [[TMP19]]
+; CHECK-NEXT:    [[TMP27:%.*]] = add i32 [[TMP25]], [[TMP26]]
+; CHECK-NEXT:    [[TMP28:%.*]] = add i32 [[TMP27]], [[TMP23]]
+; CHECK-NEXT:    [[TMP29:%.*]] = getelementptr inbounds [1024 x [[STRUCT_BARNEY]]], ptr addrspace(3) @zot.alloca1, i32 0, i32 [[TMP28]]
+; CHECK-NEXT:    store i32 0, ptr addrspace(5) null, align 2147483648
+; CHECK-NEXT:    call void @llvm.memcpy.p3.p3.i64(ptr addrspace(3) align 16 dereferenceable(16) [[TMP29]], ptr addrspace(3) align 16 dereferenceable(16) [[TMP14]], i64 16, i1 false)
+; CHECK-NEXT:    call void @llvm.memcpy.p3.p0.i64(ptr addrspace(3) align 16 dereferenceable(16) [[TMP14]], ptr align 1 dereferenceable(16) poison, i64 16, i1 false)
+; CHECK-NEXT:    [[LOAD:%.*]] = load volatile ptr, ptr addrspace(5) null, align 2147483648
+; CHECK-NEXT:    br label [[BB2:%.*]]
+; CHECK:       bb2:
+; CHECK-NEXT:    call void @llvm.memcpy.p0.p3.i64(ptr align 1 dereferenceable(16) @hoge, ptr addrspace(3) align 16 dereferenceable(16) [[TMP29]], i64 16, i1 false)
+; CHECK-NEXT:    br label [[BB2]]
+;
+bb:
+  %alloca = alloca %struct.barney, align 16, addrspace(5)
+  %alloca1 = alloca %struct.barney, align 16, addrspace(5)
+  store i32 0, ptr addrspace(5) null, align 2147483648
+  call void @llvm.memcpy.p5.p5.i64(ptr addrspace(5) noundef align 16 dereferenceable(16) %alloca1, ptr addrspace(5) noundef align 16 dereferenceable(16) %alloca, i64 16, i1 false)
+  call void @llvm.memcpy.p5.p0.i64(ptr addrspace(5) noundef align 16 dereferenceable(16) %alloca, ptr noundef nonnull align 1 dereferenceable(16) poison, i64 16, i1 false)
+  %load = load volatile ptr, ptr addrspace(5) null, align 2147483648
+  br label %bb2
+
+bb2:                                              ; preds = %bb2, %bb
+  call void @llvm.memcpy.p0.p5.i64(ptr noundef nonnull align 1 dereferenceable(16) @hoge, ptr addrspace(5) noundef align 16 dereferenceable(16) %alloca1, i64 16, i1 false)
+  br label %bb2
+}
+
+declare ptr @hoge() local_unnamed_addr #1
+
+attributes #0 = { nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none) "amdgpu-agpr-alloc"="0" "amdgpu-no-cluster-id-x" "amdgpu-no-cluster-id-y" "amdgpu-no-cluster-id-z" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
+attributes #1 = { "uniform-work-group-size"="false" }

macurtis-amd · 2025-12-18T01:28:23Z

FWIW, I also have an alternative fix that is a bit more robust/general IMO: macurtis-amd@21a3e3c.

It also allows processing of memcpy/memmove in the main switch along with the other intrinsics.

PrasoonMishra · 2025-12-18T05:54:25Z

Thanks for this fix.

In your fix, when a memcpy uses two different allocas (one as dest, one as source), it gets added to DeferredIntrs twice i.e. once when processing each alloca's worklist. The second erase will lead to a crash.

define amdgpu_kernel void @test() #0 {
  %a1 = alloca [4 x i32], addrspace(5)
  %a2 = alloca [4 x i32], addrspace(5)
  call void @llvm.memcpy.p5.p5.i64(ptr addrspace(5) %a1, ptr addrspace(5) %a2, i64 16, i1 false)
  ret void
}
attributes #0 = { "amdgpu-flat-work-group-size"="64,64" }

PrasoonMishra

See my comment for details; needs deduplication of DeferredIntrs.

macurtis-amd · 2025-12-18T12:16:44Z

In your fix, when a memcpy uses two different allocas (one as dest, one as source), it gets added to DeferredIntrs twice i.e. once when processing each alloca's worklist. The second erase will lead to a crash.

Good catch. I've changed DeferredIntrs to a SetVector to prevent duplicates. Thanks.

PrasoonMishra

LGTM.

ronlieb

thx for the fixes, looking forward to it landing soon

…#172771) With recent refactoring, LDS promotion worklists for all allocas are populated upfront. In some cases, this results in a User in multiple lists. Then as each list is processed, a User might get deleted via removeFromParent, potentially leaving a dangling pointer in a subsequent worklist. Currently this only occurs for memcpy and memmove. Prior to refactoring, these were handled by DeferredInstr, and were processed after the last use of the then singular worklist. This change moves processing of DeferredInstr to after all worklists have be processed.

arsenm

Should this be using ValueHandle / WeakVH?

arsenm · 2025-12-18T17:29:44Z

llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll

+%struct.barney = type { i8, double }
+
+; Function Attrs: nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none)
+define amdgpu_kernel void @zot() local_unnamed_addr #0 {


Suggested change

define amdgpu_kernel void @zot() local_unnamed_addr #0 {

define amdgpu_kernel void @zot() #0 {

arsenm · 2025-12-18T17:29:49Z

llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll

+
+%struct.barney = type { i8, double }
+
+; Function Attrs: nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none)


Suggested change

; Function Attrs: nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none)

arsenm · 2025-12-18T17:30:08Z

llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll

+declare ptr @hoge() local_unnamed_addr #1
+
+attributes #0 = { nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none) "amdgpu-agpr-alloc"="0" "amdgpu-no-cluster-id-x" "amdgpu-no-cluster-id-y" "amdgpu-no-cluster-id-z" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
+attributes #1 = { "uniform-work-group-size"="false" }


Remove unnecessary attributes (which is probably all of them)

…#172771) With recent refactoring, LDS promotion worklists for all allocas are populated upfront. In some cases, this results in a User in multiple lists. Then as each list is processed, a User might get deleted via removeFromParent, potentially leaving a dangling pointer in a subsequent worklist. Currently this only occurs for memcpy and memmove. Prior to refactoring, these were handled by DeferredInstr, and were processed after the last use of the then singular worklist. This change moves processing of DeferredInstr to after all worklists have be processed.

Remove unnecessary attributes in test case as requested in post-merge feedback (#172771).

macurtis-amd requested a review from nhaehnle December 18, 2025 01:04

llvmbot added the backend:AMDGPU label Dec 18, 2025

ronlieb self-requested a review December 18, 2025 01:07

PrasoonMishra requested changes Dec 18, 2025

View reviewed changes

fixup! AMDGPU/PromoteAlloca: Fix handling of users of multiple allocas

79ddbfe

PrasoonMishra approved these changes Dec 18, 2025

View reviewed changes

ronlieb approved these changes Dec 18, 2025

View reviewed changes

macurtis-amd merged commit e741cd8 into llvm:main Dec 18, 2025
9 of 10 checks passed

arsenm reviewed Dec 18, 2025

View reviewed changes

macurtis-amd mentioned this pull request Dec 19, 2025

AMDGPU: cleanup promote-alloca-user-mult.ll test #173071

Merged

macurtis-amd added a commit that referenced this pull request Dec 19, 2025

AMDGPU: cleanup promote-alloca-user-mult.ll test (#173071)

16a7dba

Remove unnecessary attributes in test case as requested in post-merge feedback (#172771).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU/PromoteAlloca: Fix handling of users of multiple allocas #172771

AMDGPU/PromoteAlloca: Fix handling of users of multiple allocas #172771

macurtis-amd commented Dec 18, 2025

Uh oh!

llvmbot commented Dec 18, 2025

Uh oh!

macurtis-amd commented Dec 18, 2025

Uh oh!

PrasoonMishra commented Dec 18, 2025 •

edited

Loading

Uh oh!

PrasoonMishra left a comment

Uh oh!

macurtis-amd commented Dec 18, 2025

Uh oh!

PrasoonMishra left a comment

Uh oh!

ronlieb left a comment

Uh oh!

Uh oh!

arsenm left a comment

Uh oh!

arsenm Dec 18, 2025

Uh oh!

arsenm Dec 18, 2025

Uh oh!

arsenm Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	define amdgpu_kernel void @zot() local_unnamed_addr #0 {
	define amdgpu_kernel void @zot() #0 {


		%struct.barney = type { i8, double }

		; Function Attrs: nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none)

AMDGPU/PromoteAlloca: Fix handling of users of multiple allocas #172771

AMDGPU/PromoteAlloca: Fix handling of users of multiple allocas #172771

Conversation

macurtis-amd commented Dec 18, 2025

Uh oh!

llvmbot commented Dec 18, 2025

Uh oh!

macurtis-amd commented Dec 18, 2025

Uh oh!

PrasoonMishra commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PrasoonMishra left a comment

Choose a reason for hiding this comment

Uh oh!

macurtis-amd commented Dec 18, 2025

Uh oh!

PrasoonMishra left a comment

Choose a reason for hiding this comment

Uh oh!

ronlieb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PrasoonMishra commented Dec 18, 2025 •

edited

Loading