Skip to content

Conversation

@macurtis-amd
Copy link
Contributor

With recent refactoring, LDS promotion worklists for all allocas are populated upfront. In some cases, this results in a User in multiple lists. Then as each list is processed, a User might get deleted via removeFromParent, potentially leaving a dangling pointer in a subsequent worklist.

Currently this only occurs for memcpy and memmove. Prior to refactoring, these were handled by DeferredInstr, and were processed after the last use of the then singular worklist.

This change moves processing of DeferredInstr to after all worklists have be processed.

With recent refactoring, LDS promotion worklists for all allocas are populated
upfront. In some cases, this results in a User in multiple lists. Then as each
list is processed, a User might get deleted via removeFromParent, potentially
leaving a dangling pointer in a subsequent worklist.

Currently this only occurs for memcpy and memmove. Prior to refactoring, these
were handled by DeferredInstr, and were processed after the last use of the then
singular worklist.

This change moves processing of DeferredInstr to after all worklists have be
processed.
@llvmbot
Copy link
Member

llvmbot commented Dec 18, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: None (macurtis-amd)

Changes

With recent refactoring, LDS promotion worklists for all allocas are populated upfront. In some cases, this results in a User in multiple lists. Then as each list is processed, a User might get deleted via removeFromParent, potentially leaving a dangling pointer in a subsequent worklist.

Currently this only occurs for memcpy and memmove. Prior to refactoring, these were handled by DeferredInstr, and were processed after the last use of the then singular worklist.

This change moves processing of DeferredInstr to after all worklists have be processed.


Full diff: https://github.com/llvm/llvm-project/pull/172771.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+17-8)
  • (added) llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll (+70)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index 83b463c630d71..361a74d57c784 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -159,7 +159,10 @@ class AMDGPUPromoteAllocaImpl {
   void analyzePromoteToVector(AllocaAnalysis &AA) const;
   void promoteAllocaToVector(AllocaAnalysis &AA);
   void analyzePromoteToLDS(AllocaAnalysis &AA) const;
-  bool tryPromoteAllocaToLDS(AllocaAnalysis &AA, bool SufficientLDS);
+  bool tryPromoteAllocaToLDS(AllocaAnalysis &AA, bool SufficientLDS,
+                             SmallVector<IntrinsicInst *> &DeferredIntrs);
+  void finishDeferredAllocaToLDSPromotion(
+      SmallVector<IntrinsicInst *> &DeferredIntrs);
 
   void scoreAlloca(AllocaAnalysis &AA) const;
 
@@ -414,6 +417,7 @@ bool AMDGPUPromoteAllocaImpl::run(Function &F, bool PromoteToLDS) {
   // clang-format on
 
   bool Changed = false;
+  SmallVector<IntrinsicInst *> DeferredIntrs;
   for (AllocaAnalysis &AA : Allocas) {
     if (AA.Vector.Ty) {
       const unsigned AllocaCost =
@@ -435,9 +439,11 @@ bool AMDGPUPromoteAllocaImpl::run(Function &F, bool PromoteToLDS) {
       }
     }
 
-    if (AA.LDS.Enable && tryPromoteAllocaToLDS(AA, SufficientLDS))
+    if (AA.LDS.Enable &&
+        tryPromoteAllocaToLDS(AA, SufficientLDS, DeferredIntrs))
       Changed = true;
   }
+  finishDeferredAllocaToLDSPromotion(DeferredIntrs);
 
   // NOTE: tryPromoteAllocaToVector removes the alloca, so Allocas contains
   // dangling pointers. If we want to reuse it past this point, the loop above
@@ -1550,8 +1556,9 @@ bool AMDGPUPromoteAllocaImpl::hasSufficientLocalMem(const Function &F) {
 }
 
 // FIXME: Should try to pick the most likely to be profitable allocas first.
-bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaAnalysis &AA,
-                                                    bool SufficientLDS) {
+bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(
+    AllocaAnalysis &AA, bool SufficientLDS,
+    SmallVector<IntrinsicInst *> &DeferredIntrs) {
   LLVM_DEBUG(dbgs() << "Trying to promote to LDS: " << *AA.Alloca << '\n');
 
   // Not likely to have sufficient local memory for promotion.
@@ -1620,8 +1627,6 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaAnalysis &AA,
   AA.Alloca->replaceAllUsesWith(Offset);
   AA.Alloca->eraseFromParent();
 
-  SmallVector<IntrinsicInst *> DeferredIntrs;
-
   PointerType *NewPtrTy = PointerType::get(Context, AMDGPUAS::LOCAL_ADDRESS);
 
   for (Value *V : AA.LDS.Worklist) {
@@ -1730,7 +1735,13 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaAnalysis &AA,
     }
   }
 
+  return true;
+}
+
+void AMDGPUPromoteAllocaImpl::finishDeferredAllocaToLDSPromotion(
+    SmallVector<IntrinsicInst *> &DeferredIntrs) {
   for (IntrinsicInst *Intr : DeferredIntrs) {
+    IRBuilder<> Builder(Intr);
     Builder.SetInsertPoint(Intr);
     Intrinsic::ID ID = Intr->getIntrinsicID();
     assert(ID == Intrinsic::memcpy || ID == Intrinsic::memmove);
@@ -1748,6 +1759,4 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaAnalysis &AA,
 
     Intr->eraseFromParent();
   }
-
-  return true;
 }
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll
new file mode 100644
index 0000000000000..915e0910a5047
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-user-mult.ll
@@ -0,0 +1,70 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -passes=amdgpu-promote-alloca < %s | FileCheck %s
+
+; This tests the case where a memcpy has two pointer operands are promoted to LDS
+; See `@llvm.memcpy.p5.p5.i64(... %alloca1, ... %alloca, ...)` below.
+
+
+%struct.barney = type { i8, double }
+
+; Function Attrs: nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none)
+define amdgpu_kernel void @zot() local_unnamed_addr #0 {
+; CHECK-LABEL: @zot(
+; CHECK-NEXT:  bb:
+; CHECK-NEXT:    [[TMP0:%.*]] = call noalias nonnull dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
+; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP0]], i64 1
+; CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr addrspace(4) [[TMP1]], align 4, !invariant.load [[META0:![0-9]+]]
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP0]], i64 2
+; CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr addrspace(4) [[TMP3]], align 4, !range [[RNG1:![0-9]+]], !invariant.load [[META0]]
+; CHECK-NEXT:    [[TMP5:%.*]] = lshr i32 [[TMP2]], 16
+; CHECK-NEXT:    [[TMP6:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.x()
+; CHECK-NEXT:    [[TMP7:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.y()
+; CHECK-NEXT:    [[TMP8:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.z()
+; CHECK-NEXT:    [[TMP9:%.*]] = mul nuw nsw i32 [[TMP5]], [[TMP4]]
+; CHECK-NEXT:    [[TMP10:%.*]] = mul i32 [[TMP9]], [[TMP6]]
+; CHECK-NEXT:    [[TMP11:%.*]] = mul nuw nsw i32 [[TMP7]], [[TMP4]]
+; CHECK-NEXT:    [[TMP12:%.*]] = add i32 [[TMP10]], [[TMP11]]
+; CHECK-NEXT:    [[TMP13:%.*]] = add i32 [[TMP12]], [[TMP8]]
+; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds [1024 x [[STRUCT_BARNEY:%.*]]], ptr addrspace(3) @zot.alloca, i32 0, i32 [[TMP13]]
+; CHECK-NEXT:    [[TMP15:%.*]] = call noalias nonnull dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
+; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP15]], i64 1
+; CHECK-NEXT:    [[TMP17:%.*]] = load i32, ptr addrspace(4) [[TMP16]], align 4, !invariant.load [[META0]]
+; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i32, ptr addrspace(4) [[TMP15]], i64 2
+; CHECK-NEXT:    [[TMP19:%.*]] = load i32, ptr addrspace(4) [[TMP18]], align 4, !range [[RNG1]], !invariant.load [[META0]]
+; CHECK-NEXT:    [[TMP20:%.*]] = lshr i32 [[TMP17]], 16
+; CHECK-NEXT:    [[TMP21:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.x()
+; CHECK-NEXT:    [[TMP22:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.y()
+; CHECK-NEXT:    [[TMP23:%.*]] = call range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.z()
+; CHECK-NEXT:    [[TMP24:%.*]] = mul nuw nsw i32 [[TMP20]], [[TMP19]]
+; CHECK-NEXT:    [[TMP25:%.*]] = mul i32 [[TMP24]], [[TMP21]]
+; CHECK-NEXT:    [[TMP26:%.*]] = mul nuw nsw i32 [[TMP22]], [[TMP19]]
+; CHECK-NEXT:    [[TMP27:%.*]] = add i32 [[TMP25]], [[TMP26]]
+; CHECK-NEXT:    [[TMP28:%.*]] = add i32 [[TMP27]], [[TMP23]]
+; CHECK-NEXT:    [[TMP29:%.*]] = getelementptr inbounds [1024 x [[STRUCT_BARNEY]]], ptr addrspace(3) @zot.alloca1, i32 0, i32 [[TMP28]]
+; CHECK-NEXT:    store i32 0, ptr addrspace(5) null, align 2147483648
+; CHECK-NEXT:    call void @llvm.memcpy.p3.p3.i64(ptr addrspace(3) align 16 dereferenceable(16) [[TMP29]], ptr addrspace(3) align 16 dereferenceable(16) [[TMP14]], i64 16, i1 false)
+; CHECK-NEXT:    call void @llvm.memcpy.p3.p0.i64(ptr addrspace(3) align 16 dereferenceable(16) [[TMP14]], ptr align 1 dereferenceable(16) poison, i64 16, i1 false)
+; CHECK-NEXT:    [[LOAD:%.*]] = load volatile ptr, ptr addrspace(5) null, align 2147483648
+; CHECK-NEXT:    br label [[BB2:%.*]]
+; CHECK:       bb2:
+; CHECK-NEXT:    call void @llvm.memcpy.p0.p3.i64(ptr align 1 dereferenceable(16) @hoge, ptr addrspace(3) align 16 dereferenceable(16) [[TMP29]], i64 16, i1 false)
+; CHECK-NEXT:    br label [[BB2]]
+;
+bb:
+  %alloca = alloca %struct.barney, align 16, addrspace(5)
+  %alloca1 = alloca %struct.barney, align 16, addrspace(5)
+  store i32 0, ptr addrspace(5) null, align 2147483648
+  call void @llvm.memcpy.p5.p5.i64(ptr addrspace(5) noundef align 16 dereferenceable(16) %alloca1, ptr addrspace(5) noundef align 16 dereferenceable(16) %alloca, i64 16, i1 false)
+  call void @llvm.memcpy.p5.p0.i64(ptr addrspace(5) noundef align 16 dereferenceable(16) %alloca, ptr noundef nonnull align 1 dereferenceable(16) poison, i64 16, i1 false)
+  %load = load volatile ptr, ptr addrspace(5) null, align 2147483648
+  br label %bb2
+
+bb2:                                              ; preds = %bb2, %bb
+  call void @llvm.memcpy.p0.p5.i64(ptr noundef nonnull align 1 dereferenceable(16) @hoge, ptr addrspace(5) noundef align 16 dereferenceable(16) %alloca1, i64 16, i1 false)
+  br label %bb2
+}
+
+declare ptr @hoge() local_unnamed_addr #1
+
+attributes #0 = { nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none) "amdgpu-agpr-alloc"="0" "amdgpu-no-cluster-id-x" "amdgpu-no-cluster-id-y" "amdgpu-no-cluster-id-z" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
+attributes #1 = { "uniform-work-group-size"="false" }

@ronlieb ronlieb self-requested a review December 18, 2025 01:07
@macurtis-amd
Copy link
Contributor Author

FWIW, I also have an alternative fix that is a bit more robust/general IMO: macurtis-amd@21a3e3c.

It also allows processing of memcpy/memmove in the main switch along with the other intrinsics.

@PrasoonMishra
Copy link
Contributor

PrasoonMishra commented Dec 18, 2025

Thanks for this fix.

In your fix, when a memcpy uses two different allocas (one as dest, one as source), it gets added to DeferredIntrs twice i.e. once when processing each alloca's worklist. The second erase will lead to a crash.

define amdgpu_kernel void @test() #0 {
  %a1 = alloca [4 x i32], addrspace(5)
  %a2 = alloca [4 x i32], addrspace(5)
  call void @llvm.memcpy.p5.p5.i64(ptr addrspace(5) %a1, ptr addrspace(5) %a2, i64 16, i1 false)
  ret void
}
attributes #0 = { "amdgpu-flat-work-group-size"="64,64" }

Copy link
Contributor

@PrasoonMishra PrasoonMishra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment for details; needs deduplication of DeferredIntrs.

@macurtis-amd
Copy link
Contributor Author

In your fix, when a memcpy uses two different allocas (one as dest, one as source), it gets added to DeferredIntrs twice i.e. once when processing each alloca's worklist. The second erase will lead to a crash.

Good catch. I've changed DeferredIntrs to a SetVector to prevent duplicates. Thanks.

Copy link
Contributor

@PrasoonMishra PrasoonMishra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@ronlieb ronlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the fixes, looking forward to it landing soon

@macurtis-amd macurtis-amd merged commit e741cd8 into llvm:main Dec 18, 2025
9 of 10 checks passed
ronlieb pushed a commit to ROCm/llvm-project that referenced this pull request Dec 18, 2025
…#172771)

With recent refactoring, LDS promotion worklists for all allocas are
populated upfront. In some cases, this results in a User in multiple
lists. Then as each list is processed, a User might get deleted via
removeFromParent, potentially leaving a dangling pointer in a subsequent
worklist.

Currently this only occurs for memcpy and memmove. Prior to refactoring,
these were handled by DeferredInstr, and were processed after the last
use of the then singular worklist.

This change moves processing of DeferredInstr to after all worklists
have be processed.
ronlieb pushed a commit to ROCm/llvm-project that referenced this pull request Dec 18, 2025
…#172771)

With recent refactoring, LDS promotion worklists for all allocas are
populated upfront. In some cases, this results in a User in multiple
lists. Then as each list is processed, a User might get deleted via
removeFromParent, potentially leaving a dangling pointer in a subsequent
worklist.

Currently this only occurs for memcpy and memmove. Prior to refactoring,
these were handled by DeferredInstr, and were processed after the last
use of the then singular worklist.

This change moves processing of DeferredInstr to after all worklists
have be processed.
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be using ValueHandle / WeakVH?

%struct.barney = type { i8, double }

; Function Attrs: nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none)
define amdgpu_kernel void @zot() local_unnamed_addr #0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
define amdgpu_kernel void @zot() local_unnamed_addr #0 {
define amdgpu_kernel void @zot() #0 {


%struct.barney = type { i8, double }

; Function Attrs: nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
; Function Attrs: nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none)

Comment on lines +67 to +70
declare ptr @hoge() local_unnamed_addr #1

attributes #0 = { nofree norecurse noreturn nounwind memory(readwrite, target_mem0: none, target_mem1: none) "amdgpu-agpr-alloc"="0" "amdgpu-no-cluster-id-x" "amdgpu-no-cluster-id-y" "amdgpu-no-cluster-id-z" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-flat-scratch-init" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "uniform-work-group-size"="false" }
attributes #1 = { "uniform-work-group-size"="false" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary attributes (which is probably all of them)

mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Dec 19, 2025
…#172771)

With recent refactoring, LDS promotion worklists for all allocas are
populated upfront. In some cases, this results in a User in multiple
lists. Then as each list is processed, a User might get deleted via
removeFromParent, potentially leaving a dangling pointer in a subsequent
worklist.

Currently this only occurs for memcpy and memmove. Prior to refactoring,
these were handled by DeferredInstr, and were processed after the last
use of the then singular worklist.

This change moves processing of DeferredInstr to after all worklists
have be processed.
macurtis-amd added a commit that referenced this pull request Dec 19, 2025
Remove unnecessary attributes in test case as requested in post-merge
feedback (#172771).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants