Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU][AtomicOptimizer] Fix DT update for divergent values with Iterative strategy #87605

Merged
merged 2 commits into from
Apr 18, 2024

Conversation

Pierre-vh
Copy link
Contributor

We take the terminator from EntryBB and put it in ComputeEnd. Make sure we also move the DT edges, we previously only did it assuming a non-conditional branch.

Fixes SWDEV-453943

We take the terminator from EntryBB and put it in ComputeEnd. Make sure we also move the DT
edges, we previously only did it assuming a non-conditional branch.
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 4, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

We take the terminator from EntryBB and put it in ComputeEnd. Make sure we also move the DT edges, we previously only did it assuming a non-conditional branch.

Fixes SWDEV-453943


Full diff: https://github.com/llvm/llvm-project/pull/87605.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp (+13-5)
  • (added) llvm/test/CodeGen/AMDGPU/atomic_optimization_split_dt_update.ll (+153)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
index dbb3de76b4ddae..8d3b2e82817052 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
@@ -743,7 +743,7 @@ void AMDGPUAtomicOptimizerImpl::optimizeAtomic(Instruction &I,
 
   Function *F = I.getFunction();
   LLVMContext &C = F->getContext();
-  
+
   // For atomic sub, perform scan with add operation and allow one lane to
   // subtract the reduced value later.
   AtomicRMWInst::BinOp ScanOp = Op;
@@ -876,7 +876,7 @@ void AMDGPUAtomicOptimizerImpl::optimizeAtomic(Instruction &I,
   BasicBlock *Predecessor = nullptr;
   if (ValDivergent && ScanImpl == ScanOptions::Iterative) {
     // Move terminator from I's block to ComputeEnd block.
-    Instruction *Terminator = EntryBB->getTerminator();
+    BranchInst *Terminator = cast<BranchInst>(EntryBB->getTerminator());
     B.SetInsertPoint(ComputeEnd);
     Terminator->removeFromParent();
     B.Insert(Terminator);
@@ -887,10 +887,18 @@ void AMDGPUAtomicOptimizerImpl::optimizeAtomic(Instruction &I,
     B.CreateBr(ComputeLoop);
 
     // Update the dominator tree for new control flow.
-    DTU.applyUpdates(
+    SmallVector<DominatorTree::UpdateType, 6> DomTreeUpdates(
         {{DominatorTree::Insert, EntryBB, ComputeLoop},
-         {DominatorTree::Insert, ComputeLoop, ComputeEnd},
-         {DominatorTree::Delete, EntryBB, SingleLaneTerminator->getParent()}});
+         {DominatorTree::Insert, ComputeLoop, ComputeEnd}});
+
+    // We're moving the terminator from EntryBB to ComputeEnd, make sure we move
+    // the DT edges as well.
+    for (auto *Succ : Terminator->successors()) {
+      DomTreeUpdates.push_back({DominatorTree::Insert, ComputeEnd, Succ});
+      DomTreeUpdates.push_back({DominatorTree::Delete, EntryBB, Succ});
+    }
+
+    DTU.applyUpdates(DomTreeUpdates);
 
     Predecessor = ComputeEnd;
   } else {
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimization_split_dt_update.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimization_split_dt_update.ll
new file mode 100644
index 00000000000000..3c99c2909c2101
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimization_split_dt_update.ll
@@ -0,0 +1,153 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -O3 -mtriple=amdgcn-- -mcpu=gfx908 %s -o - -verify-dom-info | FileCheck %s
+
+; Check we're properly adding an edge from ComputeEnd to the "End" block added by
+; SplitBlockAndInsertIfThen
+;
+; If the edge is not added, domtree verification will fail.
+
+declare i32 @quux()
+
+define amdgpu_kernel void @ham(ptr addrspace(4) %arg) {
+; CHECK-LABEL: ham:
+; CHECK:       ; %bb.0: ; %bb
+; CHECK-NEXT:    s_mov_b32 s48, SCRATCH_RSRC_DWORD0
+; CHECK-NEXT:    s_mov_b32 s49, SCRATCH_RSRC_DWORD1
+; CHECK-NEXT:    s_mov_b32 s50, -1
+; CHECK-NEXT:    s_mov_b32 s51, 0xe00000
+; CHECK-NEXT:    s_add_u32 s48, s48, s9
+; CHECK-NEXT:    s_addc_u32 s49, s49, 0
+; CHECK-NEXT:    s_mov_b64 s[36:37], s[2:3]
+; CHECK-NEXT:    s_mov_b32 s33, s8
+; CHECK-NEXT:    s_add_u32 s8, s36, 44
+; CHECK-NEXT:    s_addc_u32 s9, s37, 0
+; CHECK-NEXT:    s_mov_b64 s[38:39], s[0:1]
+; CHECK-NEXT:    s_getpc_b64 s[0:1]
+; CHECK-NEXT:    s_add_u32 s0, s0, quux@gotpcrel32@lo+4
+; CHECK-NEXT:    s_addc_u32 s1, s1, quux@gotpcrel32@hi+12
+; CHECK-NEXT:    s_load_dwordx2 s[42:43], s[0:1], 0x0
+; CHECK-NEXT:    v_lshlrev_b32_e32 v2, 20, v2
+; CHECK-NEXT:    v_lshlrev_b32_e32 v1, 10, v1
+; CHECK-NEXT:    s_mov_b64 s[34:35], s[4:5]
+; CHECK-NEXT:    v_or3_b32 v40, v0, v1, v2
+; CHECK-NEXT:    s_mov_b64 s[0:1], s[48:49]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
+; CHECK-NEXT:    s_mov_b32 s12, s6
+; CHECK-NEXT:    s_mov_b32 s13, s7
+; CHECK-NEXT:    s_mov_b32 s14, s33
+; CHECK-NEXT:    v_mov_b32_e32 v31, v40
+; CHECK-NEXT:    s_mov_b64 s[2:3], s[50:51]
+; CHECK-NEXT:    s_mov_b32 s32, 0
+; CHECK-NEXT:    s_mov_b32 s40, s7
+; CHECK-NEXT:    s_mov_b32 s41, s6
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[42:43]
+; CHECK-NEXT:    v_mov_b32_e32 v41, v0
+; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v41
+; CHECK-NEXT:    s_and_saveexec_b64 s[44:45], vcc
+; CHECK-NEXT:    s_cbranch_execz .LBB0_2
+; CHECK-NEXT:  ; %bb.1: ; %bb1
+; CHECK-NEXT:    s_add_u32 s8, s36, 44
+; CHECK-NEXT:    s_mov_b64 s[0:1], s[48:49]
+; CHECK-NEXT:    s_addc_u32 s9, s37, 0
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
+; CHECK-NEXT:    s_mov_b32 s12, s41
+; CHECK-NEXT:    s_mov_b32 s13, s40
+; CHECK-NEXT:    s_mov_b32 s14, s33
+; CHECK-NEXT:    v_mov_b32_e32 v31, v40
+; CHECK-NEXT:    s_mov_b64 s[2:3], s[50:51]
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[42:43]
+; CHECK-NEXT:    v_mov_b32_e32 v41, v0
+; CHECK-NEXT:  .LBB0_2: ; %bb3
+; CHECK-NEXT:    s_or_b64 exec, exec, s[44:45]
+; CHECK-NEXT:    s_load_dwordx2 s[42:43], s[36:37], 0x24
+; CHECK-NEXT:    s_mov_b64 s[44:45], 0
+; CHECK-NEXT:    v_mov_b32_e32 v42, 0
+; CHECK-NEXT:    s_mov_b64 s[46:47], 0
+; CHECK-NEXT:  .LBB0_3: ; %bb4
+; CHECK-NEXT:    ; =>This Loop Header: Depth=1
+; CHECK-NEXT:    ; Child Loop BB0_5 Depth 2
+; CHECK-NEXT:    s_add_u32 s8, s36, 44
+; CHECK-NEXT:    s_addc_u32 s9, s37, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_getpc_b64 s[0:1]
+; CHECK-NEXT:    s_add_u32 s0, s0, quux@gotpcrel32@lo+4
+; CHECK-NEXT:    s_addc_u32 s1, s1, quux@gotpcrel32@hi+12
+; CHECK-NEXT:    s_load_dwordx2 s[6:7], s[0:1], 0x0
+; CHECK-NEXT:    s_mov_b64 s[0:1], s[48:49]
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[38:39]
+; CHECK-NEXT:    s_mov_b64 s[10:11], s[34:35]
+; CHECK-NEXT:    s_mov_b32 s12, s41
+; CHECK-NEXT:    s_mov_b32 s13, s40
+; CHECK-NEXT:    s_mov_b32 s14, s33
+; CHECK-NEXT:    v_mov_b32_e32 v31, v40
+; CHECK-NEXT:    s_mov_b64 s[2:3], s[50:51]
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[6:7]
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; CHECK-NEXT:    s_or_b64 s[46:47], vcc, s[46:47]
+; CHECK-NEXT:    s_andn2_b64 exec, exec, s[46:47]
+; CHECK-NEXT:    s_cbranch_execnz .LBB0_3
+; CHECK-NEXT:  ; %bb.4: ; %bb7
+; CHECK-NEXT:    ; in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT:    s_or_b64 exec, exec, s[46:47]
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[42:43], 0x0
+; CHECK-NEXT:    s_mov_b64 s[2:3], exec
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:  .LBB0_5: ; %ComputeLoop
+; CHECK-NEXT:    ; Parent Loop BB0_3 Depth=1
+; CHECK-NEXT:    ; => This Inner Loop Header: Depth=2
+; CHECK-NEXT:    s_ff1_i32_b64 s5, s[2:3]
+; CHECK-NEXT:    v_readlane_b32 s8, v41, s5
+; CHECK-NEXT:    s_lshl_b64 s[6:7], 1, s5
+; CHECK-NEXT:    s_add_i32 s4, s4, s8
+; CHECK-NEXT:    s_andn2_b64 s[2:3], s[2:3], s[6:7]
+; CHECK-NEXT:    s_cmp_lg_u64 s[2:3], 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB0_5
+; CHECK-NEXT:  ; %bb.6: ; %ComputeEnd
+; CHECK-NEXT:    ; in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT:    v_mbcnt_lo_u32_b32 v0, exec_lo, 0
+; CHECK-NEXT:    v_mbcnt_hi_u32_b32 v0, exec_hi, v0
+; CHECK-NEXT:    v_cmp_eq_u32_e32 vcc, 0, v0
+; CHECK-NEXT:    s_or_b64 s[44:45], vcc, s[44:45]
+; CHECK-NEXT:    s_mov_b64 s[46:47], 0
+; CHECK-NEXT:    s_andn2_b64 exec, exec, s[44:45]
+; CHECK-NEXT:    s_cbranch_execnz .LBB0_3
+; CHECK-NEXT:  ; %bb.7: ; in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT:    s_or_b64 exec, exec, s[44:45]
+; CHECK-NEXT:    v_mov_b32_e32 v0, s4
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    global_atomic_add v42, v0, s[0:1]
+; CHECK-NEXT:    s_mov_b64 s[44:45], 0
+; CHECK-NEXT:    s_branch .LBB0_3
+; CHECK-NEXT:  ; %bb.8: ; %DummyReturnBlock
+; CHECK-NEXT:    s_endpgm
+bb:
+  %call = tail call i32 @quux()
+  %icmp = icmp eq i32 %call, 0
+  br i1 %icmp, label %bb1, label %bb3
+
+bb1:                                              ; preds = %bb
+  %call2 = tail call i32 @quux()
+  br label %bb3
+
+bb3:                                              ; preds = %bb1, %bb
+  %phi = phi i32 [ %call2, %bb1 ], [ %call, %bb ]
+  br label %bb4
+
+bb4:                                              ; preds = %bb8, %bb3
+  %call5 = tail call i32 @quux()
+  %icmp6 = icmp eq i32 %call5, 0
+  br i1 %icmp6, label %bb8, label %bb7
+
+bb7:                                              ; preds = %bb4
+  %load = load ptr, ptr addrspace(4) %arg, align 8
+  %addrspacecast = addrspacecast ptr %load to ptr addrspace(1)
+  %atomicrmw = atomicrmw add ptr addrspace(1) %addrspacecast, i32 %phi syncscope("agent-one-as") monotonic, align 4
+  br label %bb8
+
+bb8:                                              ; preds = %bb7, %bb4
+  br label %bb4
+}


// We're moving the terminator from EntryBB to ComputeEnd, make sure we move
// the DT edges as well.
for (auto *Succ : Terminator->successors()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loops twice at most (for a conditional branch)
The loop may be a bit overkill but I think it just makes more sense, see the comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could you use ComputeEnd->successors() here? Then you would not have to explain how Terminator is known to be a BranchInst.

@@ -876,7 +876,7 @@ void AMDGPUAtomicOptimizerImpl::optimizeAtomic(Instruction &I,
BasicBlock *Predecessor = nullptr;
if (ValDivergent && ScanImpl == ScanOptions::Iterative) {
// Move terminator from I's block to ComputeEnd block.
Instruction *Terminator = EntryBB->getTerminator();
BranchInst *Terminator = cast<BranchInst>(EntryBB->getTerminator());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know that EntryBB ends with a branch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Incidentally EntryBB is a bad name for this since it sounds like the entry block of the whole function.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment + changed var name

@Pierre-vh Pierre-vh requested a review from jayfoad April 8, 2024 06:14
@Pierre-vh
Copy link
Contributor Author

ping

; SplitBlockAndInsertIfThen
;
; If the edge is not added, domtree verification will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge the test with the existing ones?

@Pierre-vh Pierre-vh merged commit fcdb220 into llvm:main Apr 18, 2024
4 checks passed
@Pierre-vh Pierre-vh deleted the atomicopt-dt-fix branch April 18, 2024 07:22
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request May 2, 2024
…rative strategy (llvm#87605)

We take the terminator from EntryBB and put it in ComputeEnd. Make sure
we also move the DT edges, we previously only did it assuming a
non-conditional branch.

Fixes SWDEV-453943

Change-Id: I80aeef1da2fb2893b35959fdddb09f699c7997fa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants