AMDGPU: Fix temporal divergence introduced by machine-sink and performance regression introduced by D155343 #67456

petar-avramovic · 2023-09-26T16:46:34Z

D155343 introduced performance regressions by not allowing machine instruction sinking in some cases - SWEDEV-414443.
It originally fixed SWDEV-407790

Revert D155343 to fix performance regression SWEDEV-414443
and introduce temporal divergence lowering pass that also fixes SWDEV-407790

Add a target hook to TargetInstrInfo that can be used to fix temporal divergence that can be introduced by sinking instruction outside cycle with divergent exit. AMDGPU implementation uses machine uniformity analysis to detect temporal divergent uses.

llvmbot · 2023-09-26T16:47:38Z

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-adt

Changes

D155343 introduced performance regressions by not allowing machine instruction sinking in some cases - SWEDEV-414443.
It originally fixed SWDEV-407790

Revert D155343 to fix performance regression SWEDEV-414443
and introduce temporal divergence lowering pass that also fixes SWDEV-407790

Add a target hook to TargetInstrInfo that can be used to fix temporal divergence that can be introduced by sinking instruction outside cycle with divergent exit. AMDGPU implementation uses machine uniformity analysis to detect temporal divergent uses.

Patch is 66.88 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67456.diff

12 Files Affected:

(modified) llvm/include/llvm/ADT/GenericUniformityImpl.h (+43)
(modified) llvm/include/llvm/ADT/GenericUniformityInfo.h (+6)
(modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+8)
(modified) llvm/lib/Analysis/UniformityAnalysis.cpp (+7-2)
(modified) llvm/lib/CodeGen/MachineSink.cpp (+10-11)
(modified) llvm/lib/CodeGen/MachineUniformityAnalysis.cpp (+7-2)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+60)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+3)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir (+1-1)
(added) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+1094)
(modified) llvm/test/CodeGen/AMDGPU/sink-after-control-flow.mir (+1-1)

diff --git a/llvm/include/llvm/ADT/GenericUniformityImpl.h b/llvm/include/llvm/ADT/GenericUniformityImpl.h
index ddd0746ccd91632..755e1161b41b521 100644
--- a/llvm/include/llvm/ADT/GenericUniformityImpl.h
+++ b/llvm/include/llvm/ADT/GenericUniformityImpl.h
@@ -341,6 +341,9 @@ template <typename ContextT> class GenericUniformityAnalysisImpl {
   using DivergenceDescriptorT =
       typename SyncDependenceAnalysisT::DivergenceDescriptor;
   using BlockLabelMapT = typename SyncDependenceAnalysisT::BlockLabelMap;
+  using UseOutsideCycleWithDivergentExitInfoT =
+      typename std::tuple<ConstValueRefT, const InstructionT *,
+                          SmallVector<BlockT *, 4>>;
 
   GenericUniformityAnalysisImpl(const DominatorTreeT &DT, const CycleInfoT &CI,
                                 const TargetTransformInfo *TTI)
@@ -396,6 +399,9 @@ template <typename ContextT> class GenericUniformityAnalysisImpl {
 
   void print(raw_ostream &out) const;
 
+  iterator_range<const UseOutsideCycleWithDivergentExitInfoT *>
+  uses_outside_cycles_with_divergent_exit() const;
+
 protected:
   /// \brief Value/block pair representing a single phi input.
   struct PhiInput {
@@ -427,6 +433,8 @@ template <typename ContextT> class GenericUniformityAnalysisImpl {
 
   // Recognized cycles with divergent exits.
   SmallPtrSet<const CycleT *, 16> DivergentExitCycles;
+  SmallVector<UseOutsideCycleWithDivergentExitInfoT, 4>
+      UsesOutsideCyclesWithDivergentExit;
 
   // Cycles assumed to be divergent.
   //
@@ -470,6 +478,10 @@ template <typename ContextT> class GenericUniformityAnalysisImpl {
   /// \brief Whether \p Def is divergent when read in \p ObservingBlock.
   bool isTemporalDivergent(const BlockT &ObservingBlock,
                            const InstructionT &Def) const;
+
+  void recordUseOutsideCycleWithDivergentExit(ConstValueRefT Src,
+                                              const InstructionT *UserInstr,
+                                              const CycleT &DefCycle);
 };
 
 template <typename ImplT>
@@ -1210,6 +1222,20 @@ void GenericUniformityAnalysisImpl<ContextT>::print(raw_ostream &OS) const {
   }
 }
 
+template <typename ContextT>
+using UseOutsideCycleWithDivergentExitInfoT =
+    typename std::tuple<typename ContextT::ConstValueRefT,
+                        const typename ContextT::InstructionT *,
+                        SmallVector<typename ContextT::BlockT *, 4>>;
+
+template <typename ContextT>
+iterator_range<const UseOutsideCycleWithDivergentExitInfoT<ContextT> *>
+GenericUniformityAnalysisImpl<
+    ContextT>::uses_outside_cycles_with_divergent_exit() const {
+  return make_range(UsesOutsideCyclesWithDivergentExit.begin(),
+                    UsesOutsideCyclesWithDivergentExit.end());
+}
+
 template <typename ContextT>
 bool GenericUniformityInfo<ContextT>::hasDivergence() const {
   return DA->hasDivergence();
@@ -1248,6 +1274,13 @@ void GenericUniformityInfo<ContextT>::print(raw_ostream &out) const {
   DA->print(out);
 }
 
+template <typename ContextT>
+iterator_range<const UseOutsideCycleWithDivergentExitInfoT<ContextT> *>
+GenericUniformityInfo<ContextT>::uses_outside_cycles_with_divergent_exit()
+    const {
+  return DA->uses_outside_cycles_with_divergent_exit();
+}
+
 template <typename ContextT>
 void llvm::ModifiedPostOrder<ContextT>::computeStackPO(
     SmallVectorImpl<const BlockT *> &Stack, const CycleInfoT &CI,
@@ -1367,6 +1400,16 @@ void llvm::ModifiedPostOrder<ContextT>::compute(const CycleInfoT &CI) {
   computeStackPO(Stack, CI, nullptr, Finalized);
 }
 
+template <typename ContextT>
+void GenericUniformityAnalysisImpl<ContextT>::
+    recordUseOutsideCycleWithDivergentExit(ConstValueRefT Src,
+                                           const InstructionT *UserInstr,
+                                           const CycleT &DefCycle) {
+  SmallVector<BlockT *, 4> TmpExitBlocks;
+  DefCycle.getExitBlocks(TmpExitBlocks);
+  UsesOutsideCyclesWithDivergentExit.push_back({Src, UserInstr, TmpExitBlocks});
+}
+
 } // namespace llvm
 
 #undef DEBUG_TYPE
diff --git a/llvm/include/llvm/ADT/GenericUniformityInfo.h b/llvm/include/llvm/ADT/GenericUniformityInfo.h
index e53afccc020b469..a76813d4bb964a1 100644
--- a/llvm/include/llvm/ADT/GenericUniformityInfo.h
+++ b/llvm/include/llvm/ADT/GenericUniformityInfo.h
@@ -39,6 +39,9 @@ template <typename ContextT> class GenericUniformityInfo {
 
   using CycleInfoT = GenericCycleInfo<ContextT>;
   using CycleT = typename CycleInfoT::CycleT;
+  using UseOutsideCycleWithDivergentExitInfoT =
+      typename std::tuple<ConstValueRefT, const InstructionT *,
+                          SmallVector<BlockT *, 4>>;
 
   GenericUniformityInfo(const DominatorTreeT &DT, const CycleInfoT &CI,
                         const TargetTransformInfo *TTI = nullptr);
@@ -78,6 +81,9 @@ template <typename ContextT> class GenericUniformityInfo {
 
   void print(raw_ostream &Out) const;
 
+  iterator_range<const UseOutsideCycleWithDivergentExitInfoT *>
+  uses_outside_cycles_with_divergent_exit() const;
+
 private:
   using ImplT = GenericUniformityAnalysisImpl<ContextT>;
 
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 98679b4dcf3cbfb..2135484448ef4ad 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -19,6 +19,8 @@
 #include "llvm/ADT/Uniformity.h"
 #include "llvm/CodeGen/MIRFormatter.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
+#include "llvm/CodeGen/MachineCycleAnalysis.h"
+#include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineInstrBuilder.h"
@@ -150,6 +152,12 @@ class TargetInstrInfo : public MCInstrInfo {
     return false;
   }
 
+  virtual void fixTemporalDivergence(MachineFunction &MF,
+                                     MachineDominatorTree *DT,
+                                     MachineCycleInfo *CI) const {
+    return;
+  }
+
 protected:
   /// For instructions with opcodes for which the M_REMATERIALIZABLE flag is
   /// set, this hook lets the target specify whether the instruction is actually
diff --git a/llvm/lib/Analysis/UniformityAnalysis.cpp b/llvm/lib/Analysis/UniformityAnalysis.cpp
index 2d617db431c5888..df1299610469d30 100644
--- a/llvm/lib/Analysis/UniformityAnalysis.cpp
+++ b/llvm/lib/Analysis/UniformityAnalysis.cpp
@@ -80,12 +80,17 @@ template <>
 void llvm::GenericUniformityAnalysisImpl<
     SSAContext>::propagateTemporalDivergence(const Instruction &I,
                                              const Cycle &DefCycle) {
-  if (isDivergent(I))
-    return;
   for (auto *User : I.users()) {
     auto *UserInstr = cast<Instruction>(User);
     if (DefCycle.contains(UserInstr->getParent()))
       continue;
+
+    recordUseOutsideCycleWithDivergentExit(cast<Value>(&I), UserInstr,
+                                           DefCycle);
+
+    if (isDivergent(I))
+      continue;
+
     markDivergent(*UserInstr);
   }
 }
diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index 480ac23d43ad879..ef35ec57a56971a 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -300,7 +300,8 @@ static bool blockPrologueInterferes(const MachineBasicBlock *BB,
       if (!Reg)
         continue;
       if (MO.isUse()) {
-        if (Reg.isPhysical() && MRI && MRI->isConstantPhysReg(Reg))
+        if (Reg.isPhysical() &&
+            (TII->isIgnorableUse(MO) || (MRI && MRI->isConstantPhysReg(Reg))))
           continue;
         if (PI->modifiesRegister(Reg, TRI))
           return true;
@@ -776,6 +777,12 @@ bool MachineSinking::runOnMachineFunction(MachineFunction &MF) {
     MRI->clearKillFlags(I);
   RegsToClearKillFlags.clear();
 
+  // In some tests there is a new baic block, need to recompute
+  DT->calculate(MF);
+  CI->clear();
+  CI->compute(MF);
+  TII->fixTemporalDivergence(MF, DT, CI);
+
   return EverMadeChange;
 }
 
@@ -1245,24 +1252,16 @@ MachineSinking::FindSuccToSinkTo(MachineInstr &MI, MachineBasicBlock *MBB,
   if (MBB == SuccToSinkTo)
     return nullptr;
 
-  if (!SuccToSinkTo)
-    return nullptr;
-
   // It's not safe to sink instructions to EH landing pad. Control flow into
   // landing pad is implicitly defined.
-  if (SuccToSinkTo->isEHPad())
+  if (SuccToSinkTo && SuccToSinkTo->isEHPad())
     return nullptr;
 
   // It ought to be okay to sink instructions into an INLINEASM_BR target, but
   // only if we make sure that MI occurs _before_ an INLINEASM_BR instruction in
   // the source block (which this code does not yet do). So for now, forbid
   // doing so.
-  if (SuccToSinkTo->isInlineAsmBrIndirectTarget())
-    return nullptr;
-
-  MachineBasicBlock::const_iterator InsertPos =
-      SuccToSinkTo->SkipPHIsAndLabels(SuccToSinkTo->begin());
-  if (blockPrologueInterferes(SuccToSinkTo, InsertPos, MI, TRI, TII, MRI))
+  if (SuccToSinkTo && SuccToSinkTo->isInlineAsmBrIndirectTarget())
     return nullptr;
 
   return SuccToSinkTo;
diff --git a/llvm/lib/CodeGen/MachineUniformityAnalysis.cpp b/llvm/lib/CodeGen/MachineUniformityAnalysis.cpp
index 3e0fe2b1ba087fe..5bd87d069285857 100644
--- a/llvm/lib/CodeGen/MachineUniformityAnalysis.cpp
+++ b/llvm/lib/CodeGen/MachineUniformityAnalysis.cpp
@@ -117,11 +117,16 @@ void llvm::GenericUniformityAnalysisImpl<MachineSSAContext>::
     if (!Op.getReg().isVirtual())
       continue;
     auto Reg = Op.getReg();
-    if (isDivergent(Reg))
-      continue;
+
     for (MachineInstr &UserInstr : RegInfo.use_instructions(Reg)) {
       if (DefCycle.contains(UserInstr.getParent()))
         continue;
+
+      recordUseOutsideCycleWithDivergentExit(Reg, &UserInstr, DefCycle);
+
+      if (isDivergent(Reg))
+        continue;
+
       markDivergent(UserInstr);
     }
   }
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 2799a3e78b04d22..5db140bb0fc6c94 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineScheduler.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
 #include "llvm/CodeGen/RegisterScavenging.h"
 #include "llvm/CodeGen/ScheduleDAG.h"
 #include "llvm/IR/DiagnosticInfo.h"
@@ -171,6 +172,65 @@ bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const {
          isVALU(*MO.getParent()) && !resultDependsOnExec(*MO.getParent());
 }
 
+// Get poiners to build instruction after MI (skips phis if needed)
+static std::pair<MachineBasicBlock *, MachineBasicBlock::iterator>
+getInsertAfterPtrs(MachineInstr *MI) {
+  MachineBasicBlock *InsertMBB = MI->getParent();
+  return std::make_pair(
+      InsertMBB, InsertMBB->SkipPHIsAndLabels(std::next(MI->getIterator())));
+}
+
+static void replaceUseRegisterWith(const MachineInstr *MI, Register Reg,
+                                   Register Newreg) {
+  for (unsigned i = 0; i < MI->getNumOperands(); ++i) {
+    const MachineOperand &Op = MI->getOperand(i);
+    if (Op.isReg() && Op.getReg() == Reg) {
+      const_cast<MachineInstr *>(MI)->getOperand(i).setReg(Newreg);
+    }
+  }
+}
+
+void SIInstrInfo::fixTemporalDivergence(MachineFunction &MF,
+                                        MachineDominatorTree *DT,
+                                        MachineCycleInfo *CI) const {
+  MachineUniformityInfo MUI =
+      computeMachineUniformityInfo(MF, *CI, DT->getBase(), true);
+
+  MachineRegisterInfo &MRI = MF.getRegInfo();
+  const GCNSubtarget &Subtarget = MF.getSubtarget<GCNSubtarget>();
+  const TargetInstrInfo &TII = *Subtarget.getInstrInfo();
+  const SIRegisterInfo &TRI = *Subtarget.getRegisterInfo();
+
+  // Temporal divergence lowering is required for uniform UniformSourceReg and
+  // divergent UserInstr. UserInstr is uniform only when cycle has uniform exit.
+  for (auto [SrcReg, UserInstr, CycleExitBlocks] :
+       MUI.uses_outside_cycles_with_divergent_exit()) {
+    if (!MUI.isUniform(SrcReg) || !MUI.isDivergent(UserInstr))
+      continue;
+
+    MachineInstr *UniformSourceInstr = MRI.getVRegDef(SrcReg);
+
+    // FixMe: SrcReg is lane mask in this case. Find a better way to detect it.
+    if (UniformSourceInstr->getOpcode() == AMDGPU::SI_IF_BREAK ||
+        UserInstr->getOpcode() == AMDGPU::SI_IF)
+      continue;
+
+    unsigned Size = TRI.getRegSizeInBits(*MRI.getRegClassOrNull(SrcReg));
+    Register VgprDst =
+        MRI.createVirtualRegister(TRI.getVGPRClassForBitWidth(Size));
+
+    auto [MBB, AfterUniformSourceReg] = getInsertAfterPtrs(UniformSourceInstr);
+    BuildMI(*MBB, AfterUniformSourceReg, {}, TII.get(AMDGPU::COPY))
+        .addDef(VgprDst)
+        .addReg(SrcReg)
+        .addReg(AMDGPU::EXEC, RegState::Implicit);
+
+    replaceUseRegisterWith(UserInstr, SrcReg, VgprDst);
+  }
+
+  return;
+}
+
 bool SIInstrInfo::areLoadsFromSameBasePtr(SDNode *Load0, SDNode *Load1,
                                           int64_t &Offset0,
                                           int64_t &Offset1) const {
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index e85917a4c0f3296..4446a962d7cca44 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -222,6 +222,9 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
 
   bool isIgnorableUse(const MachineOperand &MO) const override;
 
+  void fixTemporalDivergence(MachineFunction &MF, MachineDominatorTree *DT,
+                             MachineCycleInfo *CI) const override;
+
   bool areLoadsFromSameBasePtr(SDNode *Load0, SDNode *Load1, int64_t &Offset0,
                                int64_t &Offset1) const override;
 
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll b/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll
index e2456b74f7ef1fa..b8e74bc7db09a1a 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll
@@ -21,6 +21,7 @@ define void @machinesink_loop_variable_out_of_divergent_loop(i32 %arg, i1 %cmp49
 ; CHECK-NEXT:  .LBB0_1: ; %Flow
 ; CHECK-NEXT:    ; in Loop: Header=BB0_3 Depth=1
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s8
+; CHECK-NEXT:    v_add_nc_u32_e32 v4, -4, v4
 ; CHECK-NEXT:  .LBB0_2: ; %Flow1
 ; CHECK-NEXT:    ; in Loop: Header=BB0_3 Depth=1
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s7
@@ -53,7 +54,6 @@ define void @machinesink_loop_variable_out_of_divergent_loop(i32 %arg, i1 %cmp49
 ; CHECK-NEXT:    ;;#ASMEND
 ; CHECK-NEXT:    v_add_nc_u32_e32 v4, s9, v2
 ; CHECK-NEXT:    v_cmp_ge_u32_e64 s4, v4, v0
-; CHECK-NEXT:    v_add_nc_u32_e32 v4, -4, v4
 ; CHECK-NEXT:    s_or_b32 s8, s4, s8
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s8
 ; CHECK-NEXT:    s_cbranch_execz .LBB0_1
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir b/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir
index cc14b4a80d58a7d..037a285794120da 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir
@@ -42,7 +42,6 @@ body:             |
   ; CHECK-NEXT:   successors: %bb.2(0x40000000), %bb.5(0x40000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   INLINEASM &"", 1 /* sideeffect attdialect */
-  ; CHECK-NEXT:   [[V_ADD_U32_e64_:%[0-9]+]]:vgpr_32 = V_ADD_U32_e64 [[COPY]], [[COPY1]], 0, implicit $exec
   ; CHECK-NEXT:   [[SI_IF_BREAK:%[0-9]+]]:sreg_32 = SI_IF_BREAK killed [[SI_IF1]], [[SI_IF]], implicit-def dead $scc
   ; CHECK-NEXT:   SI_LOOP [[SI_IF_BREAK]], %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
   ; CHECK-NEXT:   S_BRANCH %bb.5
@@ -52,6 +51,7 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[PHI:%[0-9]+]]:vgpr_32 = PHI [[COPY]], %bb.4
   ; CHECK-NEXT:   SI_END_CF [[SI_IF_BREAK]], implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+  ; CHECK-NEXT:   [[V_ADD_U32_e64_:%[0-9]+]]:vgpr_32 = V_ADD_U32_e64 [[COPY]], [[COPY1]], 0, implicit $exec
   ; CHECK-NEXT:   INLINEASM &"", 1 /* sideeffect attdialect */, implicit [[V_ADD_U32_e64_]]
   ; CHECK-NEXT:   S_BRANCH %bb.2
   ; CHECK-NEXT: {{  $}}
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
new file mode 100644
index 000000000000000..0c631bdcdd374a8
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
@@ -0,0 +1,1094 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1031 < %s | FileCheck %s
+
+; ModuleID = 'kernel_round1_passing.bc'
+source_filename = "/tmp/comgr-295d04/input/CompileSource"
+target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8"
+target triple = "amdgcn-amd-amdhsa"
+
+@kernel_round1.first_words_data = external hidden unnamed_addr addrspace(3) global [896 x i8], align 1
+@kernel_round1.collisionsData = external hidden unnamed_addr addrspace(3) global [3840 x i32], align 4
+@kernel_round1.collisionsNum = external hidden addrspace(3) global i32, align 4
+
+; Function Attrs: convergent mustprogress nofree nounwind willreturn memory(none)
+declare hidden i64 @_Z13get_global_idj(i32 noundef) local_unnamed_addr #0
+
+; Function Attrs: convergent nounwind
+declare hidden i32 @_Z10atomic_addPU3AS1Vjj(ptr addrspace(1) noundef, i32 noundef) local_unnamed_addr #1
+
+; Function Attrs: convergent nounwind
+declare hidden i32 @_Z10atomic_subPU3AS1Vjj(ptr addrspace(1) noundef, i32 noundef) local_unnamed_addr #1
+
+; Function Attrs: convergent mustprogress nofree nounwind willreturn memory(none)
+declare hidden i64 @_Z12get_local_idj(i32 noundef) local_unnamed_addr #0
+
+; Function Attrs: convergent nounwind
+declare hidden void @_Z7barrierj(i32 noundef) local_unnamed_addr #1
+
+; Function Attrs: convergent mustprogress nofree nounwind willreturn memory(none)
+declare hidden i32 @_Z3minjj(i32 noundef, i32 noundef) local_unnamed_addr #0
+
+; Function Attrs: convergent nounwind
+declare hidden i32 @_Z10atomic_incPU3AS3Vj(ptr addrspace(3) noundef) local_unnamed_addr #1
+
+; Function Attrs: convergent mustprogress nofree nounwind willreturn memory(none)
+declare hidden i64 @_Z14get_local_sizej(i32 noundef) local_unnamed_addr #0
+
+; Function Attrs: convergent norecurse nounwind
+define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture noundef readonly align 1 %0, ptr addrspace(1) nocapture noundef writeonly align 1 %1, ptr addrspace(1) nocapture noundef readonly align 4 %2, ptr addrspace(1) noundef align 4 %3, ptr addrspace(1) nocapture noundef readnone align 4 %4) local_unnamed_addr #2 !kernel_arg_addr_space !5 !kernel_arg_access_qual !6 !kernel_arg_type !7 !kernel_arg_base_type !7 !kernel_arg_type_qual !8 !kernel_arg_name !9 !reqd_work_group_size !10 {
+; CHECK-LABEL: kernel_round1:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_add_u32 s10, s10, s15
+; CHECK-NEXT:    s_mov_b32 s32, 0
+; CHECK-NEXT:    s_addc_u32 s11, s11, 0
+; CHECK-NEXT:    s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s10
+; CHECK-NEXT:    s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s11
+; CHECK-NEXT:    s_load_dwordx8 s[44:51], s[6:7], 0x0
+; CHECK-NEXT:    s_add_u32 s0, s0, s15
+; CHECK-NEXT:    s_mov_b64 s[34:35], s[6:7]
+; CHECK-NEXT:    s_addc_u32 s1, s1, 0
+; CHECK-NEXT:    v_mov_b32_e32 v41, v0
+; CHECK-NEXT:    s_add_u32 s42, s34, 40
+; CHECK-NEXT:    v_mov_b32_e32 v31, v0
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
+; CHECK-NEXT:    s_addc_u32 s43, s35, 0
+; CHECK-NEXT:    s_mov_b64 s[10:11], s[36:37]
+; CHECK-NEXT:    s_mov_b64 s[8:9], s[42:43]
+; CHECK-NEXT:    s_mov_b32 s33, s14
+; CHECK-NEXT:    s_mov_b32 s40, s13
+; CHECK-NEXT:    s_mov_b32 s41, s12
+; CHECK-NEXT:    s_mov_b64 s[38:39], s[4:5]
+; CHECK-NEXT:    s_getpc_b64 s[6:7]
+; CHECK-NEXT:    s_add_u32 s6, ...
[truncated]

petar-avramovic · 2023-09-27T13:06:50Z

llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.mir

ssahasra · 2023-09-28T11:25:06Z

What is the relation between #67456 (this PR) and #67033? Does this PR supersede the other one?

petar-avramovic · 2023-09-28T11:36:11Z

What is the relation between #67456 (this PR) and #67033? Does this PR supersede the other one?

Yes, this PR is updated #67033 according to all comments and main difference is that this PR uses target hook in machine-sink to fix temporal divergence while #67033 uses machine-function-pass.

nhaehnle

Thanks for the rework. I still think this needs to be improved to be done right.

The test case still feels way too large. It will suffer from random unrelated changes. It ought to be possible to write a more focused MIR test that really captures the essence of what's happening: A uniform value inside of a cycle is defined in an SGPR vreg, which is then used by a VALU instruction that would ordinarily be a candidate for sinking.

The solution itself also doesn't feel the it's From The Book. Consider:

Recomputing DomTree, Cycle Info, and Machine Uniformity during the fixup feels like a lot of effort
Machine Uniformity on wave-level MIR seems like a fundamentally dubious proposition in the first place
The resulting code isn't great. In the new test case, it would almost certainly be better to not sink the v_add and leave it inside the loop.

Let's have some ambition to do better than this.

The old logic that prevented sinking of instructions was a convoluted way of preventing the VALU instruction from being moved past an EXEC change. But that didn't really capture the essence of what the bug was.

What I think is closer to the essence of the bug is that the VALU instruction reads from an SGPR, and sinking it changes the value that is being read.

Take a look at FindSuccToSinkTo. There's a piece of logic there that will refuse to sink an instruction that uses a physical register, unless that physical register is constant. That feels very similar to the kind of problem from the SWDEV-407790 test. Unfortunately, it's difficult to fit an analogous test for the SGPR condition into the same loop over operands, because we need to know the sink target in order to perform the check.

Idea: Let's have a TII->isSafeToSinkTo callback which, in AMDGPU, checks if there are SGPR uses. If so, and the sink target is outside of a cycle that contains the SGPR def, then deny the sinking.

This is a good first start:

It will still allow the sinking where we want it to be allowed:
- Sinking in sink-after-control-flow.mir is allowed because there are no cycles involved
- Sinking in machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir and .ll are allowed because the relevant SGPR def is outside of the cycle
By contrast, sinking in machine-sink-temporal-divergence-swdev407790.ll is prevented
- This is good for correctness
- It is better for performance than adding the extra v_mov

It is not perfect, because there are potentially cases where sinking would have been beneficial:

If adding an extra v_mov would allow multiple instructions to be sunk.
If the sink candidate is more expensive than a v_mov (e.g., transcendental)

Now in general, we should not be relying on machine sinking that much anyway. For the most part, LLVM IR passes should clean that up. Machine sinking should only help cleanup "debris" from instruction selection. It seems unlikely to me that such debris would show up for the two cases above, so while the initial idea could perhaps be refined, I'd lean towards just not worrying about it for now.

A final thought: All of blockPrologueInterferes becomes wasteful as far as I can tell. Block prologue instructions are instructions that define EXEC. If we skip them because EXEC is an ignorable use, then the check isn't doing us any good. Let's just remove it.

With that in mind, can you please:

Change the main commit in this PR so that it adds the TII->isSafeToSinkTo callback and checks it at the end of FindSuccToSinkTo.
Add a final commit which removes all traces of blockPrologueInterferes. (Make it a separate commit so that it's easier to revert if it turns out I was wrong.)

petar-avramovic · 2023-10-03T17:09:36Z

Updated version that does not allow sink. I tried to implement lightweight temporal divergence detection if machine-uniformity-analysis looks too expensive. ToDo remove blockPrologueInterferes

nhaehnle

You didn't yet update the tests, did you?

Anyway, this already looks much better, thanks! Still some issues with the logic and compile time, please see inline.

nhaehnle · 2023-10-03T18:30:11Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

Should be continue (also below) since this needs to look at all operands.

nhaehnle · 2023-10-03T18:30:44Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

Don't recompute this each time. It leads to at least quadratic compile-time. It should be the responsibility of the caller to ensure that the CI is uptodate.

nhaehnle · 2023-10-03T18:32:08Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

The opcode check is independent of the operands, so it should be done once up-front.

nhaehnle · 2023-10-03T18:36:22Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

This is a good start, but it doesn't yet quite capture the essence of what is happening. Consider the possible ways in which cycles may be nested. We can think of it as sinking MI successfully out of ever more cycles containing the SGPR def. It may be helpful to draw some diagrams on a piece of paper.

Suggestion:

Get the cycle of SuccToSinkTo --> ToCycle.

While FromCycle does not contain ToCycle:

If FromCycle has a divergent exit, return false: we've shown that sinking is unsafe

Otherwise, it is safe to sink out of FromCycle, but we need to continue checking: replace FromCycle by the parent cycle and repeat

petar-avramovic · 2023-10-04T12:46:33Z

llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.mir

Small targeted mir test

petar-avramovic · 2023-10-04T12:47:05Z

blockPrologueInterferes in tryToSinkCopy seems to be required,
For example see _amdgpu_ps_main2 from llvm/test/CodeGen/AMDGPU/sink-after-control-flow-postra.mir
blockPrologueInterferes stops sink of
renamable $sgpr0_sgpr1 = COPY $sgpr6_sgpr7
past its use $sgpr4_sgpr5 = S_AND_SAVEEXEC_B64 $sgpr0_sgpr1, implicit-def $exec, implicit-def $scc, implicit $exec

nhaehnle

Thanks, this looks pretty close to landing IMO.

blockPrologueInterferes in tryToSinkCopy seems to be required,
For example see _amdgpu_ps_main2 from llvm/test/CodeGen/AMDGPU/sink-after-control-flow-postra.mir
blockPrologueInterferes stops sink of
renamable $sgpr0_sgpr1 = COPY $sgpr6_sgpr7
past its use $sgpr4_sgpr5 = S_AND_SAVEEXEC_B64 $sgpr0_sgpr1, implicit-def $exec, implicit-def $scc, implicit $exec

Does that make sense? Whatever PostRAMachineSink is trying to do, surely it shouldn't need target-specific block prologue logic to understand that a def can't be sunk past its use...

nhaehnle · 2023-10-04T19:33:40Z

llvm/lib/CodeGen/MachineSink.cpp

Thanks. Please take a note to improve compile-time further as a separate follow-up change. Basically, it should be possible to update cycle-info when splitting a critical edge without recomputing the whole thing (adding an onEdgeSplit method like for MBFI).

I haven't thought about it too deeply, but I think what it boils down to is that the new basic block is a member of least cycle containing both the source and the destination of the critical edge.

nhaehnle · 2023-10-04T19:35:53Z

llvm/lib/CodeGen/MachineSink.cpp

So isSafeToSInk can be called with SuccToSinkTo == nullptr? That seems like a bad idea.

nhaehnle · 2023-10-04T19:38:47Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

This needs to be FromCycle && !FromCycle->contains(ToCycle).

For the reason, consider the case where there are two cycles right after each other:

| v/-\ A | |\-/ | | v/-\ B | |\-/ |

It could be the case that SgprDef is in A and SuccToSinkTo is B.

This reverts commit 3f8ef57.

Introduced by 5b657f5 that moved LICM after AMDGPUCodeGenPrepare. Some instructions are no longer sunk during ir optimizations but in machine-sinking instead. If vgpr instruction used sgpr defined inside the cycle is sunk outside of the cycle we end up with not-handled case of temporal divergence. Add test for theoretical case when SALU instruction (represents uniform value) is sunk outside of the cycle. Add a test when SALU instruction can be sunk if it edits lane mask.

Temporal divergence that was present in input or introduced in IR transforms, like code-sinking or LICM, is handled in SIFixSGPRCopies by changing sgpr source instr to vgpr instr. After 5b657f5, that moved LICM after AMDGPUCodeGenPrepare, machine-sinking can introduce temporal divergence by sinking instructions outside of the cycle. Add isSafeToSink callback in TargetInstrInfo.

petar-avramovic · 2023-10-05T16:18:03Z

blockPrologueInterferes in tryToSinkCopy seems to be required,
For example see _amdgpu_ps_main2 from llvm/test/CodeGen/AMDGPU/sink-after-control-flow-postra.mir
blockPrologueInterferes stops sink of
renamable $sgpr0_sgpr1 = COPY $sgpr6_sgpr7
past its use $sgpr4_sgpr5 = S_AND_SAVEEXEC_B64 $sgpr0_sgpr1, implicit-def $exec, implicit-def $scc, implicit $exec

Does that make sense? Whatever PostRAMachineSink is trying to do, surely it shouldn't need target-specific block prologue logic to understand that a def can't be sunk past its use...

https://reviews.llvm.org/D121277. Target-prologue instructions are not checked for "def can't be sunk past its use" but skipped as part of SkipPHIsAndLabels. They have to be checked somewhere. Maybe we could remove some checks from blockPrologueInterferes?

arsenm · 2023-10-05T18:20:39Z

I think we need to stop adding these extremely specific hooks that just identify problematic situations and hack around them in individual transforms. I think we need to move towards fully modeling the uniform and divergent CFGs in MachineBasicBlock, and use dedicated restricted pseudo-copies where there's a potential temporal divergence issue

nhaehnle · 2023-10-06T00:12:12Z

Does that make sense? Whatever PostRAMachineSink is trying to do, surely it shouldn't need target-specific block prologue logic to understand that a def can't be sunk past its use...

https://reviews.llvm.org/D121277. Target-prologue instructions are not checked for "def can't be sunk past its use" but skipped as part of SkipPHIsAndLabels. They have to be checked somewhere. Maybe we could remove some checks from blockPrologueInterferes?

It looks like this new change totally supersedes what D121277 was trying to do. Perhaps it can be reverted in its entirety?

I think we need to stop adding these extremely specific hooks that just identify problematic situations and hack around them in individual transforms. I think we need to move towards fully modeling the uniform and divergent CFGs in MachineBasicBlock, and use dedicated restricted pseudo-copies where there's a potential temporal divergence issue

I agree. Do we actually have a proposal for how this should look? I've seen some ideas and had some myself, but haven't seen anything fully coherent yet.

petar-avramovic · 2023-10-06T10:00:48Z

Does that make sense? Whatever PostRAMachineSink is trying to do, surely it shouldn't need target-specific block prologue logic to understand that a def can't be sunk past its use...

https://reviews.llvm.org/D121277. Target-prologue instructions are not checked for "def can't be sunk past its use" but skipped as part of SkipPHIsAndLabels. They have to be checked somewhere. Maybe we could remove some checks from blockPrologueInterferes?

It looks like this new change totally supersedes what D121277 was trying to do. Perhaps it can be reverted in its entirety?

This change is independent from D121277. It supersedes and reverts D155343 which also used blockPrologueInterferes.

I think we need to move towards fully modeling the uniform and divergent CFGs in MachineBasicBlock,

What we need for this, we have pseudo branches for divergent control flow. Only thing I could think of would be to have dedicated sgpr register class that would be used for lane masks only.

and use dedicated restricted pseudo-copies where there's a potential temporal divergence issue

Where/how would we insert them? I am convinced that we can detect temporal divergent use at any point in the pipeline.

nhaehnle

Okay, I see now that we do need an explicit check for the block prologue for post-RA sinking. So I think this change should be good to go.

Temporal divergence that was present in input or introduced in IR transforms, like code-sinking or LICM, is handled in SIFixSGPRCopies by changing sgpr source instr to vgpr instr. After 5b657f5, that moved LICM after AMDGPUCodeGenPrepare, machine-sinking can introduce temporal divergence by sinking instructions outside of the cycle. Add isSafeToSink callback in TargetInstrInfo.

petar-avramovic · 2023-10-06T13:15:51Z

Thanks for the review closing this pr for now. I pushed commits to keep them separate.
Please let me know if you have any suggestions how to handle temporal divergence issues, I can work on a patch for that.

arsenm · 2023-10-06T14:47:59Z

I agree. Do we actually have a proposal for how this should look? I've seen some ideas and had some myself, but haven't seen anything fully coherent yet.

No not really. I was most recently thinking of splitting isConvergent into some notion of read and write convergent, and then adding a corresponding flag to basic blocks that acts sort of like IsEHPad. You could then compose read/write convergent to approximately represent vector vs. scalar vs. wwm ops with different convergence types

Temporal divergence that was present in input or introduced in IR transforms, like code-sinking or LICM, is handled in SIFixSGPRCopies by changing sgpr source instr to vgpr instr. After 5b657f5, that moved LICM after AMDGPUCodeGenPrepare, machine-sinking can introduce temporal divergence by sinking instructions outside of the cycle. Add isSafeToSink callback in TargetInstrInfo. Change-Id: I753744d3807cc05dfca687ed359fdceb9afa2fdc

petar-avramovic requested review from arsenm and ssahasra September 26, 2023 16:46

llvmbot added backend:AMDGPU llvm:analysis Includes value tracking, cost tables and constant folding llvm:adt labels Sep 26, 2023

petar-avramovic requested review from nhaehnle and ruiling September 26, 2023 16:47

petar-avramovic requested review from jayfoad and perlfu September 26, 2023 16:47

petar-avramovic mentioned this pull request Sep 26, 2023

Late temporal divergence lowering for SDAG #67033

Closed

petar-avramovic force-pushed the temporal-divergence-v2 branch from c4cce6d to 856fde8 Compare September 27, 2023 13:04

petar-avramovic commented Sep 27, 2023

View reviewed changes

petar-avramovic force-pushed the temporal-divergence-v2 branch from 856fde8 to b615a6b Compare September 29, 2023 09:15

nhaehnle reviewed Oct 3, 2023

View reviewed changes

petar-avramovic force-pushed the temporal-divergence-v2 branch from b615a6b to 532dfb6 Compare October 3, 2023 17:07

nhaehnle reviewed Oct 3, 2023

View reviewed changes

petar-avramovic force-pushed the temporal-divergence-v2 branch from 532dfb6 to 14472cd Compare October 4, 2023 12:44

petar-avramovic commented Oct 4, 2023

View reviewed changes

llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.mir Outdated

Copy link

Collaborator Author

petar-avramovic Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small targeted mir test

nhaehnle reviewed Oct 4, 2023

View reviewed changes

petar-avramovic added 3 commits October 5, 2023 17:51

Revert "MachineSink: Fix sinking VGPR def out of a divergent loop"

09eb01e

This reverts commit 3f8ef57.

petar-avramovic force-pushed the temporal-divergence-v2 branch from 14472cd to 11b8f37 Compare October 5, 2023 16:16

nhaehnle approved these changes Oct 6, 2023

View reviewed changes

petar-avramovic closed this Oct 6, 2023

stepthomas mentioned this pull request Oct 10, 2023

AMDGPU stepthomas atomic csub no rtn forms ver2 stepthomas/llvm-project#1

Closed

petar-avramovic mentioned this pull request Dec 1, 2023

Swedev 414443 #65947

Closed

jayfoad mentioned this pull request Apr 18, 2024

Crash in SIInstrInfo::isSafeToSink #89200

Closed

AMDGPU: Fix temporal divergence introduced by machine-sink and performance regression introduced by D155343 #67456

AMDGPU: Fix temporal divergence introduced by machine-sink and performance regression introduced by D155343 #67456

Uh oh!

Conversation

petar-avramovic commented Sep 26, 2023

Uh oh!

llvmbot commented Sep 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssahasra commented Sep 28, 2023

Uh oh!

petar-avramovic commented Sep 28, 2023

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

petar-avramovic commented Oct 3, 2023

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petar-avramovic commented Oct 4, 2023

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petar-avramovic commented Oct 5, 2023

Uh oh!

arsenm commented Oct 5, 2023

Uh oh!

nhaehnle commented Oct 6, 2023

Uh oh!

petar-avramovic commented Oct 6, 2023

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

petar-avramovic commented Oct 6, 2023

Uh oh!

arsenm commented Oct 6, 2023

Uh oh!

Uh oh!

llvmbot commented Sep 26, 2023 •

edited

Loading