Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU: Fix temporal divergence introduced by machine-sink and performance regression introduced by D155343 #67456

Closed

Conversation

petar-avramovic
Copy link
Collaborator

D155343 introduced performance regressions by not allowing machine instruction sinking in some cases - SWEDEV-414443.
It originally fixed SWDEV-407790

Revert D155343 to fix performance regression SWEDEV-414443
and introduce temporal divergence lowering pass that also fixes SWDEV-407790

Add a target hook to TargetInstrInfo that can be used to fix temporal divergence that can be introduced by sinking instruction outside cycle with divergent exit. AMDGPU implementation uses machine uniformity analysis to detect temporal divergent uses.

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 26, 2023

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-adt

Changes

D155343 introduced performance regressions by not allowing machine instruction sinking in some cases - SWEDEV-414443.
It originally fixed SWDEV-407790

Revert D155343 to fix performance regression SWEDEV-414443
and introduce temporal divergence lowering pass that also fixes SWDEV-407790

Add a target hook to TargetInstrInfo that can be used to fix temporal divergence that can be introduced by sinking instruction outside cycle with divergent exit. AMDGPU implementation uses machine uniformity analysis to detect temporal divergent uses.


Patch is 66.88 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67456.diff

12 Files Affected:

  • (modified) llvm/include/llvm/ADT/GenericUniformityImpl.h (+43)
  • (modified) llvm/include/llvm/ADT/GenericUniformityInfo.h (+6)
  • (modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+8)
  • (modified) llvm/lib/Analysis/UniformityAnalysis.cpp (+7-2)
  • (modified) llvm/lib/CodeGen/MachineSink.cpp (+10-11)
  • (modified) llvm/lib/CodeGen/MachineUniformityAnalysis.cpp (+7-2)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+60)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+3)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir (+1-1)
  • (added) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+1094)
  • (modified) llvm/test/CodeGen/AMDGPU/sink-after-control-flow.mir (+1-1)
diff --git a/llvm/include/llvm/ADT/GenericUniformityImpl.h b/llvm/include/llvm/ADT/GenericUniformityImpl.h
index ddd0746ccd91632..755e1161b41b521 100644
--- a/llvm/include/llvm/ADT/GenericUniformityImpl.h
+++ b/llvm/include/llvm/ADT/GenericUniformityImpl.h
@@ -341,6 +341,9 @@ template <typename ContextT> class GenericUniformityAnalysisImpl {
   using DivergenceDescriptorT =
       typename SyncDependenceAnalysisT::DivergenceDescriptor;
   using BlockLabelMapT = typename SyncDependenceAnalysisT::BlockLabelMap;
+  using UseOutsideCycleWithDivergentExitInfoT =
+      typename std::tuple<ConstValueRefT, const InstructionT *,
+                          SmallVector<BlockT *, 4>>;
 
   GenericUniformityAnalysisImpl(const DominatorTreeT &DT, const CycleInfoT &CI,
                                 const TargetTransformInfo *TTI)
@@ -396,6 +399,9 @@ template <typename ContextT> class GenericUniformityAnalysisImpl {
 
   void print(raw_ostream &out) const;
 
+  iterator_range<const UseOutsideCycleWithDivergentExitInfoT *>
+  uses_outside_cycles_with_divergent_exit() const;
+
 protected:
   /// \brief Value/block pair representing a single phi input.
   struct PhiInput {
@@ -427,6 +433,8 @@ template <typename ContextT> class GenericUniformityAnalysisImpl {
 
   // Recognized cycles with divergent exits.
   SmallPtrSet<const CycleT *, 16> DivergentExitCycles;
+  SmallVector<UseOutsideCycleWithDivergentExitInfoT, 4>
+      UsesOutsideCyclesWithDivergentExit;
 
   // Cycles assumed to be divergent.
   //
@@ -470,6 +478,10 @@ template <typename ContextT> class GenericUniformityAnalysisImpl {
   /// \brief Whether \p Def is divergent when read in \p ObservingBlock.
   bool isTemporalDivergent(const BlockT &ObservingBlock,
                            const InstructionT &Def) const;
+
+  void recordUseOutsideCycleWithDivergentExit(ConstValueRefT Src,
+                                              const InstructionT *UserInstr,
+                                              const CycleT &DefCycle);
 };
 
 template <typename ImplT>
@@ -1210,6 +1222,20 @@ void GenericUniformityAnalysisImpl<ContextT>::print(raw_ostream &OS) const {
   }
 }
 
+template <typename ContextT>
+using UseOutsideCycleWithDivergentExitInfoT =
+    typename std::tuple<typename ContextT::ConstValueRefT,
+                        const typename ContextT::InstructionT *,
+                        SmallVector<typename ContextT::BlockT *, 4>>;
+
+template <typename ContextT>
+iterator_range<const UseOutsideCycleWithDivergentExitInfoT<ContextT> *>
+GenericUniformityAnalysisImpl<
+    ContextT>::uses_outside_cycles_with_divergent_exit() const {
+  return make_range(UsesOutsideCyclesWithDivergentExit.begin(),
+                    UsesOutsideCyclesWithDivergentExit.end());
+}
+
 template <typename ContextT>
 bool GenericUniformityInfo<ContextT>::hasDivergence() const {
   return DA->hasDivergence();
@@ -1248,6 +1274,13 @@ void GenericUniformityInfo<ContextT>::print(raw_ostream &out) const {
   DA->print(out);
 }
 
+template <typename ContextT>
+iterator_range<const UseOutsideCycleWithDivergentExitInfoT<ContextT> *>
+GenericUniformityInfo<ContextT>::uses_outside_cycles_with_divergent_exit()
+    const {
+  return DA->uses_outside_cycles_with_divergent_exit();
+}
+
 template <typename ContextT>
 void llvm::ModifiedPostOrder<ContextT>::computeStackPO(
     SmallVectorImpl<const BlockT *> &Stack, const CycleInfoT &CI,
@@ -1367,6 +1400,16 @@ void llvm::ModifiedPostOrder<ContextT>::compute(const CycleInfoT &CI) {
   computeStackPO(Stack, CI, nullptr, Finalized);
 }
 
+template <typename ContextT>
+void GenericUniformityAnalysisImpl<ContextT>::
+    recordUseOutsideCycleWithDivergentExit(ConstValueRefT Src,
+                                           const InstructionT *UserInstr,
+                                           const CycleT &DefCycle) {
+  SmallVector<BlockT *, 4> TmpExitBlocks;
+  DefCycle.getExitBlocks(TmpExitBlocks);
+  UsesOutsideCyclesWithDivergentExit.push_back({Src, UserInstr, TmpExitBlocks});
+}
+
 } // namespace llvm
 
 #undef DEBUG_TYPE
diff --git a/llvm/include/llvm/ADT/GenericUniformityInfo.h b/llvm/include/llvm/ADT/GenericUniformityInfo.h
index e53afccc020b469..a76813d4bb964a1 100644
--- a/llvm/include/llvm/ADT/GenericUniformityInfo.h
+++ b/llvm/include/llvm/ADT/GenericUniformityInfo.h
@@ -39,6 +39,9 @@ template <typename ContextT> class GenericUniformityInfo {
 
   using CycleInfoT = GenericCycleInfo<ContextT>;
   using CycleT = typename CycleInfoT::CycleT;
+  using UseOutsideCycleWithDivergentExitInfoT =
+      typename std::tuple<ConstValueRefT, const InstructionT *,
+                          SmallVector<BlockT *, 4>>;
 
   GenericUniformityInfo(const DominatorTreeT &DT, const CycleInfoT &CI,
                         const TargetTransformInfo *TTI = nullptr);
@@ -78,6 +81,9 @@ template <typename ContextT> class GenericUniformityInfo {
 
   void print(raw_ostream &Out) const;
 
+  iterator_range<const UseOutsideCycleWithDivergentExitInfoT *>
+  uses_outside_cycles_with_divergent_exit() const;
+
 private:
   using ImplT = GenericUniformityAnalysisImpl<ContextT>;
 
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 98679b4dcf3cbfb..2135484448ef4ad 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -19,6 +19,8 @@
 #include "llvm/ADT/Uniformity.h"
 #include "llvm/CodeGen/MIRFormatter.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
+#include "llvm/CodeGen/MachineCycleAnalysis.h"
+#include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineInstrBuilder.h"
@@ -150,6 +152,12 @@ class TargetInstrInfo : public MCInstrInfo {
     return false;
   }
 
+  virtual void fixTemporalDivergence(MachineFunction &MF,
+                                     MachineDominatorTree *DT,
+                                     MachineCycleInfo *CI) const {
+    return;
+  }
+
 protected:
   /// For instructions with opcodes for which the M_REMATERIALIZABLE flag is
   /// set, this hook lets the target specify whether the instruction is actually
diff --git a/llvm/lib/Analysis/UniformityAnalysis.cpp b/llvm/lib/Analysis/UniformityAnalysis.cpp
index 2d617db431c5888..df1299610469d30 100644
--- a/llvm/lib/Analysis/UniformityAnalysis.cpp
+++ b/llvm/lib/Analysis/UniformityAnalysis.cpp
@@ -80,12 +80,17 @@ template <>
 void llvm::GenericUniformityAnalysisImpl<
     SSAContext>::propagateTemporalDivergence(const Instruction &I,
                                              const Cycle &DefCycle) {
-  if (isDivergent(I))
-    return;
   for (auto *User : I.users()) {
     auto *UserInstr = cast<Instruction>(User);
     if (DefCycle.contains(UserInstr->getParent()))
       continue;
+
+    recordUseOutsideCycleWithDivergentExit(cast<Value>(&I), UserInstr,
+                                           DefCycle);
+
+    if (isDivergent(I))
+      continue;
+
     markDivergent(*UserInstr);
   }
 }
diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index 480ac23d43ad879..ef35ec57a56971a 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -300,7 +300,8 @@ static bool blockPrologueInterferes(const MachineBasicBlock *BB,
       if (!Reg)
         continue;
       if (MO.isUse()) {
-        if (Reg.isPhysical() && MRI && MRI->isConstantPhysReg(Reg))
+        if (Reg.isPhysical() &&
+            (TII->isIgnorableUse(MO) || (MRI && MRI->isConstantPhysReg(Reg))))
           continue;
         if (PI->modifiesRegister(Reg, TRI))
           return true;
@@ -776,6 +777,12 @@ bool MachineSinking::runOnMachineFunction(MachineFunction &MF) {
     MRI->clearKillFlags(I);
   RegsToClearKillFlags.clear();
 
+  // In some tests there is a new baic block, need to recompute
+  DT->calculate(MF);
+  CI->clear();
+  CI->compute(MF);
+  TII->fixTemporalDivergence(MF, DT, CI);
+
   return EverMadeChange;
 }
 
@@ -1245,24 +1252,16 @@ MachineSinking::FindSuccToSinkTo(MachineInstr &MI, MachineBasicBlock *MBB,
   if (MBB == SuccToSinkTo)
     return nullptr;
 
-  if (!SuccToSinkTo)
-    return nullptr;
-
   // It's not safe to sink instructions to EH landing pad. Control flow into
   // landing pad is implicitly defined.
-  if (SuccToSinkTo->isEHPad())
+  if (SuccToSinkTo && SuccToSinkTo->isEHPad())
     return nullptr;
 
   // It ought to be okay to sink instructions into an INLINEASM_BR target, but
   // only if we make sure that MI occurs _before_ an INLINEASM_BR instruction in
   // the source block (which this code does not yet do). So for now, forbid
   // doing so.
-  if (SuccToSinkTo->isInlineAsmBrIndirectTarget())
-    return nullptr;
-
-  MachineBasicBlock::const_iterator InsertPos =
-      SuccToSinkTo->SkipPHIsAndLabels(SuccToSinkTo->begin());
-  if (blockPrologueInterferes(SuccToSinkTo, InsertPos, MI, TRI, TII, MRI))
+  if (SuccToSinkTo && SuccToSinkTo->isInlineAsmBrIndirectTarget())
     return nullptr;
 
   return SuccToSinkTo;
diff --git a/llvm/lib/CodeGen/MachineUniformityAnalysis.cpp b/llvm/lib/CodeGen/MachineUniformityAnalysis.cpp
index 3e0fe2b1ba087fe..5bd87d069285857 100644
--- a/llvm/lib/CodeGen/MachineUniformityAnalysis.cpp
+++ b/llvm/lib/CodeGen/MachineUniformityAnalysis.cpp
@@ -117,11 +117,16 @@ void llvm::GenericUniformityAnalysisImpl<MachineSSAContext>::
     if (!Op.getReg().isVirtual())
       continue;
     auto Reg = Op.getReg();
-    if (isDivergent(Reg))
-      continue;
+
     for (MachineInstr &UserInstr : RegInfo.use_instructions(Reg)) {
       if (DefCycle.contains(UserInstr.getParent()))
         continue;
+
+      recordUseOutsideCycleWithDivergentExit(Reg, &UserInstr, DefCycle);
+
+      if (isDivergent(Reg))
+        continue;
+
       markDivergent(UserInstr);
     }
   }
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 2799a3e78b04d22..5db140bb0fc6c94 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineScheduler.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
 #include "llvm/CodeGen/RegisterScavenging.h"
 #include "llvm/CodeGen/ScheduleDAG.h"
 #include "llvm/IR/DiagnosticInfo.h"
@@ -171,6 +172,65 @@ bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const {
          isVALU(*MO.getParent()) && !resultDependsOnExec(*MO.getParent());
 }
 
+// Get poiners to build instruction after MI (skips phis if needed)
+static std::pair<MachineBasicBlock *, MachineBasicBlock::iterator>
+getInsertAfterPtrs(MachineInstr *MI) {
+  MachineBasicBlock *InsertMBB = MI->getParent();
+  return std::make_pair(
+      InsertMBB, InsertMBB->SkipPHIsAndLabels(std::next(MI->getIterator())));
+}
+
+static void replaceUseRegisterWith(const MachineInstr *MI, Register Reg,
+                                   Register Newreg) {
+  for (unsigned i = 0; i < MI->getNumOperands(); ++i) {
+    const MachineOperand &Op = MI->getOperand(i);
+    if (Op.isReg() && Op.getReg() == Reg) {
+      const_cast<MachineInstr *>(MI)->getOperand(i).setReg(Newreg);
+    }
+  }
+}
+
+void SIInstrInfo::fixTemporalDivergence(MachineFunction &MF,
+                                        MachineDominatorTree *DT,
+                                        MachineCycleInfo *CI) const {
+  MachineUniformityInfo MUI =
+      computeMachineUniformityInfo(MF, *CI, DT->getBase(), true);
+
+  MachineRegisterInfo &MRI = MF.getRegInfo();
+  const GCNSubtarget &Subtarget = MF.getSubtarget<GCNSubtarget>();
+  const TargetInstrInfo &TII = *Subtarget.getInstrInfo();
+  const SIRegisterInfo &TRI = *Subtarget.getRegisterInfo();
+
+  // Temporal divergence lowering is required for uniform UniformSourceReg and
+  // divergent UserInstr. UserInstr is uniform only when cycle has uniform exit.
+  for (auto [SrcReg, UserInstr, CycleExitBlocks] :
+       MUI.uses_outside_cycles_with_divergent_exit()) {
+    if (!MUI.isUniform(SrcReg) || !MUI.isDivergent(UserInstr))
+      continue;
+
+    MachineInstr *UniformSourceInstr = MRI.getVRegDef(SrcReg);
+
+    // FixMe: SrcReg is lane mask in this case. Find a better way to detect it.
+    if (UniformSourceInstr->getOpcode() == AMDGPU::SI_IF_BREAK ||
+        UserInstr->getOpcode() == AMDGPU::SI_IF)
+      continue;
+
+    unsigned Size = TRI.getRegSizeInBits(*MRI.getRegClassOrNull(SrcReg));
+    Register VgprDst =
+        MRI.createVirtualRegister(TRI.getVGPRClassForBitWidth(Size));
+
+    auto [MBB, AfterUniformSourceReg] = getInsertAfterPtrs(UniformSourceInstr);
+    BuildMI(*MBB, AfterUniformSourceReg, {}, TII.get(AMDGPU::COPY))
+        .addDef(VgprDst)
+        .addReg(SrcReg)
+        .addReg(AMDGPU::EXEC, RegState::Implicit);
+
+    replaceUseRegisterWith(UserInstr, SrcReg, VgprDst);
+  }
+
+  return;
+}
+
 bool SIInstrInfo::areLoadsFromSameBasePtr(SDNode *Load0, SDNode *Load1,
                                           int64_t &Offset0,
                                           int64_t &Offset1) const {
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index e85917a4c0f3296..4446a962d7cca44 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -222,6 +222,9 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
 
   bool isIgnorableUse(const MachineOperand &MO) const override;
 
+  void fixTemporalDivergence(MachineFunction &MF, MachineDominatorTree *DT,
+                             MachineCycleInfo *CI) const override;
+
   bool areLoadsFromSameBasePtr(SDNode *Load0, SDNode *Load1, int64_t &Offset0,
                                int64_t &Offset1) const override;
 
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll b/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll
index e2456b74f7ef1fa..b8e74bc7db09a1a 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll
@@ -21,6 +21,7 @@ define void @machinesink_loop_variable_out_of_divergent_loop(i32 %arg, i1 %cmp49
 ; CHECK-NEXT:  .LBB0_1: ; %Flow
 ; CHECK-NEXT:    ; in Loop: Header=BB0_3 Depth=1
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s8
+; CHECK-NEXT:    v_add_nc_u32_e32 v4, -4, v4
 ; CHECK-NEXT:  .LBB0_2: ; %Flow1
 ; CHECK-NEXT:    ; in Loop: Header=BB0_3 Depth=1
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s7
@@ -53,7 +54,6 @@ define void @machinesink_loop_variable_out_of_divergent_loop(i32 %arg, i1 %cmp49
 ; CHECK-NEXT:    ;;#ASMEND
 ; CHECK-NEXT:    v_add_nc_u32_e32 v4, s9, v2
 ; CHECK-NEXT:    v_cmp_ge_u32_e64 s4, v4, v0
-; CHECK-NEXT:    v_add_nc_u32_e32 v4, -4, v4
 ; CHECK-NEXT:    s_or_b32 s8, s4, s8
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s8
 ; CHECK-NEXT:    s_cbranch_execz .LBB0_1
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir b/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir
index cc14b4a80d58a7d..037a285794120da 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir
@@ -42,7 +42,6 @@ body:             |
   ; CHECK-NEXT:   successors: %bb.2(0x40000000), %bb.5(0x40000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   INLINEASM &"", 1 /* sideeffect attdialect */
-  ; CHECK-NEXT:   [[V_ADD_U32_e64_:%[0-9]+]]:vgpr_32 = V_ADD_U32_e64 [[COPY]], [[COPY1]], 0, implicit $exec
   ; CHECK-NEXT:   [[SI_IF_BREAK:%[0-9]+]]:sreg_32 = SI_IF_BREAK killed [[SI_IF1]], [[SI_IF]], implicit-def dead $scc
   ; CHECK-NEXT:   SI_LOOP [[SI_IF_BREAK]], %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
   ; CHECK-NEXT:   S_BRANCH %bb.5
@@ -52,6 +51,7 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[PHI:%[0-9]+]]:vgpr_32 = PHI [[COPY]], %bb.4
   ; CHECK-NEXT:   SI_END_CF [[SI_IF_BREAK]], implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+  ; CHECK-NEXT:   [[V_ADD_U32_e64_:%[0-9]+]]:vgpr_32 = V_ADD_U32_e64 [[COPY]], [[COPY1]], 0, implicit $exec
   ; CHECK-NEXT:   INLINEASM &"", 1 /* sideeffect attdialect */, implicit [[V_ADD_U32_e64_]]
   ; CHECK-NEXT:   S_BRANCH %bb.2
   ; CHECK-NEXT: {{  $}}
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
new file mode 100644
index 000000000000000..0c631bdcdd374a8
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
@@ -0,0 +1,1094 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1031 < %s | FileCheck %s
+
+; ModuleID = 'kernel_round1_passing.bc'
+source_filename = "/tmp/comgr-295d04/input/CompileSource"
+target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8"
+target triple = "amdgcn-amd-amdhsa"
+
+@kernel_round1.first_words_data = external hidden unnamed_addr addrspace(3) global [896 x i8], align 1
+@kernel_round1.collisionsData = external hidden unnamed_addr addrspace(3) global [3840 x i32], align 4
+@kernel_round1.collisionsNum = external hidden addrspace(3) global i32, align 4
+
+; Function Attrs: convergent mustprogress nofree nounwind willreturn memory(none)
+declare hidden i64 @_Z13get_global_idj(i32 noundef) local_unnamed_addr #0
+
+; Function Attrs: convergent nounwind
+declare hidden i32 @_Z10atomic_addPU3AS1Vjj(ptr addrspace(1) noundef, i32 noundef) local_unnamed_addr #1
+
+; Function Attrs: convergent nounwind
+declare hidden i32 @_Z10atomic_subPU3AS1Vjj(ptr addrspace(1) noundef, i32 noundef) local_unnamed_addr #1
+
+; Function Attrs: convergent mustprogress nofree nounwind willreturn memory(none)
+declare hidden i64 @_Z12get_local_idj(i32 noundef) local_unnamed_addr #0
+
+; Function Attrs: convergent nounwind
+declare hidden void @_Z7barrierj(i32 noundef) local_unnamed_addr #1
+
+; Function Attrs: convergent mustprogress nofree nounwind willreturn memory(none)
+declare hidden i32 @_Z3minjj(i32 noundef, i32 noundef) local_unnamed_addr #0
+
+; Function Attrs: convergent nounwind
+declare hidden i32 @_Z10atomic_incPU3AS3Vj(ptr addrspace(3) noundef) local_unnamed_addr #1
+
+; Function Attrs: convergent mustprogress nofree nounwind willreturn memory(none)
+declare hidden i64 @_Z14get_local_sizej(i32 noundef) local_unnamed_addr #0
+
+; Function Attrs: convergent norecurse nounwind
+define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture noundef readonly align 1 %0, ptr addrspace(1) nocapture noundef writeonly align 1 %1, ptr addrspace(1) nocapture noundef readonly align 4 %2, ptr addrspace(1) noundef align 4 %3, ptr addrspace(1) nocapture noundef readnone align 4 %4) local_unnamed_addr #2 !kernel_arg_addr_space !5 !kernel_arg_access_qual !6 !kernel_arg_type !7 !kernel_arg_base_type !7 !kernel_arg_type_qual !8 !kernel_arg_name !9 !reqd_work_group_size !10 {
+; CHECK-LABEL: kernel_round1:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_add_u32 s10, s10, s15
+; CHECK-NEXT:    s_mov_b32 s32, 0
+; CHECK-NEXT:    s_addc_u32 s11, s11, 0
+; CHECK-NEXT:    s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s10
+; CHECK-NEXT:    s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s11
+; CHECK-NEXT:    s_load_dwordx8 s[44:51], s[6:7], 0x0
+; CHECK-NEXT:    s_add_u32 s0, s0, s15
+; CHECK-NEXT:    s_mov_b64 s[34:35], s[6:7]
+; CHECK-NEXT:    s_addc_u32 s1, s1, 0
+; CHECK-NEXT:    v_mov_b32_e32 v41, v0
+; CHECK-NEXT:    s_add_u32 s42, s34, 40
+; CHECK-NEXT:    v_mov_b32_e32 v31, v0
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_mov_b64 s[36:37], s[8:9]
+; CHECK-NEXT:    s_addc_u32 s43, s35, 0
+; CHECK-NEXT:    s_mov_b64 s[10:11], s[36:37]
+; CHECK-NEXT:    s_mov_b64 s[8:9], s[42:43]
+; CHECK-NEXT:    s_mov_b32 s33, s14
+; CHECK-NEXT:    s_mov_b32 s40, s13
+; CHECK-NEXT:    s_mov_b32 s41, s12
+; CHECK-NEXT:    s_mov_b64 s[38:39], s[4:5]
+; CHECK-NEXT:    s_getpc_b64 s[6:7]
+; CHECK-NEXT:    s_add_u32 s6, ...
[truncated]

Comment on lines 863 to 866
; CHECK-NEXT: [[S_ADD_I32_2:%[0-9]+]]:sreg_32 = S_ADD_I32 %108, 1, implicit-def dead $scc
; CHECK-NEXT: [[S_ADD_I32_3:%[0-9]+]]:sreg_32 = S_ADD_I32 [[S_ADD_I32_2]], 2, implicit-def dead $scc
; CHECK-NEXT: [[V_ADD_U32_e64_3:%[0-9]+]]:vgpr_32 = V_ADD_U32_e64 [[V_OR3_B32_e64_]], [[S_ADD_I32_3]], 0, implicit $exec
; CHECK-NEXT: [[V_ADD_U32_e64_3:%[0-9]+]]:vgpr_32 = V_ADD_U32_e64 %150, 1, 0, implicit $exec
; CHECK-NEXT: [[V_ADD_U32_e64_4:%[0-9]+]]:vgpr_32 = V_ADD_U32_e64 [[V_ADD_U32_e64_3]], 2, 0, implicit $exec
; CHECK-NEXT: [[V_ADD_U32_e64_5:%[0-9]+]]:vgpr_32 = V_ADD_U32_e64 [[V_OR3_B32_e64_]], [[V_ADD_U32_e64_4]], 0, implicit $exec
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moveToVALU

@ssahasra
Copy link
Collaborator

What is the relation between #67456 (this PR) and #67033? Does this PR supersede the other one?

@petar-avramovic
Copy link
Collaborator Author

What is the relation between #67456 (this PR) and #67033? Does this PR supersede the other one?

Yes, this PR is updated #67033 according to all comments and main difference is that this PR uses target hook in machine-sink to fix temporal divergence while #67033 uses machine-function-pass.

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rework. I still think this needs to be improved to be done right.

The test case still feels way too large. It will suffer from random unrelated changes. It ought to be possible to write a more focused MIR test that really captures the essence of what's happening: A uniform value inside of a cycle is defined in an SGPR vreg, which is then used by a VALU instruction that would ordinarily be a candidate for sinking.

The solution itself also doesn't feel the it's From The Book. Consider:

  • Recomputing DomTree, Cycle Info, and Machine Uniformity during the fixup feels like a lot of effort
  • Machine Uniformity on wave-level MIR seems like a fundamentally dubious proposition in the first place
  • The resulting code isn't great. In the new test case, it would almost certainly be better to not sink the v_add and leave it inside the loop.

Let's have some ambition to do better than this.

The old logic that prevented sinking of instructions was a convoluted way of preventing the VALU instruction from being moved past an EXEC change. But that didn't really capture the essence of what the bug was.

What I think is closer to the essence of the bug is that the VALU instruction reads from an SGPR, and sinking it changes the value that is being read.

Take a look at FindSuccToSinkTo. There's a piece of logic there that will refuse to sink an instruction that uses a physical register, unless that physical register is constant. That feels very similar to the kind of problem from the SWDEV-407790 test. Unfortunately, it's difficult to fit an analogous test for the SGPR condition into the same loop over operands, because we need to know the sink target in order to perform the check.

Idea: Let's have a TII->isSafeToSinkTo callback which, in AMDGPU, checks if there are SGPR uses. If so, and the sink target is outside of a cycle that contains the SGPR def, then deny the sinking.

This is a good first start:

  • It will still allow the sinking where we want it to be allowed:
    • Sinking in sink-after-control-flow.mir is allowed because there are no cycles involved
    • Sinking in machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir and .ll are allowed because the relevant SGPR def is outside of the cycle
  • By contrast, sinking in machine-sink-temporal-divergence-swdev407790.ll is prevented
    • This is good for correctness
    • It is better for performance than adding the extra v_mov

It is not perfect, because there are potentially cases where sinking would have been beneficial:

  • If adding an extra v_mov would allow multiple instructions to be sunk.
  • If the sink candidate is more expensive than a v_mov (e.g., transcendental)

Now in general, we should not be relying on machine sinking that much anyway. For the most part, LLVM IR passes should clean that up. Machine sinking should only help cleanup "debris" from instruction selection. It seems unlikely to me that such debris would show up for the two cases above, so while the initial idea could perhaps be refined, I'd lean towards just not worrying about it for now.

A final thought: All of blockPrologueInterferes becomes wasteful as far as I can tell. Block prologue instructions are instructions that define EXEC. If we skip them because EXEC is an ignorable use, then the check isn't doing us any good. Let's just remove it.

With that in mind, can you please:

  1. Change the main commit in this PR so that it adds the TII->isSafeToSinkTo callback and checks it at the end of FindSuccToSinkTo.
  2. Add a final commit which removes all traces of blockPrologueInterferes. (Make it a separate commit so that it's easier to revert if it turns out I was wrong.)

@petar-avramovic
Copy link
Collaborator Author

Updated version that does not allow sink. I tried to implement lightweight temporal divergence detection if machine-uniformity-analysis looks too expensive. ToDo remove blockPrologueInterferes

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't yet update the tests, did you?

Anyway, this already looks much better, thanks! Still some issues with the logic and compile time, please see inline.

// SgprDef defined inside cycle
MachineCycle *FromCycle = CI->getCycle(SgprDef->getParent());
if (FromCycle == nullptr)
return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be continue (also below) since this needs to look at all operands.

MachineBasicBlock *SuccToSinkTo,
MachineCycleInfo *CI) const {
CI->clear();
CI->compute(*MI.getMF());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't recompute this each time. It leads to at least quadratic compile-time. It should be the responsibility of the caller to ensure that the CI is uptodate.


// Allow sinking if MI edits lane mask (divergent i1 in sgpr).
if (MI.getOpcode() == AMDGPU::SI_IF_BREAK)
return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opcode check is independent of the operands, so it should be done once up-front.

return true;

// SuccToSinkTo is not in the cycle.
if (FromCycle != CI->getCycle(SuccToSinkTo)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start, but it doesn't yet quite capture the essence of what is happening. Consider the possible ways in which cycles may be nested. We can think of it as sinking MI successfully out of ever more cycles containing the SGPR def. It may be helpful to draw some diagrams on a piece of paper.

Suggestion:

  • Get the cycle of SuccToSinkTo --> ToCycle.
  • While FromCycle does not contain ToCycle:
    • If FromCycle has a divergent exit, return false: we've shown that sinking is unsafe
    • Otherwise, it is safe to sink out of FromCycle, but we need to continue checking: replace FromCycle by the parent cycle and repeat

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small targeted mir test

@petar-avramovic
Copy link
Collaborator Author

blockPrologueInterferes in tryToSinkCopy seems to be required,
For example see _amdgpu_ps_main2 from llvm/test/CodeGen/AMDGPU/sink-after-control-flow-postra.mir
blockPrologueInterferes stops sink of
renamable $sgpr0_sgpr1 = COPY $sgpr6_sgpr7
past its use $sgpr4_sgpr5 = S_AND_SAVEEXEC_B64 $sgpr0_sgpr1, implicit-def $exec, implicit-def $scc, implicit $exec

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks pretty close to landing IMO.

blockPrologueInterferes in tryToSinkCopy seems to be required,
For example see _amdgpu_ps_main2 from llvm/test/CodeGen/AMDGPU/sink-after-control-flow-postra.mir
blockPrologueInterferes stops sink of
renamable $sgpr0_sgpr1 = COPY $sgpr6_sgpr7
past its use $sgpr4_sgpr5 = S_AND_SAVEEXEC_B64 $sgpr0_sgpr1, implicit-def $exec, implicit-def $scc, implicit $exec

Does that make sense? Whatever PostRAMachineSink is trying to do, surely it shouldn't need target-specific block prologue logic to understand that a def can't be sunk past its use...

@@ -734,6 +735,8 @@ bool MachineSinking::runOnMachineFunction(MachineFunction &MF) {

MadeChange = true;
++NumSplit;
CI->clear();
CI->compute(MF);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Please take a note to improve compile-time further as a separate follow-up change. Basically, it should be possible to update cycle-info when splitting a critical edge without recomputing the whole thing (adding an onEdgeSplit method like for MBFI).

I haven't thought about it too deeply, but I think what it boils down to is that the new basic block is a member of least cycle containing both the source and the destination of the critical edge.

MachineBasicBlock::const_iterator InsertPos =
SuccToSinkTo->SkipPHIsAndLabels(SuccToSinkTo->begin());
if (blockPrologueInterferes(SuccToSinkTo, InsertPos, MI, TRI, TII, MRI))
if (!TII->isSafeToSink(MI, SuccToSinkTo, CI))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So isSafeToSInk can be called with SuccToSinkTo == nullptr? That seems like a bad idea.

MachineCycle *ToCycle = CI->getCycle(SuccToSinkTo);
// Check if there is a FromCycle that contains SgprDef's basic block but
// does not contain SuccToSinkTo and also has divergent exit condition.
while (FromCycle != ToCycle) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be FromCycle && !FromCycle->contains(ToCycle).

For the reason, consider the case where there are two cycles right after each other:

  |
  v/-\
  A  |
  |\-/
  |
  |
  v/-\
  B  |
  |\-/
  |

It could be the case that SgprDef is in A and SuccToSinkTo is B.

Introduced by 5b657f5 that moved
LICM after AMDGPUCodeGenPrepare. Some instructions are no longer
sunk during ir optimizations but in machine-sinking instead.
If vgpr instruction used sgpr defined inside the cycle is sunk outside
of the cycle we end up with not-handled case of temporal divergence.
Add test for theoretical case when SALU instruction (represents
uniform value) is sunk outside of the cycle.
Add a test when SALU instruction can be sunk if it edits lane mask.
Temporal divergence that was present in input or introduced in IR
transforms, like code-sinking or LICM, is handled in SIFixSGPRCopies
by changing sgpr source instr to vgpr instr.
After 5b657f5, that moved LICM after AMDGPUCodeGenPrepare,
machine-sinking can introduce temporal divergence by sinking
instructions outside of the cycle.
Add isSafeToSink callback in TargetInstrInfo.
@petar-avramovic
Copy link
Collaborator Author

blockPrologueInterferes in tryToSinkCopy seems to be required,
For example see _amdgpu_ps_main2 from llvm/test/CodeGen/AMDGPU/sink-after-control-flow-postra.mir
blockPrologueInterferes stops sink of
renamable $sgpr0_sgpr1 = COPY $sgpr6_sgpr7
past its use $sgpr4_sgpr5 = S_AND_SAVEEXEC_B64 $sgpr0_sgpr1, implicit-def $exec, implicit-def $scc, implicit $exec

Does that make sense? Whatever PostRAMachineSink is trying to do, surely it shouldn't need target-specific block prologue logic to understand that a def can't be sunk past its use...

https://reviews.llvm.org/D121277. Target-prologue instructions are not checked for "def can't be sunk past its use" but skipped as part of SkipPHIsAndLabels. They have to be checked somewhere. Maybe we could remove some checks from blockPrologueInterferes?

@arsenm
Copy link
Contributor

arsenm commented Oct 5, 2023

I think we need to stop adding these extremely specific hooks that just identify problematic situations and hack around them in individual transforms. I think we need to move towards fully modeling the uniform and divergent CFGs in MachineBasicBlock, and use dedicated restricted pseudo-copies where there's a potential temporal divergence issue

@nhaehnle
Copy link
Collaborator

nhaehnle commented Oct 6, 2023

Does that make sense? Whatever PostRAMachineSink is trying to do, surely it shouldn't need target-specific block prologue logic to understand that a def can't be sunk past its use...

https://reviews.llvm.org/D121277. Target-prologue instructions are not checked for "def can't be sunk past its use" but skipped as part of SkipPHIsAndLabels. They have to be checked somewhere. Maybe we could remove some checks from blockPrologueInterferes?

It looks like this new change totally supersedes what D121277 was trying to do. Perhaps it can be reverted in its entirety?

I think we need to stop adding these extremely specific hooks that just identify problematic situations and hack around them in individual transforms. I think we need to move towards fully modeling the uniform and divergent CFGs in MachineBasicBlock, and use dedicated restricted pseudo-copies where there's a potential temporal divergence issue

I agree. Do we actually have a proposal for how this should look? I've seen some ideas and had some myself, but haven't seen anything fully coherent yet.

@petar-avramovic
Copy link
Collaborator Author

Does that make sense? Whatever PostRAMachineSink is trying to do, surely it shouldn't need target-specific block prologue logic to understand that a def can't be sunk past its use...

https://reviews.llvm.org/D121277. Target-prologue instructions are not checked for "def can't be sunk past its use" but skipped as part of SkipPHIsAndLabels. They have to be checked somewhere. Maybe we could remove some checks from blockPrologueInterferes?

It looks like this new change totally supersedes what D121277 was trying to do. Perhaps it can be reverted in its entirety?

This change is independent from D121277. It supersedes and reverts D155343 which also used blockPrologueInterferes.

I think we need to move towards fully modeling the uniform and divergent CFGs in MachineBasicBlock,

What we need for this, we have pseudo branches for divergent control flow. Only thing I could think of would be to have dedicated sgpr register class that would be used for lane masks only.

and use dedicated restricted pseudo-copies where there's a potential temporal divergence issue

Where/how would we insert them? I am convinced that we can detect temporal divergent use at any point in the pipeline.

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see now that we do need an explicit check for the block prologue for post-RA sinking. So I think this change should be good to go.

petar-avramovic added a commit that referenced this pull request Oct 6, 2023
Temporal divergence that was present in input or introduced in IR
transforms, like code-sinking or LICM, is handled in SIFixSGPRCopies
by changing sgpr source instr to vgpr instr.
After 5b657f5, that moved LICM after AMDGPUCodeGenPrepare,
machine-sinking can introduce temporal divergence by sinking
instructions outside of the cycle.
Add isSafeToSink callback in TargetInstrInfo.
@petar-avramovic
Copy link
Collaborator Author

Thanks for the review closing this pr for now. I pushed commits to keep them separate.
Please let me know if you have any suggestions how to handle temporal divergence issues, I can work on a patch for that.

@arsenm
Copy link
Contributor

arsenm commented Oct 6, 2023

I agree. Do we actually have a proposal for how this should look? I've seen some ideas and had some myself, but haven't seen anything fully coherent yet.

No not really. I was most recently thinking of splitting isConvergent into some notion of read and write convergent, and then adding a corresponding flag to basic blocks that acts sort of like IsEHPad. You could then compose read/write convergent to approximately represent vector vs. scalar vs. wwm ops with different convergence types

@petar-avramovic petar-avramovic mentioned this pull request Dec 1, 2023
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Dec 5, 2023
Temporal divergence that was present in input or introduced in IR
transforms, like code-sinking or LICM, is handled in SIFixSGPRCopies
by changing sgpr source instr to vgpr instr.
After 5b657f5, that moved LICM after AMDGPUCodeGenPrepare,
machine-sinking can introduce temporal divergence by sinking
instructions outside of the cycle.
Add isSafeToSink callback in TargetInstrInfo.

Change-Id: I753744d3807cc05dfca687ed359fdceb9afa2fdc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants