Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU][SIInsertWaitcnts] Do not add s_waitcnt when the counters are known to be 0 already #72830

Merged
merged 1 commit into from
Dec 15, 2023

Conversation

Pierre-vh
Copy link
Contributor

See #72829 to precommit test changes.

Original patch by @jmmartinez : #65735

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 20, 2023

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

See #72829 to precommit test changes.

Original patch by @jmmartinez : #65735


Patch is 3.52 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/72830.diff

67 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp (+5-1)
  • (modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+75-61)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+5)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+25)
  • (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+7-5)
  • (modified) llvm/lib/Target/AMDGPU/SOPInstructions.td (+15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll (+56-281)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_uinc_wrap.ll (+58-299)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll (+2-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+7-38)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll (+1272-482)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mubuf-global.ll (-20)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (-120)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+4-252)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+1-14)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (-3)
  • (modified) llvm/test/CodeGen/AMDGPU/back-off-barrier-subtarget-feature.ll (+2-6)
  • (modified) llvm/test/CodeGen/AMDGPU/fence-barrier.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fence-lds-read2-write2.ll (-1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics.ll (-423)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (-297)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64.ll (-246)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (-297)
  • (modified) llvm/test/CodeGen/AMDGPU/force-store-sc0-sc1.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-atomics-gfx940.ll (+2-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+7-38)
  • (modified) llvm/test/CodeGen/AMDGPU/gds-allocation.ll (+4-9)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp-wrong-subtarget.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp.ll (+12-53)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (-160)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics.ll (+4-474)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (-20)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-store.ll (+16-56)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+8-390)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+36-315)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64.ll (+3-324)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+36-315)
  • (modified) llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll (-3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.gws.barrier.ll (+4-6)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.set.inactive.chain.arg.ll (-28)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll (-42)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-atomic-insert-end.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (-316)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll (+1-841)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll (+1-841)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll (-492)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll (+16-885)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll (+16-825)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll (-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll (+8-566)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (-390)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll (-390)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll (-390)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-multiple-mem-operands-atomics.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/move-to-valu-atomicrmw-system.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/move-to-valu-atomicrmw.ll (-2)
  • (added) llvm/test/CodeGen/AMDGPU/preserve-user-waitcnt.ll (+53)
  • (modified) llvm/test/CodeGen/AMDGPU/release-vgprs.mir (+56-47)
  • (modified) llvm/test/CodeGen/AMDGPU/shl_add_ptr_global.ll (-1)
  • (modified) llvm/test/CodeGen/AMDGPU/transform-block-with-return-to-epilog.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-descriptor-waterfall-loop-idom-update.ll (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/waitcnt-no-redundant.mir (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/waitcnt-preexisting-vscnt.mir (+17-16)
  • (modified) llvm/test/CodeGen/AMDGPU/waitcnt-vscnt.mir (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/waitcnt.mir (+2-2)
diff --git a/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp b/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
index bf65be3fe9035e7..c8ce1903d31537c 100644
--- a/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
+++ b/llvm/lib/Target/AMDGPU/MCA/AMDGPUCustomBehaviour.cpp
@@ -25,10 +25,12 @@ void AMDGPUInstrPostProcess::postProcessInstruction(
     std::unique_ptr<Instruction> &Inst, const MCInst &MCI) {
   switch (MCI.getOpcode()) {
   case AMDGPU::S_WAITCNT:
+  case AMDGPU::S_WAITCNT_soft:
   case AMDGPU::S_WAITCNT_EXPCNT:
   case AMDGPU::S_WAITCNT_LGKMCNT:
   case AMDGPU::S_WAITCNT_VMCNT:
   case AMDGPU::S_WAITCNT_VSCNT:
+  case AMDGPU::S_WAITCNT_VSCNT_soft:
   case AMDGPU::S_WAITCNT_EXPCNT_gfx10:
   case AMDGPU::S_WAITCNT_LGKMCNT_gfx10:
   case AMDGPU::S_WAITCNT_VMCNT_gfx10:
@@ -77,10 +79,12 @@ unsigned AMDGPUCustomBehaviour::checkCustomHazard(ArrayRef<InstRef> IssuedInst,
   default:
     return 0;
   case AMDGPU::S_WAITCNT: // This instruction
+  case AMDGPU::S_WAITCNT_soft:
   case AMDGPU::S_WAITCNT_EXPCNT:
   case AMDGPU::S_WAITCNT_LGKMCNT:
   case AMDGPU::S_WAITCNT_VMCNT:
-  case AMDGPU::S_WAITCNT_VSCNT: // to this instruction are all pseudo.
+  case AMDGPU::S_WAITCNT_VSCNT:
+  case AMDGPU::S_WAITCNT_VSCNT_soft: // to this instruction are all pseudo.
   case AMDGPU::S_WAITCNT_EXPCNT_gfx10:
   case AMDGPU::S_WAITCNT_LGKMCNT_gfx10:
   case AMDGPU::S_WAITCNT_VMCNT_gfx10:
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index ede4841b8a5fd7d..7048aee3099d166 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -292,6 +292,13 @@ class WaitcntBrackets {
     VgprVmemTypes[GprNo] = 0;
   }
 
+  void setNonKernelFunctionInitialState() {
+    for (InstCounterType Counter : inst_counter_types()) {
+      setScoreUB(Counter, getWaitCountMax(Counter));
+      PendingEvents |= WaitEventMaskForInst[Counter];
+    }
+  }
+
   void print(raw_ostream &);
   void dump() { print(dbgs()); }
 
@@ -364,7 +371,6 @@ class SIInsertWaitcnts : public MachineFunctionPass {
   const MachineRegisterInfo *MRI = nullptr;
   AMDGPU::IsaVersion IV;
 
-  DenseSet<MachineInstr *> TrackedWaitcntSet;
   DenseMap<const Value *, MachineBasicBlock *> SLoadAddresses;
   DenseMap<MachineBasicBlock *, bool> PreheadersToFlush;
   MachineLoopInfo *MLI;
@@ -477,7 +483,7 @@ class SIInsertWaitcnts : public MachineFunctionPass {
   bool generateWaitcnt(AMDGPU::Waitcnt Wait,
                        MachineBasicBlock::instr_iterator It,
                        MachineBasicBlock &Block, WaitcntBrackets &ScoreBrackets,
-                       MachineInstr *OldWaitcntInstr);
+                       MachineInstr *OldWaitcntInstr) const;
   void updateEventWaitcntAfter(MachineInstr &Inst,
                                WaitcntBrackets *ScoreBrackets);
   bool insertWaitcntInBlock(MachineFunction &MF, MachineBasicBlock &Block,
@@ -486,6 +492,7 @@ class SIInsertWaitcnts : public MachineFunctionPass {
                                MachineInstr &OldWaitcntInstr,
                                AMDGPU::Waitcnt &Wait,
                                MachineBasicBlock::instr_iterator It) const;
+  bool updateWaitcntIfSoft(MachineInstr *Waitcnt) const;
 };
 
 } // end anonymous namespace
@@ -870,6 +877,15 @@ static bool updateOperandIfDifferent(MachineInstr &MI, uint16_t OpName,
   return true;
 }
 
+bool SIInsertWaitcnts::updateWaitcntIfSoft(MachineInstr *Waitcnt) const {
+  unsigned Opcode = Waitcnt->getOpcode();
+  if (!SIInstrInfo::isSoftWaitcnt(Opcode))
+    return false;
+
+  Waitcnt->setDesc(TII->get(SIInstrInfo::getNonSoftWaitcntOpcode(Opcode)));
+  return true;
+}
+
 /// Combine consecutive waitcnt instructions that precede \p It and follow
 /// \p OldWaitcntInstr and apply any extra wait from waitcnt that were added
 /// by previous passes. Currently this pass conservatively assumes that these
@@ -886,18 +902,22 @@ bool SIInsertWaitcnts::applyPreexistingWaitcnt(
     if (II.isMetaInstruction())
       continue;
 
-    if (II.getOpcode() == AMDGPU::S_WAITCNT) {
+    unsigned Opcode = II.getOpcode();
+    bool CanFullyDiscardWaitcntSequence = SIInstrInfo::isSoftWaitcnt(Opcode);
+
+    if (SIInstrInfo::isWaitcnt(Opcode)) {
       // Conservatively update required wait if this waitcnt was added in an
       // earlier pass. In this case it will not exist in the tracked waitcnt
       // set.
-      if (!TrackedWaitcntSet.count(&II)) {
-        unsigned IEnc = II.getOperand(0).getImm();
-        AMDGPU::Waitcnt OldWait = AMDGPU::decodeWaitcnt(IV, IEnc);
-        Wait = Wait.combined(OldWait);
-      }
+      unsigned IEnc = II.getOperand(0).getImm();
+      AMDGPU::Waitcnt OldWait = AMDGPU::decodeWaitcnt(IV, IEnc);
+      if (CanFullyDiscardWaitcntSequence)
+        ScoreBrackets.simplifyWaitcnt(OldWait);
+      Wait = Wait.combined(OldWait);
 
       // Merge consecutive waitcnt of the same type by erasing multiples.
-      if (!WaitcntInstr) {
+      if (!WaitcntInstr &&
+          (Wait.hasWaitExceptVsCnt() || !CanFullyDiscardWaitcntSequence)) {
         WaitcntInstr = &II;
       } else {
         II.eraseFromParent();
@@ -905,15 +925,17 @@ bool SIInsertWaitcnts::applyPreexistingWaitcnt(
       }
 
     } else {
-      assert(II.getOpcode() == AMDGPU::S_WAITCNT_VSCNT);
+      assert(SIInstrInfo::isWaitcntVsCnt(Opcode));
       assert(II.getOperand(0).getReg() == AMDGPU::SGPR_NULL);
-      if (!TrackedWaitcntSet.count(&II)) {
-        unsigned OldVSCnt =
-            TII->getNamedOperand(II, AMDGPU::OpName::simm16)->getImm();
-        Wait.VsCnt = std::min(Wait.VsCnt, OldVSCnt);
-      }
 
-      if (!WaitcntVsCntInstr) {
+      unsigned OldVSCnt =
+          TII->getNamedOperand(II, AMDGPU::OpName::simm16)->getImm();
+      if (CanFullyDiscardWaitcntSequence)
+        ScoreBrackets.simplifyWaitcnt(InstCounterType::VS_CNT, OldVSCnt);
+      Wait.VsCnt = std::min(Wait.VsCnt, OldVSCnt);
+
+      if (!WaitcntVsCntInstr &&
+          (Wait.hasWaitVsCnt() || !CanFullyDiscardWaitcntSequence)) {
         WaitcntVsCntInstr = &II;
       } else {
         II.eraseFromParent();
@@ -924,48 +946,38 @@ bool SIInsertWaitcnts::applyPreexistingWaitcnt(
 
   // Updated encoding of merged waitcnt with the required wait.
   if (WaitcntInstr) {
-    if (Wait.hasWaitExceptVsCnt()) {
-      Modified |=
-          updateOperandIfDifferent(*WaitcntInstr, AMDGPU::OpName::simm16,
-                                   AMDGPU::encodeWaitcnt(IV, Wait));
-      ScoreBrackets.applyWaitcnt(Wait);
-      Wait.VmCnt = ~0u;
-      Wait.LgkmCnt = ~0u;
-      Wait.ExpCnt = ~0u;
-
-      LLVM_DEBUG(It == OldWaitcntInstr.getParent()->end()
-                     ? dbgs() << "applyPreexistingWaitcnt\n"
-                              << "New Instr at block end: " << *WaitcntInstr
-                              << '\n'
-                     : dbgs() << "applyPreexistingWaitcnt\n"
-                              << "Old Instr: " << *It
-                              << "New Instr: " << *WaitcntInstr << '\n');
+    Modified |= updateOperandIfDifferent(*WaitcntInstr, AMDGPU::OpName::simm16,
+                                         AMDGPU::encodeWaitcnt(IV, Wait));
+    Modified |= updateWaitcntIfSoft(WaitcntInstr);
 
-    } else {
-      WaitcntInstr->eraseFromParent();
-      Modified = true;
-    }
+    ScoreBrackets.applyWaitcnt(Wait);
+    Wait.VmCnt = ~0u;
+    Wait.LgkmCnt = ~0u;
+    Wait.ExpCnt = ~0u;
+
+    LLVM_DEBUG(It == OldWaitcntInstr.getParent()->end()
+                   ? dbgs()
+                         << "applyPreexistingWaitcnt\n"
+                         << "New Instr at block end: " << *WaitcntInstr << '\n'
+                   : dbgs() << "applyPreexistingWaitcnt\n"
+                            << "Old Instr: " << *It
+                            << "New Instr: " << *WaitcntInstr << '\n');
   }
 
   if (WaitcntVsCntInstr) {
-    if (Wait.hasWaitVsCnt()) {
-      assert(ST->hasVscnt());
-      Modified |= updateOperandIfDifferent(*WaitcntVsCntInstr,
-                                           AMDGPU::OpName::simm16, Wait.VsCnt);
-      ScoreBrackets.applyWaitcnt(Wait);
-      Wait.VsCnt = ~0u;
-
-      LLVM_DEBUG(It == OldWaitcntInstr.getParent()->end()
-                     ? dbgs() << "applyPreexistingWaitcnt\n"
-                              << "New Instr at block end: "
-                              << *WaitcntVsCntInstr << '\n'
-                     : dbgs() << "applyPreexistingWaitcnt\n"
-                              << "Old Instr: " << *It
-                              << "New Instr: " << *WaitcntVsCntInstr << '\n');
-    } else {
-      WaitcntVsCntInstr->eraseFromParent();
-      Modified = true;
-    }
+    Modified |= updateOperandIfDifferent(*WaitcntVsCntInstr,
+                                         AMDGPU::OpName::simm16, Wait.VsCnt);
+    Modified |= updateWaitcntIfSoft(WaitcntVsCntInstr);
+    ScoreBrackets.applyWaitcnt(Wait);
+    Wait.VsCnt = ~0u;
+
+    LLVM_DEBUG(It == OldWaitcntInstr.getParent()->end()
+                   ? dbgs() << "applyPreexistingWaitcnt\n"
+                            << "New Instr at block end: " << *WaitcntVsCntInstr
+                            << '\n'
+                   : dbgs() << "applyPreexistingWaitcnt\n"
+                            << "Old Instr: " << *It
+                            << "New Instr: " << *WaitcntVsCntInstr << '\n');
   }
 
   return Modified;
@@ -1284,7 +1296,7 @@ bool SIInsertWaitcnts::generateWaitcnt(AMDGPU::Waitcnt Wait,
                                        MachineBasicBlock::instr_iterator It,
                                        MachineBasicBlock &Block,
                                        WaitcntBrackets &ScoreBrackets,
-                                       MachineInstr *OldWaitcntInstr) {
+                                       MachineInstr *OldWaitcntInstr) const {
   bool Modified = false;
   const DebugLoc &DL = Block.findDebugLoc(It);
 
@@ -1317,7 +1329,6 @@ bool SIInsertWaitcnts::generateWaitcnt(AMDGPU::Waitcnt Wait,
     unsigned Enc = AMDGPU::encodeWaitcnt(IV, Wait);
     auto SWaitInst =
         BuildMI(Block, It, DL, TII->get(AMDGPU::S_WAITCNT)).addImm(Enc);
-    TrackedWaitcntSet.insert(SWaitInst);
     Modified = true;
 
     LLVM_DEBUG(dbgs() << "generateWaitcnt\n";
@@ -1331,7 +1342,6 @@ bool SIInsertWaitcnts::generateWaitcnt(AMDGPU::Waitcnt Wait,
     auto SWaitInst = BuildMI(Block, It, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT))
                          .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
                          .addImm(Wait.VsCnt);
-    TrackedWaitcntSet.insert(SWaitInst);
     Modified = true;
 
     LLVM_DEBUG(dbgs() << "generateWaitcnt\n";
@@ -1574,9 +1584,9 @@ bool WaitcntBrackets::merge(const WaitcntBrackets &Other) {
 }
 
 static bool isWaitInstr(MachineInstr &Inst) {
-  return Inst.getOpcode() == AMDGPU::S_WAITCNT ||
-         (Inst.getOpcode() == AMDGPU::S_WAITCNT_VSCNT &&
-          Inst.getOperand(0).isReg() &&
+  auto Opcode = Inst.getOpcode();
+  return SIInstrInfo::isWaitcnt(Opcode) ||
+         (SIInstrInfo::isWaitcntVsCnt(Opcode) && Inst.getOperand(0).isReg() &&
           Inst.getOperand(0).getReg() == AMDGPU::SGPR_NULL);
 }
 
@@ -1845,7 +1855,6 @@ bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {
       TRI->getEncodingValue(AMDGPU::SGPR0) & AMDGPU::HWEncoding::REG_IDX_MASK;
   Encoding.SGPRL = Encoding.SGPR0 + NumSGPRsMax - 1;
 
-  TrackedWaitcntSet.clear();
   BlockInfos.clear();
   bool Modified = false;
 
@@ -1863,6 +1872,11 @@ bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {
       ;
     BuildMI(EntryBB, I, DebugLoc(), TII->get(AMDGPU::S_WAITCNT)).addImm(0);
 
+    auto NonKernelInitialState =
+        std::make_unique<WaitcntBrackets>(ST, Limits, Encoding);
+    NonKernelInitialState->setNonKernelFunctionInitialState();
+    BlockInfos[&EntryBB].Incoming = std::move(NonKernelInitialState);
+
     Modified = true;
   }
 
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index c4baabcd9232b56..7ccaea823a2295c 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -8774,6 +8774,11 @@ bool SIInstrInfo::isAsmOnlyOpcode(int MCOp) const {
 }
 
 int SIInstrInfo::pseudoToMCOpcode(int Opcode) const {
+
+  // FIXME: move to the right place
+  if (SIInstrInfo::isSoftWaitcnt(Opcode))
+    Opcode = SIInstrInfo::getNonSoftWaitcntOpcode(Opcode);
+
   unsigned Gen = subtargetEncodingFamily(ST);
 
   if ((get(Opcode).TSFlags & SIInstrFlags::renamedInGFX9) != 0 &&
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index de2820e5c013ee3..fb11d30dbca6dca 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -859,6 +859,31 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
     return get(Opcode).TSFlags & SIInstrFlags::TiedSourceNotRead;
   }
 
+  static unsigned getNonSoftWaitcntOpcode(unsigned Opcode) {
+    if (isWaitcnt(Opcode))
+      return AMDGPU::S_WAITCNT;
+
+    if (isWaitcntVsCnt(Opcode))
+      return AMDGPU::S_WAITCNT_VSCNT;
+
+    llvm_unreachable("Expected opcode S_WAITCNT/S_WAITCNT_VSCNT");
+  }
+
+  static bool isWaitcnt(unsigned Opcode) {
+    return Opcode == AMDGPU::S_WAITCNT || Opcode == AMDGPU::S_WAITCNT_soft;
+  }
+
+  static bool isWaitcntVsCnt(unsigned Opcode) {
+    return Opcode == AMDGPU::S_WAITCNT_VSCNT ||
+           Opcode == AMDGPU::S_WAITCNT_VSCNT_soft;
+  }
+
+  // soft waitcnt instructions can be relaxed/optimized out by SIInsertWaitcnts
+  static bool isSoftWaitcnt(unsigned Opcode) {
+    return Opcode == AMDGPU::S_WAITCNT_soft ||
+           Opcode == AMDGPU::S_WAITCNT_VSCNT_soft;
+  }
+
   bool isVGPRCopy(const MachineInstr &MI) const {
     assert(isCopyInstr(MI));
     Register Dest = MI.getOperand(0).getReg();
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index bc48f7b76c6d787..10ec54d3317fdf1 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -1055,7 +1055,8 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                             VMCnt ? 0 : getVmcntBitMask(IV),
                             getExpcntBitMask(IV),
                             LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT)).addImm(WaitCntImmediate);
+    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+        .addImm(WaitCntImmediate);
     Changed = true;
   }
 
@@ -1963,14 +1964,15 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                             VMCnt ? 0 : getVmcntBitMask(IV),
                             getExpcntBitMask(IV),
                             LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT)).addImm(WaitCntImmediate);
+    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+        .addImm(WaitCntImmediate);
     Changed = true;
   }
 
   if (VSCnt) {
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT))
-      .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
-      .addImm(0);
+    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
+        .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
+        .addImm(0);
     Changed = true;
   }
 
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index 90056e6ca281e78..83b325a148a168a 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -1464,6 +1464,21 @@ def S_WAKEUP : SOPP_Pseudo <"s_wakeup", (ins) > {
 
 def S_WAITCNT : SOPP_Pseudo <"s_waitcnt" , (ins SWaitCnt:$simm16), "$simm16",
     [(int_amdgcn_s_waitcnt timm:$simm16)]>;
+
+// "_soft" waitcnts are waitcnts that are either relaxed into their non-soft
+// counterpart, or completely removed.
+//
+// These are inserted by to resolve memory dependencies by the memory
+// legalizer and later optimized by SIInsertWaitcnts
+// For example, a S_WAITCNT_soft 0 can be completely removed on a function
+// that doesn't access memory.
+def S_WAITCNT_soft : SOPP_Pseudo <"s_soft_waitcnt" , (ins SWaitCnt:$simm16), "$simm16">;
+def S_WAITCNT_VSCNT_soft : SOPP_Pseudo<"s_soft_waitcnt_vscnt", (ins SReg_32:$sdst, s16imm:$simm16), "$sdst, $simm16"> {
+  let mayLoad = 1;
+  let mayStore = 1;
+  let has_sdst = 1;
+}
+
 def S_SETHALT : SOPP_Pseudo <"s_sethalt" , (ins i32imm:$simm16), "$simm16",
     [(int_amdgcn_s_sethalt timm:$simm16)]>;
 def S_SETKILL : SOPP_Pseudo <"s_setkill" , (ins i16imm:$simm16), "$simm16">;
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll
index feb65a5210d59d2..25cee87244975e5 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll
@@ -22,7 +22,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32(ptr addrspace(1) %out, ptr add
 ; CI-NEXT:    s_mov_b32 m0, -1
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
 ; CI-NEXT:    v_mov_b32_e32 v1, s2
-; CI-NEXT:    s_waitcnt lgkmcnt(0)
 ; CI-NEXT:    ds_dec_rtn_u32 v2, v1, v0
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
 ; CI-NEXT:    v_mov_b32_e32 v0, s0
@@ -38,7 +37,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32(ptr addrspace(1) %out, ptr add
 ; VI-NEXT:    s_mov_b32 m0, -1
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    v_mov_b32_e32 v1, s2
-; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    ds_dec_rtn_u32 v2, v1, v0
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    v_mov_b32_e32 v0, s0
@@ -53,7 +51,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32(ptr addrspace(1) %out, ptr add
 ; GFX9-NEXT:    v_mov_b32_e32 v1, 42
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    v_mov_b32_e32 v0, s2
-; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    ds_dec_rtn_u32 v0, v0, v1
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    v_mov_b32_e32 v1, 0
@@ -67,8 +64,7 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32(ptr addrspace(1) %out, ptr add
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-NEXT:    v_mov_b32_e32 v0, s0
 ; GFX10-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
-; GFX10-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-NEXT:    ds_dec_rtn_u32 v0, v0, v1
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-NEXT:    buffer_gl0_inv
@@ -83,8 +79,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32(ptr addrspace(1) %out, ptr add
 ; GFX11-NEXT:    s_load_b64 s[0:1], s[0:1], 0x0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    v_dual_mov_b32 v1, 42 :: v_dual_mov_b32 v0, s2
-; GFX11-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    ds_dec_rtn_u32 v0, v0, v1
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    buffer_gl0_inv
@@ -107,7 +101,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32_offset(ptr addrspace(1) %out,
 ; CI-NEXT:    s_mov_b32 m0, -1
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
 ; CI-NEXT:    v_mov_b32_e32 v1, s2
-; CI-NEXT:    s_waitcnt lgkmcnt(0)
 ; CI-NEXT:    ds_dec_rtn_u32 v2, v1, v0 offset:16
 ; CI-NEXT:    s_waitcnt lgkmcnt(0)
 ; CI-NEXT:    v_mov_b32_e32 v0, s0
@@ -123,7 +116,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32_offset(ptr addrspace(1) %out,
 ; VI-NEXT:    s_mov_b32 m0, -1
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    v_mov_b32_e32 v1, s2
-; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    ds_dec_rtn_u32 v2, v1, v0 offset:16
 ; VI-NEXT:    s_waitcnt lgkmcnt(0)
 ; VI-NEXT:    v_mov_b32_e32 v0, s0
@@ -138,7 +130,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32_offset(ptr addrspace(1) %out,
 ; GFX9-NEXT:    v_mov_b32_e32 v0, 42
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    v_mov_b32_e32 v1, s2
-; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    ds_dec_rtn_u32 v0, v1, v0 offset:16
 ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX9-NEXT:    v_mov_b32_e32 v1, 0
@@ -152,8 +143,7 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32_offset(ptr addrspace(1) %out,
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-NEXT:    v_mov_b32_e32 v1, s0
 ; GFX10-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
-; GFX10-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
 ; ...
[truncated]

Copy link
Collaborator

@rovka rovka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by nits.

llvm/lib/Target/AMDGPU/SOPInstructions.td Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/SOPInstructions.td Outdated Show resolved Hide resolved
@@ -8774,6 +8774,11 @@ bool SIInstrInfo::isAsmOnlyOpcode(int MCOp) const {
}

int SIInstrInfo::pseudoToMCOpcode(int Opcode) const {

// FIXME: move to the right place
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the right place? :) This comment should be a bit more specific

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know (just picked up the patch) - looks like the right place to me

llvm/lib/Target/AMDGPU/SOPInstructions.td Outdated Show resolved Hide resolved
unsigned Opcode = II.getOpcode();
bool CanFullyDiscardWaitcntSequence = SIInstrInfo::isSoftWaitcnt(Opcode);

if (SIInstrInfo::isWaitcnt(Opcode)) {
// Conservatively update required wait if this waitcnt was added in an
// earlier pass. In this case it will not exist in the tracked waitcnt
// set.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should it say instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there's no more tracked waitcnt set, so we should rephrase it in terms of soft waitcnts instead.

Something along the lines of "Update required wait. If this waitcnt was added in an earlier pass, but is no longer needed, it may be removed."

@jayfoad
Copy link
Contributor

jayfoad commented Nov 20, 2023

Original patch by @jmmartinez : #65735

I had a bunch of outstanding comments on that PR.

@Pierre-vh
Copy link
Contributor Author

Original patch by @jmmartinez : #65735

I had a bunch of outstanding comments on that PR.

Aren't they all addressed, with the exception of the added const in SIInsertWaitCnt (which i'm going to remove) and the extra waitcnt mystery which I am looking into right now?

@jayfoad
Copy link
Contributor

jayfoad commented Nov 20, 2023

Original patch by @jmmartinez : #65735

I had a bunch of outstanding comments on that PR.

Aren't they all addressed, with the exception of the added const in SIInsertWaitCnt (which i'm going to remove) and the extra waitcnt mystery which I am looking into right now?

I don't know because I can't easily see what has changed. I'll try to take a fresh look soon.

@@ -45,6 +45,7 @@ define void @back_off_barrier_no_fence(ptr %in, ptr %out) #0 {
; GFX11-BACKOFF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-BACKOFF-NEXT: flat_load_b32 v0, v[0:1]
; GFX11-BACKOFF-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-BACKOFF-NEXT: s_waitcnt_vscnt null, 0x0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra waitcnt is due to setNonKernelFunctionInitialState
Without this patch the counters start at

*** Block0 ***
    VM_CNT(0): 
    LGKM_CNT(0): 
    EXP_CNT(0): 
    VS_CNT(0): 

With the patch:

*** Block0 ***
    VM_CNT(63): 
    LGKM_CNT(63): 
    EXP_CNT(7): 
    VS_CNT(63): 

I didn't follow the full discussion around the patch so I don't know the context behind this change. I will try to dive deeper this week to provide more meaningful feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also confused by setNonKernelFunctionInitialState. It does not seem like it should be necessary for this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it there's quite a few more changes, e.g.

  • MSG_DEALLOC_VGPRS is back in llvm.amdgcn.set.inactive.chain.arg.ll
  • atomicrmw-expand.ll has a few missing s_waitcnt_vscnt null, 0x0 which I'm not sure if it's good or bad

Given that the PSDB passes with the changes I assume they're here for a reason, but Juan cannot tell us why right now so I don't know what the reasoning is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the effect of setNonKernelFunctionInitialState is, roughly, to say that all the wait counters are in an unknown state, instead of known to be 0. It seems very odd to do this immediately after emitting s_waitcnt 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking about this some more, it is probably a good idea to set vscnt (only) to "unknown", since we do not insert a wait for vscnt on function entry. I guess that would undo most or all of the changes you saw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fine for now then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this patch I would prefer that setNonKernelFunctionInitialState only sets vscnt to unknown, and leaves the other counters as known to be zero. Does that change affect the tests?

Also I think it might need to do Brackets.setPendingEvent(SCRATCH_WRITE_ACCESS) (like in #73122) to indicate that there might be outstanding scratch writes. Does that affect the tests? I'm not sure if it also needs to set the other pending event bits too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or if you prefer we could try to get #73122 finished first, and then revisit the current patch.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this patch I would prefer that setNonKernelFunctionInitialState only sets vscnt to unknown, and leaves the other counters as known to be zero. Does that change affect the tests?

With

    setScoreUB(VS_CNT, getWaitCountMax(VS_CNT));
    PendingEvents |= WaitEventMaskForInst[VS_CNT];

Only CodeGen/AMDGPU/vgpr-descriptor-waterfall-loop-idom-update.ll changes, everything else stays the same

I have no strong preference on whether to finish this first or the other patch - I just picked it up and I'm still learning about InsertWaitCnt myself. Though, I think if this patch has a lot more improvements than regressions we should land it first and add a TODO for the remaining bad cases

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello!

In an initial version of the patch, before we introduced de soft-waitcnts, setting the counters to 0 in the initial state would end up removing the "s_waitcnt 0" placed at non-kernel function entry. This is not needed anymore since we make the distinction between soft/strong waitcnts. However, we still have to do it for vscnt as jay said.

@Pierre-vh
Copy link
Contributor Author

Gentle ping, it'd be nice to land this before the end of the year :)

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any interaction with #73122, given the change to no longer auto-adding 0 on entry?

@jayfoad
Copy link
Contributor

jayfoad commented Dec 6, 2023

Is there any interaction with #73122, given the change to no longer auto-adding 0 on entry?

Yes there is definitely some overlap, but given that #73122 is proving harder to complete than I thought it would be, it doesn't seem fair to make it a prerequisite for this patch.

@Pierre-vh
Copy link
Contributor Author

gentle ping
If possible I'd like to land this before I'm OOO (next 2 weeks), if it's not possible to review this today or tomorrow let me know :)

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp Show resolved Hide resolved
llvm/lib/Target/AMDGPU/SIInstrInfo.h Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Outdated Show resolved Hide resolved
if (!SIInstrInfo::isSoftWaitcnt(Opcode))
return false;

Waitcnt->setDesc(TII->get(SIInstrInfo::getNonSoftWaitcntOpcode(Opcode)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this here as well as in pseudoToMCOpcode?

Note that SIInsertWaitcnts is iterative, so it may process the same block several times. If that happens then by the time the it runs the second time on a block, all remaining soft waitcnts will have been converted to hard ones by the first pass. Is that desirable for some reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this here as well as in pseudoToMCOpcode?

So soft waitcnts can be lowered into MCInsts
When I picked up this patch, IIRC, I tried to remove the one in pseudoToMCOpcode (and make soft waitcnts illegal to lower to MCInst) but it didn't work. I can try again if you want

Note that SIInsertWaitcnts is iterative, so it may process the same block several times. If that happens then by the time the it runs the second time on a block, all remaining soft waitcnts will have been converted to hard ones by the first pass. Is that desirable for some reason?

Not sure, I'll think about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that SIInsertWaitcnts is iterative, so it may process the same block several times. If that happens then by the time the it runs the second time on a block, all remaining soft waitcnts will have been converted to hard ones by the first pass. Is that desirable for some reason?

So my assumption is that it's a normal process. The first iteration will either eliminate the unneeded waitcnts, or make them non-soft. Further iterations on the block should act like they did before when there were no soft waitcnts.

I've renamed this function to also make it clearer because I think its current name was confusing. It's only called when we update a waitcnt (so when we know the waitcnt is needed) in order to "promote" the soft waitcnt into a normal one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's OK, but it still seems odd to me that we promote soft waitcnts both here and in pseudoToMCOpcode.

SIInsertWaitcnts is required for correctness (it is not just an optimization) so why is the pseudoToMCOpcode part required?

Alternatively, could we stop promoting soft waitcnts here? Or would that somehow change the behaviour of the second (or subsequent) visit to a basic block?

@Pierre-vh
Copy link
Contributor Author

(Apologies for the noise but I rebased again and squashed commits as well. There are now 2 commits in this PR, one has a separate review.)

Pierre-vh added a commit that referenced this pull request Dec 15, 2023
Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK to commit now.

I'm still a little confused about why we promote soft waitcnts in two places, but at worst one of them is redundant and everything should still work correctly.

@Pierre-vh Pierre-vh merged commit ef067f5 into llvm:main Dec 15, 2023
4 checks passed
@Pierre-vh Pierre-vh deleted the swaitcnt-zero branch December 15, 2023 11:33
@perlfu
Copy link
Contributor

perlfu commented Dec 17, 2023

@Pierre-vh I am seeing some regressions from this change causing generation of extra unnecessary s_waitcnt instructions.

See attached test: llc -march=amdgcn -start-before=si-insert-waitcnts -mcpu=gfx1030 -verify-machineinstrs -mattr=+wavefrontsize32,-wavefrontsize64 -o - test.mir

	s_waitcnt vmcnt(0)
	v_mov_b32_e32 v0, 0
	s_waitcnt vmcnt(1)                   <-- new extra s_waitcnt
	v_readfirstlane_b32 s8, v7
	v_readfirstlane_b32 s9, v8
	v_readfirstlane_b32 s10, v9
	v_readfirstlane_b32 s11, v10
	v_readfirstlane_b32 s12, v11
	v_readfirstlane_b32 s13, v12
	v_readfirstlane_b32 s14, v13
	v_readfirstlane_b32 s15, v14
	s_waitcnt vmcnt(0)                   <-- new extra s_waitcnt
	v_readfirstlane_b32 s16, v15
	v_readfirstlane_b32 s17, v16
	v_readfirstlane_b32 s18, v17
	v_readfirstlane_b32 s19, v18
	v_mov_b32_e32 v1, v0

test.mir.gz

@jayfoad
Copy link
Contributor

jayfoad commented Dec 18, 2023

Thanks for the test case! This is a great example of why we should not promote soft waitcnts in applyPreexistingWaitcnt.

The full codegen looks like this:

; %bb.0:
        s_mov_b32 s4, exec_lo
        s_wqm_b32 exec_lo, exec_lo
        v_mov_b32_e32 v7, v2
        v_ashrrev_i32_e32 v4, 31, v3
        v_mov_b32_e32 v6, v1
        v_mov_b32_e32 v5, v0
        v_ashrrev_i32_e32 v8, 31, v7
        v_lshlrev_b64 v[2:3], 4, v[3:4]
        v_lshlrev_b64 v[0:1], 5, v[7:8]
        v_add_co_u32 v0, vcc_lo, s0, v0
        v_add_co_ci_u32_e32 v1, vcc_lo, s1, v1, vcc_lo
        v_add_co_u32 v2, vcc_lo, s2, v2
        v_add_co_ci_u32_e32 v3, vcc_lo, s3, v3, vcc_lo
        s_clause 0x1
        global_load_dwordx4 v[11:14], v[0:1], off offset:16
        global_load_dwordx4 v[7:10], v[0:1], off
        global_load_dwordx4 v[15:18], v[2:3], off
        s_mov_b32 s0, exec_lo
.LBB0_1:                                ; =>This Inner Loop Header: Depth=1
        v_readfirstlane_b32 s1, v5
        v_readfirstlane_b32 s2, v6
        v_cmp_eq_u32_e64 s1, s1, v5
        v_cmp_eq_u32_e64 s2, s2, v6
        s_and_b32 s1, s1, s2
        s_and_saveexec_b32 s1, s1
        s_waitcnt vmcnt(0)
        v_mov_b32_e32 v0, 0
        s_waitcnt vmcnt(1)
        v_readfirstlane_b32 s8, v7
        v_readfirstlane_b32 s9, v8
        v_readfirstlane_b32 s10, v9
        v_readfirstlane_b32 s11, v10
        v_readfirstlane_b32 s12, v11
        v_readfirstlane_b32 s13, v12
        v_readfirstlane_b32 s14, v13
        v_readfirstlane_b32 s15, v14
        s_waitcnt vmcnt(0)
        v_readfirstlane_b32 s16, v15
        v_readfirstlane_b32 s17, v16
        v_readfirstlane_b32 s18, v17
        v_readfirstlane_b32 s19, v18
        v_mov_b32_e32 v1, v0
                                        ; implicit-def: $vgpr5
                                        ; implicit-def: $vgpr6
                                        ; implicit-def: $vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14
                                        ; implicit-def: $vgpr15_vgpr16_vgpr17_vgpr18
        image_sample v[0:3], v[0:1], s[8:15], s[16:19] dmask:0xf dim:SQ_RSRC_IMG_2D
        s_xor_b32 exec_lo, exec_lo, s1
        s_cbranch_execnz .LBB0_1
; %bb.2:
        s_mov_b32 exec_lo, s0
        s_and_b32 exec_lo, exec_lo, s4
        s_waitcnt vmcnt(0)
        ; return to shader part epilog

The first time we visit LBB0_1, we have only seen the bb.0 predecessor so we only know about waits due to the global_load_dwordx4, and we insert the "s_waitcnt vmcnt(1)" and "s_waitcnt vmcnt(0)" that you highlighted as redundant.

The second time we visit LBB0_1, we have seen the image_sample instruction in the LBB0_1 predecessor, and we insert the "s_waitcnt vmcnt(0)" before the "v_mov_b32_e32 v0, 0". This renders the other two waitcnts redundant, but we can't remove them because we have already promoted them to "hard" waitcnts.

@jayfoad
Copy link
Contributor

jayfoad commented Dec 18, 2023

See #75785 for a fix.

jayfoad added a commit to jayfoad/llvm-project that referenced this pull request Apr 30, 2024
Since llvm#72830 the memory legalizer tests have not shown s_waitcnt
instructions inserted by SIMemoryLegalizer because they have mostly been
removed by SIInsertWaitcnts.

Checking the MIR immediately after SIMemoryLegalizer runs fixes this so
you can see exactly what the pass has inserted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants