Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU/GlobalISel: lane masks merging #73337

Merged
merged 1 commit into from
Feb 29, 2024

Conversation

petar-avramovic
Copy link
Collaborator

@petar-avramovic petar-avramovic commented Nov 24, 2023

Basic implementation of lane mask merging for GlobalISel.
Lane masks on GlobalISel are registers with sgpr register class
and S1 LLT - required by machine uniformity analysis.
Implements equivalent of lowerPhis from SILowerI1Copies.cpp in:
patch 1: #75340
patch 2: #75349
patch 3: #80003
patch 4: #78431
patch 5: is in this commit:

AMDGPU/GlobalISelDivergenceLowering: constrain incoming registers

Previously, in PHIs that represent lane masks, incoming registers
taken as-is were not selected as lane masks. Such registers are not
being merged with another lane mask and most often only have S1 LLT.
Implement constrainAsLaneMask by constraining incoming registers
taken as-is with lane mask attributes, essentially transforming them
to lane masks. This is final step in having PHI instructions created
in this pass to be fully instruction-selected.

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 24, 2023

@llvm/pr-subscribers-backend-amdgpu

Author: Petar Avramovic (petar-avramovic)

Changes

Basic implementation of lane mask merging for GlobalISel.
Lane masks on GlobalISel are registers with sgpr register class and S1 LLT - required by machine uniformity analysis.
This PR correctly handles all cases involving lane mask phis. There are some improvements to be made in code size.
There is notable improvement in testing with GlobalISel.
I will open separate PR for all patches.
Last three patches were meant to go together but I separated them here to see effects in mir tests (last two are additions to how SDAG lowers phis and do not affect phi lowering in SDAG)


Patch is 279.74 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73337.diff

21 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineSSAUpdater.h (+2-3)
  • (modified) llvm/lib/CodeGen/MachineSSAUpdater.cpp (+19-25)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+4)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp (+213)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+32)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (modified) llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp (+169-150)
  • (added) llvm/lib/Target/AMDGPU/SILowerI1Copies.h (+79)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+345)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.mir (+640)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+571)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+1085)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+514)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+1167)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+188)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+358)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+37)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.mir (+70)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+25-17)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll (+1)
diff --git a/llvm/include/llvm/CodeGen/MachineSSAUpdater.h b/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
index bbd09d7d151ba07..765cabdb313097b 100644
--- a/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
+++ b/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
@@ -40,8 +40,8 @@ class MachineSSAUpdater {
   //typedef DenseMap<MachineBasicBlock*, Register> AvailableValsTy;
   void *AV = nullptr;
 
-  /// VRC - Register class of the current virtual register.
-  const TargetRegisterClass *VRC = nullptr;
+  /// RegAttrs - current virtual register, new registers copy its attributes.
+  Register RegAttrs;
 
   /// InsertedPHIs - If this is non-null, the MachineSSAUpdater adds all PHI
   /// nodes that it creates to the vector.
@@ -62,7 +62,6 @@ class MachineSSAUpdater {
   /// Initialize - Reset this object to get ready for a new set of SSA
   /// updates.
   void Initialize(Register V);
-  void Initialize(const TargetRegisterClass *RC);
 
   /// AddAvailableValue - Indicate that a rewritten value is available at the
   /// end of the specified block with the specified value.
diff --git a/llvm/lib/CodeGen/MachineSSAUpdater.cpp b/llvm/lib/CodeGen/MachineSSAUpdater.cpp
index 48076663ddf5382..48537057e2031a8 100644
--- a/llvm/lib/CodeGen/MachineSSAUpdater.cpp
+++ b/llvm/lib/CodeGen/MachineSSAUpdater.cpp
@@ -51,17 +51,13 @@ MachineSSAUpdater::~MachineSSAUpdater() {
 
 /// Initialize - Reset this object to get ready for a new set of SSA
 /// updates.
-void MachineSSAUpdater::Initialize(const TargetRegisterClass *RC) {
+void MachineSSAUpdater::Initialize(Register V) {
   if (!AV)
     AV = new AvailableValsTy();
   else
     getAvailableVals(AV).clear();
 
-  VRC = RC;
-}
-
-void MachineSSAUpdater::Initialize(Register V) {
-  Initialize(MRI->getRegClass(V));
+  RegAttrs = V;
 }
 
 /// HasValueForBlock - Return true if the MachineSSAUpdater already has a value for
@@ -115,13 +111,12 @@ Register LookForIdenticalPHI(MachineBasicBlock *BB,
 /// InsertNewDef - Insert an empty PHI or IMPLICIT_DEF instruction which define
 /// a value of the given register class at the start of the specified basic
 /// block. It returns the virtual register defined by the instruction.
-static
-MachineInstrBuilder InsertNewDef(unsigned Opcode,
-                           MachineBasicBlock *BB, MachineBasicBlock::iterator I,
-                           const TargetRegisterClass *RC,
-                           MachineRegisterInfo *MRI,
-                           const TargetInstrInfo *TII) {
-  Register NewVR = MRI->createVirtualRegister(RC);
+static MachineInstrBuilder InsertNewDef(unsigned Opcode, MachineBasicBlock *BB,
+                                        MachineBasicBlock::iterator I,
+                                        Register RegAttrs,
+                                        MachineRegisterInfo *MRI,
+                                        const TargetInstrInfo *TII) {
+  Register NewVR = MRI->cloneVirtualRegister(RegAttrs);
   return BuildMI(*BB, I, DebugLoc(), TII->get(Opcode), NewVR);
 }
 
@@ -158,9 +153,9 @@ Register MachineSSAUpdater::GetValueInMiddleOfBlock(MachineBasicBlock *BB,
     if (ExistingValueOnly)
       return Register();
     // Insert an implicit_def to represent an undef value.
-    MachineInstr *NewDef = InsertNewDef(TargetOpcode::IMPLICIT_DEF,
-                                        BB, BB->getFirstTerminator(),
-                                        VRC, MRI, TII);
+    MachineInstr *NewDef =
+        InsertNewDef(TargetOpcode::IMPLICIT_DEF, BB, BB->getFirstTerminator(),
+                     RegAttrs, MRI, TII);
     return NewDef->getOperand(0).getReg();
   }
 
@@ -197,8 +192,8 @@ Register MachineSSAUpdater::GetValueInMiddleOfBlock(MachineBasicBlock *BB,
 
   // Otherwise, we do need a PHI: insert one now.
   MachineBasicBlock::iterator Loc = BB->empty() ? BB->end() : BB->begin();
-  MachineInstrBuilder InsertedPHI = InsertNewDef(TargetOpcode::PHI, BB,
-                                                 Loc, VRC, MRI, TII);
+  MachineInstrBuilder InsertedPHI =
+      InsertNewDef(TargetOpcode::PHI, BB, Loc, RegAttrs, MRI, TII);
 
   // Fill in all the predecessors of the PHI.
   for (unsigned i = 0, e = PredValues.size(); i != e; ++i)
@@ -300,10 +295,9 @@ class SSAUpdaterTraits<MachineSSAUpdater> {
   static Register GetUndefVal(MachineBasicBlock *BB,
                               MachineSSAUpdater *Updater) {
     // Insert an implicit_def to represent an undef value.
-    MachineInstr *NewDef = InsertNewDef(TargetOpcode::IMPLICIT_DEF,
-                                        BB, BB->getFirstNonPHI(),
-                                        Updater->VRC, Updater->MRI,
-                                        Updater->TII);
+    MachineInstr *NewDef =
+        InsertNewDef(TargetOpcode::IMPLICIT_DEF, BB, BB->getFirstNonPHI(),
+                     Updater->RegAttrs, Updater->MRI, Updater->TII);
     return NewDef->getOperand(0).getReg();
   }
 
@@ -312,9 +306,9 @@ class SSAUpdaterTraits<MachineSSAUpdater> {
   static Register CreateEmptyPHI(MachineBasicBlock *BB, unsigned NumPreds,
                                  MachineSSAUpdater *Updater) {
     MachineBasicBlock::iterator Loc = BB->empty() ? BB->end() : BB->begin();
-    MachineInstr *PHI = InsertNewDef(TargetOpcode::PHI, BB, Loc,
-                                     Updater->VRC, Updater->MRI,
-                                     Updater->TII);
+    MachineInstr *PHI =
+        InsertNewDef(TargetOpcode::PHI, BB, Loc, Updater->RegAttrs,
+                     Updater->MRI, Updater->TII);
     return PHI->getOperand(0).getReg();
   }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 323560a46f31de2..007d64944244a09 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -36,6 +36,7 @@ FunctionPass *createSIAnnotateControlFlowPass();
 FunctionPass *createSIFoldOperandsPass();
 FunctionPass *createSIPeepholeSDWAPass();
 FunctionPass *createSILowerI1CopiesPass();
+FunctionPass *createAMDGPUGlobalISelDivergenceLoweringPass();
 FunctionPass *createSIShrinkInstructionsPass();
 FunctionPass *createSILoadStoreOptimizerPass();
 FunctionPass *createSIWholeQuadModePass();
@@ -162,6 +163,9 @@ extern char &SILowerWWMCopiesID;
 void initializeSILowerI1CopiesPass(PassRegistry &);
 extern char &SILowerI1CopiesID;
 
+void initializeAMDGPUGlobalISelDivergenceLoweringPass(PassRegistry &);
+extern char &AMDGPUGlobalISelDivergenceLoweringID;
+
 void initializeSILowerSGPRSpillsPass(PassRegistry &);
 extern char &SILowerSGPRSpillsID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp
new file mode 100644
index 000000000000000..a955a08b0a3776e
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp
@@ -0,0 +1,213 @@
+//===-- AMDGPUGlobalISelDivergenceLowering.cpp ----------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// GlobalISel pass that selects divergent i1 phis as lane mask phis.
+// Lane mask merging uses same algorithm as SDAG in SILowerI1Copies.
+// Handles all cases of temporal divergence.
+//
+// For divergent non-phi i1 and uniform i1 uses outside of the cycle this pass
+// currently depends on LCSSA to insert phis with one incoming.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "SILowerI1Copies.h"
+#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
+#include "llvm/InitializePasses.h"
+
+#define DEBUG_TYPE "global-isel-divergence-lowering"
+
+using namespace llvm;
+
+namespace {
+
+class AMDGPUGlobalISelDivergenceLowering : public MachineFunctionPass {
+public:
+  static char ID;
+
+public:
+  AMDGPUGlobalISelDivergenceLowering() : MachineFunctionPass(ID) {
+    initializeAMDGPUGlobalISelDivergenceLoweringPass(
+        *PassRegistry::getPassRegistry());
+  }
+
+  bool runOnMachineFunction(MachineFunction &MF) override;
+
+  StringRef getPassName() const override {
+    return "GlobalISel divergence lowering";
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.setPreservesCFG();
+    AU.addRequired<MachineCycleInfoWrapperPass>();
+    AU.addRequired<MachineDominatorTree>();
+    AU.addRequired<MachinePostDominatorTree>();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+};
+
+class DivergenceLoweringHelper : public PhiLoweringHelper {
+public:
+  DivergenceLoweringHelper(MachineFunction *MF, MachineDominatorTree *DT,
+                           MachinePostDominatorTree *PDT,
+                           MachineUniformityInfo *MUI);
+
+private:
+  MachineUniformityInfo *MUI = nullptr;
+
+public:
+  void markAsLaneMask(Register DstReg) const override;
+  void getCandidatesForLowering(
+      SmallVectorImpl<MachineInstr *> &Vreg1Phis) const override;
+  void collectIncomingValuesFromPhi(
+      const MachineInstr *MI,
+      SmallVectorImpl<Incoming> &Incomings) const override;
+  void replaceDstReg(Register NewReg, Register OldReg,
+                     MachineBasicBlock *MBB) override;
+  void buildMergeLaneMasks(MachineBasicBlock &MBB,
+                           MachineBasicBlock::iterator I, const DebugLoc &DL,
+                           Register DstReg, Register PrevReg,
+                           Register CurReg) override;
+  void constrainIncomingRegisterTakenAsIs(Incoming &In) override;
+};
+
+DivergenceLoweringHelper::DivergenceLoweringHelper(
+    MachineFunction *MF, MachineDominatorTree *DT,
+    MachinePostDominatorTree *PDT, MachineUniformityInfo *MUI)
+    : PhiLoweringHelper(MF, DT, PDT), MUI(MUI) {}
+
+// _(s1) -> SReg_32/64(s1)
+void DivergenceLoweringHelper::markAsLaneMask(Register DstReg) const {
+  assert(MRI->getType(DstReg) == LLT::scalar(1));
+
+  // Can't blindly set a register class on phi, users could have reg class
+  // constraints (e.g. sreg_32/64..._xexec classes for control flow intrinsics).
+  if (MRI->getRegClassOrNull(DstReg))
+    return;
+
+  MRI->setRegClass(DstReg, ST->getBoolRC());
+  return;
+}
+
+void DivergenceLoweringHelper::getCandidatesForLowering(
+    SmallVectorImpl<MachineInstr *> &Vreg1Phis) const {
+  LLT S1 = LLT::scalar(1);
+
+  // Add divergent i1 phis to the list
+  for (MachineBasicBlock &MBB : *MF) {
+    for (MachineInstr &MI : MBB.phis()) {
+      Register Dst = MI.getOperand(0).getReg();
+      if (MRI->getType(Dst) == S1 && MUI->isDivergent(Dst))
+        Vreg1Phis.push_back(&MI);
+    }
+  }
+
+  return;
+}
+
+void DivergenceLoweringHelper::collectIncomingValuesFromPhi(
+    const MachineInstr *MI, SmallVectorImpl<Incoming> &Incomings) const {
+  for (unsigned i = 1; i < MI->getNumOperands(); i += 2) {
+    Incomings.emplace_back(MI->getOperand(i).getReg(),
+                           MI->getOperand(i + 1).getMBB(), Register());
+  }
+}
+
+void DivergenceLoweringHelper::replaceDstReg(Register NewReg, Register OldReg,
+                                             MachineBasicBlock *MBB) {
+  BuildMI(*MBB, MBB->getFirstNonPHI(), {}, TII->get(AMDGPU::COPY), OldReg)
+      .addReg(NewReg);
+}
+
+// Get pointers to build instruction just after MI (skips phis if needed)
+static std::pair<MachineBasicBlock *, MachineBasicBlock::iterator>
+getInsertAfterPtrs(MachineInstr *MI) {
+  MachineBasicBlock *InsertMBB = MI->getParent();
+  return std::make_pair(
+      InsertMBB, InsertMBB->SkipPHIsAndLabels(std::next(MI->getIterator())));
+}
+
+void DivergenceLoweringHelper::buildMergeLaneMasks(
+    MachineBasicBlock &MBB, MachineBasicBlock::iterator I, const DebugLoc &DL,
+    Register DstReg, Register PrevReg, Register CurReg) {
+  // TODO: check if inputs are constants or results of a compare.
+
+  Register PrevRegCopy = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  auto [PrevMBB, AfterPrevReg] = getInsertAfterPtrs(MRI->getVRegDef(PrevReg));
+  BuildMI(*PrevMBB, AfterPrevReg, DL, TII->get(AMDGPU::COPY), PrevRegCopy)
+      .addReg(PrevReg);
+  Register PrevMaskedReg = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  BuildMI(MBB, I, DL, TII->get(AndN2Op), PrevMaskedReg)
+      .addReg(PrevRegCopy)
+      .addReg(ExecReg);
+
+  Register CurRegCopy = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  auto [CurMBB, AfterCurReg] = getInsertAfterPtrs(MRI->getVRegDef(CurReg));
+  BuildMI(*CurMBB, AfterCurReg, DL, TII->get(AMDGPU::COPY), CurRegCopy)
+      .addReg(CurReg);
+  Register CurMaskedReg = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  BuildMI(MBB, I, DL, TII->get(AndOp), CurMaskedReg)
+      .addReg(ExecReg)
+      .addReg(CurRegCopy);
+
+  BuildMI(MBB, I, DL, TII->get(OrOp), DstReg)
+      .addReg(PrevMaskedReg)
+      .addReg(CurMaskedReg);
+
+  return;
+}
+
+// GlobalISel has to constrain S1 incoming taken as-is with lane mask register
+// class. Insert a copy of Incoming.Reg to new lane mask inside Incoming.Block,
+// Incoming.Reg becomes that new lane mask.
+void DivergenceLoweringHelper::constrainIncomingRegisterTakenAsIs(
+    Incoming &In) {
+  MachineIRBuilder B(*MF);
+  B.setInsertPt(*In.Block, In.Block->getFirstTerminator());
+
+  auto Copy = B.buildCopy(LLT::scalar(1), In.Reg);
+  MRI->setRegClass(Copy.getReg(0), ST->getBoolRC());
+  In.Reg = Copy.getReg(0);
+
+  return;
+}
+
+} // End anonymous namespace.
+
+INITIALIZE_PASS_BEGIN(AMDGPUGlobalISelDivergenceLowering, DEBUG_TYPE,
+                      "GlobalISel divergence lowering", false, false)
+INITIALIZE_PASS_DEPENDENCY(MachineCycleInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
+INITIALIZE_PASS_DEPENDENCY(MachinePostDominatorTree)
+INITIALIZE_PASS_END(AMDGPUGlobalISelDivergenceLowering, DEBUG_TYPE,
+                    "GlobalISel divergence lowering", false, false)
+
+char AMDGPUGlobalISelDivergenceLowering::ID = 0;
+
+char &llvm::AMDGPUGlobalISelDivergenceLoweringID =
+    AMDGPUGlobalISelDivergenceLowering::ID;
+
+FunctionPass *llvm::createAMDGPUGlobalISelDivergenceLoweringPass() {
+  return new AMDGPUGlobalISelDivergenceLowering();
+}
+
+bool AMDGPUGlobalISelDivergenceLowering::runOnMachineFunction(
+    MachineFunction &MF) {
+  MachineCycleInfo &CycleInfo =
+      getAnalysis<MachineCycleInfoWrapperPass>().getCycleInfo();
+  MachineDominatorTree &DomTree = getAnalysis<MachineDominatorTree>();
+
+  MachineUniformityInfo MUI =
+      computeMachineUniformityInfo(MF, CycleInfo, DomTree.getBase(), true);
+
+  DivergenceLoweringHelper Helper(
+      &MF, &DomTree, &getAnalysis<MachinePostDominatorTree>(), &MUI);
+
+  Helper.lowerPhis();
+  return true;
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index b772efe04c7141d..3146f7f5a218566 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -207,7 +207,39 @@ bool AMDGPUInstructionSelector::selectCOPY(MachineInstr &I) const {
   return true;
 }
 
+bool isLaneMask(Register Reg, MachineRegisterInfo *MRI,
+                const SIRegisterInfo &TRI) {
+  if (MRI->getType(Reg) != LLT::scalar(1))
+    return false;
+  const TargetRegisterClass *RC = MRI->getRegClassOrNull(Reg);
+  if (!RC || !TRI.isSGPRClass(RC))
+    return false;
+
+  return true;
+}
+
+// PHI where all register operands are sgpr(register class) with S1 LLT.
+bool isLaneMaskPhi(MachineInstr &I, MachineRegisterInfo *MRI,
+                   const SIRegisterInfo &TRI) {
+  if (I.getOpcode() != AMDGPU::PHI)
+    return false;
+
+  if (!isLaneMask(I.getOperand(0).getReg(), MRI, TRI))
+    return false;
+
+  for (unsigned i = 1, e = I.getNumOperands(); i != e; e += 2) {
+    if (!isLaneMask(I.getOperand(i).getReg(), MRI, TRI))
+      return false;
+  }
+
+  return true;
+}
+
 bool AMDGPUInstructionSelector::selectPHI(MachineInstr &I) const {
+  // Already selected in divergence lowering pass
+  if (isLaneMaskPhi(I, MRI, TRI))
+    return true;
+
   const Register DefReg = I.getOperand(0).getReg();
   const LLT DefTy = MRI->getType(DefReg);
   if (DefTy == LLT::scalar(1)) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 0c38fa32c6f33a8..1d0be8984604da2 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -375,6 +375,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeAMDGPUDAGToDAGISelPass(*PR);
   initializeGCNDPPCombinePass(*PR);
   initializeSILowerI1CopiesPass(*PR);
+  initializeAMDGPUGlobalISelDivergenceLoweringPass(*PR);
   initializeSILowerWWMCopiesPass(*PR);
   initializeSILowerSGPRSpillsPass(*PR);
   initializeSIFixSGPRCopiesPass(*PR);
@@ -1255,6 +1256,7 @@ bool GCNPassConfig::addLegalizeMachineIR() {
 void GCNPassConfig::addPreRegBankSelect() {
   bool IsOptNone = getOptLevel() == CodeGenOptLevel::None;
   addPass(createAMDGPUPostLegalizeCombiner(IsOptNone));
+  addPass(createAMDGPUGlobalISelDivergenceLoweringPass());
 }
 
 bool GCNPassConfig::addRegBankSelect() {
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 53a33f8210d2a84..2c92e7a07388553 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -55,6 +55,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUCtorDtorLowering.cpp
   AMDGPUExportClustering.cpp
   AMDGPUFrameLowering.cpp
+  AMDGPUGlobalISelDivergenceLowering.cpp
   AMDGPUGlobalISelUtils.cpp
   AMDGPUHSAMetadataStreamer.cpp
   AMDGPUInsertDelayAlu.cpp
diff --git a/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp b/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
index 68c8f4024e73007..59c05bab531f4b5 100644
--- a/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
@@ -21,57 +21,26 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "SILowerI1Copies.h"
 #include "AMDGPU.h"
-#include "GCNSubtarget.h"
-#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
-#include "llvm/CodeGen/MachineDominators.h"
-#include "llvm/CodeGen/MachineFunctionPass.h"
-#include "llvm/CodeGen/MachinePostDominators.h"
 #include "llvm/CodeGen/MachineSSAUpdater.h"
 #include "llvm/InitializePasses.h"
+#include "llvm/Target/CGPassBuilderOption.h"
 
 #define DEBUG_TYPE "si-i1-copies"
 
 using namespace llvm;
 
-static unsigned createLaneMaskReg(MachineFunction &MF);
-static unsigned insertUndefLaneMask(MachineBasicBlock &MBB);
+static Register insertUndefLaneMask(MachineBasicBlock *MBB,
+                                    MachineRegisterInfo *MRI,
+                                    Register *LaneMaskRegAttrs);
 
 namespace {
 
-struct Incoming {
-  Register Reg;
-  MachineBasicBlock *Block;
-  Register UpdatedReg;
-
-  Incoming(Register Reg, MachineBasicBlock *Block, Register UpdatedReg)
-      : Reg(Reg), Block(Block), UpdatedReg(UpdatedReg) {}
-};
-
 class SILowerI1Copies : public MachineFunctionPass {
 public:
   static char ID;
 
-private:
-  bool IsWave32 = false;
-  MachineFunction *MF = nullptr;
-  MachineDominatorTree *DT = nullptr;
-  MachinePostDominatorTree *PDT = nullptr;
-  MachineRegisterInfo *MRI = nullptr;
-  const GCNSubtarget *ST = nullptr;
-  const SIInstrInfo *TII = nullptr;
-
-  unsigned ExecReg;
-  unsigned MovOp;
-  unsigned AndOp;
-  unsigned OrOp;
-  unsigned XorOp;
-  unsigned AndN2Op;
-  unsigned OrN2Op;
-
-  DenseSet<unsigned> ConstrainRegs;
-
-public:
   SILowerI1Copies() : MachineFunctionPass(ID) {
     initializeSILowerI1CopiesPass(*PassRegistry::getPassRegistry());
   }
@@ -86,29 +55,53 @@ class SILowerI1Copies : public MachineFunctionPass {
     AU.addRequired<MachinePostDominatorTree>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
+};
+
+class Vreg1LoweringHelper : public PhiLoweringHelper {
+public:
+  Vreg1LoweringHelper(MachineFunction *MF, MachineDominatorTree *DT,
+                      MachinePostDominatorTree *PDT);
 
 private:
-  bool lowerCopiesFromI1();
-  bool lowerPhis();
-  bool lowerCopiesToI1();
-  bool isConstantLaneMask(Register Reg, bool &Val) const;
+  DenseSet<Register> ConstrainRegs;
+
+public:
+  void markAsLaneMask(Register DstReg) const...
[truncated]

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 24, 2023

@llvm/pr-subscribers-llvm-globalisel

Author: Petar Avramovic (petar-avramovic)

Changes

Basic implementation of lane mask merging for GlobalISel.
Lane masks on GlobalISel are registers with sgpr register class and S1 LLT - required by machine uniformity analysis.
This PR correctly handles all cases involving lane mask phis. There are some improvements to be made in code size.
There is notable improvement in testing with GlobalISel.
I will open separate PR for all patches.
Last three patches were meant to go together but I separated them here to see effects in mir tests (last two are additions to how SDAG lowers phis and do not affect phi lowering in SDAG)


Patch is 279.74 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73337.diff

21 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineSSAUpdater.h (+2-3)
  • (modified) llvm/lib/CodeGen/MachineSSAUpdater.cpp (+19-25)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+4)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp (+213)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+32)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (modified) llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp (+169-150)
  • (added) llvm/lib/Target/AMDGPU/SILowerI1Copies.h (+79)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+345)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.mir (+640)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+571)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+1085)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+514)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+1167)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+188)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+358)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+37)
  • (added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.mir (+70)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+25-17)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll (+1)
diff --git a/llvm/include/llvm/CodeGen/MachineSSAUpdater.h b/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
index bbd09d7d151ba07..765cabdb313097b 100644
--- a/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
+++ b/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
@@ -40,8 +40,8 @@ class MachineSSAUpdater {
   //typedef DenseMap<MachineBasicBlock*, Register> AvailableValsTy;
   void *AV = nullptr;
 
-  /// VRC - Register class of the current virtual register.
-  const TargetRegisterClass *VRC = nullptr;
+  /// RegAttrs - current virtual register, new registers copy its attributes.
+  Register RegAttrs;
 
   /// InsertedPHIs - If this is non-null, the MachineSSAUpdater adds all PHI
   /// nodes that it creates to the vector.
@@ -62,7 +62,6 @@ class MachineSSAUpdater {
   /// Initialize - Reset this object to get ready for a new set of SSA
   /// updates.
   void Initialize(Register V);
-  void Initialize(const TargetRegisterClass *RC);
 
   /// AddAvailableValue - Indicate that a rewritten value is available at the
   /// end of the specified block with the specified value.
diff --git a/llvm/lib/CodeGen/MachineSSAUpdater.cpp b/llvm/lib/CodeGen/MachineSSAUpdater.cpp
index 48076663ddf5382..48537057e2031a8 100644
--- a/llvm/lib/CodeGen/MachineSSAUpdater.cpp
+++ b/llvm/lib/CodeGen/MachineSSAUpdater.cpp
@@ -51,17 +51,13 @@ MachineSSAUpdater::~MachineSSAUpdater() {
 
 /// Initialize - Reset this object to get ready for a new set of SSA
 /// updates.
-void MachineSSAUpdater::Initialize(const TargetRegisterClass *RC) {
+void MachineSSAUpdater::Initialize(Register V) {
   if (!AV)
     AV = new AvailableValsTy();
   else
     getAvailableVals(AV).clear();
 
-  VRC = RC;
-}
-
-void MachineSSAUpdater::Initialize(Register V) {
-  Initialize(MRI->getRegClass(V));
+  RegAttrs = V;
 }
 
 /// HasValueForBlock - Return true if the MachineSSAUpdater already has a value for
@@ -115,13 +111,12 @@ Register LookForIdenticalPHI(MachineBasicBlock *BB,
 /// InsertNewDef - Insert an empty PHI or IMPLICIT_DEF instruction which define
 /// a value of the given register class at the start of the specified basic
 /// block. It returns the virtual register defined by the instruction.
-static
-MachineInstrBuilder InsertNewDef(unsigned Opcode,
-                           MachineBasicBlock *BB, MachineBasicBlock::iterator I,
-                           const TargetRegisterClass *RC,
-                           MachineRegisterInfo *MRI,
-                           const TargetInstrInfo *TII) {
-  Register NewVR = MRI->createVirtualRegister(RC);
+static MachineInstrBuilder InsertNewDef(unsigned Opcode, MachineBasicBlock *BB,
+                                        MachineBasicBlock::iterator I,
+                                        Register RegAttrs,
+                                        MachineRegisterInfo *MRI,
+                                        const TargetInstrInfo *TII) {
+  Register NewVR = MRI->cloneVirtualRegister(RegAttrs);
   return BuildMI(*BB, I, DebugLoc(), TII->get(Opcode), NewVR);
 }
 
@@ -158,9 +153,9 @@ Register MachineSSAUpdater::GetValueInMiddleOfBlock(MachineBasicBlock *BB,
     if (ExistingValueOnly)
       return Register();
     // Insert an implicit_def to represent an undef value.
-    MachineInstr *NewDef = InsertNewDef(TargetOpcode::IMPLICIT_DEF,
-                                        BB, BB->getFirstTerminator(),
-                                        VRC, MRI, TII);
+    MachineInstr *NewDef =
+        InsertNewDef(TargetOpcode::IMPLICIT_DEF, BB, BB->getFirstTerminator(),
+                     RegAttrs, MRI, TII);
     return NewDef->getOperand(0).getReg();
   }
 
@@ -197,8 +192,8 @@ Register MachineSSAUpdater::GetValueInMiddleOfBlock(MachineBasicBlock *BB,
 
   // Otherwise, we do need a PHI: insert one now.
   MachineBasicBlock::iterator Loc = BB->empty() ? BB->end() : BB->begin();
-  MachineInstrBuilder InsertedPHI = InsertNewDef(TargetOpcode::PHI, BB,
-                                                 Loc, VRC, MRI, TII);
+  MachineInstrBuilder InsertedPHI =
+      InsertNewDef(TargetOpcode::PHI, BB, Loc, RegAttrs, MRI, TII);
 
   // Fill in all the predecessors of the PHI.
   for (unsigned i = 0, e = PredValues.size(); i != e; ++i)
@@ -300,10 +295,9 @@ class SSAUpdaterTraits<MachineSSAUpdater> {
   static Register GetUndefVal(MachineBasicBlock *BB,
                               MachineSSAUpdater *Updater) {
     // Insert an implicit_def to represent an undef value.
-    MachineInstr *NewDef = InsertNewDef(TargetOpcode::IMPLICIT_DEF,
-                                        BB, BB->getFirstNonPHI(),
-                                        Updater->VRC, Updater->MRI,
-                                        Updater->TII);
+    MachineInstr *NewDef =
+        InsertNewDef(TargetOpcode::IMPLICIT_DEF, BB, BB->getFirstNonPHI(),
+                     Updater->RegAttrs, Updater->MRI, Updater->TII);
     return NewDef->getOperand(0).getReg();
   }
 
@@ -312,9 +306,9 @@ class SSAUpdaterTraits<MachineSSAUpdater> {
   static Register CreateEmptyPHI(MachineBasicBlock *BB, unsigned NumPreds,
                                  MachineSSAUpdater *Updater) {
     MachineBasicBlock::iterator Loc = BB->empty() ? BB->end() : BB->begin();
-    MachineInstr *PHI = InsertNewDef(TargetOpcode::PHI, BB, Loc,
-                                     Updater->VRC, Updater->MRI,
-                                     Updater->TII);
+    MachineInstr *PHI =
+        InsertNewDef(TargetOpcode::PHI, BB, Loc, Updater->RegAttrs,
+                     Updater->MRI, Updater->TII);
     return PHI->getOperand(0).getReg();
   }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 323560a46f31de2..007d64944244a09 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -36,6 +36,7 @@ FunctionPass *createSIAnnotateControlFlowPass();
 FunctionPass *createSIFoldOperandsPass();
 FunctionPass *createSIPeepholeSDWAPass();
 FunctionPass *createSILowerI1CopiesPass();
+FunctionPass *createAMDGPUGlobalISelDivergenceLoweringPass();
 FunctionPass *createSIShrinkInstructionsPass();
 FunctionPass *createSILoadStoreOptimizerPass();
 FunctionPass *createSIWholeQuadModePass();
@@ -162,6 +163,9 @@ extern char &SILowerWWMCopiesID;
 void initializeSILowerI1CopiesPass(PassRegistry &);
 extern char &SILowerI1CopiesID;
 
+void initializeAMDGPUGlobalISelDivergenceLoweringPass(PassRegistry &);
+extern char &AMDGPUGlobalISelDivergenceLoweringID;
+
 void initializeSILowerSGPRSpillsPass(PassRegistry &);
 extern char &SILowerSGPRSpillsID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp
new file mode 100644
index 000000000000000..a955a08b0a3776e
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp
@@ -0,0 +1,213 @@
+//===-- AMDGPUGlobalISelDivergenceLowering.cpp ----------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// GlobalISel pass that selects divergent i1 phis as lane mask phis.
+// Lane mask merging uses same algorithm as SDAG in SILowerI1Copies.
+// Handles all cases of temporal divergence.
+//
+// For divergent non-phi i1 and uniform i1 uses outside of the cycle this pass
+// currently depends on LCSSA to insert phis with one incoming.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "SILowerI1Copies.h"
+#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
+#include "llvm/InitializePasses.h"
+
+#define DEBUG_TYPE "global-isel-divergence-lowering"
+
+using namespace llvm;
+
+namespace {
+
+class AMDGPUGlobalISelDivergenceLowering : public MachineFunctionPass {
+public:
+  static char ID;
+
+public:
+  AMDGPUGlobalISelDivergenceLowering() : MachineFunctionPass(ID) {
+    initializeAMDGPUGlobalISelDivergenceLoweringPass(
+        *PassRegistry::getPassRegistry());
+  }
+
+  bool runOnMachineFunction(MachineFunction &MF) override;
+
+  StringRef getPassName() const override {
+    return "GlobalISel divergence lowering";
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.setPreservesCFG();
+    AU.addRequired<MachineCycleInfoWrapperPass>();
+    AU.addRequired<MachineDominatorTree>();
+    AU.addRequired<MachinePostDominatorTree>();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+};
+
+class DivergenceLoweringHelper : public PhiLoweringHelper {
+public:
+  DivergenceLoweringHelper(MachineFunction *MF, MachineDominatorTree *DT,
+                           MachinePostDominatorTree *PDT,
+                           MachineUniformityInfo *MUI);
+
+private:
+  MachineUniformityInfo *MUI = nullptr;
+
+public:
+  void markAsLaneMask(Register DstReg) const override;
+  void getCandidatesForLowering(
+      SmallVectorImpl<MachineInstr *> &Vreg1Phis) const override;
+  void collectIncomingValuesFromPhi(
+      const MachineInstr *MI,
+      SmallVectorImpl<Incoming> &Incomings) const override;
+  void replaceDstReg(Register NewReg, Register OldReg,
+                     MachineBasicBlock *MBB) override;
+  void buildMergeLaneMasks(MachineBasicBlock &MBB,
+                           MachineBasicBlock::iterator I, const DebugLoc &DL,
+                           Register DstReg, Register PrevReg,
+                           Register CurReg) override;
+  void constrainIncomingRegisterTakenAsIs(Incoming &In) override;
+};
+
+DivergenceLoweringHelper::DivergenceLoweringHelper(
+    MachineFunction *MF, MachineDominatorTree *DT,
+    MachinePostDominatorTree *PDT, MachineUniformityInfo *MUI)
+    : PhiLoweringHelper(MF, DT, PDT), MUI(MUI) {}
+
+// _(s1) -> SReg_32/64(s1)
+void DivergenceLoweringHelper::markAsLaneMask(Register DstReg) const {
+  assert(MRI->getType(DstReg) == LLT::scalar(1));
+
+  // Can't blindly set a register class on phi, users could have reg class
+  // constraints (e.g. sreg_32/64..._xexec classes for control flow intrinsics).
+  if (MRI->getRegClassOrNull(DstReg))
+    return;
+
+  MRI->setRegClass(DstReg, ST->getBoolRC());
+  return;
+}
+
+void DivergenceLoweringHelper::getCandidatesForLowering(
+    SmallVectorImpl<MachineInstr *> &Vreg1Phis) const {
+  LLT S1 = LLT::scalar(1);
+
+  // Add divergent i1 phis to the list
+  for (MachineBasicBlock &MBB : *MF) {
+    for (MachineInstr &MI : MBB.phis()) {
+      Register Dst = MI.getOperand(0).getReg();
+      if (MRI->getType(Dst) == S1 && MUI->isDivergent(Dst))
+        Vreg1Phis.push_back(&MI);
+    }
+  }
+
+  return;
+}
+
+void DivergenceLoweringHelper::collectIncomingValuesFromPhi(
+    const MachineInstr *MI, SmallVectorImpl<Incoming> &Incomings) const {
+  for (unsigned i = 1; i < MI->getNumOperands(); i += 2) {
+    Incomings.emplace_back(MI->getOperand(i).getReg(),
+                           MI->getOperand(i + 1).getMBB(), Register());
+  }
+}
+
+void DivergenceLoweringHelper::replaceDstReg(Register NewReg, Register OldReg,
+                                             MachineBasicBlock *MBB) {
+  BuildMI(*MBB, MBB->getFirstNonPHI(), {}, TII->get(AMDGPU::COPY), OldReg)
+      .addReg(NewReg);
+}
+
+// Get pointers to build instruction just after MI (skips phis if needed)
+static std::pair<MachineBasicBlock *, MachineBasicBlock::iterator>
+getInsertAfterPtrs(MachineInstr *MI) {
+  MachineBasicBlock *InsertMBB = MI->getParent();
+  return std::make_pair(
+      InsertMBB, InsertMBB->SkipPHIsAndLabels(std::next(MI->getIterator())));
+}
+
+void DivergenceLoweringHelper::buildMergeLaneMasks(
+    MachineBasicBlock &MBB, MachineBasicBlock::iterator I, const DebugLoc &DL,
+    Register DstReg, Register PrevReg, Register CurReg) {
+  // TODO: check if inputs are constants or results of a compare.
+
+  Register PrevRegCopy = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  auto [PrevMBB, AfterPrevReg] = getInsertAfterPtrs(MRI->getVRegDef(PrevReg));
+  BuildMI(*PrevMBB, AfterPrevReg, DL, TII->get(AMDGPU::COPY), PrevRegCopy)
+      .addReg(PrevReg);
+  Register PrevMaskedReg = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  BuildMI(MBB, I, DL, TII->get(AndN2Op), PrevMaskedReg)
+      .addReg(PrevRegCopy)
+      .addReg(ExecReg);
+
+  Register CurRegCopy = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  auto [CurMBB, AfterCurReg] = getInsertAfterPtrs(MRI->getVRegDef(CurReg));
+  BuildMI(*CurMBB, AfterCurReg, DL, TII->get(AMDGPU::COPY), CurRegCopy)
+      .addReg(CurReg);
+  Register CurMaskedReg = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  BuildMI(MBB, I, DL, TII->get(AndOp), CurMaskedReg)
+      .addReg(ExecReg)
+      .addReg(CurRegCopy);
+
+  BuildMI(MBB, I, DL, TII->get(OrOp), DstReg)
+      .addReg(PrevMaskedReg)
+      .addReg(CurMaskedReg);
+
+  return;
+}
+
+// GlobalISel has to constrain S1 incoming taken as-is with lane mask register
+// class. Insert a copy of Incoming.Reg to new lane mask inside Incoming.Block,
+// Incoming.Reg becomes that new lane mask.
+void DivergenceLoweringHelper::constrainIncomingRegisterTakenAsIs(
+    Incoming &In) {
+  MachineIRBuilder B(*MF);
+  B.setInsertPt(*In.Block, In.Block->getFirstTerminator());
+
+  auto Copy = B.buildCopy(LLT::scalar(1), In.Reg);
+  MRI->setRegClass(Copy.getReg(0), ST->getBoolRC());
+  In.Reg = Copy.getReg(0);
+
+  return;
+}
+
+} // End anonymous namespace.
+
+INITIALIZE_PASS_BEGIN(AMDGPUGlobalISelDivergenceLowering, DEBUG_TYPE,
+                      "GlobalISel divergence lowering", false, false)
+INITIALIZE_PASS_DEPENDENCY(MachineCycleInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
+INITIALIZE_PASS_DEPENDENCY(MachinePostDominatorTree)
+INITIALIZE_PASS_END(AMDGPUGlobalISelDivergenceLowering, DEBUG_TYPE,
+                    "GlobalISel divergence lowering", false, false)
+
+char AMDGPUGlobalISelDivergenceLowering::ID = 0;
+
+char &llvm::AMDGPUGlobalISelDivergenceLoweringID =
+    AMDGPUGlobalISelDivergenceLowering::ID;
+
+FunctionPass *llvm::createAMDGPUGlobalISelDivergenceLoweringPass() {
+  return new AMDGPUGlobalISelDivergenceLowering();
+}
+
+bool AMDGPUGlobalISelDivergenceLowering::runOnMachineFunction(
+    MachineFunction &MF) {
+  MachineCycleInfo &CycleInfo =
+      getAnalysis<MachineCycleInfoWrapperPass>().getCycleInfo();
+  MachineDominatorTree &DomTree = getAnalysis<MachineDominatorTree>();
+
+  MachineUniformityInfo MUI =
+      computeMachineUniformityInfo(MF, CycleInfo, DomTree.getBase(), true);
+
+  DivergenceLoweringHelper Helper(
+      &MF, &DomTree, &getAnalysis<MachinePostDominatorTree>(), &MUI);
+
+  Helper.lowerPhis();
+  return true;
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index b772efe04c7141d..3146f7f5a218566 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -207,7 +207,39 @@ bool AMDGPUInstructionSelector::selectCOPY(MachineInstr &I) const {
   return true;
 }
 
+bool isLaneMask(Register Reg, MachineRegisterInfo *MRI,
+                const SIRegisterInfo &TRI) {
+  if (MRI->getType(Reg) != LLT::scalar(1))
+    return false;
+  const TargetRegisterClass *RC = MRI->getRegClassOrNull(Reg);
+  if (!RC || !TRI.isSGPRClass(RC))
+    return false;
+
+  return true;
+}
+
+// PHI where all register operands are sgpr(register class) with S1 LLT.
+bool isLaneMaskPhi(MachineInstr &I, MachineRegisterInfo *MRI,
+                   const SIRegisterInfo &TRI) {
+  if (I.getOpcode() != AMDGPU::PHI)
+    return false;
+
+  if (!isLaneMask(I.getOperand(0).getReg(), MRI, TRI))
+    return false;
+
+  for (unsigned i = 1, e = I.getNumOperands(); i != e; e += 2) {
+    if (!isLaneMask(I.getOperand(i).getReg(), MRI, TRI))
+      return false;
+  }
+
+  return true;
+}
+
 bool AMDGPUInstructionSelector::selectPHI(MachineInstr &I) const {
+  // Already selected in divergence lowering pass
+  if (isLaneMaskPhi(I, MRI, TRI))
+    return true;
+
   const Register DefReg = I.getOperand(0).getReg();
   const LLT DefTy = MRI->getType(DefReg);
   if (DefTy == LLT::scalar(1)) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 0c38fa32c6f33a8..1d0be8984604da2 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -375,6 +375,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeAMDGPUDAGToDAGISelPass(*PR);
   initializeGCNDPPCombinePass(*PR);
   initializeSILowerI1CopiesPass(*PR);
+  initializeAMDGPUGlobalISelDivergenceLoweringPass(*PR);
   initializeSILowerWWMCopiesPass(*PR);
   initializeSILowerSGPRSpillsPass(*PR);
   initializeSIFixSGPRCopiesPass(*PR);
@@ -1255,6 +1256,7 @@ bool GCNPassConfig::addLegalizeMachineIR() {
 void GCNPassConfig::addPreRegBankSelect() {
   bool IsOptNone = getOptLevel() == CodeGenOptLevel::None;
   addPass(createAMDGPUPostLegalizeCombiner(IsOptNone));
+  addPass(createAMDGPUGlobalISelDivergenceLoweringPass());
 }
 
 bool GCNPassConfig::addRegBankSelect() {
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 53a33f8210d2a84..2c92e7a07388553 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -55,6 +55,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUCtorDtorLowering.cpp
   AMDGPUExportClustering.cpp
   AMDGPUFrameLowering.cpp
+  AMDGPUGlobalISelDivergenceLowering.cpp
   AMDGPUGlobalISelUtils.cpp
   AMDGPUHSAMetadataStreamer.cpp
   AMDGPUInsertDelayAlu.cpp
diff --git a/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp b/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
index 68c8f4024e73007..59c05bab531f4b5 100644
--- a/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
@@ -21,57 +21,26 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "SILowerI1Copies.h"
 #include "AMDGPU.h"
-#include "GCNSubtarget.h"
-#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
-#include "llvm/CodeGen/MachineDominators.h"
-#include "llvm/CodeGen/MachineFunctionPass.h"
-#include "llvm/CodeGen/MachinePostDominators.h"
 #include "llvm/CodeGen/MachineSSAUpdater.h"
 #include "llvm/InitializePasses.h"
+#include "llvm/Target/CGPassBuilderOption.h"
 
 #define DEBUG_TYPE "si-i1-copies"
 
 using namespace llvm;
 
-static unsigned createLaneMaskReg(MachineFunction &MF);
-static unsigned insertUndefLaneMask(MachineBasicBlock &MBB);
+static Register insertUndefLaneMask(MachineBasicBlock *MBB,
+                                    MachineRegisterInfo *MRI,
+                                    Register *LaneMaskRegAttrs);
 
 namespace {
 
-struct Incoming {
-  Register Reg;
-  MachineBasicBlock *Block;
-  Register UpdatedReg;
-
-  Incoming(Register Reg, MachineBasicBlock *Block, Register UpdatedReg)
-      : Reg(Reg), Block(Block), UpdatedReg(UpdatedReg) {}
-};
-
 class SILowerI1Copies : public MachineFunctionPass {
 public:
   static char ID;
 
-private:
-  bool IsWave32 = false;
-  MachineFunction *MF = nullptr;
-  MachineDominatorTree *DT = nullptr;
-  MachinePostDominatorTree *PDT = nullptr;
-  MachineRegisterInfo *MRI = nullptr;
-  const GCNSubtarget *ST = nullptr;
-  const SIInstrInfo *TII = nullptr;
-
-  unsigned ExecReg;
-  unsigned MovOp;
-  unsigned AndOp;
-  unsigned OrOp;
-  unsigned XorOp;
-  unsigned AndN2Op;
-  unsigned OrN2Op;
-
-  DenseSet<unsigned> ConstrainRegs;
-
-public:
   SILowerI1Copies() : MachineFunctionPass(ID) {
     initializeSILowerI1CopiesPass(*PassRegistry::getPassRegistry());
   }
@@ -86,29 +55,53 @@ class SILowerI1Copies : public MachineFunctionPass {
     AU.addRequired<MachinePostDominatorTree>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
+};
+
+class Vreg1LoweringHelper : public PhiLoweringHelper {
+public:
+  Vreg1LoweringHelper(MachineFunction *MF, MachineDominatorTree *DT,
+                      MachinePostDominatorTree *PDT);
 
 private:
-  bool lowerCopiesFromI1();
-  bool lowerPhis();
-  bool lowerCopiesToI1();
-  bool isConstantLaneMask(Register Reg, bool &Val) const;
+  DenseSet<Register> ConstrainRegs;
+
+public:
+  void markAsLaneMask(Register DstReg) const...
[truncated]

#include "AMDGPU.h"
#include "llvm/CodeGen/MachineFunctionPass.h"

#define DEBUG_TYPE "global-isel-divergence-lowering"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have better suggestions for pass name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should get an amdgpu- prefix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amdgpu-gi-divergence-lowering to keep it short?

llvm/include/llvm/CodeGen/MachineSSAUpdater.h Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachineSSAUpdater.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachineSSAUpdater.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachineSSAUpdater.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachineSSAUpdater.cpp Outdated Show resolved Hide resolved
// Incoming.Reg becomes that new lane mask.
void DivergenceLoweringHelper::constrainIncomingRegisterTakenAsIs(
Incoming &In) {
MachineIRBuilder B(*MF);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't construct a fresh MIR builder, use the same one throughout the pass

// PHI where all register operands are sgpr(register class) with S1 LLT.
bool isLaneMaskPhi(MachineInstr &I, MachineRegisterInfo *MRI,
const SIRegisterInfo &TRI) {
if (I.getOpcode() != AMDGPU::PHI)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

G_PHI. Not sure we should bother handling non-G_* phi at all, but isPHI also works

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was meant to verify that lane mask PHIs are correctly selected if they were selected before instruction-selection. Idea was to not split selection of lane mask PHIs across multiple passes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the point is that if we see an PHI here (as opposed to a G_PHI), that's because it has already been selected.

Having this as a separate helper is confusing IMHO. Better to inline a check against PHI into selectPHI, and add an assert there that the defined register is a lane mask in that case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the point is that if we see an PHI here (as opposed to a G_PHI), that's because it has already been selected.

Yes at the moment, but I did not want to be that much restrictive and put an assert.
It was meant to report selected lane mask PHI as already selected.
If it was not properly instruction selected leave it (ideally to fail selection).
There will be changes to how we select phis later so maybe then we can add some assert.
Can we inline it for now with a return?

llvm/lib/Target/AMDGPU/SILowerI1Copies.h Outdated Show resolved Hide resolved

; There is a divergent, according to machine uniformity info, g_brcond branch
; here, not lowered to si_if because of "amdgpu-flat-work-group-size"="1,1".
define dllexport amdgpu_cs void @_amdgpu_cs_main( i32 inreg noundef %.userdata0, <3 x i32> inreg noundef %.WorkgroupId, <3 x i32> noundef %.LocalInvocationId) #0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need all the dllexports

llvm/lib/Target/AMDGPU/SILowerI1Copies.h Outdated Show resolved Hide resolved
Copy link
Contributor

@Pierre-vh Pierre-vh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more docs in AMDGPUGlobalISelDivergenceLowering.cpp would be nice.
Seems like it's going to be a core part of the GlobalISel pipeline so IMO it should be well documented so it's not too hard to understand and contribute to.

bool runOnMachineFunction(MachineFunction &MF) override;

StringRef getPassName() const override {
return "GlobalISel divergence lowering";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs AMDGPU prefix too

#include "AMDGPU.h"
#include "llvm/CodeGen/MachineFunctionPass.h"

#define DEBUG_TYPE "global-isel-divergence-lowering"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amdgpu-gi-divergence-lowering to keep it short?

Comment on lines 8 to 13
// GlobalISel pass that selects divergent i1 phis as lane mask phis.
// Lane mask merging uses same algorithm as SDAG in SILowerI1Copies.
// Handles all cases of temporal divergence.
//
// For divergent non-phi i1 and uniform i1 uses outside of the cycle this pass
// currently depends on LCSSA to insert phis with one incoming.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really small nit: Use three slashes + start with \file, first line should be /// \file

return;

MRI->setRegClass(DstReg, ST->getBoolRC());
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra return


void DivergenceLoweringHelper::collectIncomingValuesFromPhi(
const MachineInstr *MI, SmallVectorImpl<Incoming> &Incomings) const {
for (unsigned i = 1; i < MI->getNumOperands(); i += 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i -> I (uppercase)

}
}

return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra return

MRI->setRegClass(Copy.getReg(0), ST->getBoolRC());
In.Reg = Copy.getReg(0);

return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra return

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra return in constrainIncomingRegisterTakenAsIs should still be removed.

getAnalysis<MachineCycleInfoWrapperPass>().getCycleInfo();
MachineDominatorTree &DomTree = getAnalysis<MachineDominatorTree>();

MachineUniformityInfo MUI =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you depend on MachineUniformityAnalysis instead?

#include "llvm/CodeGen/MachinePostDominators.h"
#include "llvm/CodeGen/MachineSSAUpdater.h"

using namespace llvm;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using namespace in headers

llvm/lib/Target/AMDGPU/SILowerI1Copies.h Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Dec 4, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@petar-avramovic petar-avramovic force-pushed the global-isel-lane-masks branch 2 times, most recently from fb97bf5 to e14b035 Compare December 4, 2023 16:03
@petar-avramovic
Copy link
Collaborator Author

Ping. Updated for comments about code quality. Unchanged for comments about functional changes:
Divergent i1 phis (G_PHIs to be precise) are fully instruction selected into PHIs (all registers created in this pass are lane masks with correct register class and S1 LLT). Check regression tests changes (last two patches in particular) to get better idea what is happening.




define amdgpu_ps void @divergent_i1_phi_uniform_branch(i32 addrspace(1)* %out, i32 %tid, i32 inreg %cond, i32 addrspace(1)* %dummyaddr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test should be updated to use opaque pointers


; There is a divergent, according to machine uniformity info, g_brcond branch
; here, not lowered to si_if because of "amdgpu-flat-work-group-size"="1,1".
define amdgpu_cs void @_amdgpu_cs_main( i32 inreg noundef %.userdata0, <3 x i32> inreg noundef %.WorkgroupId, <3 x i32> noundef %.LocalInvocationId) #0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name the function to indicate it's the 1,1 case

return MRI.createVirtualRegister(ST.isWave32() ? &AMDGPU::SReg_32RegClass
: &AMDGPU::SReg_64RegClass);
Register llvm::createLaneMaskReg(MachineRegisterInfo *MRI,
Register *LaneMaskRegAttrs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of these parameters being pointers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general idea was to crash if Register *LaneMaskRegAttrs was not initialized, it might not be that obvious if uninitialized Register(0) was used and bad code was generated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Register() used in just about anything will fail the verifier so that's fine

@@ -895,3 +910,7 @@ void SILowerI1Copies::buildMergeLaneMasks(MachineBasicBlock &MBB,
.addReg(CurMaskedReg ? CurMaskedReg : ExecReg);
}
}

void Vreg1LoweringHelper::constrainIncomingRegisterTakenAsIs(Incoming &In) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this function? Don't really understand the name

llvm/include/llvm/CodeGen/MachineSSAUpdater.h Outdated Show resolved Hide resolved
; GFX10-NEXT: [[S_OR_B32_:%[0-9]+]]:sreg_32(s1) = S_OR_B32 [[S_ANDN2_B32_]](s1), [[S_AND_B32_]](s1), implicit-def $scc
; GFX10-NEXT: {{ $}}
; GFX10-NEXT: bb.2:
; GFX10-NEXT: [[PHI:%[0-9]+]]:sreg_32(s1) = PHI [[ICMP]](s1), %bb.0, [[S_OR_B32_]](s1), %bb.1
; GFX10-NEXT: [[COPY6:%[0-9]+]]:sreg_32(s1) = COPY [[PHI]](s1)
; GFX10-NEXT: [[PHI:%[0-9]+]]:sreg_32(s1) = PHI [[COPY4]](s1), %bb.0, [[S_OR_B32_]](s1), %bb.1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this function? Don't really understand the name constrainIncomingRegisterTakenAsIs

But here register operand with name COPY4 has lane-mask register class and after this patch lane mask PHI is fully selected. I wanted to avoid having to split selection of lane mask phis across multiple passes.

  ; GFX10-NEXT:   [[ICMP:%[0-9]+]]:_(s1) = G_ICMP intpred(uge), [[COPY2]](s32), [[C]]
  ; GFX10-NEXT:   [[COPY4:%[0-9]+]]:sreg_32(s1) = COPY [[ICMP]](s1)
  ; GFX10-NEXT:   [[PHI:%[0-9]+]]:sreg_32(s1) = PHI [[COPY4]](s1), %bb.0, [[S_OR_B32_]](s1), %bb.1

; GFX10-NEXT: [[S_OR_B32_:%[0-9]+]]:sreg_32(s1) = S_OR_B32 [[S_ANDN2_B32_]](s1), [[S_AND_B32_]](s1), implicit-def $scc
; GFX10-NEXT: {{ $}}
; GFX10-NEXT: bb.2:
; GFX10-NEXT: [[PHI:%[0-9]+]]:sreg_32(s1) = PHI [[ICMP]](s1), %bb.0, [[S_OR_B32_]](s1), %bb.1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this function? Don't really understand the name constrainIncomingRegisterTakenAsIs

That Incoming is from the block that is visited first in RPOT, so it is taken as is. But it does not have reg class (the register operand with name ICMP), so PHI instruction is only partially selected

  ; GFX10-NEXT:   [[ICMP:%[0-9]+]]:_(s1) = G_ICMP intpred(uge), [[COPY2]](s32), [[C]]
  ; GFX10-NEXT:   [[PHI:%[0-9]+]]:sreg_32(s1) = PHI [[ICMP]](s1), %bb.0, [[S_OR_B32_]](s1), %bb.1

@petar-avramovic
Copy link
Collaborator Author

Ping. Nicolai and Jay do you have comments.

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably for the best to land this PR soon. I have a few detail comments still inline, but hopefully they're not too complex.

I am somewhat surprised about the order in which the changes appear. Though you did update the tests, so did you actually run a check (e.g. using git rebase -x "ninja -C build check-llvm-codegen-amdgpu" ...) that passes after each individual commit? If yes, then great. That would make me feel a bit more confident about the structure of this.

} // End anonymous namespace.

INITIALIZE_PASS_BEGIN(AMDGPUGlobalISelDivergenceLowering, DEBUG_TYPE,
"GlobalISel divergence lowering", false, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was addressed in the wrong commit. This makes it harder to review the PR :(

// PHI where all register operands are sgpr(register class) with S1 LLT.
bool isLaneMaskPhi(MachineInstr &I, MachineRegisterInfo *MRI,
const SIRegisterInfo &TRI) {
if (I.getOpcode() != AMDGPU::PHI)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the point is that if we see an PHI here (as opposed to a G_PHI), that's because it has already been selected.

Having this as a separate helper is confusing IMHO. Better to inline a check against PHI into selectPHI, and add an assert there that the defined register is a lane mask in that case.

MRI->setRegClass(Copy.getReg(0), ST->getBoolRC());
In.Reg = Copy.getReg(0);

return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra return in constrainIncomingRegisterTakenAsIs should still be removed.

@petar-avramovic
Copy link
Collaborator Author

I am somewhat surprised about the order in which the changes appear. Though you did update the tests, so did you actually run a check (e.g. using git rebase -x "ninja -C build check-llvm-codegen-amdgpu" ...) that passes after each individual commit? If yes, then great. That would make me feel a bit more confident about the structure of this.

Each individual patch passes check-llvm, ir is correct after the last patch but mir updates are relevant and give good overview of the changes (PHIs go from partially selected to fully selected). I wanted to open separate pr for each of the patches if they are ok together (don't know if there is a better way).

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM except for one nitty nit.

bool AMDGPUInstructionSelector::selectPHI(MachineInstr &I) const {
const Register DefReg = I.getOperand(0).getReg();
const LLT DefTy = MRI->getType(DefReg);
// Lane mask PHIs, PHI where all register operands have sgpr register class
// with S1 LLT, are already selected in divergence lowering pass.
if (I.getOpcode() == AMDGPU::PHI && isLaneMask(DefReg, MRI, TRI)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I should be clearer. I don't think we ever have a PHI here other than for lane masks. So it would be better for understanding the overall context to say something like:

  if (I.getOpcode() == AMDGPU::PHI) {
    assert(isLaneMask(DefReg, MRI, TRI);
    return true;
  }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one regression test with PHI in phi_s32_ss_sbranch - GlobalISel/inst-select-phi.mir . It can be disabled. Other then that assert is fine.
I wanted to check isLaneMask for all operand, can I bring helper function back then? it would look like this without helper

  if (I.getOpcode() == AMDGPU::PHI) {
    assert(isLaneMask(DefReg, MRI, TRI));
    for (unsigned i = 1, e = I.getNumOperands(); i != e; e += 2)
      assert(isLaneMask(I.getOperand(i).getReg(), MRI, TRI));
    return true;
  }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I wonder if @arsenm knows where that test comes from, whether that's on purpose or perhaps a mistake and the test should be removed.

If my understanding is wrong and other PHIs are possible, it would be good to understand why. If they aren't, then the test should be removed and it's not so important to check all the operands IMHO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then, we can merge first two patches in the meantime

petar-avramovic added a commit that referenced this pull request Dec 13, 2023
Add empty AMDGPUGlobalISelDivergenceLowering pass. This pass will
implement
- selection of divergent i1 phis as lane mask phis, requires lane mask
merging in some cases
- lower uses of divergent i1 values outside of the cycle using lane mask
merging
- lowering of all cases of temporal divergence:
- lower uses of uniform i1 values outside of the cycle using lane mask
merging
- lower uses of uniform non-i1 values outside of the cycle using a copy
to vgpr inside of the cycle

Add very detailed set of regression tests for cases mentioned above.

patch 1 from: #73337
petar-avramovic added a commit that referenced this pull request Dec 15, 2023
Make abstract class PhiLoweringHelper and expose it for use in
GlobalISel path.
SILowerI1Copies implements PhiLoweringHelper as Vreg1LoweringHelper and
it is equivalent to SILowerI1Copies.
Notable change that createLaneMaskReg now clones attributes from
register that has lane mask attributes instead of creating register with
lane mask register class. This is because lane masks have
different(more) attributes in GlobalISel.

patch 2 from: #73337
petar-avramovic added a commit to petar-avramovic/llvm-project that referenced this pull request Jan 17, 2024
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select them
as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: llvm#73337
petar-avramovic added a commit to petar-avramovic/llvm-project that referenced this pull request Jan 18, 2024
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select them
as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: llvm#73337
ampandey-1995 pushed a commit to ampandey-1995/llvm-project that referenced this pull request Jan 19, 2024
…6145)

Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: llvm#73337
petar-avramovic added a commit to petar-avramovic/llvm-project that referenced this pull request Jan 23, 2024
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select them
as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: llvm#73337
petar-avramovic added a commit that referenced this pull request Jan 24, 2024
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: #73337
petar-avramovic added a commit to petar-avramovic/llvm-project that referenced this pull request Jan 30, 2024
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: llvm#73337
petar-avramovic added a commit to petar-avramovic/llvm-project that referenced this pull request Feb 1, 2024
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: llvm#73337
petar-avramovic added a commit that referenced this pull request Feb 5, 2024
Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: #73337
petar-avramovic added a commit to petar-avramovic/llvm-project that referenced this pull request Feb 5, 2024
GlobalISel works with registers that could have register class,
register bank and LLT as attributes.
When initializing MachineSSAUpdater save all attributes of register
and create new registers with same attributes instead of only using
register class.

patch 4 from: llvm#73337
petar-avramovic added a commit to petar-avramovic/llvm-project that referenced this pull request Feb 5, 2024
When initializing MachineSSAUpdater save all attributes of current
virtual register and create new virtual registers with same attributes.
Now new virtual registers have same both register class or bank and LLT.
Previously new virtual registers had same register class but LLT was not
set (LLT was set to default/empty LLT).
Required by GlobalISel for AMDGPU, new 'lane mask' virtual registers
created by MachineSSAUpdater need to have both register class and LLT.

patch 4 from: llvm#73337
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
…0003)

Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper.
Use machine uniformity analysis to find divergent i1 phis and select
them as lane mask phis in same way SILowerI1Copies select VReg_1 phis.
Note that divergent i1 phis include phis created by LCSSA and all cases
of uses outside of cycle are actually covered by "lowering LCSSA phis".
GlobalISel lane masks are registers with sgpr register class and S1 LLT.

TODO: General goal is that instructions created in this pass are fully
instruction-selected so that selection of lane mask phis is not split
across multiple passes.

patch 3 from: llvm#73337
petar-avramovic added a commit to petar-avramovic/llvm-project that referenced this pull request Feb 21, 2024
When initializing MachineSSAUpdater save all attributes of current
virtual register and create new virtual registers with same attributes.
Now new virtual registers have same both register class or bank and LLT.
Previously new virtual registers had same register class but LLT was not
set (LLT was set to default/empty LLT).
Required by GlobalISel for AMDGPU, new 'lane mask' virtual registers
created by MachineSSAUpdater need to have both register class and LLT.

patch 4 from: llvm#73337
petar-avramovic added a commit that referenced this pull request Feb 26, 2024
…78431)

When initializing MachineSSAUpdater save all attributes of current
virtual register and create new virtual registers with same attributes.
Now new virtual registers have same both register class or bank and LLT.
Previously new virtual registers had same register class but LLT was not
set (LLT was set to default/empty LLT).
Required by GlobalISel for AMDGPU, new 'lane mask' virtual registers
created by MachineSSAUpdater need to have both register class and LLT.

patch 4 from: #73337
@arsenm
Copy link
Contributor

arsenm commented Feb 29, 2024

Needs update to resolve conflict

Basic implementation of lane mask merging for GlobalISel.
Lane masks on GlobalISel are registers with sgpr register class
and S1 LLT - required by machine uniformity analysis.
Implements equivalent of lowerPhis from SILowerI1Copies.cpp in:
patch 1: llvm#75340
patch 2: llvm#75349
patch 3: llvm#80003
patch 4: llvm#78431
patch 5: is in this commit:

AMDGPU/GlobalISelDivergenceLowering: constrain incoming registers

Previously, in PHIs that represent lane masks, incoming registers
taken as-is were not selected as lane masks. Such registers are not
being merged with another lane mask and most often only have S1 LLT.
Implement constrainAsLaneMask by constraining incoming registers
taken as-is with lane mask attributes, essentially transforming them
to lane masks. This is final step in having PHI instructions created
in this pass to be fully instruction-selected.
@petar-avramovic petar-avramovic changed the title GlobalISel lane masks merging AMDGPU/GlobalISel: lane masks merging Feb 29, 2024
@petar-avramovic
Copy link
Collaborator Author

Updated with final patch from original PR

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the amdgpu-global-isel-risky-select flag be deleted?

@petar-avramovic
Copy link
Collaborator Author

Can the amdgpu-global-isel-risky-select flag be deleted?

Yes, will do it in another patch, from the upcoming patches:
S1 G_PHIs should not be selected in instruction-select, instead:

  • divergent S1 G_PHI is pre-selected in AMDGPUGlobalISelDivergenceLowering (into fully inst-selected PHI)
  • uniform S1 G_PHI is lowered into S32 G_PHI in AMDGPURegBankSelect.cpp (TODO)

@petar-avramovic petar-avramovic merged commit 6c2eec5 into llvm:main Feb 29, 2024
3 of 4 checks passed
qihangkong pushed a commit to rvgpu/rvgpu-llvm that referenced this pull request Apr 23, 2024
Make abstract class PhiLoweringHelper and expose it for use in
GlobalISel path.
SILowerI1Copies implements PhiLoweringHelper as Vreg1LoweringHelper and
it is equivalent to SILowerI1Copies.
Notable change that createLaneMaskReg now clones attributes from
register that has lane mask attributes instead of creating register with
lane mask register class. This is because lane masks have
different(more) attributes in GlobalISel.

patch 2 from: llvm/llvm-project#73337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants