AMDGPU/GlobalISel: lane masks merging #73337

petar-avramovic · 2023-11-24T15:02:01Z

Basic implementation of lane mask merging for GlobalISel.
Lane masks on GlobalISel are registers with sgpr register class
and S1 LLT - required by machine uniformity analysis.
Implements equivalent of lowerPhis from SILowerI1Copies.cpp in:
patch 1: #75340
patch 2: #75349
patch 3: #80003
patch 4: #78431
patch 5: is in this commit:

AMDGPU/GlobalISelDivergenceLowering: constrain incoming registers

Previously, in PHIs that represent lane masks, incoming registers
taken as-is were not selected as lane masks. Such registers are not
being merged with another lane mask and most often only have S1 LLT.
Implement constrainAsLaneMask by constraining incoming registers
taken as-is with lane mask attributes, essentially transforming them
to lane masks. This is final step in having PHI instructions created
in this pass to be fully instruction-selected.

llvmbot · 2023-11-24T15:02:30Z

@llvm/pr-subscribers-backend-amdgpu

Author: Petar Avramovic (petar-avramovic)

Changes

Basic implementation of lane mask merging for GlobalISel.
Lane masks on GlobalISel are registers with sgpr register class and S1 LLT - required by machine uniformity analysis.
This PR correctly handles all cases involving lane mask phis. There are some improvements to be made in code size.
There is notable improvement in testing with GlobalISel.
I will open separate PR for all patches.
Last three patches were meant to go together but I separated them here to see effects in mir tests (last two are additions to how SDAG lowers phis and do not affect phi lowering in SDAG)

Patch is 279.74 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73337.diff

21 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineSSAUpdater.h (+2-3)
(modified) llvm/lib/CodeGen/MachineSSAUpdater.cpp (+19-25)
(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+4)
(added) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp (+213)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+32)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(modified) llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp (+169-150)
(added) llvm/lib/Target/AMDGPU/SILowerI1Copies.h (+79)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+345)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.mir (+640)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+571)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+1085)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+514)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+1167)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+188)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+358)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+37)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.mir (+70)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+25-17)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll (+1)

diff --git a/llvm/include/llvm/CodeGen/MachineSSAUpdater.h b/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
index bbd09d7d151ba07..765cabdb313097b 100644
--- a/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
+++ b/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
@@ -40,8 +40,8 @@ class MachineSSAUpdater {
   //typedef DenseMap<MachineBasicBlock*, Register> AvailableValsTy;
   void *AV = nullptr;
 
-  /// VRC - Register class of the current virtual register.
-  const TargetRegisterClass *VRC = nullptr;
+  /// RegAttrs - current virtual register, new registers copy its attributes.
+  Register RegAttrs;
 
   /// InsertedPHIs - If this is non-null, the MachineSSAUpdater adds all PHI
   /// nodes that it creates to the vector.
@@ -62,7 +62,6 @@ class MachineSSAUpdater {
   /// Initialize - Reset this object to get ready for a new set of SSA
   /// updates.
   void Initialize(Register V);
-  void Initialize(const TargetRegisterClass *RC);
 
   /// AddAvailableValue - Indicate that a rewritten value is available at the
   /// end of the specified block with the specified value.
diff --git a/llvm/lib/CodeGen/MachineSSAUpdater.cpp b/llvm/lib/CodeGen/MachineSSAUpdater.cpp
index 48076663ddf5382..48537057e2031a8 100644
--- a/llvm/lib/CodeGen/MachineSSAUpdater.cpp
+++ b/llvm/lib/CodeGen/MachineSSAUpdater.cpp
@@ -51,17 +51,13 @@ MachineSSAUpdater::~MachineSSAUpdater() {
 
 /// Initialize - Reset this object to get ready for a new set of SSA
 /// updates.
-void MachineSSAUpdater::Initialize(const TargetRegisterClass *RC) {
+void MachineSSAUpdater::Initialize(Register V) {
   if (!AV)
     AV = new AvailableValsTy();
   else
     getAvailableVals(AV).clear();
 
-  VRC = RC;
-}
-
-void MachineSSAUpdater::Initialize(Register V) {
-  Initialize(MRI->getRegClass(V));
+  RegAttrs = V;
 }
 
 /// HasValueForBlock - Return true if the MachineSSAUpdater already has a value for
@@ -115,13 +111,12 @@ Register LookForIdenticalPHI(MachineBasicBlock *BB,
 /// InsertNewDef - Insert an empty PHI or IMPLICIT_DEF instruction which define
 /// a value of the given register class at the start of the specified basic
 /// block. It returns the virtual register defined by the instruction.
-static
-MachineInstrBuilder InsertNewDef(unsigned Opcode,
-                           MachineBasicBlock *BB, MachineBasicBlock::iterator I,
-                           const TargetRegisterClass *RC,
-                           MachineRegisterInfo *MRI,
-                           const TargetInstrInfo *TII) {
-  Register NewVR = MRI->createVirtualRegister(RC);
+static MachineInstrBuilder InsertNewDef(unsigned Opcode, MachineBasicBlock *BB,
+                                        MachineBasicBlock::iterator I,
+                                        Register RegAttrs,
+                                        MachineRegisterInfo *MRI,
+                                        const TargetInstrInfo *TII) {
+  Register NewVR = MRI->cloneVirtualRegister(RegAttrs);
   return BuildMI(*BB, I, DebugLoc(), TII->get(Opcode), NewVR);
 }
 
@@ -158,9 +153,9 @@ Register MachineSSAUpdater::GetValueInMiddleOfBlock(MachineBasicBlock *BB,
     if (ExistingValueOnly)
       return Register();
     // Insert an implicit_def to represent an undef value.
-    MachineInstr *NewDef = InsertNewDef(TargetOpcode::IMPLICIT_DEF,
-                                        BB, BB->getFirstTerminator(),
-                                        VRC, MRI, TII);
+    MachineInstr *NewDef =
+        InsertNewDef(TargetOpcode::IMPLICIT_DEF, BB, BB->getFirstTerminator(),
+                     RegAttrs, MRI, TII);
     return NewDef->getOperand(0).getReg();
   }
 
@@ -197,8 +192,8 @@ Register MachineSSAUpdater::GetValueInMiddleOfBlock(MachineBasicBlock *BB,
 
   // Otherwise, we do need a PHI: insert one now.
   MachineBasicBlock::iterator Loc = BB->empty() ? BB->end() : BB->begin();
-  MachineInstrBuilder InsertedPHI = InsertNewDef(TargetOpcode::PHI, BB,
-                                                 Loc, VRC, MRI, TII);
+  MachineInstrBuilder InsertedPHI =
+      InsertNewDef(TargetOpcode::PHI, BB, Loc, RegAttrs, MRI, TII);
 
   // Fill in all the predecessors of the PHI.
   for (unsigned i = 0, e = PredValues.size(); i != e; ++i)
@@ -300,10 +295,9 @@ class SSAUpdaterTraits<MachineSSAUpdater> {
   static Register GetUndefVal(MachineBasicBlock *BB,
                               MachineSSAUpdater *Updater) {
     // Insert an implicit_def to represent an undef value.
-    MachineInstr *NewDef = InsertNewDef(TargetOpcode::IMPLICIT_DEF,
-                                        BB, BB->getFirstNonPHI(),
-                                        Updater->VRC, Updater->MRI,
-                                        Updater->TII);
+    MachineInstr *NewDef =
+        InsertNewDef(TargetOpcode::IMPLICIT_DEF, BB, BB->getFirstNonPHI(),
+                     Updater->RegAttrs, Updater->MRI, Updater->TII);
     return NewDef->getOperand(0).getReg();
   }
 
@@ -312,9 +306,9 @@ class SSAUpdaterTraits<MachineSSAUpdater> {
   static Register CreateEmptyPHI(MachineBasicBlock *BB, unsigned NumPreds,
                                  MachineSSAUpdater *Updater) {
     MachineBasicBlock::iterator Loc = BB->empty() ? BB->end() : BB->begin();
-    MachineInstr *PHI = InsertNewDef(TargetOpcode::PHI, BB, Loc,
-                                     Updater->VRC, Updater->MRI,
-                                     Updater->TII);
+    MachineInstr *PHI =
+        InsertNewDef(TargetOpcode::PHI, BB, Loc, Updater->RegAttrs,
+                     Updater->MRI, Updater->TII);
     return PHI->getOperand(0).getReg();
   }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 323560a46f31de2..007d64944244a09 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -36,6 +36,7 @@ FunctionPass *createSIAnnotateControlFlowPass();
 FunctionPass *createSIFoldOperandsPass();
 FunctionPass *createSIPeepholeSDWAPass();
 FunctionPass *createSILowerI1CopiesPass();
+FunctionPass *createAMDGPUGlobalISelDivergenceLoweringPass();
 FunctionPass *createSIShrinkInstructionsPass();
 FunctionPass *createSILoadStoreOptimizerPass();
 FunctionPass *createSIWholeQuadModePass();
@@ -162,6 +163,9 @@ extern char &SILowerWWMCopiesID;
 void initializeSILowerI1CopiesPass(PassRegistry &);
 extern char &SILowerI1CopiesID;
 
+void initializeAMDGPUGlobalISelDivergenceLoweringPass(PassRegistry &);
+extern char &AMDGPUGlobalISelDivergenceLoweringID;
+
 void initializeSILowerSGPRSpillsPass(PassRegistry &);
 extern char &SILowerSGPRSpillsID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp
new file mode 100644
index 000000000000000..a955a08b0a3776e
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp
@@ -0,0 +1,213 @@
+//===-- AMDGPUGlobalISelDivergenceLowering.cpp ----------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// GlobalISel pass that selects divergent i1 phis as lane mask phis.
+// Lane mask merging uses same algorithm as SDAG in SILowerI1Copies.
+// Handles all cases of temporal divergence.
+//
+// For divergent non-phi i1 and uniform i1 uses outside of the cycle this pass
+// currently depends on LCSSA to insert phis with one incoming.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "SILowerI1Copies.h"
+#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
+#include "llvm/InitializePasses.h"
+
+#define DEBUG_TYPE "global-isel-divergence-lowering"
+
+using namespace llvm;
+
+namespace {
+
+class AMDGPUGlobalISelDivergenceLowering : public MachineFunctionPass {
+public:
+  static char ID;
+
+public:
+  AMDGPUGlobalISelDivergenceLowering() : MachineFunctionPass(ID) {
+    initializeAMDGPUGlobalISelDivergenceLoweringPass(
+        *PassRegistry::getPassRegistry());
+  }
+
+  bool runOnMachineFunction(MachineFunction &MF) override;
+
+  StringRef getPassName() const override {
+    return "GlobalISel divergence lowering";
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.setPreservesCFG();
+    AU.addRequired<MachineCycleInfoWrapperPass>();
+    AU.addRequired<MachineDominatorTree>();
+    AU.addRequired<MachinePostDominatorTree>();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+};
+
+class DivergenceLoweringHelper : public PhiLoweringHelper {
+public:
+  DivergenceLoweringHelper(MachineFunction *MF, MachineDominatorTree *DT,
+                           MachinePostDominatorTree *PDT,
+                           MachineUniformityInfo *MUI);
+
+private:
+  MachineUniformityInfo *MUI = nullptr;
+
+public:
+  void markAsLaneMask(Register DstReg) const override;
+  void getCandidatesForLowering(
+      SmallVectorImpl<MachineInstr *> &Vreg1Phis) const override;
+  void collectIncomingValuesFromPhi(
+      const MachineInstr *MI,
+      SmallVectorImpl<Incoming> &Incomings) const override;
+  void replaceDstReg(Register NewReg, Register OldReg,
+                     MachineBasicBlock *MBB) override;
+  void buildMergeLaneMasks(MachineBasicBlock &MBB,
+                           MachineBasicBlock::iterator I, const DebugLoc &DL,
+                           Register DstReg, Register PrevReg,
+                           Register CurReg) override;
+  void constrainIncomingRegisterTakenAsIs(Incoming &In) override;
+};
+
+DivergenceLoweringHelper::DivergenceLoweringHelper(
+    MachineFunction *MF, MachineDominatorTree *DT,
+    MachinePostDominatorTree *PDT, MachineUniformityInfo *MUI)
+    : PhiLoweringHelper(MF, DT, PDT), MUI(MUI) {}
+
+// _(s1) -> SReg_32/64(s1)
+void DivergenceLoweringHelper::markAsLaneMask(Register DstReg) const {
+  assert(MRI->getType(DstReg) == LLT::scalar(1));
+
+  // Can't blindly set a register class on phi, users could have reg class
+  // constraints (e.g. sreg_32/64..._xexec classes for control flow intrinsics).
+  if (MRI->getRegClassOrNull(DstReg))
+    return;
+
+  MRI->setRegClass(DstReg, ST->getBoolRC());
+  return;
+}
+
+void DivergenceLoweringHelper::getCandidatesForLowering(
+    SmallVectorImpl<MachineInstr *> &Vreg1Phis) const {
+  LLT S1 = LLT::scalar(1);
+
+  // Add divergent i1 phis to the list
+  for (MachineBasicBlock &MBB : *MF) {
+    for (MachineInstr &MI : MBB.phis()) {
+      Register Dst = MI.getOperand(0).getReg();
+      if (MRI->getType(Dst) == S1 && MUI->isDivergent(Dst))
+        Vreg1Phis.push_back(&MI);
+    }
+  }
+
+  return;
+}
+
+void DivergenceLoweringHelper::collectIncomingValuesFromPhi(
+    const MachineInstr *MI, SmallVectorImpl<Incoming> &Incomings) const {
+  for (unsigned i = 1; i < MI->getNumOperands(); i += 2) {
+    Incomings.emplace_back(MI->getOperand(i).getReg(),
+                           MI->getOperand(i + 1).getMBB(), Register());
+  }
+}
+
+void DivergenceLoweringHelper::replaceDstReg(Register NewReg, Register OldReg,
+                                             MachineBasicBlock *MBB) {
+  BuildMI(*MBB, MBB->getFirstNonPHI(), {}, TII->get(AMDGPU::COPY), OldReg)
+      .addReg(NewReg);
+}
+
+// Get pointers to build instruction just after MI (skips phis if needed)
+static std::pair<MachineBasicBlock *, MachineBasicBlock::iterator>
+getInsertAfterPtrs(MachineInstr *MI) {
+  MachineBasicBlock *InsertMBB = MI->getParent();
+  return std::make_pair(
+      InsertMBB, InsertMBB->SkipPHIsAndLabels(std::next(MI->getIterator())));
+}
+
+void DivergenceLoweringHelper::buildMergeLaneMasks(
+    MachineBasicBlock &MBB, MachineBasicBlock::iterator I, const DebugLoc &DL,
+    Register DstReg, Register PrevReg, Register CurReg) {
+  // TODO: check if inputs are constants or results of a compare.
+
+  Register PrevRegCopy = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  auto [PrevMBB, AfterPrevReg] = getInsertAfterPtrs(MRI->getVRegDef(PrevReg));
+  BuildMI(*PrevMBB, AfterPrevReg, DL, TII->get(AMDGPU::COPY), PrevRegCopy)
+      .addReg(PrevReg);
+  Register PrevMaskedReg = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  BuildMI(MBB, I, DL, TII->get(AndN2Op), PrevMaskedReg)
+      .addReg(PrevRegCopy)
+      .addReg(ExecReg);
+
+  Register CurRegCopy = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  auto [CurMBB, AfterCurReg] = getInsertAfterPtrs(MRI->getVRegDef(CurReg));
+  BuildMI(*CurMBB, AfterCurReg, DL, TII->get(AMDGPU::COPY), CurRegCopy)
+      .addReg(CurReg);
+  Register CurMaskedReg = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  BuildMI(MBB, I, DL, TII->get(AndOp), CurMaskedReg)
+      .addReg(ExecReg)
+      .addReg(CurRegCopy);
+
+  BuildMI(MBB, I, DL, TII->get(OrOp), DstReg)
+      .addReg(PrevMaskedReg)
+      .addReg(CurMaskedReg);
+
+  return;
+}
+
+// GlobalISel has to constrain S1 incoming taken as-is with lane mask register
+// class. Insert a copy of Incoming.Reg to new lane mask inside Incoming.Block,
+// Incoming.Reg becomes that new lane mask.
+void DivergenceLoweringHelper::constrainIncomingRegisterTakenAsIs(
+    Incoming &In) {
+  MachineIRBuilder B(*MF);
+  B.setInsertPt(*In.Block, In.Block->getFirstTerminator());
+
+  auto Copy = B.buildCopy(LLT::scalar(1), In.Reg);
+  MRI->setRegClass(Copy.getReg(0), ST->getBoolRC());
+  In.Reg = Copy.getReg(0);
+
+  return;
+}
+
+} // End anonymous namespace.
+
+INITIALIZE_PASS_BEGIN(AMDGPUGlobalISelDivergenceLowering, DEBUG_TYPE,
+                      "GlobalISel divergence lowering", false, false)
+INITIALIZE_PASS_DEPENDENCY(MachineCycleInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
+INITIALIZE_PASS_DEPENDENCY(MachinePostDominatorTree)
+INITIALIZE_PASS_END(AMDGPUGlobalISelDivergenceLowering, DEBUG_TYPE,
+                    "GlobalISel divergence lowering", false, false)
+
+char AMDGPUGlobalISelDivergenceLowering::ID = 0;
+
+char &llvm::AMDGPUGlobalISelDivergenceLoweringID =
+    AMDGPUGlobalISelDivergenceLowering::ID;
+
+FunctionPass *llvm::createAMDGPUGlobalISelDivergenceLoweringPass() {
+  return new AMDGPUGlobalISelDivergenceLowering();
+}
+
+bool AMDGPUGlobalISelDivergenceLowering::runOnMachineFunction(
+    MachineFunction &MF) {
+  MachineCycleInfo &CycleInfo =
+      getAnalysis<MachineCycleInfoWrapperPass>().getCycleInfo();
+  MachineDominatorTree &DomTree = getAnalysis<MachineDominatorTree>();
+
+  MachineUniformityInfo MUI =
+      computeMachineUniformityInfo(MF, CycleInfo, DomTree.getBase(), true);
+
+  DivergenceLoweringHelper Helper(
+      &MF, &DomTree, &getAnalysis<MachinePostDominatorTree>(), &MUI);
+
+  Helper.lowerPhis();
+  return true;
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index b772efe04c7141d..3146f7f5a218566 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -207,7 +207,39 @@ bool AMDGPUInstructionSelector::selectCOPY(MachineInstr &I) const {
   return true;
 }
 
+bool isLaneMask(Register Reg, MachineRegisterInfo *MRI,
+                const SIRegisterInfo &TRI) {
+  if (MRI->getType(Reg) != LLT::scalar(1))
+    return false;
+  const TargetRegisterClass *RC = MRI->getRegClassOrNull(Reg);
+  if (!RC || !TRI.isSGPRClass(RC))
+    return false;
+
+  return true;
+}
+
+// PHI where all register operands are sgpr(register class) with S1 LLT.
+bool isLaneMaskPhi(MachineInstr &I, MachineRegisterInfo *MRI,
+                   const SIRegisterInfo &TRI) {
+  if (I.getOpcode() != AMDGPU::PHI)
+    return false;
+
+  if (!isLaneMask(I.getOperand(0).getReg(), MRI, TRI))
+    return false;
+
+  for (unsigned i = 1, e = I.getNumOperands(); i != e; e += 2) {
+    if (!isLaneMask(I.getOperand(i).getReg(), MRI, TRI))
+      return false;
+  }
+
+  return true;
+}
+
 bool AMDGPUInstructionSelector::selectPHI(MachineInstr &I) const {
+  // Already selected in divergence lowering pass
+  if (isLaneMaskPhi(I, MRI, TRI))
+    return true;
+
   const Register DefReg = I.getOperand(0).getReg();
   const LLT DefTy = MRI->getType(DefReg);
   if (DefTy == LLT::scalar(1)) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 0c38fa32c6f33a8..1d0be8984604da2 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -375,6 +375,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeAMDGPUDAGToDAGISelPass(*PR);
   initializeGCNDPPCombinePass(*PR);
   initializeSILowerI1CopiesPass(*PR);
+  initializeAMDGPUGlobalISelDivergenceLoweringPass(*PR);
   initializeSILowerWWMCopiesPass(*PR);
   initializeSILowerSGPRSpillsPass(*PR);
   initializeSIFixSGPRCopiesPass(*PR);
@@ -1255,6 +1256,7 @@ bool GCNPassConfig::addLegalizeMachineIR() {
 void GCNPassConfig::addPreRegBankSelect() {
   bool IsOptNone = getOptLevel() == CodeGenOptLevel::None;
   addPass(createAMDGPUPostLegalizeCombiner(IsOptNone));
+  addPass(createAMDGPUGlobalISelDivergenceLoweringPass());
 }
 
 bool GCNPassConfig::addRegBankSelect() {
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 53a33f8210d2a84..2c92e7a07388553 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -55,6 +55,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUCtorDtorLowering.cpp
   AMDGPUExportClustering.cpp
   AMDGPUFrameLowering.cpp
+  AMDGPUGlobalISelDivergenceLowering.cpp
   AMDGPUGlobalISelUtils.cpp
   AMDGPUHSAMetadataStreamer.cpp
   AMDGPUInsertDelayAlu.cpp
diff --git a/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp b/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
index 68c8f4024e73007..59c05bab531f4b5 100644
--- a/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
@@ -21,57 +21,26 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "SILowerI1Copies.h"
 #include "AMDGPU.h"
-#include "GCNSubtarget.h"
-#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
-#include "llvm/CodeGen/MachineDominators.h"
-#include "llvm/CodeGen/MachineFunctionPass.h"
-#include "llvm/CodeGen/MachinePostDominators.h"
 #include "llvm/CodeGen/MachineSSAUpdater.h"
 #include "llvm/InitializePasses.h"
+#include "llvm/Target/CGPassBuilderOption.h"
 
 #define DEBUG_TYPE "si-i1-copies"
 
 using namespace llvm;
 
-static unsigned createLaneMaskReg(MachineFunction &MF);
-static unsigned insertUndefLaneMask(MachineBasicBlock &MBB);
+static Register insertUndefLaneMask(MachineBasicBlock *MBB,
+                                    MachineRegisterInfo *MRI,
+                                    Register *LaneMaskRegAttrs);
 
 namespace {
 
-struct Incoming {
-  Register Reg;
-  MachineBasicBlock *Block;
-  Register UpdatedReg;
-
-  Incoming(Register Reg, MachineBasicBlock *Block, Register UpdatedReg)
-      : Reg(Reg), Block(Block), UpdatedReg(UpdatedReg) {}
-};
-
 class SILowerI1Copies : public MachineFunctionPass {
 public:
   static char ID;
 
-private:
-  bool IsWave32 = false;
-  MachineFunction *MF = nullptr;
-  MachineDominatorTree *DT = nullptr;
-  MachinePostDominatorTree *PDT = nullptr;
-  MachineRegisterInfo *MRI = nullptr;
-  const GCNSubtarget *ST = nullptr;
-  const SIInstrInfo *TII = nullptr;
-
-  unsigned ExecReg;
-  unsigned MovOp;
-  unsigned AndOp;
-  unsigned OrOp;
-  unsigned XorOp;
-  unsigned AndN2Op;
-  unsigned OrN2Op;
-
-  DenseSet<unsigned> ConstrainRegs;
-
-public:
   SILowerI1Copies() : MachineFunctionPass(ID) {
     initializeSILowerI1CopiesPass(*PassRegistry::getPassRegistry());
   }
@@ -86,29 +55,53 @@ class SILowerI1Copies : public MachineFunctionPass {
     AU.addRequired<MachinePostDominatorTree>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
+};
+
+class Vreg1LoweringHelper : public PhiLoweringHelper {
+public:
+  Vreg1LoweringHelper(MachineFunction *MF, MachineDominatorTree *DT,
+                      MachinePostDominatorTree *PDT);
 
 private:
-  bool lowerCopiesFromI1();
-  bool lowerPhis();
-  bool lowerCopiesToI1();
-  bool isConstantLaneMask(Register Reg, bool &Val) const;
+  DenseSet<Register> ConstrainRegs;
+
+public:
+  void markAsLaneMask(Register DstReg) const...
[truncated]

llvmbot · 2023-11-24T15:02:30Z

@llvm/pr-subscribers-llvm-globalisel

Author: Petar Avramovic (petar-avramovic)

Changes

Basic implementation of lane mask merging for GlobalISel.
Lane masks on GlobalISel are registers with sgpr register class and S1 LLT - required by machine uniformity analysis.
This PR correctly handles all cases involving lane mask phis. There are some improvements to be made in code size.
There is notable improvement in testing with GlobalISel.
I will open separate PR for all patches.
Last three patches were meant to go together but I separated them here to see effects in mir tests (last two are additions to how SDAG lowers phis and do not affect phi lowering in SDAG)

Patch is 279.74 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73337.diff

21 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineSSAUpdater.h (+2-3)
(modified) llvm/lib/CodeGen/MachineSSAUpdater.cpp (+19-25)
(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+4)
(added) llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp (+213)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+32)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(modified) llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp (+169-150)
(added) llvm/lib/Target/AMDGPU/SILowerI1Copies.h (+79)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+345)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.mir (+640)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+571)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+1085)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+514)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+1167)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+188)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+358)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+37)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.mir (+70)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+25-17)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll (+1)

diff --git a/llvm/include/llvm/CodeGen/MachineSSAUpdater.h b/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
index bbd09d7d151ba07..765cabdb313097b 100644
--- a/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
+++ b/llvm/include/llvm/CodeGen/MachineSSAUpdater.h
@@ -40,8 +40,8 @@ class MachineSSAUpdater {
   //typedef DenseMap<MachineBasicBlock*, Register> AvailableValsTy;
   void *AV = nullptr;
 
-  /// VRC - Register class of the current virtual register.
-  const TargetRegisterClass *VRC = nullptr;
+  /// RegAttrs - current virtual register, new registers copy its attributes.
+  Register RegAttrs;
 
   /// InsertedPHIs - If this is non-null, the MachineSSAUpdater adds all PHI
   /// nodes that it creates to the vector.
@@ -62,7 +62,6 @@ class MachineSSAUpdater {
   /// Initialize - Reset this object to get ready for a new set of SSA
   /// updates.
   void Initialize(Register V);
-  void Initialize(const TargetRegisterClass *RC);
 
   /// AddAvailableValue - Indicate that a rewritten value is available at the
   /// end of the specified block with the specified value.
diff --git a/llvm/lib/CodeGen/MachineSSAUpdater.cpp b/llvm/lib/CodeGen/MachineSSAUpdater.cpp
index 48076663ddf5382..48537057e2031a8 100644
--- a/llvm/lib/CodeGen/MachineSSAUpdater.cpp
+++ b/llvm/lib/CodeGen/MachineSSAUpdater.cpp
@@ -51,17 +51,13 @@ MachineSSAUpdater::~MachineSSAUpdater() {
 
 /// Initialize - Reset this object to get ready for a new set of SSA
 /// updates.
-void MachineSSAUpdater::Initialize(const TargetRegisterClass *RC) {
+void MachineSSAUpdater::Initialize(Register V) {
   if (!AV)
     AV = new AvailableValsTy();
   else
     getAvailableVals(AV).clear();
 
-  VRC = RC;
-}
-
-void MachineSSAUpdater::Initialize(Register V) {
-  Initialize(MRI->getRegClass(V));
+  RegAttrs = V;
 }
 
 /// HasValueForBlock - Return true if the MachineSSAUpdater already has a value for
@@ -115,13 +111,12 @@ Register LookForIdenticalPHI(MachineBasicBlock *BB,
 /// InsertNewDef - Insert an empty PHI or IMPLICIT_DEF instruction which define
 /// a value of the given register class at the start of the specified basic
 /// block. It returns the virtual register defined by the instruction.
-static
-MachineInstrBuilder InsertNewDef(unsigned Opcode,
-                           MachineBasicBlock *BB, MachineBasicBlock::iterator I,
-                           const TargetRegisterClass *RC,
-                           MachineRegisterInfo *MRI,
-                           const TargetInstrInfo *TII) {
-  Register NewVR = MRI->createVirtualRegister(RC);
+static MachineInstrBuilder InsertNewDef(unsigned Opcode, MachineBasicBlock *BB,
+                                        MachineBasicBlock::iterator I,
+                                        Register RegAttrs,
+                                        MachineRegisterInfo *MRI,
+                                        const TargetInstrInfo *TII) {
+  Register NewVR = MRI->cloneVirtualRegister(RegAttrs);
   return BuildMI(*BB, I, DebugLoc(), TII->get(Opcode), NewVR);
 }
 
@@ -158,9 +153,9 @@ Register MachineSSAUpdater::GetValueInMiddleOfBlock(MachineBasicBlock *BB,
     if (ExistingValueOnly)
       return Register();
     // Insert an implicit_def to represent an undef value.
-    MachineInstr *NewDef = InsertNewDef(TargetOpcode::IMPLICIT_DEF,
-                                        BB, BB->getFirstTerminator(),
-                                        VRC, MRI, TII);
+    MachineInstr *NewDef =
+        InsertNewDef(TargetOpcode::IMPLICIT_DEF, BB, BB->getFirstTerminator(),
+                     RegAttrs, MRI, TII);
     return NewDef->getOperand(0).getReg();
   }
 
@@ -197,8 +192,8 @@ Register MachineSSAUpdater::GetValueInMiddleOfBlock(MachineBasicBlock *BB,
 
   // Otherwise, we do need a PHI: insert one now.
   MachineBasicBlock::iterator Loc = BB->empty() ? BB->end() : BB->begin();
-  MachineInstrBuilder InsertedPHI = InsertNewDef(TargetOpcode::PHI, BB,
-                                                 Loc, VRC, MRI, TII);
+  MachineInstrBuilder InsertedPHI =
+      InsertNewDef(TargetOpcode::PHI, BB, Loc, RegAttrs, MRI, TII);
 
   // Fill in all the predecessors of the PHI.
   for (unsigned i = 0, e = PredValues.size(); i != e; ++i)
@@ -300,10 +295,9 @@ class SSAUpdaterTraits<MachineSSAUpdater> {
   static Register GetUndefVal(MachineBasicBlock *BB,
                               MachineSSAUpdater *Updater) {
     // Insert an implicit_def to represent an undef value.
-    MachineInstr *NewDef = InsertNewDef(TargetOpcode::IMPLICIT_DEF,
-                                        BB, BB->getFirstNonPHI(),
-                                        Updater->VRC, Updater->MRI,
-                                        Updater->TII);
+    MachineInstr *NewDef =
+        InsertNewDef(TargetOpcode::IMPLICIT_DEF, BB, BB->getFirstNonPHI(),
+                     Updater->RegAttrs, Updater->MRI, Updater->TII);
     return NewDef->getOperand(0).getReg();
   }
 
@@ -312,9 +306,9 @@ class SSAUpdaterTraits<MachineSSAUpdater> {
   static Register CreateEmptyPHI(MachineBasicBlock *BB, unsigned NumPreds,
                                  MachineSSAUpdater *Updater) {
     MachineBasicBlock::iterator Loc = BB->empty() ? BB->end() : BB->begin();
-    MachineInstr *PHI = InsertNewDef(TargetOpcode::PHI, BB, Loc,
-                                     Updater->VRC, Updater->MRI,
-                                     Updater->TII);
+    MachineInstr *PHI =
+        InsertNewDef(TargetOpcode::PHI, BB, Loc, Updater->RegAttrs,
+                     Updater->MRI, Updater->TII);
     return PHI->getOperand(0).getReg();
   }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 323560a46f31de2..007d64944244a09 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -36,6 +36,7 @@ FunctionPass *createSIAnnotateControlFlowPass();
 FunctionPass *createSIFoldOperandsPass();
 FunctionPass *createSIPeepholeSDWAPass();
 FunctionPass *createSILowerI1CopiesPass();
+FunctionPass *createAMDGPUGlobalISelDivergenceLoweringPass();
 FunctionPass *createSIShrinkInstructionsPass();
 FunctionPass *createSILoadStoreOptimizerPass();
 FunctionPass *createSIWholeQuadModePass();
@@ -162,6 +163,9 @@ extern char &SILowerWWMCopiesID;
 void initializeSILowerI1CopiesPass(PassRegistry &);
 extern char &SILowerI1CopiesID;
 
+void initializeAMDGPUGlobalISelDivergenceLoweringPass(PassRegistry &);
+extern char &AMDGPUGlobalISelDivergenceLoweringID;
+
 void initializeSILowerSGPRSpillsPass(PassRegistry &);
 extern char &SILowerSGPRSpillsID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp
new file mode 100644
index 000000000000000..a955a08b0a3776e
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp
@@ -0,0 +1,213 @@
+//===-- AMDGPUGlobalISelDivergenceLowering.cpp ----------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// GlobalISel pass that selects divergent i1 phis as lane mask phis.
+// Lane mask merging uses same algorithm as SDAG in SILowerI1Copies.
+// Handles all cases of temporal divergence.
+//
+// For divergent non-phi i1 and uniform i1 uses outside of the cycle this pass
+// currently depends on LCSSA to insert phis with one incoming.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "SILowerI1Copies.h"
+#include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
+#include "llvm/InitializePasses.h"
+
+#define DEBUG_TYPE "global-isel-divergence-lowering"
+
+using namespace llvm;
+
+namespace {
+
+class AMDGPUGlobalISelDivergenceLowering : public MachineFunctionPass {
+public:
+  static char ID;
+
+public:
+  AMDGPUGlobalISelDivergenceLowering() : MachineFunctionPass(ID) {
+    initializeAMDGPUGlobalISelDivergenceLoweringPass(
+        *PassRegistry::getPassRegistry());
+  }
+
+  bool runOnMachineFunction(MachineFunction &MF) override;
+
+  StringRef getPassName() const override {
+    return "GlobalISel divergence lowering";
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.setPreservesCFG();
+    AU.addRequired<MachineCycleInfoWrapperPass>();
+    AU.addRequired<MachineDominatorTree>();
+    AU.addRequired<MachinePostDominatorTree>();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+};
+
+class DivergenceLoweringHelper : public PhiLoweringHelper {
+public:
+  DivergenceLoweringHelper(MachineFunction *MF, MachineDominatorTree *DT,
+                           MachinePostDominatorTree *PDT,
+                           MachineUniformityInfo *MUI);
+
+private:
+  MachineUniformityInfo *MUI = nullptr;
+
+public:
+  void markAsLaneMask(Register DstReg) const override;
+  void getCandidatesForLowering(
+      SmallVectorImpl<MachineInstr *> &Vreg1Phis) const override;
+  void collectIncomingValuesFromPhi(
+      const MachineInstr *MI,
+      SmallVectorImpl<Incoming> &Incomings) const override;
+  void replaceDstReg(Register NewReg, Register OldReg,
+                     MachineBasicBlock *MBB) override;
+  void buildMergeLaneMasks(MachineBasicBlock &MBB,
+                           MachineBasicBlock::iterator I, const DebugLoc &DL,
+                           Register DstReg, Register PrevReg,
+                           Register CurReg) override;
+  void constrainIncomingRegisterTakenAsIs(Incoming &In) override;
+};
+
+DivergenceLoweringHelper::DivergenceLoweringHelper(
+    MachineFunction *MF, MachineDominatorTree *DT,
+    MachinePostDominatorTree *PDT, MachineUniformityInfo *MUI)
+    : PhiLoweringHelper(MF, DT, PDT), MUI(MUI) {}
+
+// _(s1) -> SReg_32/64(s1)
+void DivergenceLoweringHelper::markAsLaneMask(Register DstReg) const {
+  assert(MRI->getType(DstReg) == LLT::scalar(1));
+
+  // Can't blindly set a register class on phi, users could have reg class
+  // constraints (e.g. sreg_32/64..._xexec classes for control flow intrinsics).
+  if (MRI->getRegClassOrNull(DstReg))
+    return;
+
+  MRI->setRegClass(DstReg, ST->getBoolRC());
+  return;
+}
+
+void DivergenceLoweringHelper::getCandidatesForLowering(
+    SmallVectorImpl<MachineInstr *> &Vreg1Phis) const {
+  LLT S1 = LLT::scalar(1);
+
+  // Add divergent i1 phis to the list
+  for (MachineBasicBlock &MBB : *MF) {
+    for (MachineInstr &MI : MBB.phis()) {
+      Register Dst = MI.getOperand(0).getReg();
+      if (MRI->getType(Dst) == S1 && MUI->isDivergent(Dst))
+        Vreg1Phis.push_back(&MI);
+    }
+  }
+
+  return;
+}
+
+void DivergenceLoweringHelper::collectIncomingValuesFromPhi(
+    const MachineInstr *MI, SmallVectorImpl<Incoming> &Incomings) const {
+  for (unsigned i = 1; i < MI->getNumOperands(); i += 2) {
+    Incomings.emplace_back(MI->getOperand(i).getReg(),
+                           MI->getOperand(i + 1).getMBB(), Register());
+  }
+}
+
+void DivergenceLoweringHelper::replaceDstReg(Register NewReg, Register OldReg,
+                                             MachineBasicBlock *MBB) {
+  BuildMI(*MBB, MBB->getFirstNonPHI(), {}, TII->get(AMDGPU::COPY), OldReg)
+      .addReg(NewReg);
+}
+
+// Get pointers to build instruction just after MI (skips phis if needed)
+static std::pair<MachineBasicBlock *, MachineBasicBlock::iterator>
+getInsertAfterPtrs(MachineInstr *MI) {
+  MachineBasicBlock *InsertMBB = MI->getParent();
+  return std::make_pair(
+      InsertMBB, InsertMBB->SkipPHIsAndLabels(std::next(MI->getIterator())));
+}
+
+void DivergenceLoweringHelper::buildMergeLaneMasks(
+    MachineBasicBlock &MBB, MachineBasicBlock::iterator I, const DebugLoc &DL,
+    Register DstReg, Register PrevReg, Register CurReg) {
+  // TODO: check if inputs are constants or results of a compare.
+
+  Register PrevRegCopy = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  auto [PrevMBB, AfterPrevReg] = getInsertAfterPtrs(MRI->getVRegDef(PrevReg));
+  BuildMI(*PrevMBB, AfterPrevReg, DL, TII->get(AMDGPU::COPY), PrevRegCopy)
+      .addReg(PrevReg);
+  Register PrevMaskedReg = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  BuildMI(MBB, I, DL, TII->get(AndN2Op), PrevMaskedReg)
+      .addReg(PrevRegCopy)
+      .addReg(ExecReg);
+
+  Register CurRegCopy = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  auto [CurMBB, AfterCurReg] = getInsertAfterPtrs(MRI->getVRegDef(CurReg));
+  BuildMI(*CurMBB, AfterCurReg, DL, TII->get(AMDGPU::COPY), CurRegCopy)
+      .addReg(CurReg);
+  Register CurMaskedReg = createLaneMaskReg(MRI, LaneMaskRegAttrs);
+  BuildMI(MBB, I, DL, TII->get(AndOp), CurMaskedReg)
+      .addReg(ExecReg)
+      .addReg(CurRegCopy);
+
+  BuildMI(MBB, I, DL, TII->get(OrOp), DstReg)
+      .addReg(PrevMaskedReg)
+      .addReg(CurMaskedReg);
+
+  return;
+}
+
+// GlobalISel has to constrain S1 incoming taken as-is with lane mask register
+// class. Insert a copy of Incoming.Reg to new lane mask inside Incoming.Block,
+// Incoming.Reg becomes that new lane mask.
+void DivergenceLoweringHelper::constrainIncomingRegisterTakenAsIs(
+    Incoming &In) {
+  MachineIRBuilder B(*MF);
+  B.setInsertPt(*In.Block, In.Block->getFirstTerminator());
+
+  auto Copy = B.buildCopy(LLT::scalar(1), In.Reg);
+  MRI->setRegClass(Copy.getReg(0), ST->getBoolRC());
+  In.Reg = Copy.getReg(0);
+
+  return;
+}
+
+} // End anonymous namespace.
+
+INITIALIZE_PASS_BEGIN(AMDGPUGlobalISelDivergenceLowering, DEBUG_TYPE,
+                      "GlobalISel divergence lowering", false, false)
+INITIALIZE_PASS_DEPENDENCY(MachineCycleInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
+INITIALIZE_PASS_DEPENDENCY(MachinePostDominatorTree)
+INITIALIZE_PASS_END(AMDGPUGlobalISelDivergenceLowering, DEBUG_TYPE,
+                    "GlobalISel divergence lowering", false, false)
+
+char AMDGPUGlobalISelDivergenceLowering::ID = 0;
+
+char &llvm::AMDGPUGlobalISelDivergenceLoweringID =
+    AMDGPUGlobalISelDivergenceLowering::ID;
+
+FunctionPass *llvm::createAMDGPUGlobalISelDivergenceLoweringPass() {
+  return new AMDGPUGlobalISelDivergenceLowering();
+}
+
+bool AMDGPUGlobalISelDivergenceLowering::runOnMachineFunction(
+    MachineFunction &MF) {
+  MachineCycleInfo &CycleInfo =
+      getAnalysis<MachineCycleInfoWrapperPass>().getCycleInfo();
+  MachineDominatorTree &DomTree = getAnalysis<MachineDominatorTree>();
+
+  MachineUniformityInfo MUI =
+      computeMachineUniformityInfo(MF, CycleInfo, DomTree.getBase(), true);
+
+  DivergenceLoweringHelper Helper(
+      &MF, &DomTree, &getAnalysis<MachinePostDominatorTree>(), &MUI);
+
+  Helper.lowerPhis();
+  return true;
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index b772efe04c7141d..3146f7f5a218566 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -207,7 +207,39 @@ bool AMDGPUInstructionSelector::selectCOPY(MachineInstr &I) const {
   return true;
 }
 
+bool isLaneMask(Register Reg, MachineRegisterInfo *MRI,
+                const SIRegisterInfo &TRI) {
+  if (MRI->getType(Reg) != LLT::scalar(1))
+    return false;
+  const TargetRegisterClass *RC = MRI->getRegClassOrNull(Reg);
+  if (!RC || !TRI.isSGPRClass(RC))
+    return false;
+
+  return true;
+}
+
+// PHI where all register operands are sgpr(register class) with S1 LLT.
+bool isLaneMaskPhi(MachineInstr &I, MachineRegisterInfo *MRI,
+                   const SIRegisterInfo &TRI) {
+  if (I.getOpcode() != AMDGPU::PHI)
+    return false;
+
+  if (!isLaneMask(I.getOperand(0).getReg(), MRI, TRI))
+    return false;
+
+  for (unsigned i = 1, e = I.getNumOperands(); i != e; e += 2) {
+    if (!isLaneMask(I.getOperand(i).getReg(), MRI, TRI))
+      return false;
+  }
+
+  return true;
+}
+
 bool AMDGPUInstructionSelector::selectPHI(MachineInstr &I) const {
+  // Already selected in divergence lowering pass
+  if (isLaneMaskPhi(I, MRI, TRI))
+    return true;
+
   const Register DefReg = I.getOperand(0).getReg();
   const LLT DefTy = MRI->getType(DefReg);
   if (DefTy == LLT::scalar(1)) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 0c38fa32c6f33a8..1d0be8984604da2 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -375,6 +375,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeAMDGPUDAGToDAGISelPass(*PR);
   initializeGCNDPPCombinePass(*PR);
   initializeSILowerI1CopiesPass(*PR);
+  initializeAMDGPUGlobalISelDivergenceLoweringPass(*PR);
   initializeSILowerWWMCopiesPass(*PR);
   initializeSILowerSGPRSpillsPass(*PR);
   initializeSIFixSGPRCopiesPass(*PR);
@@ -1255,6 +1256,7 @@ bool GCNPassConfig::addLegalizeMachineIR() {
 void GCNPassConfig::addPreRegBankSelect() {
   bool IsOptNone = getOptLevel() == CodeGenOptLevel::None;
   addPass(createAMDGPUPostLegalizeCombiner(IsOptNone));
+  addPass(createAMDGPUGlobalISelDivergenceLoweringPass());
 }
 
 bool GCNPassConfig::addRegBankSelect() {
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 53a33f8210d2a84..2c92e7a07388553 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -55,6 +55,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUCtorDtorLowering.cpp
   AMDGPUExportClustering.cpp
   AMDGPUFrameLowering.cpp
+  AMDGPUGlobalISelDivergenceLowering.cpp
   AMDGPUGlobalISelUtils.cpp
   AMDGPUHSAMetadataStreamer.cpp
   AMDGPUInsertDelayAlu.cpp
diff --git a/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp b/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
index 68c8f4024e73007..59c05bab531f4b5 100644
--- a/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp
@@ -21,57 +21,26 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "SILowerI1Copies.h"
 #include "AMDGPU.h"
-#include "GCNSubtarget.h"
-#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
-#include "llvm/CodeGen/MachineDominators.h"
-#include "llvm/CodeGen/MachineFunctionPass.h"
-#include "llvm/CodeGen/MachinePostDominators.h"
 #include "llvm/CodeGen/MachineSSAUpdater.h"
 #include "llvm/InitializePasses.h"
+#include "llvm/Target/CGPassBuilderOption.h"
 
 #define DEBUG_TYPE "si-i1-copies"
 
 using namespace llvm;
 
-static unsigned createLaneMaskReg(MachineFunction &MF);
-static unsigned insertUndefLaneMask(MachineBasicBlock &MBB);
+static Register insertUndefLaneMask(MachineBasicBlock *MBB,
+                                    MachineRegisterInfo *MRI,
+                                    Register *LaneMaskRegAttrs);
 
 namespace {
 
-struct Incoming {
-  Register Reg;
-  MachineBasicBlock *Block;
-  Register UpdatedReg;
-
-  Incoming(Register Reg, MachineBasicBlock *Block, Register UpdatedReg)
-      : Reg(Reg), Block(Block), UpdatedReg(UpdatedReg) {}
-};
-
 class SILowerI1Copies : public MachineFunctionPass {
 public:
   static char ID;
 
-private:
-  bool IsWave32 = false;
-  MachineFunction *MF = nullptr;
-  MachineDominatorTree *DT = nullptr;
-  MachinePostDominatorTree *PDT = nullptr;
-  MachineRegisterInfo *MRI = nullptr;
-  const GCNSubtarget *ST = nullptr;
-  const SIInstrInfo *TII = nullptr;
-
-  unsigned ExecReg;
-  unsigned MovOp;
-  unsigned AndOp;
-  unsigned OrOp;
-  unsigned XorOp;
-  unsigned AndN2Op;
-  unsigned OrN2Op;
-
-  DenseSet<unsigned> ConstrainRegs;
-
-public:
   SILowerI1Copies() : MachineFunctionPass(ID) {
     initializeSILowerI1CopiesPass(*PassRegistry::getPassRegistry());
   }
@@ -86,29 +55,53 @@ class SILowerI1Copies : public MachineFunctionPass {
     AU.addRequired<MachinePostDominatorTree>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
+};
+
+class Vreg1LoweringHelper : public PhiLoweringHelper {
+public:
+  Vreg1LoweringHelper(MachineFunction *MF, MachineDominatorTree *DT,
+                      MachinePostDominatorTree *PDT);
 
 private:
-  bool lowerCopiesFromI1();
-  bool lowerPhis();
-  bool lowerCopiesToI1();
-  bool isConstantLaneMask(Register Reg, bool &Val) const;
+  DenseSet<Register> ConstrainRegs;
+
+public:
+  void markAsLaneMask(Register DstReg) const...
[truncated]

petar-avramovic · 2023-11-27T11:29:17Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+#include "AMDGPU.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+
+#define DEBUG_TYPE "global-isel-divergence-lowering"


Do you have better suggestions for pass name?

Should get an amdgpu- prefix

amdgpu-gi-divergence-lowering to keep it short?

llvm/include/llvm/CodeGen/MachineSSAUpdater.h

llvm/lib/CodeGen/MachineSSAUpdater.cpp

arsenm · 2023-11-28T13:28:53Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+// Incoming.Reg becomes that new lane mask.
+void DivergenceLoweringHelper::constrainIncomingRegisterTakenAsIs(
+    Incoming &In) {
+  MachineIRBuilder B(*MF);


Don't construct a fresh MIR builder, use the same one throughout the pass

arsenm · 2023-11-28T13:29:31Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+// PHI where all register operands are sgpr(register class) with S1 LLT.
+bool isLaneMaskPhi(MachineInstr &I, MachineRegisterInfo *MRI,
+                   const SIRegisterInfo &TRI) {
+  if (I.getOpcode() != AMDGPU::PHI)


G_PHI. Not sure we should bother handling non-G_* phi at all, but isPHI also works

This was meant to verify that lane mask PHIs are correctly selected if they were selected before instruction-selection. Idea was to not split selection of lane mask PHIs across multiple passes

So the point is that if we see an PHI here (as opposed to a G_PHI), that's because it has already been selected.

Having this as a separate helper is confusing IMHO. Better to inline a check against PHI into selectPHI, and add an assert there that the defined register is a lane mask in that case.

So the point is that if we see an PHI here (as opposed to a G_PHI), that's because it has already been selected.

Yes at the moment, but I did not want to be that much restrictive and put an assert.
It was meant to report selected lane mask PHI as already selected.
If it was not properly instruction selected leave it (ideally to fail selection).
There will be changes to how we select phis later so maybe then we can add some assert.
Can we inline it for now with a return?

llvm/lib/Target/AMDGPU/SILowerI1Copies.h

arsenm · 2023-11-28T13:30:17Z

llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll

+
+; There is a divergent, according to machine uniformity info, g_brcond branch
+; here, not lowered to si_if because of "amdgpu-flat-work-group-size"="1,1".
+define dllexport amdgpu_cs void @_amdgpu_cs_main( i32 inreg noundef %.userdata0, <3 x i32> inreg noundef %.WorkgroupId, <3 x i32> noundef %.LocalInvocationId) #0 {


Don't need all the dllexports

llvm/lib/Target/AMDGPU/SILowerI1Copies.h

Pierre-vh

Some more docs in AMDGPUGlobalISelDivergenceLowering.cpp would be nice.
Seems like it's going to be a core part of the GlobalISel pipeline so IMO it should be well documented so it's not too hard to understand and contribute to.

Pierre-vh · 2023-11-28T13:47:57Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+  bool runOnMachineFunction(MachineFunction &MF) override;
+
+  StringRef getPassName() const override {
+    return "GlobalISel divergence lowering";


Needs AMDGPU prefix too

Pierre-vh · 2023-11-28T13:48:00Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+#include "AMDGPU.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+
+#define DEBUG_TYPE "global-isel-divergence-lowering"


amdgpu-gi-divergence-lowering to keep it short?

Pierre-vh · 2023-11-28T13:48:32Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+// GlobalISel pass that selects divergent i1 phis as lane mask phis.
+// Lane mask merging uses same algorithm as SDAG in SILowerI1Copies.
+// Handles all cases of temporal divergence.
+//
+// For divergent non-phi i1 and uniform i1 uses outside of the cycle this pass
+// currently depends on LCSSA to insert phis with one incoming.


really small nit: Use three slashes + start with \file, first line should be /// \file

Pierre-vh · 2023-11-28T13:49:10Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+    return;
+
+  MRI->setRegClass(DstReg, ST->getBoolRC());
+  return;


extra return

Pierre-vh · 2023-11-28T13:49:44Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+
+void DivergenceLoweringHelper::collectIncomingValuesFromPhi(
+    const MachineInstr *MI, SmallVectorImpl<Incoming> &Incomings) const {
+  for (unsigned i = 1; i < MI->getNumOperands(); i += 2) {


i -> I (uppercase)

Pierre-vh · 2023-11-28T13:54:49Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+    }
+  }
+
+  return;


extra return

Pierre-vh · 2023-11-28T13:54:58Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+  MRI->setRegClass(Copy.getReg(0), ST->getBoolRC());
+  In.Reg = Copy.getReg(0);
+
+  return;


extra return

This extra return in constrainIncomingRegisterTakenAsIs should still be removed.

Pierre-vh · 2023-11-28T13:56:34Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+      getAnalysis<MachineCycleInfoWrapperPass>().getCycleInfo();
+  MachineDominatorTree &DomTree = getAnalysis<MachineDominatorTree>();
+
+  MachineUniformityInfo MUI =


Can't you depend on MachineUniformityAnalysis instead?

Pierre-vh · 2023-11-28T13:57:10Z

llvm/lib/Target/AMDGPU/SILowerI1Copies.h

+#include "llvm/CodeGen/MachinePostDominators.h"
+#include "llvm/CodeGen/MachineSSAUpdater.h"
+
+using namespace llvm;


Avoid using namespace in headers

llvm/lib/Target/AMDGPU/SILowerI1Copies.h

github-actions · 2023-12-04T15:56:34Z

✅ With the latest revision this PR passed the C/C++ code formatter.

petar-avramovic · 2023-12-07T11:05:35Z

Ping. Updated for comments about code quality. Unchanged for comments about functional changes:
Divergent i1 phis (G_PHIs to be precise) are fully instruction selected into PHIs (all registers created in this pass are lane masks with correct register class and S1 LLT). Check regression tests changes (last two patches in particular) to get better idea what is happening.

arsenm · 2023-12-07T11:25:04Z

llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll

+
+
+
+define amdgpu_ps void @divergent_i1_phi_uniform_branch(i32 addrspace(1)* %out, i32 %tid, i32 inreg %cond, i32 addrspace(1)* %dummyaddr) {


Test should be updated to use opaque pointers

arsenm · 2023-12-07T11:26:08Z

llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll

+
+; There is a divergent, according to machine uniformity info, g_brcond branch
+; here, not lowered to si_if because of "amdgpu-flat-work-group-size"="1,1".
+define amdgpu_cs void @_amdgpu_cs_main( i32 inreg noundef %.userdata0, <3 x i32> inreg noundef %.WorkgroupId, <3 x i32> noundef %.LocalInvocationId) #0 {


Name the function to indicate it's the 1,1 case

arsenm · 2023-12-07T11:28:55Z

llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp

-  return MRI.createVirtualRegister(ST.isWave32() ? &AMDGPU::SReg_32RegClass
-                                                 : &AMDGPU::SReg_64RegClass);
+Register llvm::createLaneMaskReg(MachineRegisterInfo *MRI,
+                                 Register *LaneMaskRegAttrs) {


What's the point of these parameters being pointers?

The general idea was to crash if Register *LaneMaskRegAttrs was not initialized, it might not be that obvious if uninitialized Register(0) was used and bad code was generated.

A Register() used in just about anything will fail the verifier so that's fine

arsenm · 2023-12-07T11:33:54Z

llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp

@@ -895,3 +910,7 @@ void SILowerI1Copies::buildMergeLaneMasks(MachineBasicBlock &MBB,
        .addReg(CurMaskedReg ? CurMaskedReg : ExecReg);
  }
 }
+
+void Vreg1LoweringHelper::constrainIncomingRegisterTakenAsIs(Incoming &In) {


What's the point of this function? Don't really understand the name

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

llvm/include/llvm/CodeGen/MachineSSAUpdater.h

petar-avramovic · 2023-12-07T14:46:11Z

llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir

  ; GFX10-NEXT:   [[S_OR_B32_:%[0-9]+]]:sreg_32(s1) = S_OR_B32 [[S_ANDN2_B32_]](s1), [[S_AND_B32_]](s1), implicit-def $scc
  ; GFX10-NEXT: {{  $}}
  ; GFX10-NEXT: bb.2:
-  ; GFX10-NEXT:   [[PHI:%[0-9]+]]:sreg_32(s1) = PHI [[ICMP]](s1), %bb.0, [[S_OR_B32_]](s1), %bb.1
-  ; GFX10-NEXT:   [[COPY6:%[0-9]+]]:sreg_32(s1) = COPY [[PHI]](s1)
+  ; GFX10-NEXT:   [[PHI:%[0-9]+]]:sreg_32(s1) = PHI [[COPY4]](s1), %bb.0, [[S_OR_B32_]](s1), %bb.1


What's the point of this function? Don't really understand the name constrainIncomingRegisterTakenAsIs

But here register operand with name COPY4 has lane-mask register class and after this patch lane mask PHI is fully selected. I wanted to avoid having to split selection of lane mask phis across multiple passes.

; GFX10-NEXT: [[ICMP:%[0-9]+]]:_(s1) = G_ICMP intpred(uge), [[COPY2]](s32), [[C]] ; GFX10-NEXT: [[COPY4:%[0-9]+]]:sreg_32(s1) = COPY [[ICMP]](s1) ; GFX10-NEXT: [[PHI:%[0-9]+]]:sreg_32(s1) = PHI [[COPY4]](s1), %bb.0, [[S_OR_B32_]](s1), %bb.1

petar-avramovic · 2023-12-07T14:46:14Z

llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir

  ; GFX10-NEXT:   [[S_OR_B32_:%[0-9]+]]:sreg_32(s1) = S_OR_B32 [[S_ANDN2_B32_]](s1), [[S_AND_B32_]](s1), implicit-def $scc
  ; GFX10-NEXT: {{  $}}
  ; GFX10-NEXT: bb.2:
-  ; GFX10-NEXT:   [[PHI:%[0-9]+]]:sreg_32(s1) = PHI [[ICMP]](s1), %bb.0, [[S_OR_B32_]](s1), %bb.1


What's the point of this function? Don't really understand the name constrainIncomingRegisterTakenAsIs

That Incoming is from the block that is visited first in RPOT, so it is taken as is. But it does not have reg class (the register operand with name ICMP), so PHI instruction is only partially selected

; GFX10-NEXT: [[ICMP:%[0-9]+]]:_(s1) = G_ICMP intpred(uge), [[COPY2]](s32), [[C]] ; GFX10-NEXT: [[PHI:%[0-9]+]]:sreg_32(s1) = PHI [[ICMP]](s1), %bb.0, [[S_OR_B32_]](s1), %bb.1

petar-avramovic · 2023-12-11T11:15:37Z

Ping. Nicolai and Jay do you have comments.

nhaehnle

It's probably for the best to land this PR soon. I have a few detail comments still inline, but hopefully they're not too complex.

I am somewhat surprised about the order in which the changes appear. Though you did update the tests, so did you actually run a check (e.g. using git rebase -x "ninja -C build check-llvm-codegen-amdgpu" ...) that passes after each individual commit? If yes, then great. That would make me feel a bit more confident about the structure of this.

nhaehnle · 2023-12-12T20:34:27Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+} // End anonymous namespace.
+
+INITIALIZE_PASS_BEGIN(AMDGPUGlobalISelDivergenceLowering, DEBUG_TYPE,
+                      "GlobalISel divergence lowering", false, false)


This comment was addressed in the wrong commit. This makes it harder to review the PR :(

nhaehnle · 2023-12-12T20:44:44Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

+// PHI where all register operands are sgpr(register class) with S1 LLT.
+bool isLaneMaskPhi(MachineInstr &I, MachineRegisterInfo *MRI,
+                   const SIRegisterInfo &TRI) {
+  if (I.getOpcode() != AMDGPU::PHI)


So the point is that if we see an PHI here (as opposed to a G_PHI), that's because it has already been selected.

Having this as a separate helper is confusing IMHO. Better to inline a check against PHI into selectPHI, and add an assert there that the defined register is a lane mask in that case.

nhaehnle · 2023-12-12T20:47:28Z

llvm/lib/Target/AMDGPU/AMDGPUGlobalISelDivergenceLowering.cpp

+  MRI->setRegClass(Copy.getReg(0), ST->getBoolRC());
+  In.Reg = Copy.getReg(0);
+
+  return;


This extra return in constrainIncomingRegisterTakenAsIs should still be removed.

petar-avramovic · 2023-12-13T09:25:23Z

I am somewhat surprised about the order in which the changes appear. Though you did update the tests, so did you actually run a check (e.g. using git rebase -x "ninja -C build check-llvm-codegen-amdgpu" ...) that passes after each individual commit? If yes, then great. That would make me feel a bit more confident about the structure of this.

Each individual patch passes check-llvm, ir is correct after the last patch but mir updates are relevant and give good overview of the changes (PHIs go from partially selected to fully selected). I wanted to open separate pr for each of the patches if they are ok together (don't know if there is a better way).

nhaehnle

Thanks, LGTM except for one nitty nit.

nhaehnle · 2023-12-13T13:02:05Z

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

 bool AMDGPUInstructionSelector::selectPHI(MachineInstr &I) const {
  const Register DefReg = I.getOperand(0).getReg();
  const LLT DefTy = MRI->getType(DefReg);
+  // Lane mask PHIs, PHI where all register operands have sgpr register class
+  // with S1 LLT, are already selected in divergence lowering pass.
+  if (I.getOpcode() == AMDGPU::PHI && isLaneMask(DefReg, MRI, TRI)) {


Perhaps I should be clearer. I don't think we ever have a PHI here other than for lane masks. So it would be better for understanding the overall context to say something like:

if (I.getOpcode() == AMDGPU::PHI) { assert(isLaneMask(DefReg, MRI, TRI); return true; }

There is one regression test with PHI in phi_s32_ss_sbranch - GlobalISel/inst-select-phi.mir . It can be disabled. Other then that assert is fine.
I wanted to check isLaneMask for all operand, can I bring helper function back then? it would look like this without helper

if (I.getOpcode() == AMDGPU::PHI) { assert(isLaneMask(DefReg, MRI, TRI)); for (unsigned i = 1, e = I.getNumOperands(); i != e; e += 2) assert(isLaneMask(I.getOperand(i).getReg(), MRI, TRI)); return true; }

Huh, I wonder if @arsenm knows where that test comes from, whether that's on purpose or perhaps a mistake and the test should be removed.

If my understanding is wrong and other PHIs are possible, it would be good to understand why. If they aren't, then the test should be removed and it's not so important to check all the operands IMHO.

Okay then, we can merge first two patches in the meantime

Add empty AMDGPUGlobalISelDivergenceLowering pass. This pass will implement - selection of divergent i1 phis as lane mask phis, requires lane mask merging in some cases - lower uses of divergent i1 values outside of the cycle using lane mask merging - lowering of all cases of temporal divergence: - lower uses of uniform i1 values outside of the cycle using lane mask merging - lower uses of uniform non-i1 values outside of the cycle using a copy to vgpr inside of the cycle Add very detailed set of regression tests for cases mentioned above. patch 1 from: #73337

llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll

Make abstract class PhiLoweringHelper and expose it for use in GlobalISel path. SILowerI1Copies implements PhiLoweringHelper as Vreg1LoweringHelper and it is equivalent to SILowerI1Copies. Notable change that createLaneMaskReg now clones attributes from register that has lane mask attributes instead of creating register with lane mask register class. This is because lane masks have different(more) attributes in GlobalISel. patch 2 from: #73337

Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: llvm#73337

…6145) Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: llvm#73337

Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: llvm#73337

Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: #73337

Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: llvm#73337

Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: #73337

GlobalISel works with registers that could have register class, register bank and LLT as attributes. When initializing MachineSSAUpdater save all attributes of register and create new registers with same attributes instead of only using register class. patch 4 from: llvm#73337

When initializing MachineSSAUpdater save all attributes of current virtual register and create new virtual registers with same attributes. Now new virtual registers have same both register class or bank and LLT. Previously new virtual registers had same register class but LLT was not set (LLT was set to default/empty LLT). Required by GlobalISel for AMDGPU, new 'lane mask' virtual registers created by MachineSSAUpdater need to have both register class and LLT. patch 4 from: llvm#73337

…0003) Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: llvm#73337

When initializing MachineSSAUpdater save all attributes of current virtual register and create new virtual registers with same attributes. Now new virtual registers have same both register class or bank and LLT. Previously new virtual registers had same register class but LLT was not set (LLT was set to default/empty LLT). Required by GlobalISel for AMDGPU, new 'lane mask' virtual registers created by MachineSSAUpdater need to have both register class and LLT. patch 4 from: llvm#73337

…78431) When initializing MachineSSAUpdater save all attributes of current virtual register and create new virtual registers with same attributes. Now new virtual registers have same both register class or bank and LLT. Previously new virtual registers had same register class but LLT was not set (LLT was set to default/empty LLT). Required by GlobalISel for AMDGPU, new 'lane mask' virtual registers created by MachineSSAUpdater need to have both register class and LLT. patch 4 from: #73337

arsenm · 2024-02-29T07:47:05Z

Needs update to resolve conflict

Basic implementation of lane mask merging for GlobalISel. Lane masks on GlobalISel are registers with sgpr register class and S1 LLT - required by machine uniformity analysis. Implements equivalent of lowerPhis from SILowerI1Copies.cpp in: patch 1: llvm#75340 patch 2: llvm#75349 patch 3: llvm#80003 patch 4: llvm#78431 patch 5: is in this commit: AMDGPU/GlobalISelDivergenceLowering: constrain incoming registers Previously, in PHIs that represent lane masks, incoming registers taken as-is were not selected as lane masks. Such registers are not being merged with another lane mask and most often only have S1 LLT. Implement constrainAsLaneMask by constraining incoming registers taken as-is with lane mask attributes, essentially transforming them to lane masks. This is final step in having PHI instructions created in this pass to be fully instruction-selected.

petar-avramovic · 2024-02-29T12:21:44Z

Updated with final patch from original PR

arsenm

Can the amdgpu-global-isel-risky-select flag be deleted?

petar-avramovic · 2024-02-29T12:55:59Z

Can the amdgpu-global-isel-risky-select flag be deleted?

Yes, will do it in another patch, from the upcoming patches:
S1 G_PHIs should not be selected in instruction-select, instead:

divergent S1 G_PHI is pre-selected in AMDGPUGlobalISelDivergenceLowering (into fully inst-selected PHI)
uniform S1 G_PHI is lowered into S32 G_PHI in AMDGPURegBankSelect.cpp (TODO)

Make abstract class PhiLoweringHelper and expose it for use in GlobalISel path. SILowerI1Copies implements PhiLoweringHelper as Vreg1LoweringHelper and it is equivalent to SILowerI1Copies. Notable change that createLaneMaskReg now clones attributes from register that has lane mask attributes instead of creating register with lane mask register class. This is because lane masks have different(more) attributes in GlobalISel. patch 2 from: llvm/llvm-project#73337

llvmbot added backend:AMDGPU llvm:globalisel labels Nov 24, 2023

petar-avramovic requested review from nhaehnle and jayfoad November 24, 2023 15:02

petar-avramovic requested review from arsenm and Pierre-vh November 24, 2023 15:02

petar-avramovic commented Nov 27, 2023

View reviewed changes

arsenm reviewed Nov 28, 2023

View reviewed changes

Pierre-vh reviewed Nov 28, 2023

View reviewed changes

petar-avramovic force-pushed the global-isel-lane-masks branch 2 times, most recently from fb97bf5 to e14b035 Compare December 4, 2023 16:03

arsenm reviewed Dec 7, 2023

View reviewed changes

petar-avramovic commented Dec 7, 2023

View reviewed changes

petar-avramovic force-pushed the global-isel-lane-masks branch from e14b035 to d144cb0 Compare December 8, 2023 14:45

nhaehnle reviewed Dec 12, 2023

View reviewed changes

petar-avramovic force-pushed the global-isel-lane-masks branch from d144cb0 to c22d59c Compare December 13, 2023 11:57

nhaehnle approved these changes Dec 13, 2023

View reviewed changes

petar-avramovic mentioned this pull request Dec 13, 2023

AMDGPU/GlobalISel: add AMDGPUGlobalISelDivergenceLowering pass #75340

Merged

petar-avramovic mentioned this pull request Dec 13, 2023

AMDGPU: refactor phi lowering from SILowerI1Copies (NFCI) #75349

Merged

ruiling reviewed Dec 15, 2023

View reviewed changes

llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll Outdated Show resolved Hide resolved

petar-avramovic force-pushed the global-isel-lane-masks branch from c22d59c to dca496b Compare December 15, 2023 11:57

petar-avramovic mentioned this pull request Jan 17, 2024

AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis #78482

Merged

petar-avramovic force-pushed the global-isel-lane-masks branch from 1f62a55 to c3df429 Compare January 17, 2024 17:58

petar-avramovic force-pushed the global-isel-lane-masks branch from c3df429 to 1237d7f Compare January 18, 2024 14:19

petar-avramovic force-pushed the global-isel-lane-masks branch from 1237d7f to 38b427d Compare January 23, 2024 11:02

petar-avramovic force-pushed the global-isel-lane-masks branch from 38b427d to ed018de Compare January 30, 2024 13:35

petar-avramovic mentioned this pull request Jan 30, 2024

AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis #80003

Merged

petar-avramovic force-pushed the global-isel-lane-masks branch from ed018de to 0601066 Compare February 1, 2024 11:54

petar-avramovic force-pushed the global-isel-lane-masks branch from 0601066 to 618508b Compare February 29, 2024 12:19

petar-avramovic changed the title ~~GlobalISel lane masks merging~~ AMDGPU/GlobalISel: lane masks merging Feb 29, 2024

arsenm reviewed Feb 29, 2024

View reviewed changes

arsenm approved these changes Feb 29, 2024

View reviewed changes

petar-avramovic merged commit 6c2eec5 into llvm:main Feb 29, 2024
3 of 4 checks passed

+                  }
+                }
+                return;




		define amdgpu_ps void @divergent_i1_phi_uniform_branch(i32 addrspace(1)* %out, i32 %tid, i32 inreg %cond, i32 addrspace(1)* %dummyaddr) {

AMDGPU/GlobalISel: lane masks merging #73337

AMDGPU/GlobalISel: lane masks merging #73337

Conversation

petar-avramovic commented Nov 24, 2023 • edited

llvmbot commented Nov 24, 2023

llvmbot commented Nov 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petar-avramovic Dec 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pierre-vh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 4, 2023 • edited

petar-avramovic commented Dec 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petar-avramovic Dec 7, 2023 • edited

Choose a reason for hiding this comment

petar-avramovic Dec 7, 2023 • edited

Choose a reason for hiding this comment

petar-avramovic commented Dec 11, 2023

nhaehnle left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petar-avramovic commented Dec 13, 2023

nhaehnle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arsenm commented Feb 29, 2024

petar-avramovic commented Feb 29, 2024

arsenm left a comment

Choose a reason for hiding this comment

petar-avramovic commented Feb 29, 2024

petar-avramovic commented Nov 24, 2023 •

edited

petar-avramovic Dec 13, 2023 •

edited

github-actions bot commented Dec 4, 2023 •

edited

petar-avramovic Dec 7, 2023 •

edited

petar-avramovic Dec 7, 2023 •

edited

nhaehnle left a comment •

edited