[RISCV] Support postRA vsetvl insertion pass #70549

BeMg · 2023-10-28T11:29:49Z

This patch try to get rid of vsetvl implict vl/vtype def-use chain and improve the register allocation quality by moving the vsetvl insertion pass after RVV register allocation

It will gain the benefit for the following optimization from

unblock scheduler's constraints by removing vl/vtype def-use chain
Support RVV re-materialization
Support partial spill

This patch add a new option -riscv-vsetvl-after-rvv-regalloc=<1|0> to control this feature and default set as disable.

github-actions · 2023-10-28T11:41:48Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvmbot · 2023-10-28T12:36:20Z

@llvm/pr-subscribers-debuginfo

@llvm/pr-subscribers-backend-risc-v

Author: Piyou Chen (BeMg)

Changes

This patch try to get rid of vsetvl implict vl/vtype def-use chain and improve the register allocation quality by

move the vsetvl insertion pass after RVV register allocation
split register allocation into RVV and other

It will gain the benefit for the following optimization from

unblock scheduler's constraints by removing vl/vtype def-use chain
Support RVV re-materialization
Support partial spill

This patch add a new option -riscv-split-RA=<1|0> to control this feature and default set as disable.

It can compile the the local testsuite and some relate big benchmark, but it still generate worse code in some case. (llvm regression test)

Note

Register allocation

Allocator be split by pass the filter function to control which VReg should be allocated.

VSETVL insertion pass

Because it no longer SSA form, there are some place need to update.
The LiveInterval need to maintain the correctness because this pass will insert/remove instruction.
LiveDebugVariables also should to perserve, otherwise it will trigger the assertion fail.

Patch is 31.97 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/70549.diff

9 Files Affected:

(renamed) llvm/include/llvm/CodeGen/LiveDebugVariables.h (+1-1)
(modified) llvm/lib/CodeGen/LiveDebugVariables.cpp (+1-1)
(modified) llvm/lib/CodeGen/RegAllocBasic.cpp (+1-1)
(modified) llvm/lib/CodeGen/RegAllocGreedy.cpp (+1-1)
(modified) llvm/lib/CodeGen/VirtRegMap.cpp (+1-1)
(modified) llvm/lib/Target/RISCV/RISCV.h (+1-1)
(modified) llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp (+208-44)
(modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+110-1)
(added) llvm/test/CodeGen/RISCV/rvv/splitRA-vsetvl.ll (+116)

diff --git a/llvm/lib/CodeGen/LiveDebugVariables.h b/llvm/include/llvm/CodeGen/LiveDebugVariables.h
similarity index 96%
rename from llvm/lib/CodeGen/LiveDebugVariables.h
rename to llvm/include/llvm/CodeGen/LiveDebugVariables.h
index 9998ce9e8dad861..3643a2cd0981f19 100644
--- a/llvm/lib/CodeGen/LiveDebugVariables.h
+++ b/llvm/include/llvm/CodeGen/LiveDebugVariables.h
@@ -29,7 +29,7 @@ template <typename T> class ArrayRef;
 class LiveIntervals;
 class VirtRegMap;
 
-class LLVM_LIBRARY_VISIBILITY LiveDebugVariables : public MachineFunctionPass {
+class LiveDebugVariables : public MachineFunctionPass {
   void *pImpl = nullptr;
 
 public:
diff --git a/llvm/lib/CodeGen/LiveDebugVariables.cpp b/llvm/lib/CodeGen/LiveDebugVariables.cpp
index 9603c1f01e08569..5f5a858c7bade6e 100644
--- a/llvm/lib/CodeGen/LiveDebugVariables.cpp
+++ b/llvm/lib/CodeGen/LiveDebugVariables.cpp
@@ -18,7 +18,7 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/IntervalMap.h"
diff --git a/llvm/lib/CodeGen/RegAllocBasic.cpp b/llvm/lib/CodeGen/RegAllocBasic.cpp
index 6661991396302ec..a4d4a218fea5584 100644
--- a/llvm/lib/CodeGen/RegAllocBasic.cpp
+++ b/llvm/lib/CodeGen/RegAllocBasic.cpp
@@ -12,7 +12,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "AllocationOrder.h"
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
 #include "RegAllocBase.h"
 #include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/CalcSpillWeights.h"
diff --git a/llvm/lib/CodeGen/RegAllocGreedy.cpp b/llvm/lib/CodeGen/RegAllocGreedy.cpp
index 349d8b0975f3a10..4c25fcfeba89073 100644
--- a/llvm/lib/CodeGen/RegAllocGreedy.cpp
+++ b/llvm/lib/CodeGen/RegAllocGreedy.cpp
@@ -14,7 +14,7 @@
 #include "RegAllocGreedy.h"
 #include "AllocationOrder.h"
 #include "InterferenceCache.h"
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
 #include "RegAllocBase.h"
 #include "RegAllocEvictionAdvisor.h"
 #include "RegAllocPriorityAdvisor.h"
diff --git a/llvm/lib/CodeGen/VirtRegMap.cpp b/llvm/lib/CodeGen/VirtRegMap.cpp
index 48f4ee29fbe95d4..a2b046a59c9d773 100644
--- a/llvm/lib/CodeGen/VirtRegMap.cpp
+++ b/llvm/lib/CodeGen/VirtRegMap.cpp
@@ -16,7 +16,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "llvm/CodeGen/VirtRegMap.h"
-#include "LiveDebugVariables.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/CodeGen/LiveInterval.h"
diff --git a/llvm/lib/Target/RISCV/RISCV.h b/llvm/lib/Target/RISCV/RISCV.h
index 3d8e33dc716ea44..9c9906ff0bc6430 100644
--- a/llvm/lib/Target/RISCV/RISCV.h
+++ b/llvm/lib/Target/RISCV/RISCV.h
@@ -60,7 +60,7 @@ void initializeRISCVPreRAExpandPseudoPass(PassRegistry &);
 FunctionPass *createRISCVExpandAtomicPseudoPass();
 void initializeRISCVExpandAtomicPseudoPass(PassRegistry &);
 
-FunctionPass *createRISCVInsertVSETVLIPass();
+FunctionPass *createRISCVInsertVSETVLIPass(bool IsSplitRA = false);
 void initializeRISCVInsertVSETVLIPass(PassRegistry &);
 
 FunctionPass *createRISCVPostRAExpandPseudoPass();
diff --git a/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp b/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
index 9d2b2c3b3f59264..03c8d72a32dc714 100644
--- a/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
@@ -26,6 +26,7 @@
 
 #include "RISCV.h"
 #include "RISCVSubtarget.h"
+#include "llvm/CodeGen/LiveDebugVariables.h"
 #include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include <queue>
@@ -44,6 +45,86 @@ static cl::opt<bool> UseStrictAsserts(
 
 namespace {
 
+// Return true when CandidateDefMI can arrive UserMI without reach other defMI
+// thought CFG. Otherwise, if it:
+//    1. Encounter the DefMI again
+//    2. Seen the all BasicBlock can reach
+// return false
+static bool isReachingDef(const MachineInstr &UserMI,
+                          const MachineInstr &CandidateDefMI, Register Reg,
+                          const MachineRegisterInfo *MRI) {
+  std::queue<std::pair<const MachineBasicBlock *, bool>> MBBQ;
+  llvm::DenseMap<int, bool> SeenMBB;
+
+  MBBQ.push(std::make_pair(CandidateDefMI.getParent(), false));
+  while (!MBBQ.empty()) {
+
+    const MachineBasicBlock *CurrMBB = MBBQ.front().first;
+    bool SeenDef = MBBQ.front().second;
+    MBBQ.pop();
+    SeenMBB[CurrMBB->getNumber()] = true;
+    bool NeedSkip = false;
+    for (auto &MI : *CurrMBB) {
+      // If we encounter DefMI more than once, this CandidateDefMI is not
+      // reaching definition for UserMI.
+      if (SeenDef && llvm::any_of(MRI->def_instructions(Reg),
+                                  [&MI](const MachineInstr &DefMI) {
+                                    return DefMI.isIdenticalTo(MI);
+                                  })) {
+        NeedSkip = true;
+        break;
+      }
+
+      if (MI.isIdenticalTo(CandidateDefMI))
+        SeenDef = true;
+
+      if (SeenDef && MI.isIdenticalTo(UserMI))
+        return true;
+    }
+
+    if (NeedSkip)
+      continue;
+
+    for (auto *Succ : CurrMBB->successors()) {
+      if (SeenMBB[Succ->getNumber()])
+        continue;
+      MBBQ.push(std::make_pair(Succ, SeenDef));
+    }
+  }
+
+  return false;
+}
+
+// Return MachineInstr* when we found the unique reaching definition.
+// Otherwise, return nullptr.
+// FIXME: Could it detect the PHI node situation?
+static MachineInstr *findReachingDef(Register Reg, const MachineInstr *MI,
+                                     const MachineRegisterInfo *MRI) {
+
+  MachineInstr *ReachingDefMI = nullptr;
+  for (auto &DefMI : MRI->def_instructions(Reg)) {
+    if (isReachingDef(*MI, DefMI, Reg, MRI)) {
+      if (!ReachingDefMI)
+        ReachingDefMI = &DefMI;
+      else
+        return nullptr;
+    }
+  }
+
+  return ReachingDefMI;
+}
+
+// For the SSA form, we could just use the getVRegDef to take Reaching
+// definition. For the non-SSA, we create this pass own reaching definition
+// searching function.
+static MachineInstr *getReachingDefMI(Register VReg, const MachineInstr *MI,
+                                      const MachineRegisterInfo *MRI) {
+  if (MRI->isSSA())
+    return MRI->getVRegDef(VReg);
+
+  return findReachingDef(VReg, MI, MRI);
+}
+
 static unsigned getVLOpNum(const MachineInstr &MI) {
   return RISCVII::getVLOpNum(MI.getDesc());
 }
@@ -187,13 +268,17 @@ static bool hasUndefinedMergeOp(const MachineInstr &MI,
   if (UseMO.getReg() == RISCV::NoRegister)
     return true;
 
-  if (MachineInstr *UseMI = MRI.getVRegDef(UseMO.getReg())) {
+  if (!MRI.isSSA() && UseMO.isUndef())
+    return true;
+
+  if (MachineInstr *UseMI = getReachingDefMI(UseMO.getReg(), &MI, &MRI)) {
     if (UseMI->isImplicitDef())
       return true;
 
     if (UseMI->isRegSequence()) {
       for (unsigned i = 1, e = UseMI->getNumOperands(); i < e; i += 2) {
-        MachineInstr *SourceMI = MRI.getVRegDef(UseMI->getOperand(i).getReg());
+        MachineInstr *SourceMI =
+            getReachingDefMI(UseMI->getOperand(i).getReg(), UseMI, &MRI);
         if (!SourceMI || !SourceMI->isImplicitDef())
           return false;
       }
@@ -494,9 +579,9 @@ class VSETVLIInfo {
     if (hasAVLReg()) {
       if (getAVLReg() == RISCV::X0)
         return true;
-      if (MachineInstr *MI = MRI.getVRegDef(getAVLReg());
-          MI && MI->getOpcode() == RISCV::ADDI &&
-          MI->getOperand(1).isReg() && MI->getOperand(2).isImm() &&
+      if (MachineInstr *MI = MRI.getUniqueVRegDef(getAVLReg());
+          MI && MI->getOpcode() == RISCV::ADDI && MI->getOperand(1).isReg() &&
+          MI->getOperand(2).isImm() &&
           MI->getOperand(1).getReg() == RISCV::X0 &&
           MI->getOperand(2).getImm() != 0)
         return true;
@@ -727,20 +812,32 @@ class RISCVInsertVSETVLI : public MachineFunctionPass {
   const RISCVSubtarget *ST;
   const TargetInstrInfo *TII;
   MachineRegisterInfo *MRI;
+  LiveIntervals *LIS = nullptr;
 
   std::vector<BlockData> BlockInfo;
   std::queue<const MachineBasicBlock *> WorkList;
+  bool IsSplitRA = false;
 
 public:
   static char ID;
 
-  RISCVInsertVSETVLI() : MachineFunctionPass(ID) {
+  RISCVInsertVSETVLI(bool IsSplitRA = false)
+      : MachineFunctionPass(ID), IsSplitRA(IsSplitRA) {
     initializeRISCVInsertVSETVLIPass(*PassRegistry::getPassRegistry());
   }
   bool runOnMachineFunction(MachineFunction &MF) override;
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
     AU.setPreservesCFG();
+
+    if (IsSplitRA) {
+      AU.addRequired<LiveIntervals>();
+      AU.addPreserved<LiveIntervals>();
+      AU.addRequired<SlotIndexes>();
+      AU.addPreserved<SlotIndexes>();
+      AU.addPreserved<LiveDebugVariables>();
+    }
+
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -864,6 +961,30 @@ static VSETVLIInfo getInfoForVSETVLI(const MachineInstr &MI) {
   return NewInfo;
 }
 
+static void fixupModifyVRegLI(Register VReg, LiveIntervals *LIS) {
+  if (!LIS)
+    return;
+
+  if (LIS->hasInterval(VReg))
+    LIS->removeInterval(VReg);
+  LIS->createAndComputeVirtRegInterval(VReg);
+}
+
+static void fixupModifyVRegLIFromVSETVL(MachineInstr *MI, LiveIntervals *LIS) {
+
+  if (!LIS)
+    return;
+
+  if (LIS->isNotInMIMap(*MI))
+    LIS->InsertMachineInstrInMaps(*MI);
+  for (auto &MO : MI->operands()) {
+    if (!MO.isReg() || MO.getReg() == 0 || !MO.getReg().isVirtual())
+      continue;
+    Register Reg = MO.getReg();
+    fixupModifyVRegLI(Reg, LIS);
+  }
+}
+
 void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
                      MachineBasicBlock::iterator InsertPt, DebugLoc DL,
                      const VSETVLIInfo &Info, const VSETVLIInfo &PrevInfo) {
@@ -872,11 +993,13 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
     // Use X0, X0 form if the AVL is the same and the SEW+LMUL gives the same
     // VLMAX.
     if (Info.hasSameAVL(PrevInfo) && Info.hasSameVLMAX(PrevInfo)) {
-      BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
-          .addReg(RISCV::X0, RegState::Define | RegState::Dead)
-          .addReg(RISCV::X0, RegState::Kill)
-          .addImm(Info.encodeVTYPE())
-          .addReg(RISCV::VL, RegState::Implicit);
+      auto NeedFixupMI =
+          BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
+              .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+              .addReg(RISCV::X0, RegState::Kill)
+              .addImm(Info.encodeVTYPE())
+              .addReg(RISCV::VL, RegState::Implicit);
+      fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
       return;
     }
 
@@ -885,15 +1008,18 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
     // same, we can use the X0, X0 form.
     if (Info.hasSameVLMAX(PrevInfo) && Info.hasAVLReg() &&
         Info.getAVLReg().isVirtual()) {
-      if (MachineInstr *DefMI = MRI->getVRegDef(Info.getAVLReg())) {
+      if (MachineInstr *DefMI =
+              getReachingDefMI(Info.getAVLReg(), &(*InsertPt), MRI)) {
         if (isVectorConfigInstr(*DefMI)) {
           VSETVLIInfo DefInfo = getInfoForVSETVLI(*DefMI);
           if (DefInfo.hasSameAVL(PrevInfo) && DefInfo.hasSameVLMAX(PrevInfo)) {
-            BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
-                .addReg(RISCV::X0, RegState::Define | RegState::Dead)
-                .addReg(RISCV::X0, RegState::Kill)
-                .addImm(Info.encodeVTYPE())
-                .addReg(RISCV::VL, RegState::Implicit);
+            auto NeedFixupMI =
+                BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
+                    .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+                    .addReg(RISCV::X0, RegState::Kill)
+                    .addImm(Info.encodeVTYPE())
+                    .addReg(RISCV::VL, RegState::Implicit);
+            fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
             return;
           }
         }
@@ -902,10 +1028,12 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
   }
 
   if (Info.hasAVLImm()) {
-    BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
-        .addReg(RISCV::X0, RegState::Define | RegState::Dead)
-        .addImm(Info.getAVLImm())
-        .addImm(Info.encodeVTYPE());
+    auto NeedFixupMI =
+        BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
+            .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+            .addImm(Info.getAVLImm())
+            .addImm(Info.encodeVTYPE());
+    fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
     return;
   }
 
@@ -915,18 +1043,22 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
     // the previous vl to become invalid.
     if (PrevInfo.isValid() && !PrevInfo.isUnknown() &&
         Info.hasSameVLMAX(PrevInfo)) {
-      BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
-          .addReg(RISCV::X0, RegState::Define | RegState::Dead)
-          .addReg(RISCV::X0, RegState::Kill)
-          .addImm(Info.encodeVTYPE())
-          .addReg(RISCV::VL, RegState::Implicit);
+      auto NeedFixupMI =
+          BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETVLIX0))
+              .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+              .addReg(RISCV::X0, RegState::Kill)
+              .addImm(Info.encodeVTYPE())
+              .addReg(RISCV::VL, RegState::Implicit);
+      fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
       return;
     }
     // Otherwise use an AVL of 1 to avoid depending on previous vl.
-    BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
-        .addReg(RISCV::X0, RegState::Define | RegState::Dead)
-        .addImm(1)
-        .addImm(Info.encodeVTYPE());
+    auto NeedFixupMI =
+        BuildMI(MBB, InsertPt, DL, TII->get(RISCV::PseudoVSETIVLI))
+            .addReg(RISCV::X0, RegState::Define | RegState::Dead)
+            .addImm(1)
+            .addImm(Info.encodeVTYPE());
+    fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
     return;
   }
 
@@ -942,10 +1074,13 @@ void RISCVInsertVSETVLI::insertVSETVLI(MachineBasicBlock &MBB,
     DestReg = MRI->createVirtualRegister(&RISCV::GPRRegClass);
     Opcode = RISCV::PseudoVSETVLIX0;
   }
-  BuildMI(MBB, InsertPt, DL, TII->get(Opcode))
-      .addReg(DestReg, RegState::Define | RegState::Dead)
-      .addReg(AVLReg)
-      .addImm(Info.encodeVTYPE());
+  auto NeedFixupMI =
+      BuildMI(MBB, InsertPt, DL, TII->get(Opcode))
+          .addReg(DestReg, RegState::Define | RegState::Dead)
+          .addReg(AVLReg,
+                  (IsSplitRA && MRI->def_empty(AVLReg)) ? RegState::Undef : 0)
+          .addImm(Info.encodeVTYPE());
+  fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
 }
 
 static bool isLMUL1OrSmaller(RISCVII::VLMUL LMUL) {
@@ -1007,7 +1142,7 @@ bool RISCVInsertVSETVLI::needVSETVLI(const MachineInstr &MI,
   // VSETVLI here.
   if (Require.hasAVLReg() && Require.getAVLReg().isVirtual() &&
       CurInfo.hasCompatibleVTYPE(Used, Require)) {
-    if (MachineInstr *DefMI = MRI->getVRegDef(Require.getAVLReg())) {
+    if (MachineInstr *DefMI = getReachingDefMI(Require.getAVLReg(), &MI, MRI)) {
       if (isVectorConfigInstr(*DefMI)) {
         VSETVLIInfo DefInfo = getInfoForVSETVLI(*DefMI);
         if (DefInfo.hasSameAVL(CurInfo) && DefInfo.hasSameVLMAX(CurInfo))
@@ -1062,7 +1197,7 @@ void RISCVInsertVSETVLI::transferBefore(VSETVLIInfo &Info,
   // without being sure we can kill the original source reg entirely.
   if (!Info.hasAVLReg() || !Info.getAVLReg().isVirtual())
     return;
-  MachineInstr *DefMI = MRI->getVRegDef(Info.getAVLReg());
+  const MachineInstr *DefMI = getReachingDefMI(Info.getAVLReg(), &MI, MRI);
   if (!DefMI || !isVectorConfigInstr(*DefMI))
     return;
 
@@ -1187,7 +1322,7 @@ bool RISCVInsertVSETVLI::needVSETVLIPHI(const VSETVLIInfo &Require,
     return true;
 
   // We need the AVL to be produce by a PHI node in this basic block.
-  MachineInstr *PHI = MRI->getVRegDef(AVLReg);
+  MachineInstr *PHI = MRI->getUniqueVRegDef(AVLReg);
   if (!PHI || PHI->getOpcode() != RISCV::PHI || PHI->getParent() != &MBB)
     return true;
 
@@ -1202,7 +1337,7 @@ bool RISCVInsertVSETVLI::needVSETVLIPHI(const VSETVLIInfo &Require,
       return true;
 
     // We need the PHI input to the be the output of a VSET(I)VLI.
-    MachineInstr *DefMI = MRI->getVRegDef(InReg);
+    MachineInstr *DefMI = MRI->getUniqueVRegDef(InReg);
     if (!DefMI || !isVectorConfigInstr(*DefMI))
       return true;
 
@@ -1258,8 +1393,12 @@ void RISCVInsertVSETVLI::emitVSETVLIs(MachineBasicBlock &MBB) {
         MachineOperand &VLOp = MI.getOperand(getVLOpNum(MI));
         if (VLOp.isReg()) {
           // Erase the AVL operand from the instruction.
+          Register VLOpReg = VLOp.getReg();
+          bool IsVirtVLOpReg = VLOp.getReg().isVirtual();
           VLOp.setReg(RISCV::NoRegister);
           VLOp.setIsKill(false);
+          if (IsVirtVLOpReg)
+            fixupModifyVRegLI(VLOpReg, LIS);
         }
         MI.addOperand(MachineOperand::CreateReg(RISCV::VL, /*isDef*/ false,
                                                 /*isImp*/ true));
@@ -1509,8 +1648,23 @@ void RISCVInsertVSETVLI::doLocalPostpass(MachineBasicBlock &MBB) {
     Used = getDemanded(MI, MRI, ST);
   }
 
-  for (auto *MI : ToDelete)
+  std::vector<Register> NeedFixup;
+
+  for (auto *MI : ToDelete) {
+    for (auto &MO : MI->operands()) {
+      if (!MO.isReg() || MO.getReg() == 0 || !MO.getReg().isVirtual())
+        continue;
+      Register Reg = MO.getReg();
+      NeedFixup.push_back(Reg);
+    }
     MI->eraseFromParent();
+    if (IsSplitRA)
+      LIS->RemoveMachineInstrFromMaps(*MI);
+  }
+
+  for (auto Reg : NeedFixup) {
+    fixupModifyVRegLI(Reg, LIS);
+  }
 }
 
 void RISCVInsertVSETVLI::insertReadVL(MachineBasicBlock &MBB) {
@@ -1518,11 +1672,16 @@ void RISCVInsertVSETVLI::insertReadVL(MachineBasicBlock &MBB) {
     MachineInstr &MI = *I++;
     if (RISCV::isFaultFirstLoad(MI)) {
       Register VLOutput = MI.getOperand(1).getReg();
-      if (!MRI->use_nodbg_empty(VLOutput))
-        BuildMI(MBB, I, MI.getDebugLoc(), TII->get(RISCV::PseudoReadVL),
-                VLOutput);
+      bool IsVirtual = MI.getOperand(1).getReg().isVirtual();
+      if (!MRI->use_nodbg_empty(VLOutput)) {
+        auto NeedFixupMI = BuildMI(MBB, I, MI.getDebugLoc(),
+                                   TII->get(RISCV::PseudoReadVL), VLOutput);
+        fixupModifyVRegLIFromVSETVL(NeedFixupMI, LIS);
+      }
       // We don't use the vl output of the VLEFF/VLSEGFF anymore.
       MI.getOperand(1).setReg(RISCV::X0);
+      if (IsVirtual)
+        fixupModifyVRegLI(VLOutput, LIS);
     }
   }
 }
@@ -1538,6 +1697,11 @@ bool RISCVInsertVSETVLI::runOnMachineFunction(MachineFunction &MF) {
   TII = ST->getInstrInfo();
   MRI = &MF.getRegInfo();
 
+  if (IsSplitRA)
+    LIS = &getAnalysis<LiveIntervals>();
+  else
+    LIS = nullptr;
+
   assert(BlockInfo.empty() && "Expect empty block infos");
   BlockInfo.resize(MF.getNumBlockIDs());
 
@@ -1604,6 +1768,6 @@ bool RISCVInsertVSETVLI::runOnMachineFunction(MachineFunction &MF) {
 }
 
 /// Returns an instance of the Insert VSETVLI pass.
-FunctionPass *llvm::createRISCVInsertVSETVLIPass() {
-  return new RISCVInsertVSETVLI();
+FunctionPass *llvm::createRISCVInsertVSETVLIPass(bool IsSplitRA) {
+  return new RISCVInsertVSETVLI(IsSplitRA);
 }
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 953ac097b915044..a3d5def678f745d 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -27,6 +27,7 @@
 #include "llvm/CodeGen/MIRParser/MIParser.h"
 #include "llvm/CodeGen/MIRYamlMapping.h"
 #include "llvm/CodeGen/Passes.h"
+#include "llvm/CodeGen/RegAllocRegistry.h"
 #include "llvm/CodeGen/TargetLoweringObjectFileImpl.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
 #include "llvm/InitializePasses.h"
@@ -83,6 +84,10 @@ static cl::opt<bool>
                    cl::desc("Enable sinking and folding of instruction copies"),
                    cl::init(false), cl::Hidden);
 
+static cl::opt<bool> EnableSplitRA("riscv-split-RA", cl::Hidden,
+                              ...
[truncated]

rofirrim · 2023-10-31T06:48:38Z

Oh, very interesting! Thanks for working on this. I didn't know we could do partial register allocation (RA)!

I have some high-level questions:

unblock scheduler's constraints by removing vl/vtype def-use chain

Just for me to understand, this would not require RA of RVV registers byt itself. We would still need to run the InsertVSETVL pass late, like you're doing now. Is that correct?

However splitting the passes unlocks the two following cases:

Support RVV re-materialization

I understand that here you will we able to do that because you have only RA'd RVV registers and the GPR virtual registers that we use as carriers of the VL value make this a bit easier. Is my intuition correct here?

Support partial spill

Similar as above, I imagine you plan to emit a PseudoVSPILL_VL / PseudoVRELOAD_VL instruction or something like that, right?

Did you consider partial copies as well?

I * think * some of those might be hindered by the tail undisturbed policy which may still require to spill/copy the whole register. Do you think it does or it would already be handled after RA of RVV registers?

Looking forward progress here, this looks promising!

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp

michaelmaitland · 2023-10-31T18:52:05Z

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp

+
+static llvm::once_flag InitializeDefaultRVVRegisterAllocatorFlag;
+
+/// -riscv-splitRA-rvv-regalloc=... command line option.


Is riscv-splitRA-rvv-regalloc redundant? Why not riscv-rvv-regalloc? It looks like that is what AMDGPU does with sgpr-regalloc and vgpr-regalloc.

michaelmaitland · 2023-10-31T18:59:44Z

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp

@@ -427,7 +535,8 @@ void RISCVPassConfig::addPreRegAlloc() {
  addPass(createRISCVPreRAExpandPseudoPass());
  if (TM->getOptLevel() != CodeGenOptLevel::None)
    addPass(createRISCVMergeBaseOffsetOptPass());
-  addPass(createRISCVInsertVSETVLIPass());
+  if (!EnableSplitRA)


IIUC, there is nothing stopping us from being able to do split register allocation with the vsetvli insertion pass remaining where it was.

I think we should have an option for enabling split register allocation, and another option for enabling late vsetvli insertion. That way we can have the following functionality:

Split register allocation && early vsetvli insertion

Split register allocation && late vsetvli insertion

Non-split register allocation && early vsetvli insertion

Non-split register allocation && late vsetvli insertion

This flexibility may allow us to pinpoint exactly where the shortcomings are (do they come from split regalloc, late vsetvli insertion, or the combination of both)?

BeMg · 2023-11-02T07:34:48Z

Oh, very interesting! Thanks for working on this. I didn't know we could do partial register allocation (RA)!

I have some high-level questions:

unblock scheduler's constraints by removing vl/vtype def-use chain

Just for me to understand, this would not require RA of RVV registers byt itself. We would still need to run the InsertVSETVL pass late, like you're doing now. Is that correct?

Yes. For scheduler, it actually doesn't need change the register allocator.

However splitting the passes unlocks the two following cases:

Support RVV re-materialization

I understand that here you will we able to do that because you have only RA'd RVV registers and the GPR virtual registers that we use as carriers of the VL value make this a bit easier. Is my intuition correct here?

I just consider the trival case in remat now. Remove Implict use vl/vtype not only unblock scheduler but also make default remat (isReallyTriviallyReMaterializable) in RVV instruction possible. (For example, vid and vmv)

llvm-project/llvm/lib/CodeGen/TargetInstrInfo.cpp

Line 1130 in e98f3bf

bool TargetInstrInfo::isReallyTriviallyReMaterializable(

The more complicated case (include some def-use chain) could be the further work by define RISC-V own remat hook or make the new pseudo to make it suitable for default remat.

Support partial spill

Similar as above, I imagine you plan to emit a PseudoVSPILL_VL / PseudoVRELOAD_VL instruction or something like that, right?

Yes. We plan to add the concrete size information for each MachineInstruction. With these information, we can update the RISCVInstrInfo::storeRegToStackSlot and RISCVInstrInfo::loadRegToStackSlot to make it emit the pseudo base on concrete size.

Did you consider partial copies as well?

Not, but it work in same flow.

I * think * some of those might be hindered by the tail undisturbed policy which may still require to spill/copy the whole register. Do you think it does or it would already be handled after RA of RVV registers?

This is why we need the concrete size information to make sure the partial spill/reload doesn't break the correctness.

Looking forward progress here, this looks promising!

Thanks for comment.

preames

As discussed at the dev meeting, I think this is the right overall direction.

In terms of staging, I think it makes sense to first split regalloc in it's own review, and then add the vsetvli variant between them.

I'm concerned by the regressions you mentioned. Have you investigated them? In particular, are there any that are cause by splitting RA on it's own? If they're "only" vsetvli regressions, those are likely less scary and a bit easier to fix.

preames · 2023-11-07T20:05:05Z

llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp

-          MI && MI->getOpcode() == RISCV::ADDI &&
-          MI->getOperand(1).isReg() && MI->getOperand(2).isImm() &&
+      if (MachineInstr *MI = MRI.getUniqueVRegDef(getAVLReg());
+          MI && MI->getOpcode() == RISCV::ADDI && MI->getOperand(1).isReg() &&


This looks to be stray whitespace change. Remove.

preames · 2023-11-07T20:05:53Z

llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp


 public:
  static char ID;

-  RISCVInsertVSETVLI() : MachineFunctionPass(ID) {
+  RISCVInsertVSETVLI(bool IsSplitRA = false)


In terms of this flag, maybe a better name would be isPostRVVRegAlloc?

BeMg · 2023-11-08T10:35:04Z

Thanks for reviewing.

As discussed at the dev meeting, I think this is the right overall direction.

In terms of staging, I think it makes sense to first split regalloc in it's own review, and then add the vsetvli variant between them.

Sure.

I'm concerned by the regressions you mentioned. Have you investigated them? In particular, are there any that are cause by splitting RA on it's own? If they're "only" vsetvli regressions, those are likely less scary and a bit easier to fix.

I push the some commits to represent how many llvm regression will be affect when we enable the splitting RA only, splitting RA with new vsetvl insertion pass and new vsetvl insertion pass only. In summary, Splitting RA itself doesn't change a lot but with new vsetvl insertion change a lot even only itself.

topperc · 2023-11-09T17:05:37Z

llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp

+  if (MRI->isSSA())
+    return MRI->getVRegDef(VReg);
+
+  return findReachingDef(VReg, MI, MRI);


Can we use VNInfo, LiveQueryResult, getInstructionIndex, etc. to get the defintion?

Now it uses the LiveIntervals/VNInfo to get the defintion.

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp

lukel97 · 2023-11-13T08:26:34Z

I tried out this patch and I agree that the register allocation split should be done in a separate PR, with the split just enabled by default: There were only four lit test diffs, none of them were regressions. I also ran this on the llvm-test-suite for rv64gv and couldn't find any regressions, just some shuffling about of registers, here's the diff:
regalloc-split.patch

lukel97 · 2023-11-13T09:00:32Z

One small regression I noticed with the post-ra insertvsetvli pass is this function from test/CodeGen/RISCV/rvv/fixed-vectors-mask-buildvec.ll:

define <4 x i1> @buildvec_mask_nonconst_v4i1(i1 %x, i1 %y) {
  %1 = insertelement <4 x i1> poison, i1 %x, i32 0
  %2 = insertelement <4 x i1> %1,  i1 %x, i32 1
  %3 = insertelement <4 x i1> %2,  i1 %y, i32 2
  %4 = insertelement <4 x i1> %3,  i1 %y, i32 3
  ret <4 x i1> %4
}

With pre-RA vsetvli insertion we have:

buildvec_mask_nonconst_v4i1:            # @buildvec_mask_nonconst_v4i1
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 4, e8, mf4, ta, ma
	vmv.v.i	v0, 3
	vmv.v.x	v8, a1
	vmerge.vxm	v8, v8, a0, v0
	vand.vi	v8, v8, 1
	vmsne.vi	v0, v8, 0
	ret

But post-RA insertion results in an extra vsetvli:

buildvec_mask_nonconst_v4i1:            # @buildvec_mask_nonconst_v4i1
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 1, e8, mf8, ta, ma
	vmv.v.i	v0, 3
	vsetivli	zero, 4, e8, mf4, ta, ma
	vmv.v.x	v8, a1
	vmerge.vxm	v8, v8, a0, v0
	vand.vi	v8, v8, 1
	vmsne.vi	v0, v8, 0
	ret

From what I can tell this is due to how the vmv.v.i gets scheduled before the vmv.v.x, e.g. the BB goes from beginning like this:

bb.0 (%ir-block.0):
  liveins: $x10, $x11
  %1:gpr = COPY $x11
  %0:gpr = COPY $x10
  %2:vr = PseudoVMV_V_X_MF4 $noreg(tied-def 0), %1:gpr, 4, 3, 0
  %3:vr = PseudoVMV_V_I_MF8 $noreg(tied-def 0), 3, 1, 3, 0

to this:

0B	bb.0 (%ir-block.0):
	  liveins: $x10, $x11
16B	  %1:gpr = COPY $x11
32B	  %0:gpr = COPY $x10
64B	  renamable $v0 = PseudoVMV_V_I_MF8 undef renamable $v0(tied-def 0), 3, 1, 3, 0
80B	  renamable $v8 = PseudoVMV_V_X_MF4 undef renamable $v8(tied-def 0), %1:gpr, 4, 3, 0

We end up with the extra vsetvli because we have an optimisation where we avoid inserting a vsetvli if we know we have a vmv.v.i with VL=1 in needVSETVLI, which is in turn called after the first instruction in transferBefore when going from the vmv.v.x -> vmv.v.i. But because the vmv.v.i instruction is now first, we don't do the check going between the vmv.v.i -> vmv.v.x.

We could probably recover this case by teaching the backwards local postpass about this.

BeMg · 2023-11-13T10:19:30Z

I have opened the seqerate patch for splitting regalloc only. #72096

lukel97 · 2023-11-13T10:28:24Z

With -riscv-vsetvli-after-rvv-regalloc=true, we seem to lose stack slot coloring, e.g. with

declare <vscale x 8 x i64> @llvm.bitreverse.nxv8i64(<vscale x 8 x i64> %va)
define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
  %a = call <vscale x 8 x i64> @llvm.bitreverse.nxv8i64(<vscale x 8 x i64> %va)
  ret <vscale x 8 x i64> %a
}

We no longer mark one of the slots as dead:

diff --git a/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll b/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll
index 3514fa66f588..47f097087601 100644
--- a/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll
@@ -1134,9 +1134,9 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
 ; RV32-NEXT:    addi sp, sp, -16
 ; RV32-NEXT:    .cfi_def_cfa_offset 16
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    slli a0, a0, 4
 ; RV32-NEXT:    sub sp, sp, a0
-; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 8 * vlenb
+; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
 ; RV32-NEXT:    sw zero, 12(sp)
 ; RV32-NEXT:    lui a0, 1044480
 ; RV32-NEXT:    sw a0, 8(sp)
@@ -1162,6 +1162,10 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
 ; RV32-NEXT:    addi a4, sp, 16
 ; RV32-NEXT:    vl8r.v v0, (a4) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v24, v24, v0
+; RV32-NEXT:    csrr a4, vlenb
+; RV32-NEXT:    slli a4, a4, 3
+; RV32-NEXT:    add a4, sp, a4
+; RV32-NEXT:    addi a4, a4, 16
 ; RV32-NEXT:    vs8r.v v24, (a4) # Unknown-size Folded Spill
 ; RV32-NEXT:    vand.vx v0, v8, a2
 ; RV32-NEXT:    vsll.vx v0, v0, a1
@@ -1173,7 +1177,10 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
 ; RV32-NEXT:    vsll.vi v16, v16, 8
 ; RV32-NEXT:    vor.vv v8, v8, v16
 ; RV32-NEXT:    vor.vv v8, v24, v8
-; RV32-NEXT:    addi a0, sp, 16
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 16
 ; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v8, v8, v16
 ; RV32-NEXT:    vsrl.vi v16, v8, 4
@@ -1207,7 +1214,7 @@ define <vscale x 8 x i64> @bitreverse_nxv8i64(<vscale x 8 x i64> %va) {
 ; RV32-NEXT:    vadd.vv v8, v8, v8
 ; RV32-NEXT:    vor.vv v8, v16, v8
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    slli a0, a0, 4
 ; RV32-NEXT:    add sp, sp, a0
 ; RV32-NEXT:    addi sp, sp, 16
 ; RV32-NEXT:    ret

It seems like it's because LiveStacks is now empty when we get to the stack slot colouring pass. Do we need to mark it as preserved?

llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp

preames · 2023-11-16T15:15:24Z

Now that #72096 has landed, the next step is a change which enables the split RA by default. Once that's in, this change can be rebased on tip of tree, and we can focus review attention on the changes to InsertVSETVLI and the resulting test changes.

BeMg · 2024-05-15T05:17:27Z

Rebase with main

lukel97 · 2024-05-15T10:39:12Z

I ran this on SPEC CPU 2017 (-O3 -march=rv64gv) and collected the dynamic instruction count (total insns), dynamic vsetvl instruction count (vset), dynamic non vsetvl x0,x0 instruction count (vl set) and number of spills inserted.

The changes are all less than < +-1%, with the exception of 619.lbm_s/519.lbm_r where there's 40% more vsetvls executed but a 23% improvement in dynamic instruction count. The improvement seems to be due to us spilling less vectors in a hot loop.

On leela, there's 0.9% more vsetvls executed but 0.9% more non-x0,x0s. So it's likely that the new vsetvls aren't VL preserving.
On x264, there's more non-vsetvl x0,x0s executed, but less vsetvls executed overall.

So overall I think this looks good.

('Program', '')	('vset', 'prera')	('vset', 'postra')	('vset', 'diff')	('total insns', 'prera')	('total insns', 'postra')	('total insns', 'diff')	('regalloc.NumSpills', 'prera')	('regalloc.NumSpills', 'postra')	('regalloc.NumSpills', 'diff')	('vl set', 'prera')	('vl set', 'postra')	('vl set', 'diff')
test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test	3.93984e+09	5.49984e+09	0.395955	9.34441e+11	7.23061e+11	-0.22621	208	179	-0.139423	1662	1666	0.00240674
test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test	4.95512e+08	6.90512e+08	0.393532	1.17465e+11	9.08479e+10	-0.226599	208	180	-0.134615	1142	1146	0.00350263
test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test	4.22353e+08	4.26072e+08	0.00880622	4.674e+11	4.67404e+11	7.95723e-06	358	358	0	3.91329e+08	3.94923e+08	0.00918232
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test	4.22353e+08	4.26072e+08	0.00880622	4.674e+11	4.67404e+11	7.95723e-06	358	358	0	3.91329e+08	3.94923e+08	0.00918232
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test	1.88329e+09	1.88349e+09	0.000103542	7.20803e+11	7.20803e+11	3.96615e-07	13887	13888	7.20098e-05	1.4395e+09	1.43956e+09	4.23063e-05
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test	1.73542e+07	1.73559e+07	9.89384e-05	1.08127e+10	1.08127e+10	1.69088e-06	13720	13728	0.00058309	1.69636e+07	1.69654e+07	0.000108586
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test	1.73542e+07	1.73559e+07	9.89384e-05	1.08124e+10	1.08124e+10	1.69092e-06	13720	13728	0.00058309	1.69636e+07	1.69654e+07	0.000108586
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test	6.54343e+07	6.54382e+07	5.89141e-05	3.15461e+10	3.15416e+10	-0.000142572	1916	1902	-0.00730689	5.5646e+07	5.56494e+07	6.2089e-05
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test	1.7041e+09	1.70412e+09	1.18502e-05	1.50246e+11	1.50246e+11	-4.98108e-07	44229	44232	6.78288e-05	1.69819e+09	1.69819e+09	6.17128e-07
test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test	2.26807e+08	2.26807e+08	7.93626e-07	1.41909e+11	1.4195e+11	0.000287282	694	698	0.00576369	2.26807e+08	2.26807e+08	7.93626e-07
test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test	2.26807e+08	2.26807e+08	7.93626e-07	1.41903e+11	1.41943e+11	0.000287296	694	698	0.00576369	2.26807e+08	2.26807e+08	7.93626e-07
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test	1.86518e+10	1.86518e+10	1.0814e-07	2.47888e+11	2.47888e+11	1.17916e-08	4634	4670	0.00776867	2.62581e+07	2.62609e+07	0.000105491
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test	1.86518e+10	1.86518e+10	1.0814e-07	2.47888e+11	2.47888e+11	1.17916e-08	4634	4670	0.00776867	2.62581e+07	2.62609e+07	0.000105491
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test	218721	218721	0	1.79594e+09	1.79594e+09	0	4291	4291	0	218718	218718	0
test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test	6.19906e+07	6.19906e+07	0	4.72968e+10	4.72968e+10	0	301	301	0	5.57433e+07	5.57433e+07	0
test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test	2.36169e+07	2.36169e+07	0	1.53744e+11	1.53744e+11	0	123	123	0	2.36169e+07	2.36169e+07	0
test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test	6.19906e+07	6.19906e+07	0	4.72968e+10	4.72968e+10	0	301	301	0	5.57433e+07	5.57433e+07	0
test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test	2.23429e+09	2.23429e+09	0	4.08002e+11	4.08002e+11	0	749	749	0	2.23429e+09	2.23429e+09	0
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test	218721	218721	0	1.79594e+09	1.79594e+09	0	4291	4291	0	218718	218718	0
test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test	2.23429e+09	2.23429e+09	0	4.08002e+11	4.08002e+11	0	749	749	0	2.23429e+09	2.23429e+09	0
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test	9.48185e+07	9.48185e+07	0	2.1198e+11	2.11983e+11	1.13134e-05	6743	6744	0.000148302	9.46089e+07	9.46089e+07	0
test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test	2.36169e+07	2.36169e+07	0	1.53744e+11	1.53744e+11	0	123	123	0	2.36169e+07	2.36169e+07	0
test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test	4.64859e+08	4.64859e+08	-4.30238e-09	2.89851e+11	2.89851e+11	2.99464e-09	1604	1604	0	4.51635e+08	4.51635e+08	-4.42836e-09
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test	4.64859e+08	4.64859e+08	-4.30238e-09	2.89851e+11	2.89851e+11	3.40174e-09	1604	1604	0	4.51635e+08	4.51635e+08	-4.42836e-09
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test	6.96396e+08	6.96396e+08	-1.37853e-07	4.42779e+11	4.42818e+11	8.90763e-05	359	358	-0.00278552	6.37384e+08	6.37384e+08	0
test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test	6.60366e+08	6.60366e+08	-1.45374e-07	4.10743e+11	4.10779e+11	8.81044e-05	359	358	-0.00278552	6.04725e+08	6.04725e+08	0
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test	5.04246e+09	5.02065e+09	-0.00432564	2.19801e+11	2.19701e+11	-0.000456991	1894	1894	0	1.16667e+09	1.1786e+09	0.0102294
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test	5.04246e+09	5.02065e+09	-0.00432564	2.19801e+11	2.19701e+11	-0.000456991	1894	1894	0	1.16667e+09	1.1786e+09	0.0102294
Geomean difference			0.0243876			-0.0181787			-0.0099226			0.00161097

lukel97

LGTM. I didn't look at all the test diffs, but in the ones I did I didn't spot anything interesting in them.

Some spills are removed, some new ones are added, but after looking at the benchmark results I don't think they're an issue. Same too goes for the fluctuation in copies.

This is a big change though so probably best to wait for a second LGTM too!

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp

preames

Did a sweep through the test diffs - mostly ignored vp tests, and don't claim to have looked at everything. We could do another round of diff reduction here, but I am not currently asking for it. I'm going to apply this downstream and look at a couple workloads, another comment will follow once I've done so.

I will note that my comments flag at least one real regression caused/triggered by this patch. I'm not yet sure that regression matters but it does appear to be real.

preames · 2024-05-15T16:30:23Z

llvm/test/CodeGen/RISCV/rvv/abs-vp.ll

 ; CHECK-NEXT:    vmv1r.v v0, v24
-; CHECK-NEXT:    csrr a0, vlenb


This delta is interesting and surprising - in a good way.

I hadn't considered that this change would allow reuse of the vlenb CSR read.

Though there's also a missed opportunity here, both before and after. a0 is vlenb. With etype=e64, and emul-m8, that means the VL here is VLMAX.

This delta is interesting and surprising - in a good way.

I hadn't considered that this change would allow reuse of the vlenb CSR read.

I think it is a coincidence that PseudoReadVLENB chooses $x11 as the destination register ($x10 is still used by vsetvl) during Prologue/Epilogue Insertion & Frame Finalization, allowing the Machine Late Instructions Cleanup Pass to reuse definitions from predecessors.

preames · 2024-05-15T16:33:38Z

llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll

 ; CHECK-NEXT:    lui a1, %hi(.LCPI10_0)
 ; CHECK-NEXT:    flh fa5, %lo(.LCPI10_0)(a1)
+; CHECK-NEXT:    vmv1r.v v16, v0


This is another (unrelated) missed opportunity. We're moving v0 into another register, but nothing between this and restore appears to actually write to v0. Do we have an imprecise clobber specification on some instruction?

During the instruction selection, compiler will emit the $v0 = COPY %1:vr for each RVV mask instruction. These COPY will be eliminated after Machine Copy Propagation Pass, but $v16 = COPY $v0 can't be eliminated due to PseudoVMFLT_VFPR16_M8_MASK early-clobber flag.

Could PseudoVMFLT_VFPR16_M8_MASK accept $v0 as the destination register, or is there another chance to remove that COPY?

; After Instruction Selection bb.0 (%ir-block.0): liveins: $v8m8, $v0, $x10 %2:gprnox0 = COPY $x10 %1:vr = COPY $v0 %0:vrm8 = COPY $v8m8 %3:vrm8 = COPY %0:vrm8 $v0 = COPY %1:vr %4:vrm8nov0 = PseudoVFSGNJX_VV_M8_E16_MASK $noreg(tied-def 0), %3:vrm8, %3:vrm8, $v0, %2:gprnox0, 4, 3 %5:gpr = LUI target-flags(riscv-hi) %const.0 %6:fpr16 = FLH killed %5:gpr, target-flags(riscv-lo) %const.0 :: (load (s16) from constant-pool) $v0 = COPY %1:vr early-clobber %7:vr = nofpexcept PseudoVMFLT_VFPR16_M8_MASK %1:vr(tied-def 0), killed %4:vrm8nov0, killed %6:fpr16, $v0, %2:gprnox0, 4 ...

; After Machine Copy Propagation Pass bb.0 (%ir-block.0): liveins: $v0, $x10, $v8m8 renamable $x11 = LUI target-flags(riscv-hi) %const.0 renamable $f15_h = FLH killed renamable $x11, target-flags(riscv-lo) %const.0 :: (load (s16) from constant-pool) renamable $v16 = COPY $v0 dead $x0 = PseudoVSETVLI killed renamable $x10, 203, implicit-def $vl, implicit-def $vtype renamable $v24m8 = PseudoVFSGNJX_VV_M8_E16_MASK undef renamable $v24m8(tied-def 0), renamable $v8m8, renamable $v8m8, $v0, $noreg, 4, 3, implicit $vl, implicit $vtype dead $x0 = PseudoVSETVLIX0 killed $x0, 75, implicit-def $vl, implicit-def $vtype, implicit $vl early-clobber renamable $v16 = nofpexcept PseudoVMFLT_VFPR16_M8_MASK killed renamable $v16(tied-def 0), killed renamable $v24m8, killed renamable $f15_h, $v0, $noreg, 4, implicit $vl, implicit $vtype ...

preames · 2024-05-15T16:35:44Z

llvm/test/CodeGen/RISCV/rvv/combine-store-extract-crash.ll

 ; RV32-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
-; RV32-NEXT:    vslideup.vi v10, v9, 4
+; RV32-NEXT:    vslideup.vi v9, v10, 4


Another (unrelated) opportunity. We're inserting a subvec into a zero vector just to compute a sum reduction where the zero elements are nops. We could have computed the sum on the sub-vec directly.

preames · 2024-05-15T16:40:17Z

llvm/test/CodeGen/RISCV/rvv/compressstore.ll

@@ -197,28 +197,51 @@ entry:
 define void @test_compresstore_v256i8(ptr %p, <256 x i1> %mask, <256 x i8> %data) {
 ; RV64-LABEL: test_compresstore_v256i8:
 ; RV64:       # %bb.0: # %entry
-; RV64-NEXT:    vmv1r.v v7, v8


This change is a bit concerning, we have no actual reordering happening here, but we chose a strictly less optimal register allocation solution. I don't see any obvious reason for this one, does anyone else?

From the register allocation process, the main reason for the less optimal regalloc is the register allocator making varying eviction decisions. These decisions differ due to different LiveInterval spill weights (evicting the one with the smaller weight). One factor in the weight calculation is the LiveInterval's size (And it changed between pre-ra and post-ra vsetvl flow).

In this testcase %12 going to evict.

selectOrSplit VRM8:%12 [240e,288r:0) 0@240e weight:4.454343e-03 w=4.454343e-03

pre-ra flow choose evict %2

VR:%2 [48r,336r:0) 0@48r weight:4.404070e-03 w=4.404070e-03 VRM8:%3 [32r,240r:0) 0@32r weight:3.322368e-03 w=3.322368e-03 ... unassigning %2 from $v8: V8 assigning %12 to $v8m8: V8 [240e,288r:0) 0@240e V9 [240e,288r:0) 0@240e V10 [240e,288r:0) 0@240e V11 [240e,288r:0) 0@240e V12 [240e,288r:0) 0@240e V13 [240e,288r:0) 0@240e V14 [240e,288r:0) 0@240e V15 [240e,288r:0) 0@240e

post-ra flow choose evict %3 because %2 spill weight is heavy than %12.

VR:%2 [48r,256r:0) 0@48r weight:4.983553e-03 w=4.983553e-03 VRM8:%3 [32r,192r:0) 0@32r weight:3.607143e-03 w=3.607143e-03 ... unassigning %3 from $v16m8: V16 V17 V18 V19 V20 V21 V22 V23 assigning %12 to $v16m8: V16 [192e,224r:0) 0@192e V17 [192e,224r:0) 0@192e V18 [192e,224r:0) 0@192e V19 [192e,224r:0) 0@192e V20 [192e,224r:0) 0@192e V21 [192e,224r:0)

The second round for %3 assign cause the spill code insertion.

preames · 2024-05-15T16:43:38Z

llvm/test/CodeGen/RISCV/rvv/fixed-vector-i8-index-cornercase.ll

@@ -35,10 +35,10 @@ define <512 x i8> @single_source(<512 x i8> %a) {
 ; CHECK-NEXT:    vslidedown.vi v16, v16, 4
 ; CHECK-NEXT:    li a0, 466
 ; CHECK-NEXT:    li a1, 465
+; CHECK-NEXT:    lbu a2, 1012(sp)


This is a slight regression - it's the same push back heuristic I raised previously. Non blocking, but this would be good to address.

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fnearbyint-constrained-sdnode.ll

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-mask-buildvec.ll

preames · 2024-05-15T17:03:06Z

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-transpose.ll

 ; CHECK-NEXT:    lui a0, 11
 ; CHECK-NEXT:    addi a0, a0, -1366
-; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
+; CHECK-NEXT:    vsetivli zero, 1, e16, m1, ta, ma


Again, real (if minor) regression.

It looks like we're actually removing a vsetvli overall here. But we go from 1 VL toggle to 2.

Although we could go a step further and teach the doLocalPostpass to change the AVL and LMUL in this case to avoid a toggle:

vsetivli zero, 16, e16, m2, ta, ma vmv.s.x v0, a0 vsetvli zero, zero, e8, m1, ta, ma vslidedown.vi v8, v8, 1

This isn't something that it was able to do before, since doLocalPostpass will only mutate vsetlvis if it can remove a vsetvli entirely.

preames · 2024-05-20T16:42:47Z

Following up to my prior comment, thank you @BeMg and @lukel97 for the analysis on each of the cases I flagged. For my purposes, the understanding demonstrated on the one remaining regression is sufficient. Knowing that it's an artifact of scheduling combined with an existing missed optimization gives me confidence we'll be able to address this.

Given that, LGTM!

I want to complement both of you for working through a long and challenging processing of splitting this up into reviewable pieces, landing and rebasing many many times, and driving this to conclusion. Thank you!

…CVInsertVSETVLI We have two rules in needVSETVLI where we can relax the demanded fields for slides and splats when VL=1. However these aren't present in getDemanded which prevents us from coalescing some vsetvlis around slides and splats in the backwards pass. The reasoning as to why they weren't in getDemanded is that these require us to check the value of the AVL operand, which may be stale in the backwards pass: the actual VL or VTYPE value may differ from what was precisely requested in the pseudo's operands. Using the original operands should actually be fine though, as we only care about what was originally demanded by the instruction. The current value of VL or VTYPE shouldn't influence this. This addresses some of the regressions we are seeing in llvm#70549 from splats and slides getting reordered.

…CVInsertVSETVLI (#92860) We have two rules in needVSETVLI where we can relax the demanded fields for slides and splats when VL=1. However these aren't present in getDemanded which prevents us from coalescing some vsetvlis around slides and splats in the backwards pass. The reasoning as to why they weren't in getDemanded is that these require us to check the value of the AVL operand, which may be stale in the backwards pass: the actual VL or VTYPE value may differ from what was precisely requested in the pseudo's operands. Using the original operands should actually be fine though, as we only care about what was originally demanded by the instruction. The current value of VL or VTYPE shouldn't influence this. This addresses some of the regressions we are seeing in #70549 from splats and slides getting reordered.

lukel97 · 2024-05-21T06:50:28Z

Nice, great work :)

Now that rematerialization and partial spilling are unblocked, were you planning to take a look at either of them?
I have half a proof-of-concept patch lying around for rematerialization that I could pick up again, but only if you haven't already started that work.

We no longer need to separate the passes now that llvm#70549 is landed and this will unblock llvm#89089. It's not strictly NFC because it will move coalescing before register allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this makes it closer to the original behaviour.

BeMg · 2024-05-21T13:53:12Z

@lukel97 Thanks for collaborating on these patches. And the reviewer's comments also made this patch possible.

Nice, great work :)

Now that rematerialization and partial spilling are unblocked, were you planning to take a look at either of them? I have half a proof-of-concept patch lying around for rematerialization that I could pick up again, but only if you haven't already started that work.

Feel free to pick up on rematerialization. I haven't started on it yet. :)

I will explore whether we could do something on scheduler to control vsetvl insertion.

topperc · 2024-05-22T21:03:26Z

Thank you @lukel97 and @BeMg for pushing this through.

michalt · 2024-05-23T07:05:21Z

Yeah, this is pretty exciting, thanks for all the work everyone! 🎉

We no longer need to separate the passes now that llvm#70549 is landed and this will unblock llvm#89089. It's not strictly NFC because it will move coalescing before register allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this makes it closer to the original behaviour.

We no longer need to separate the passes now that #70549 is landed and this will unblock #89089. It's not strictly NFC because it will move coalescing before register allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this makes it closer to the original behaviour.

BeMg requested review from topperc, michaelmaitland and kito-cheng October 28, 2023 11:29

kito-cheng requested a review from preames October 28, 2023 12:34

kito-cheng added the backend:RISC-V label Oct 28, 2023

kito-cheng requested a review from lukel97 October 28, 2023 12:36

kito-cheng requested a review from rofirrim October 28, 2023 12:36

BeMg marked this pull request as ready for review October 31, 2023 06:04

llvmbot added the debuginfo label Oct 31, 2023

michaelmaitland reviewed Oct 31, 2023

View reviewed changes

BeMg requested a review from michaelmaitland November 5, 2023 05:22

preames reviewed Nov 7, 2023

View reviewed changes

topperc reviewed Nov 9, 2023

View reviewed changes

michaelmaitland reviewed Nov 9, 2023

View reviewed changes

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp Outdated Show resolved Hide resolved

BeMg mentioned this pull request Nov 13, 2023

[RISCV] Split regalloc between RVV and other #72096

Merged

BeMg changed the title ~~[RISCV] RISC-V split register allocation and move vsetvl pass in between~~ [RISCV] Support postRA vsetvl insertion pass Nov 13, 2023

lukel97 reviewed Nov 13, 2023

View reviewed changes

llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp Outdated Show resolved Hide resolved

lukel97 reviewed Nov 13, 2023

View reviewed changes

llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp Outdated Show resolved Hide resolved

BeMg requested review from lukel97 and topperc November 17, 2023 10:30

BeMg force-pushed the riscv-split-regalloc-and-vsetvlpass-in-between branch from e84993b to c92c6af Compare May 15, 2024 05:17

lukel97 approved these changes May 15, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp Outdated Show resolved Hide resolved

preames reviewed May 15, 2024

View reviewed changes

BeMg force-pushed the riscv-split-regalloc-and-vsetvlpass-in-between branch from 601d59c to 757d1d1 Compare May 17, 2024 07:12

BeMg force-pushed the riscv-split-regalloc-and-vsetvlpass-in-between branch from 757d1d1 to 918f004 Compare May 21, 2024 03:07

lukel97 mentioned this pull request May 21, 2024

[RISCV] Unify getDemanded between forward and backwards passes in RISCVInsertVSETVLI #92860

Merged

BeMg added 3 commits May 20, 2024 23:13

[RISCV] Support post-regalloc vsetvli pass

09e09e5

Update option and option desc

e8e051e

Update testcase after llvm#92860 land

f2a60bd

BeMg force-pushed the riscv-split-regalloc-and-vsetvlpass-in-between branch from 918f004 to f2a60bd Compare May 21, 2024 06:17

BeMg merged commit 675e7bd into llvm:main May 21, 2024
3 of 4 checks passed

lukel97 mentioned this pull request May 21, 2024

[RISCV] Merge RISCVCoalesceVSETVLI back into RISCVInsertVSETVLI #92869

Merged


		static llvm::once_flag InitializeDefaultRVVRegisterAllocatorFlag;

		/// -riscv-splitRA-rvv-regalloc=... command line option.

[RISCV] Support postRA vsetvl insertion pass #70549

[RISCV] Support postRA vsetvl insertion pass #70549

Conversation

BeMg commented Oct 28, 2023 • edited

github-actions bot commented Oct 28, 2023 • edited

llvmbot commented Oct 28, 2023 • edited

Note

Register allocation

VSETVL insertion pass

rofirrim commented Oct 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BeMg commented Nov 2, 2023

preames left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BeMg commented Nov 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukel97 commented Nov 13, 2023

lukel97 commented Nov 13, 2023

BeMg commented Nov 13, 2023

lukel97 commented Nov 13, 2023

preames commented Nov 16, 2023

BeMg commented May 15, 2024

lukel97 commented May 15, 2024 • edited

lukel97 left a comment

Choose a reason for hiding this comment

preames left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukel97 May 17, 2024 • edited

Choose a reason for hiding this comment

preames commented May 20, 2024 • edited

lukel97 commented May 21, 2024

BeMg commented May 21, 2024

topperc commented May 22, 2024

michalt commented May 23, 2024

BeMg commented Oct 28, 2023 •

edited

github-actions bot commented Oct 28, 2023 •

edited

llvmbot commented Oct 28, 2023 •

edited

lukel97 commented May 15, 2024 •

edited

lukel97 May 17, 2024 •

edited

preames commented May 20, 2024 •

edited