[ModuloSchedule] Implement modulo variable expansion for pipelining #65609

ytmukai · 2023-09-07T14:23:07Z

Modulo variable expansion is a technique that resolves overlap of variable lifetimes by unrolling. The existing implementation solves it by making a copy by move instruction for processors with ordinary registers such as Arm and x86. This method may result in a very large number of move instructions, which can cause performance problems.

Modulo variable expansion is enabled by specifying -pipeliner-mve-cg. A backend must implement some newly defined interfaces in PipelinerLoopInfo.

Discourse thread: https://discourse.llvm.org/t/implementing-modulo-variable-expansion-for-machinepipeliner

ytmukai · 2023-09-07T14:59:42Z

I also created an implementation for AArch64 at https://github.com/ytmukai/llvm-project/tree/swpl-aarch64. I will create a PR for it after this PR is merged, as it depends on this PR.

I measured the effect of modulo variable expansion (MVE) with llvm-test-suite. I used Ampere Altra Max, an AArch64 processor with Neoverse N1 core.
The performance compared to the non-pipelined results are as follows:

MVE	>3% improvement cases	>3% degradation cases
Disabled (conventional)	50	222
Enabled	71	115

The total number of test cases is 811, selected by enabling TEST_SUITE_BENCHMARKING_ONLY.

The number of cases in which the performance improvement is more than 3% has increased by 40%.
There are still many cases showing degradation. I think this could be improved by adding a process to limit register pressure.

The optimization flags used are as follows:

non-pipelined: -O3
MVE disabled: non-pipelined flags + -mllvm -aarch64-enable-pipeliner=1 -mllvm -pipeliner-max-stages=1000 -mllvm --pipeliner-max-mii=1000 -mllvm -enable-misched=0 -mllvm -enable-post-misched=0 -mtune=neoverse-n1 -mllvm -pipeliner-enable-copytophi=0
MVE enabled: MVE disabled flags + -mllvm -pipeliner-mve-cg=1

ytmukai · 2023-09-20T08:24:31Z

ping

DragonDisciple · 2023-09-20T22:46:24Z

llvm/lib/CodeGen/ModuloSchedule.cpp

+/// Check if ModuloScheduleExpanderMVE can be applied to L
+bool ModuloScheduleExpanderMVE::canApply(MachineLoop &L) {
+  if (!L.getExitBlock()) {
+    LLVM_DEBUG(dbgs() << "Can not apply MVE expander\n";);


Would be a good debug enhancement to add small explanations to each failure so users know what in particular causes this not to apply.

Thanks, updated as 85175fa

DragonDisciple · 2023-09-20T22:49:20Z

llvm/lib/CodeGen/ModuloSchedule.cpp

+    for (MachineOperand &MO : MI.defs())
+      if (MO.isReg())
+        for (MachineInstr &Ref : MRI.use_instructions(MO.getReg()))
+          if (Ref.getParent() != BB || Ref.isPHI()) {


Side note: I'm curious how frequently and why LLVM generates PHI -> PHI within the same loop. I had only first seen it happen in the default ModuloScheduleExpander, which uses same-loop PHI->PHI lifetimes to represent values being passed through the pipeline stages.

What I'm asking I suppose is, was there an error case where you saw this occur?

It seems that such phi may be generated as a result of eliminating mov as follows.
https://llvm.godbolt.org/z/3sMqc5hTK

You'd always have a phi-phi if there is a loop-carried value with input distance > 1, in which case during software pipelining, the producer can move into a higher stage than the consumer (not sure if LLVM's upstream modulo scheduler can support this, we can in our downstream version). During swp, you'd always have phi-phi if producer stage - consumer stage > 1

dpenry · 2023-09-21T05:54:02Z

llvm/include/llvm/CodeGen/TargetInstrInfo.h

@@ -748,6 +748,20 @@ class TargetInstrInfo : public MCInstrInfo {
    createTripCountGreaterCondition(int TC, MachineBasicBlock &MBB,
                                    SmallVectorImpl<MachineOperand> &Cond) = 0;

+    /// Create a condtion to determine if the remaining trip count represented


I would prefer that a different method name be used for this new method. The reason is that the assumptions of this variant of createTripCountGreaterCondition are quite different from those of the original function.

The original method is trying to create a comparison against the original trip counter coming into the loop and can assume that there is a reaching definition of that original trip counter at the end of MBB. Some targets (PowerPC and ARM) actually ignore the TC in some circumstances because there is a further assumption that this function will be called to add comparisons at the end of prologue stages, so the TC which is wanted will always be "the right one" for the prologue stage. This even permits pipelining of loops which don't really have a trip count. The original method also can return whether the comparison is statically known to be taken or not taken, and the caller must make use of that.

The new method wants something different: it wants a comparison of how many iterations remain to the desired TC value and thus requires that a register containing that remaining iteration count be passed in. It doesn't worry about static comparisons -- in fact, the return value is ignored. Thus it really should have a different name. How about createRemainingIterationsGreaterCondition? (That's a bit long, but it is descriptive).

It would be even nicer if this method could subsume the three that have been added. As all that is done with getCounterInitReg is to just use it immediately as a call argument, and getCounterUpdatedReg is to look it up in a map and then use the result as a call argument, how about adding a map parameter to createRemainingIterationsGreaterCondition so that the target's override of the method can do the lookup. If the map is empty, then the initial value can be used. (That convention should then be documented.)
Another nice feature of this approach is that if non-virtual registers contain the loop count (as is the case with some targets), the conditions can still be properly generated.

I would also add another function which says whether MVE is supported by the target and call that at the appropriate place; this will prevent erroring out on a MVE-able loop when a target doesn't actually support it.

Finally, the convention has been to provide no default implementation for all the methods (with one exception); while maintaining the convention will require adding a dummy implementation to pipelining targets which won't support this kind of schedule expansion, it is probably better to stick with convention here.

Thanks for the advice! I fixed all points with 354a42c: rename and integration to createRemainingIterationsGreaterCondition(), addition of isMVESupported(), non-use of default implementation. I verified that the tests for the existing target (Hexagon, ARM and PowerPC) pass.

llvmbot · 2023-09-22T13:58:22Z

@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-backend-arm

Changes

Modulo variable expansion is a technique that resolves overlap of variable lifetimes by unrolling. The existing implementation solves it by making a copy by move instruction for processors with ordinary registers such as Arm and x86. This method may result in a very large number of move instructions, which can cause performance problems.

Modulo variable expansion is enabled by specifying -pipeliner-mve-cg. A backend must implement some newly defined interfaces in PipelinerLoopInfo.

Discourse thread: https://discourse.llvm.org/t/implementing-modulo-variable-expansion-for-machinepipeliner

Patch is 32.38 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/65609.diff

7 Files Affected:

(modified) llvm/include/llvm/CodeGen/ModuloSchedule.h (+65)
(modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+17)
(modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+9)
(modified) llvm/lib/CodeGen/ModuloSchedule.cpp (+619)
(modified) llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp (+9)
(modified) llvm/lib/Target/Hexagon/HexagonInstrInfo.cpp (+9)
(modified) llvm/lib/Target/PowerPC/PPCInstrInfo.cpp (+9)

diff --git a/llvm/include/llvm/CodeGen/ModuloSchedule.h b/llvm/include/llvm/CodeGen/ModuloSchedule.h
index d03f7b4959159e4..8aa0a1a81186ad6 100644
--- a/llvm/include/llvm/CodeGen/ModuloSchedule.h
+++ b/llvm/include/llvm/CodeGen/ModuloSchedule.h
@@ -369,6 +369,71 @@ class PeelingModuloScheduleExpander {
   std::unique_ptr<TargetInstrInfo::PipelinerLoopInfo> LoopInfo;
 };
 
+/// Expand the kernel using modulo variable expansion algorithm (MVE).
+/// It unrolls the kernel enough to avoid overlap of register lifetime.
+class ModuloScheduleExpanderMVE {
+private:
+  using ValueMapTy = DenseMap<unsigned, unsigned>;
+  using MBBVectorTy = SmallVectorImpl<MachineBasicBlock *>;
+  using InstrMapTy = DenseMap<MachineInstr *, MachineInstr *>;
+
+  ModuloSchedule &Schedule;
+  MachineFunction &MF;
+  const TargetSubtargetInfo &ST;
+  MachineRegisterInfo &MRI;
+  const TargetInstrInfo *TII = nullptr;
+  LiveIntervals &LIS;
+
+  MachineBasicBlock *OrigKernel = nullptr;
+  MachineBasicBlock *OrigPreheader = nullptr;
+  MachineBasicBlock *OrigExit = nullptr;
+  MachineBasicBlock *Check = nullptr;
+  MachineBasicBlock *Prolog = nullptr;
+  MachineBasicBlock *NewKernel = nullptr;
+  MachineBasicBlock *Epilog = nullptr;
+  MachineBasicBlock *NewPreheader = nullptr;
+  MachineBasicBlock *NewExit = nullptr;
+  std::unique_ptr<TargetInstrInfo::PipelinerLoopInfo> LoopInfo;
+
+  /// The number of unroll required to avoid overlap of live ranges.
+  /// NumUnroll = 1 means no unrolling.
+  int NumUnroll;
+
+  void calcNumUnroll();
+  void generatePipelinedLoop();
+  void generateProlog(SmallVectorImpl<ValueMapTy> &VRMap);
+  void generatePhi(MachineInstr *OrigMI, int UnrollNum,
+                   SmallVectorImpl<ValueMapTy> &PrologVRMap,
+                   SmallVectorImpl<ValueMapTy> &KernelVRMap,
+                   SmallVectorImpl<ValueMapTy> &PhiVRMap);
+  void generateKernel(SmallVectorImpl<ValueMapTy> &PrologVRMap,
+                      SmallVectorImpl<ValueMapTy> &KernelVRMap);
+  void generateEpilog(SmallVectorImpl<ValueMapTy> &KernelVRMap,
+                      SmallVectorImpl<ValueMapTy> &EpilogVRMap);
+  void mergeRegUsesAfterPipeline(Register OrigReg, Register NewReg);
+
+  MachineInstr *cloneInstr(MachineInstr *OldMI);
+
+  void updateInstrDef(MachineInstr *NewMI, ValueMapTy &VRMap, bool LastDef);
+
+  void generateKernelPhi(Register OrigLoopVal, Register NewLoopVal,
+                         unsigned UnrollNum,
+                         SmallVectorImpl<ValueMapTy> &VRMapProlog,
+                         SmallVectorImpl<ValueMapTy> &VRMapPhi);
+  void updateInstrUse(MachineInstr *MI, int StageNum, int PhaseNum,
+                      SmallVectorImpl<ValueMapTy> &CurVRMap,
+                      SmallVectorImpl<ValueMapTy> *PrevVRMap);
+
+public:
+  ModuloScheduleExpanderMVE(MachineFunction &MF, ModuloSchedule &S,
+                            LiveIntervals &LIS)
+      : Schedule(S), MF(MF), ST(MF.getSubtarget()), MRI(MF.getRegInfo()),
+        TII(ST.getInstrInfo()), LIS(LIS) {}
+
+  void expand();
+  static bool canApply(MachineLoop &L);
+};
+
 /// Expander that simply annotates each scheduled instruction with a post-instr
 /// symbol that can be consumed by the ModuloScheduleTest pass.
 ///
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 1c2ca8678346472..5344960b711cdf9 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -748,6 +748,19 @@ class TargetInstrInfo : public MCInstrInfo {
     createTripCountGreaterCondition(int TC, MachineBasicBlock &MBB,
                                     SmallVectorImpl<MachineOperand> &Cond) = 0;
 
+    /// Create a condition to determine if the remaining trip count for a phase
+    /// is greater than TC. Some instructions such as comparisons may be
+    /// inserted at the bottom of MBB. The all instructions expanded for the
+    /// phase must be inserted in MBB before calling this function. RegMap is
+    /// the map from the original registers to the expanded registers for the
+    /// phase.
+    ///
+    /// MBB can also be a predecessor of the prologue block. Then RegMap must be
+    /// empty and the compared value is the initial value of the trip count.
+    virtual void createRemainingIterationsGreaterCondition(
+        int TC, MachineBasicBlock &MBB, SmallVectorImpl<MachineOperand> &Cond,
+        DenseMap<unsigned, unsigned> RegMap) = 0;
+
     /// Modify the loop such that the trip count is
     /// OriginalTC + TripCountAdjust.
     virtual void adjustTripCount(int TripCountAdjust) = 0;
@@ -761,6 +774,10 @@ class TargetInstrInfo : public MCInstrInfo {
     /// Once this function is called, no other functions on this object are
     /// valid; the loop has been removed.
     virtual void disposed() = 0;
+
+    /// Return true if the target can expand pipelined schedule with modulo
+    /// variable expansion.
+    virtual bool isMVEExpanderSupported() = 0;
   };
 
   /// Analyze loop L, which must be a single-basic-block loop, and if the
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index 788ff5b3b5acdfc..3242dcc855672ca 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -174,6 +174,10 @@ static cl::opt<bool> ExperimentalCodeGen(
     cl::desc(
         "Use the experimental peeling code generator for software pipelining"));
 
+static cl::opt<bool>
+    MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
+               cl::desc("Use the MVE code generator for software pipelining"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -659,6 +663,11 @@ void SwingSchedulerDAG::schedule() {
   if (ExperimentalCodeGen && NewInstrChanges.empty()) {
     PeelingModuloScheduleExpander MSE(MF, MS, &LIS);
     MSE.expand();
+  } else if (MVECodeGen && NewInstrChanges.empty() &&
+             LoopPipelinerInfo->isMVEExpanderSupported() &&
+             ModuloScheduleExpanderMVE::canApply(Loop)) {
+    ModuloScheduleExpanderMVE MSE(MF, MS, LIS);
+    MSE.expand();
   } else {
     ModuloScheduleExpander MSE(MF, MS, LIS, std::move(NewInstrChanges));
     MSE.expand();
diff --git a/llvm/lib/CodeGen/ModuloSchedule.cpp b/llvm/lib/CodeGen/ModuloSchedule.cpp
index 0bef513342ff123..6a2d141e8f0f7cc 100644
--- a/llvm/lib/CodeGen/ModuloSchedule.cpp
+++ b/llvm/lib/CodeGen/ModuloSchedule.cpp
@@ -2096,6 +2096,625 @@ void PeelingModuloScheduleExpander::validateAgainstModuloScheduleExpander() {
   MSE.cleanup();
 }
 
+MachineInstr *ModuloScheduleExpanderMVE::cloneInstr(MachineInstr *OldMI) {
+  MachineInstr *NewMI = MF.CloneMachineInstr(OldMI);
+
+  // TODO: Offset information needs to be corrected.
+  NewMI->dropMemRefs(MF);
+
+  return NewMI;
+}
+
+/// Create a dedicated exit for Loop. Exit is the original exit for Loop.
+/// If it is already dedicated exit, return it. Otherwise, insert a new
+/// block between them and return the new block.
+static MachineBasicBlock *createDedicatedExit(MachineBasicBlock *Loop,
+                                              MachineBasicBlock *Exit) {
+  if (Exit->pred_size() == 1)
+    return Exit;
+
+  MachineFunction *MF = Loop->getParent();
+  const TargetInstrInfo *TII = MF->getSubtarget().getInstrInfo();
+
+  MachineBasicBlock *NewExit =
+      MF->CreateMachineBasicBlock(Loop->getBasicBlock());
+  MF->insert(Loop->getIterator(), NewExit);
+
+  MachineBasicBlock *TBB = nullptr, *FBB = nullptr;
+  SmallVector<MachineOperand, 4> Cond;
+  TII->analyzeBranch(*Loop, TBB, FBB, Cond);
+  if (TBB == Loop)
+    FBB = NewExit;
+  else if (FBB == Loop)
+    TBB = NewExit;
+  else
+    llvm_unreachable("unexpected loop structure");
+  TII->removeBranch(*Loop);
+  TII->insertBranch(*Loop, TBB, FBB, Cond, DebugLoc());
+  Loop->removeSuccessor(Exit);
+  Loop->addSuccessor(NewExit);
+  TII->insertUnconditionalBranch(*NewExit, Exit, DebugLoc());
+  NewExit->addSuccessor(Exit);
+
+  for (MachineInstr &Phi : Exit->phis()) {
+    for (MachineOperand &MO : Phi.operands())
+      if (MO.isMBB() && MO.getMBB() == Loop)
+        MO.setMBB(NewExit);
+  }
+
+  return NewExit;
+}
+
+/// Generate a pipelined loop that is unrolled by using MVE algorithm and any
+/// other necessary blocks. The control flow is modified to execute the
+/// pipelined loop if the trip count satisfies the condition, otherwise the
+/// original loop. The original loop is also used to execute the reminder
+/// iterations which occur due to unrolling.
+void ModuloScheduleExpanderMVE::generatePipelinedLoop() {
+  // The control flow for pipelining with MVE:
+  //
+  // OrigPreheader:
+  //   // The block that is originally the loop preheader
+  //   goto Check
+  //
+  // Check:
+  //   // Check whether the trip count satisfies the requirements to pipeline.
+  //   if (LoopCounter > NumStages + NumUnroll - 2)
+  //     // The minimum number of iterations to pipeline =
+  //     //   iterations executed in prolog/epilog (NumStages-1) +
+  //     //   iterations executed in one kernel run (NumUnroll)
+  //     goto Prolog
+  //   // fallback to the original loop
+  //   goto NewPreheader
+  //
+  // Prolog:
+  //   // All prolog stages. There are no direct branches to the epilogue.
+  //   goto NewKernel
+  //
+  // NewKernel:
+  //   // NumUnroll copies of the kernel
+  //   if (LoopCounter > MVE-1)
+  //     goto NewKernel
+  //   goto Epilog
+  //
+  // Epilog:
+  //   // All epilog stages.
+  //   if (LoopCounter > 0)
+  //     // The remainder is executed in the original loop
+  //     goto NewPreheader
+  //   goto NewExit
+  //
+  // NewPreheader:
+  //   // Newly created preheader for the original loop.
+  //   // The initial values of the phis in the loop are merged from two paths.
+  //   NewInitVal = Phi OrigInitVal, Check, PipelineLastVal, Epilog
+  //   goto OrigKernel
+  //
+  // OrigKernel:
+  //   // The original loop block.
+  //   if (LoopCounter != 0)
+  //     goto OrigKernel
+  //   goto NewExit
+  //
+  // NewExit:
+  //   // Newly created dedicated exit for the original loop.
+  //   // Merge values which are referenced after the loop
+  //   Merged = Phi OrigVal, OrigKernel, PipelineVal, Epilog
+  //   goto OrigExit
+  //
+  // OrigExit:
+  //   // The block that is originally the loop exit.
+  //   // If it is already deicated exit, NewExit is not created.
+
+  // An example of where each stage is executed:
+  // Assume #Stages 3, #MVE 4, #Iterations 12
+  // Iter   0 1 2 3 4 5 6 7 8 9 10-11
+  // -------------------------------------------------
+  // Stage  0                          Prolog#0
+  // Stage  1 0                        Prolog#1
+  // Stage  2 1 0                      Kernel Unroll#0 Iter#0
+  // Stage    2 1 0                    Kernel Unroll#1 Iter#0
+  // Stage      2 1 0                  Kernel Unroll#2 Iter#0
+  // Stage        2 1 0                Kernel Unroll#3 Iter#0
+  // Stage          2 1 0              Kernel Unroll#0 Iter#1
+  // Stage            2 1 0            Kernel Unroll#1 Iter#1
+  // Stage              2 1 0          Kernel Unroll#2 Iter#1
+  // Stage                2 1 0        Kernel Unroll#3 Iter#1
+  // Stage                  2 1        Epilog#0
+  // Stage                    2        Epilog#1
+  // Stage                      0-2    OrigKernel
+
+  LoopInfo = TII->analyzeLoopForPipelining(OrigKernel);
+  assert(LoopInfo && "Must be able to analyze loop!");
+
+  calcNumUnroll();
+
+  Check = MF.CreateMachineBasicBlock(OrigKernel->getBasicBlock());
+  Prolog = MF.CreateMachineBasicBlock(OrigKernel->getBasicBlock());
+  NewKernel = MF.CreateMachineBasicBlock(OrigKernel->getBasicBlock());
+  Epilog = MF.CreateMachineBasicBlock(OrigKernel->getBasicBlock());
+  NewPreheader = MF.CreateMachineBasicBlock(OrigKernel->getBasicBlock());
+
+  MF.insert(OrigKernel->getIterator(), Check);
+  MF.insert(OrigKernel->getIterator(), Prolog);
+  MF.insert(OrigKernel->getIterator(), NewKernel);
+  MF.insert(OrigKernel->getIterator(), Epilog);
+  MF.insert(OrigKernel->getIterator(), NewPreheader);
+
+  NewExit = createDedicatedExit(OrigKernel, OrigExit);
+
+  NewPreheader->transferSuccessorsAndUpdatePHIs(OrigPreheader);
+  TII->insertUnconditionalBranch(*NewPreheader, OrigKernel, DebugLoc());
+
+  OrigPreheader->addSuccessor(Check);
+  TII->removeBranch(*OrigPreheader);
+  TII->insertUnconditionalBranch(*OrigPreheader, Check, DebugLoc());
+
+  Check->addSuccessor(Prolog);
+  Check->addSuccessor(NewPreheader);
+
+  Prolog->addSuccessor(NewKernel);
+
+  NewKernel->addSuccessor(NewKernel);
+  NewKernel->addSuccessor(Epilog);
+
+  Epilog->addSuccessor(NewPreheader);
+  Epilog->addSuccessor(NewExit);
+
+  SmallVector<MachineOperand, 4> Cond;
+  LoopInfo->createRemainingIterationsGreaterCondition(
+      Schedule.getNumStages() + NumUnroll - 2, *Check, Cond, ValueMapTy());
+  TII->insertBranch(*Check, Prolog, NewPreheader, Cond, DebugLoc());
+
+  // VRMaps map (prolog/kernel/epilog phase#, original register#) to new
+  // register#
+  SmallVector<ValueMapTy> PrologVRMap, KernelVRMap, EpilogVRMap;
+  generateProlog(PrologVRMap);
+  generateKernel(PrologVRMap, KernelVRMap);
+  generateEpilog(KernelVRMap, EpilogVRMap);
+}
+
+/// Replace MI's use operands according to the maps.
+void ModuloScheduleExpanderMVE::updateInstrUse(
+    MachineInstr *MI, int StageNum, int PhaseNum,
+    SmallVectorImpl<ValueMapTy> &CurVRMap,
+    SmallVectorImpl<ValueMapTy> *PrevVRMap) {
+  // If MI is in the prolog/kernel/epilog block, CurVRMap is
+  // PrologVRMap/KernelVRMap/EpilogVRMap respectively.
+  // PrevVRMap is nullptr/PhiVRMap/KernelVRMap respectively.
+  // Refer to the appropriate map according to the stage difference between
+  // MI and the definition of an operand.
+
+  for (MachineOperand &UseMO : MI->uses()) {
+    if (!UseMO.isReg() || !UseMO.getReg().isVirtual())
+      continue;
+    int DiffStage = 0;
+    Register OrigReg = UseMO.getReg();
+    MachineInstr *DefInst = MRI.getVRegDef(OrigReg);
+    if (!DefInst || DefInst->getParent() != OrigKernel)
+      continue;
+    unsigned InitReg = 0;
+    unsigned DefReg = OrigReg;
+    if (DefInst->isPHI()) {
+      ++DiffStage;
+      unsigned LoopReg;
+      getPhiRegs(*DefInst, OrigKernel, InitReg, LoopReg);
+      // LoopReg is guaranteed to be defined within the loop by canApply()
+      DefReg = LoopReg;
+      DefInst = MRI.getVRegDef(LoopReg);
+    }
+    unsigned DefStageNum = Schedule.getStage(DefInst);
+    DiffStage += StageNum - DefStageNum;
+    Register NewReg;
+    if (PhaseNum >= DiffStage && CurVRMap[PhaseNum - DiffStage].count(DefReg))
+      // NewReg is defined in a previous phase of the same block
+      NewReg = CurVRMap[PhaseNum - DiffStage][DefReg];
+    else if (!PrevVRMap)
+      // Since this is the first iteration, refer the initial register of the
+      // loop
+      NewReg = InitReg;
+    else
+      // Cases where DiffStage is larger than PhaseNum.
+      // If MI is in the kernel block, the value is defined by the previous
+      // iteration and PhiVRMap is referenced. If MI is in the epilog block, the
+      // value is defined in the kernel block and KernelVRMap is referenced.
+      NewReg = (*PrevVRMap)[PrevVRMap->size() - (DiffStage - PhaseNum)][DefReg];
+
+    const TargetRegisterClass *NRC =
+        MRI.constrainRegClass(NewReg, MRI.getRegClass(OrigReg));
+    if (NRC)
+      UseMO.setReg(NewReg);
+    else {
+      Register SplitReg = MRI.createVirtualRegister(MRI.getRegClass(OrigReg));
+      BuildMI(*OrigKernel, MI, MI->getDebugLoc(), TII->get(TargetOpcode::COPY),
+              SplitReg)
+          .addReg(NewReg);
+      UseMO.setReg(SplitReg);
+    }
+  }
+}
+
+/// Return a phi if Reg is referenced by the phi.
+/// canApply() guarantees that at most only one such phi exists.
+static MachineInstr *getLoopPhiUser(Register Reg, MachineBasicBlock *Loop) {
+  for (MachineInstr &Phi : Loop->phis()) {
+    unsigned InitVal, LoopVal;
+    getPhiRegs(Phi, Loop, InitVal, LoopVal);
+    if (LoopVal == Reg)
+      return &Phi;
+  }
+  return nullptr;
+}
+
+/// Generate phis for registers defined by OrigMI.
+void ModuloScheduleExpanderMVE::generatePhi(
+    MachineInstr *OrigMI, int UnrollNum,
+    SmallVectorImpl<ValueMapTy> &PrologVRMap,
+    SmallVectorImpl<ValueMapTy> &KernelVRMap,
+    SmallVectorImpl<ValueMapTy> &PhiVRMap) {
+  int StageNum = Schedule.getStage(OrigMI);
+  bool UsePrologReg;
+  if (Schedule.getNumStages() - NumUnroll + UnrollNum - 1 >= StageNum)
+    UsePrologReg = true;
+  else if (Schedule.getNumStages() - NumUnroll + UnrollNum == StageNum)
+    UsePrologReg = false;
+  else
+    return;
+
+  // Examples that show which stages are merged by phi.
+  // Meaning of the symbol following the stage number:
+  //   a/b: Stages with the same letter are merged (UsePrologReg == true)
+  //   +: Merged with the initial value (UsePrologReg == false)
+  //   *: No phis required
+  //
+  // #Stages 3, #MVE 4
+  // Iter   0 1 2 3 4 5 6 7 8
+  // -----------------------------------------
+  // Stage  0a                 Prolog#0
+  // Stage  1a 0b              Prolog#1
+  // Stage  2* 1* 0*           Kernel Unroll#0
+  // Stage     2* 1* 0+        Kernel Unroll#1
+  // Stage        2* 1+ 0a     Kernel Unroll#2
+  // Stage           2+ 1a 0b  Kernel Unroll#3
+  //
+  // #Stages 3, #MVE 2
+  // Iter   0 1 2 3 4 5 6 7 8
+  // -----------------------------------------
+  // Stage  0a                 Prolog#0
+  // Stage  1a 0b              Prolog#1
+  // Stage  2* 1+ 0a           Kernel Unroll#0
+  // Stage     2+ 1a 0b        Kernel Unroll#1
+  //
+  // #Stages 3, #MVE 1
+  // Iter   0 1 2 3 4 5 6 7 8
+  // -----------------------------------------
+  // Stage  0*                 Prolog#0
+  // Stage  1a 0b              Prolog#1
+  // Stage  2+ 1a 0b           Kernel Unroll#0
+
+  for (MachineOperand &DefMO : OrigMI->defs()) {
+    if (!DefMO.isReg())
+      continue;
+    Register OrigReg = DefMO.getReg();
+    auto NewReg = KernelVRMap[UnrollNum].find(OrigReg);
+    if (NewReg == KernelVRMap[UnrollNum].end())
+      continue;
+    Register CorrespondReg;
+    if (UsePrologReg) {
+      int PrologNum = Schedule.getNumStages() - NumUnroll + UnrollNum - 1;
+      CorrespondReg = PrologVRMap[PrologNum][OrigReg];
+    } else {
+      MachineInstr *Phi = getLoopPhiUser(OrigReg, OrigKernel);
+      if (!Phi)
+        continue;
+      CorrespondReg = getInitPhiReg(*Phi, OrigKernel);
+    }
+
+    assert(CorrespondReg.isValid());
+    Register PhiReg = MRI.createVirtualRegister(MRI.getRegClass(OrigReg));
+    BuildMI(*NewKernel, NewKernel->getFirstNonPHI(), DebugLoc(),
+            TII->get(TargetOpcode::PHI), PhiReg)
+        .addReg(NewReg->second)
+        .addMBB(NewKernel)
+        .addReg(CorrespondReg)
+        .addMBB(Prolog);
+    PhiVRMap[UnrollNum][OrigReg] = PhiReg;
+  }
+}
+
+static void replacePhiSrc(MachineInstr &Phi, Register OrigReg, Register NewReg,
+                          MachineBasicBlock *NewMBB) {
+  for (unsigned Idx = 1; Idx < Phi.getNumOperands(); Idx += 2) {
+    if (Phi.getOperand(Idx).getReg() == OrigReg) {
+      Phi.getOperand(Idx).setReg(NewReg);
+      Phi.getOperand(Idx + 1).setMBB(NewMBB);
+      return;
+    }
+  }
+}
+
+/// Generate phis that merge values from multiple routes
+void ModuloScheduleExpanderMVE::mergeRegUsesAfterPipeline(Register OrigReg,
+                                                          Register NewReg) {
+  SmallVector<MachineOperand *> UsesAfterLoop;
+  SmallVector<MachineInstr *> LoopPhis;
+  for (MachineRegisterInfo::use_iterator I = MRI.use_begin(OrigReg),
+                                         E = MRI.use_end();
+       I != E; ++I) {
+    MachineOperand &O = *I;
+    if (O.getParent()->getParent() != OrigKernel &&
+        O.getParent()->getParent() != Prolog &&
+        O.getParent()->getParent() != NewKernel &&
+        O.getParent()->getParent() != Epilog)
+      UsesAfterLoop.push_back(&O);
+    if (O.getParent()->getParent() == OrigKernel && O.getParent()->isPHI())
+      LoopPhis.push_back(O.getParent());
+  }
+
+  // Merge the route that only execute the pipelined loop (when there are no
+  // remaining iterations) with the route that execute the original loop.
+  if (!UsesAfterLoop.empty()) {
+    Register PhiReg = MRI.createVirtualRegister(MRI.getRegClass(OrigReg));
+    BuildMI(*NewExit, NewExit->getFirstNonPHI(), DebugLoc(),
+            TII...
[truncated]

hgreving2304 · 2023-09-22T21:39:22Z

ping

Cool.

Could you describe this a bit more (for example how are low trip counts handled?)

I'm actually not sure if the upstream LLVM swing scheduler stitches the prolig/epilog schedule for low trip counts or whether it versions the loop.

Could you comment on the cost model used to unroll per MVE? It should only be needed if there are copies, or an excessive number of copies.

ytmukai · 2023-09-26T13:40:31Z

Could you describe this a bit more (for example how are low trip counts handled?)

The pipelined code is only selected if the trip count is more than the amount consumed by the prologue/epilogue and one iteration of the kernel. If less than that, the original loop is selected. The original loop is also used to handle the remainder of the unroll by MVE.

Details of the control flow are described below:

llvm-project/llvm/lib/CodeGen/ModuloSchedule.cpp

Lines 2154 to 2225 in 9f5dc31

    
           // The control flow for pipelining with MVE: 
        
           // 
        
           // OrigPreheader: 
        
           //   // The block that is originally the loop preheader 
        
           //   goto Check 
        
           // 
        
           // Check: 
        
           //   // Check whether the trip count satisfies the requirements to pipeline. 
        
           //   if (LoopCounter > NumStages + NumUnroll - 2) 
        
           //     // The minimum number of iterations to pipeline = 
        
           //     //   iterations executed in prolog/epilog (NumStages-1) + 
        
           //     //   iterations executed in one kernel run (NumUnroll) 
        
           //     goto Prolog 
        
           //   // fallback to the original loop 
        
           //   goto NewPreheader 
        
           // 
        
           // Prolog: 
        
           //   // All prolog stages. There are no direct branches to the epilogue. 
        
           //   goto NewKernel 
        
           // 
        
           // NewKernel: 
        
           //   // NumUnroll copies of the kernel 
        
           //   if (LoopCounter > MVE-1) 
        
           //     goto NewKernel 
        
           //   goto Epilog 
        
           // 
        
           // Epilog: 
        
           //   // All epilog stages. 
        
           //   if (LoopCounter > 0) 
        
           //     // The remainder is executed in the original loop 
        
           //     goto NewPreheader 
        
           //   goto NewExit 
        
           // 
        
           // NewPreheader: 
        
           //   // Newly created preheader for the original loop. 
        
           //   // The initial values of the phis in the loop are merged from two paths. 
        
           //   NewInitVal = Phi OrigInitVal, Check, PipelineLastVal, Epilog 
        
           //   goto OrigKernel 
        
           // 
        
           // OrigKernel: 
        
           //   // The original loop block. 
        
           //   if (LoopCounter != 0) 
        
           //     goto OrigKernel 
        
           //   goto NewExit 
        
           // 
        
           // NewExit: 
        
           //   // Newly created dedicated exit for the original loop. 
        
           //   // Merge values which are referenced after the loop 
        
           //   Merged = Phi OrigVal, OrigKernel, PipelineVal, Epilog 
        
           //   goto OrigExit 
        
           // 
        
           // OrigExit: 
        
           //   // The block that is originally the loop exit. 
        
           //   // If it is already deicated exit, NewExit is not created. 
        
           // An example of where each stage is executed: 
        
           // Assume #Stages 3, #MVE 4, #Iterations 12 
        
           // Iter   0 1 2 3 4 5 6 7 8 9 10-11 
        
           // ------------------------------------------------- 
        
           // Stage  0                          Prolog#0 
        
           // Stage  1 0                        Prolog#1 
        
           // Stage  2 1 0                      Kernel Unroll#0 Iter#0 
        
           // Stage    2 1 0                    Kernel Unroll#1 Iter#0 
        
           // Stage      2 1 0                  Kernel Unroll#2 Iter#0 
        
           // Stage        2 1 0                Kernel Unroll#3 Iter#0 
        
           // Stage          2 1 0              Kernel Unroll#0 Iter#1 
        
           // Stage            2 1 0            Kernel Unroll#1 Iter#1 
        
           // Stage              2 1 0          Kernel Unroll#2 Iter#1 
        
           // Stage                2 1 0        Kernel Unroll#3 Iter#1 
        
           // Stage                  2 1        Epilog#0 
        
           // Stage                    2        Epilog#1 
        
           // Stage                      0-2    OrigKernel

I'm actually not sure if the upstream LLVM swing scheduler stitches the prolig/epilog schedule for low trip counts or whether it versions the loop.

The upstream implementation has no unrolling and also allows branching from the prologue to the epilogue, so that any trip count is handled only by pipelined code.

If we adopt a branch from the prologue to the epilogue, we should not be able to schedule those parts. Therefore, we did not employ that in this implementation. A further reason is that the original loop is needed to handle the remainder of the unroll anyway. (It might be possible to solve this problem by branching out into an epilogue in the middle of the kernel, but I believe it would require a large number of copies or multiple versions of the epilogue.)

Could you comment on the cost model used to unroll per MVE? It should only be needed if there are copies, or an excessive number of copies.

Actually, this implementation unrolls a number of times such that it requires none of the copies. A large number of unrolls has the disadvantage of higher required trip counts and increased code size. Since that disadvantage and the cost of copies are not directly comparable, I feel it is difficult to consider those tradeoffs.

It may be a good idea to choose the minimum number of unrolls where the ResMII, recalculated considering the resources consumed by the copies, does not exceed the scheduled II. It may also be necessary to check that the latency of the copies does not extend the critical path.

This patch does not replace the existing process, but adds a new entire process that uses MVE. We believe that the first step is to choose the appropriate method depending on the processor. For example, Intel processors seem to have an optimization that handles copies of floating-point registers at a lower cost in some cases. For processors without such optimizations, copy intructions often leads to an increase in II, so I think it may be better to unroll it sufficiently.

ytmukai · 2023-11-21T13:38:29Z

In order to verify the effect of MVE in more detail, I tested it in a scientific computation loop, for which software pipelining would be useful. The tested code is the following loop from IAMR, one of the ECP proxy applications, with some modifications.
https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584
(The modifications include removing branches and adjusting the number of iterations so that accesses hit the L1 cache to make it easier to check the effects of the optimization, which may change the conditions from the actual application execution. I am working on making the modified code publicly available.)

The modified code was measured with Graviton3 (Neoverse V1) and the results are as follows.

-pipeliner-force-ii=N	cycles (no mve)	instructions (no mve)	cycles (mve)	instructions (mve)	cycles (no swpl)	instructions (no swpl)
-	-	-	-	-	18.0	62.6
11 (MII)	29.3	134.4	19.6	100.7	-	-
12	22.1	119.0	18.1	95.1	-	-
13	20.0	103.8	17.6	89.8	-	-
14	17.8	92.1	17.1	86.4	-	-
15	17.3	85.3	16.6	80.4	-	-
16	18.4	89.1	16.4	75.0	-	-
17	17.5	80.3	15.6	67.6	-	-
18	17.2	76.4	15.7	65.6	-	-
19	16.6	70.4	15.5	64.1	-	-
20	16.5	69.5	15.7	64.6	-	-
21	16.6	69.5	15.7	64.6	-	-
22	15.9	67.6	15.7	64.6	-	-
23	16.5	67.5	16.1	64.6	-	-
24	16.5	67.6	16.0	64.6	-	-
25	16.5	66.5	16.2	64.6	-	-

The table compares the number of cycles and instructions per iteration with and without MVE for each scheduled II. The number of cycles is reduced by 3% by MVE when comparing the fastest results for each. The reduction is 14% compared to the case without pipelining.

The number of floating-point instructions per iteration of the loop is 43 (division is reciprocally approximated by -mrecip) and Neoverse V1 has four arithmetic units, so ResMII is 11. With small II, the performance is reduced by spills due to lack of register.
This result indicates that the number of registers should also be considered when determining II. My colleague @kasuga-fj is working on this issue (https://discourse.llvm.org/t/considering-register-pressure-when-deciding-initiation-interval-in-machinepipeliner/74725).

I would appreciate any advice for further verification.

Modulo variable expansion is a technique that resolves overlap of variable lifetimes by unrolling. The existing implementation solves it by making a copy by move instruction for processors with ordinary registers such as Arm and x86. This method may result in a very large number of move instructions, which can cause performance problems. Modulo variable expansion is enabled by specifing -pipeliner-mve-cg. A backend must implement some newly defined interfaces in PipelinerLoopInfo.

…ining

ytmukai · 2024-04-12T15:13:57Z

I have updated the patch; interface fixes and AArch64 implementation. Could you please restart the review?

Replace a parameter RegMap to LastStage0Insts in createRemainingIterationsGreaterCondition(). These maps are used to obtain the loop counter value at that point. We found that RegMap is not sufficient if that register is defined by PHI, so it was changed to an instruction map.
Implement interfaces of MVE for AArch64. The main part is to determine whether the loop ends at the unrolled kernel. It is implemented in a way that generates comparison instructions for the next unroll count iteration and accumulates the results with CINC instructions. Although the number of instructions is large, it can handle many forms of loops and is not a significant problem for large kernels where pipelining is effective.

I tested the performance with llvm-test-suite on a Neoverse V1 processor. The following table lists the most effective test cases with pipelining. All results are uploaded as swp.performance.llvm-test-suite.csv.

test case	NOSWP	SWP	SWP+MVE	SWP+faster	speedup by SWP(faster)	speedup by MVE
SingleSource/Benchmarks/Misc/flops-3.test	0.67	0.82	0.54	0.54	24.9%	53.2%
MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test	2.59	2.61	2.24	2.24	15.6%	16.5%
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test	2.62	2.63	2.28	2.28	14.7%	15.1%
SingleSource/Benchmarks/Misc/flops-1.test	0.54	0.68	0.47	0.47	13.6%	43.1%
SingleSource/Benchmarks/Adobe-C++/loop_unroll.test	0.46	0.41	0.41	0.41	12.7%	0.0%
SingleSource/Benchmarks/Misc/flops.test	2.66	2.79	2.41	2.41	10.3%	15.7%

Optimization flags:

NOSWP: -O3 -mcpu=neoverse-v1
SWP: NOSWP + -mllvm -aarch64-enable-pipeliner -mllvm -pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0 -mllvm -pipeliner-register-pressure
SWP+MVE: SWP + -mllvm -pipeliner-mve-cg
SWP+faster is the faster one for each cases.

In many of the above cases, the effect of MVE is significant and the csv shows that it is better in general. I believe that MVE should be the default in AArch64.

Note: Cases with unstable runtimes and micro-benchmarks are excluded. The micro-benchmarks are mostly the same and are not suitable for pipelining because its kernel is very small.

ytmukai · 2024-04-26T09:02:58Z

We have also evaluated other benchmarks and will share the results.

In a hotspot kernel in ExaMiniMD, pipelining with MVE is effective.

ExaMiniMD is a benchmark for molecular dynamics in ECP Proxy Applications. Then the kernel calculates the LJ potential, a kind of pair potential.

It was measured on one Neoverse V1 core. The execution time of the kernel accounts for approximately 50% of the total. The number of cycles of the kernel measured by perf record is as follows (The fastest II is selected for each):

NOSWP	SWP(II=25)	SWP+MVE(II=17)
15.2e9	15.3e9	14.2e9

(Compile flags: -Ofast -mrecip -g -mcpu=neoverse-v1 -mllvm -sve-gather-overhead=1 -mllvm -sve-scatter-overhead=1 -mllvm -aarch64-enable-pipeliner=1 -mllvm -pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0 -mllvm -pipeliner-mve-cg=0or1 -mllvm -pipeliner-force-ii=N -fno-unroll-loops)

Pipelining with MVE reduced execution time by 6.7%. (3.4% in total)

We also evaluated other benchmarks in ECP Proxy Applications, CORAL-2 Benchmarks and SPEC, but no hotspot kernels found to be effective. This was mostly due to the fact that pipelining was inherently ineffective for the kernels because of a small number instructions, etc. Some kernels could benefit if pipelining can be applied by improving other optimizations, alias analysis, inline expansion of math functions, template optimizations.

DragonDisciple · 2024-04-26T18:38:57Z

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

+  void createRemainingIterationsGreaterCondition(
+      int TC, MachineBasicBlock &MBB, SmallVectorImpl<MachineOperand> &Cond,
+      DenseMap<MachineInstr *, MachineInstr *> LastStage0Insts) override {
+    llvm_unreachable(


You could save some target code by making this unreachable a default implementation in PipelineLoopInfo, such that any target which enables MVE would by default run into the error case if they did not define it themselves.

This is similar to what's done in, for example, optimizeSelect (which is guarded by analyzeSelect), and replaceBranchWithTailCall (guarded by canMakeTailCallConditional)

Fixed to provide default implementation. I previously considered it a convention not to provide a default implementation, but since the implementation of this interface is optional, it would be more convenient to have a default.

DragonDisciple · 2024-04-26T18:39:52Z

llvm/include/llvm/CodeGen/TargetInstrInfo.h

@@ -766,6 +766,22 @@ class TargetInstrInfo : public MCInstrInfo {
    createTripCountGreaterCondition(int TC, MachineBasicBlock &MBB,
                                    SmallVectorImpl<MachineOperand> &Cond) = 0;

+    /// Create a condition to determine if the remaining trip count for a phase
+    /// is greater than TC. Some instructions such as comparisons may be
+    /// inserted at the bottom of MBB. The all instructions expanded for the


"The all instructions" -> "All instructions"

DragonDisciple · 2024-04-26T18:41:32Z

llvm/include/llvm/CodeGen/TargetInstrInfo.h

+
+    /// Return true if the target can expand pipelined schedule with modulo
+    /// variable expansion.
+    virtual bool isMVEExpanderSupported() = 0;


Should be a sane default (false, almost certainly for now) to avoid forcing downstream targets to update their implementation.

Fixed to provide a default that returns false.

DragonDisciple · 2024-04-26T18:51:14Z

llvm/lib/CodeGen/ModuloSchedule.cpp

+  TII->insertUnconditionalBranch(*NewExit, Exit, DebugLoc());
+  NewExit->addSuccessor(Exit);
+
+  for (MachineInstr &Phi : Exit->phis()) {


MachineBasicBlock::replacePhiUsesWith

DragonDisciple · 2024-04-26T19:10:12Z

llvm/lib/CodeGen/ModuloSchedule.cpp

+    llvm_unreachable("unexpected loop structure");
+  TII->removeBranch(*Loop);
+  TII->insertBranch(*Loop, TBB, FBB, Cond, DebugLoc());
+  Loop->removeSuccessor(Exit);


MachineBasicBlock::replaceSuccessor does the remove/add, plus more. Would it make sense to call that instead?

Fixed. Thanks, I didn't know the appropriate interface.

DragonDisciple · 2024-04-26T19:18:48Z

llvm/lib/CodeGen/ModuloSchedule.cpp

+/// Generate a pipelined loop that is unrolled by using MVE algorithm and any
+/// other necessary blocks. The control flow is modified to execute the
+/// pipelined loop if the trip count satisfies the condition, otherwise the
+/// original loop. The original loop is also used to execute the reminder


'reminder' -> 'remainder'

DragonDisciple · 2024-04-26T19:47:46Z

llvm/lib/CodeGen/ModuloSchedule.cpp

+  //
+  // Check:
+  //   // Check whether the trip count satisfies the requirements to pipeline.
+  //   if (LoopCounter > NumStages + NumUnroll - 2)


Does this branch order make sense for optimal performance? If the 'expected' case is to run the pipelined loop, avoiding a forced branch seems more profitable, and the conditional should fire if and only if the 'exceptional' case of going around the pipelined loop occurs.

This is what the base expander seems to do, but it's confusing...

llvm-project/llvm/lib/CodeGen/ModuloSchedule.cpp

Line 878 in 300340f

LoopInfo->createTripCountGreaterCondition(j + 1, *Prolog, Cond);

It gets the trip count condition, and then inserts a branch that uses 'epilog' (branch-around) as the true and 'lastpro' (execute loop) as the false branch. This is counterintuitive but leads to a forced branch only if there's not enough iterations.

Unless I'm missing something, of course... I suppose a target could identify and swap the condition.

For now, I added a private option to swap the target blocks.

llvm-project/llvm/lib/CodeGen/ModuloSchedule.cpp

Lines 25 to 27 in 594e3de

static cl::opt<bool> SwapBranchTargetsMVE(

"pipeliner-swap-branch-targets-mve", cl::Hidden, cl::init(false),

cl::desc("Swap target blocks of a conditional branch for MVE expander"));

llvm-project/llvm/lib/CodeGen/ModuloSchedule.cpp

Lines 2159 to 2167 in 594e3de

if (SwapBranchTargetsMVE) {

// Set SwapBranchTargetsMVE to true if a target prefers to replace TBB and

// FBB for optimal performance.

if (TII->reverseBranchCondition(Cond))

llvm_unreachable("can not reverse branch condition");

TII->insertBranch(MBB, &Otherwise, &GreaterThan, Cond, DebugLoc());

} else {

TII->insertBranch(MBB, &GreaterThan, &Otherwise, Cond, DebugLoc());

}

I was not aware of this, but if a processor predicts conditions, I think the number of executed instructions will be smaller if the 'expected' case is the destination of the conditional branch. But for processors that don't, I agree with you.
I would like to later create a framework for setting this up for each target, along with several other private options such as MVE activation.

DragonDisciple · 2024-04-26T21:54:46Z

I am by no means an expert in AArch64, so someone with a better handle on it than I should review the implementations and tests.

However, this looks fine, and the results seem generally positive. I'd like to get this upstream so my team can pull it down and play with it. One thing I think we'd like to improve (and share upstream, of course!) is handling loop-carried dependences such as accumulations.

We have a pre-pipeliner unroller that does this, but think that using this method is likely cleaner.

tl;dr: LGTM

…ining

ytmukai · 2024-05-07T16:48:16Z

Thanks for the review! I would be very grateful if you would use and improve this.

I understand pre-pipeliner unroll and issues about accumulations as I commented in an earlier discussion. (https://discourse.llvm.org/t/implementing-modulo-variable-expansion-for-machinepipeliner/71748/7?u=ytmukai)

As other improvements, we are aware of a problem of large unroll counts, which occur when the same addressing register is used by a load and a store. They are often located at the beginning and end of the DDG and cross all stages.

davemgreen

Sorry for the delay. I looked through the AArch64 code and it looks OK to me.

davemgreen · 2024-06-05T07:42:43Z

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

+/// If Reg is an induction variable, return true and set some parameters
+static bool getIndVarInfo(Register Reg, const MachineBasicBlock *LoopBB,
+                          MachineInstr *&UpdateInst,
+                          unsigned *UpdateCounterOprNum, Register *InitReg,


Pass UpdateCounterOprNum and the other "outputs" by reference? unsigned &UpdateCounterOprNum

Fixed. There was no reason to make it a pointer.

davemgreen · 2024-06-05T07:58:44Z

llvm/test/CodeGen/AArch64/sms-acceptable-loop3.mir

@@ -1,80 +0,0 @@
-# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -pipeliner-enable-copytophi=0 -debug-only=pipeliner 2>&1 | FileCheck %s


Is it the CBZ that is no longer supported? Is that something that we would re-add support for? Should we keep the test with a note about not supporting CBZ?

I omitted CBZ because of the additional processing required to support it and because we did not see it used in typical loops to pipeline. If that support is found to be useful I would like to modify it.

The tests were changed to remain and marked as unsupported.

llvm-project/llvm/test/CodeGen/AArch64/sms-acceptable-loop4.mir

Lines 4 to 5 in 004574d

# UNSUPPORTED: target={{.*}}

# Compare and branch instructions are not supported now.

Sounds good.

I would say it might be best to update the CHECK below instead or marking it unsupported. We can then check we don't do anything bad with the CBZ. The comment is good to keep in explaining that the instruction isn't supported.

davemgreen · 2024-06-05T08:13:14Z

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

+  MachineInstr *Comp = nullptr;
+  unsigned CompCounterOprNum = 0;
+  for (MachineInstr &MI : reverse(*LoopBB)) {
+    if (MI.modifiesRegister(AArch64::NZCV, &TRI)) {


Do we need to check for other uses of NZCV?

All instructions that use NZCV are scheduled in the first stage (not pipelined). This means that their NZCV def-use do not cross blocks. Thus, they do not interfere each other with the use of NZCV for loop control.
That case is tested by https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AArch64/sms-unpipeline-insts1.mir.

In addition, it seems that a case in which NZCV is referenced across loops is considered an undefined reference.
https://godbolt.org/z/PP3fzdd8a

CALL instructions could cause problems because NZCV use/def is not set. It is fixed not to pipeline when CALL or instructions which have unmodeled side effects exist as 0afa899.

Looking at the code in MachineScheduler, it seems that there are no other instructions that implicitly use NZCV.

llvm-project/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp

Lines 550 to 555 in 083a266

/// Returns true if MI is an instruction we are unable to reason about

/// (like a call or something with unmodeled side effects).

static inline bool isGlobalMemoryObject(MachineInstr *MI) {

return MI->isCall() || MI->hasUnmodeledSideEffects() ||

(MI->hasOrderedMemoryRef() && !MI->isDereferenceableInvariantLoad());

}

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

…ining

ytmukai · 2024-06-06T14:48:03Z

Thank you for the review! I think I have fixed all the issues.

davemgreen

Thanks. I would recommend updating the tests as opposed to marking them as unsupported, but otherwise LGTM.

davemgreen · 2024-06-06T16:52:04Z

llvm/test/CodeGen/AArch64/sms-acceptable-loop3.mir

@@ -1,80 +0,0 @@
-# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -pipeliner-enable-copytophi=0 -debug-only=pipeliner 2>&1 | FileCheck %s


Sounds good.

I would say it might be best to update the CHECK below instead or marking it unsupported. We can then check we don't do anything bad with the CBZ. The comment is good to keep in explaining that the instruction isn't supported.

davemgreen · 2024-06-06T16:53:06Z

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

+  MachineInstr *Comp = nullptr;
+  unsigned CompCounterOprNum = 0;
+  for (MachineInstr &MI : reverse(*LoopBB)) {
+    if (MI.modifiesRegister(AArch64::NZCV, &TRI)) {


ytmukai added llvm:codegen llvm:optimizations labels Sep 7, 2023

ytmukai requested review from Meinersbur, DragonDisciple, davemgreen, ThomasRaoux, hgreving2304, bcahoon and dpenry September 7, 2023 14:23

DragonDisciple reviewed Sep 20, 2023

View reviewed changes

dpenry reviewed Sep 21, 2023

View reviewed changes

llvmbot added the backend:ARM label Sep 22, 2023

ytmukai added backend:AArch64 and removed backend:ARM labels Sep 22, 2023

luporl mentioned this pull request Jan 3, 2024

[CodeGen][MachinePipeliner] Limit register pressure when scheduling #74807

Merged

ytmukai mentioned this pull request Jan 26, 2024

[AArch64][MachinePipeliner] Add pipeliner support for AArch64 #79589

Merged

ytmukai force-pushed the swpl-mve branch from a0f409d to 921c493 Compare March 5, 2024 13:54

llvmbot added backend:ARM backend:PowerPC labels Mar 5, 2024

ytmukai added 4 commits April 5, 2024 16:34

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

4008807

…ining

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

f1343ac

…ining

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

02e79a9

…ining

ytmukai added 2 commits April 5, 2024 16:34

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

5f97d45

…ining

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

38a765b

…ining

ytmukai force-pushed the swpl-mve branch from 921c493 to 38a765b Compare April 9, 2024 17:07

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

1ec55c8

…ining

DragonDisciple reviewed Apr 26, 2024

View reviewed changes

ytmukai added 2 commits May 7, 2024 15:07

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

8c80fe9

…ining

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

594e3de

…ining

davemgreen reviewed Jun 5, 2024

View reviewed changes

ytmukai added 4 commits June 6, 2024 12:14

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

004574d

…ining

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

5831275

…ining

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

0afa899

…ining

fixup! [ModuloSchedule] Implement modulo variable expansion for pipel…

ff701b1

…ining

davemgreen approved these changes Jun 6, 2024

View reviewed changes

	static cl::opt<bool> SwapBranchTargetsMVE(
	"pipeliner-swap-branch-targets-mve", cl::Hidden, cl::init(false),
	cl::desc("Swap target blocks of a conditional branch for MVE expander"));

	if (SwapBranchTargetsMVE) {
	// Set SwapBranchTargetsMVE to true if a target prefers to replace TBB and
	// FBB for optimal performance.
	if (TII->reverseBranchCondition(Cond))
	llvm_unreachable("can not reverse branch condition");
	TII->insertBranch(MBB, &Otherwise, &GreaterThan, Cond, DebugLoc());
	} else {
	TII->insertBranch(MBB, &GreaterThan, &Otherwise, Cond, DebugLoc());
	}

		@@ -1,80 +0,0 @@
		# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -pipeliner-enable-copytophi=0 -debug-only=pipeliner 2>&1 \| FileCheck %s

	# UNSUPPORTED: target={{.*}}
	# Compare and branch instructions are not supported now.

	/// Returns true if MI is an instruction we are unable to reason about
	/// (like a call or something with unmodeled side effects).
	static inline bool isGlobalMemoryObject(MachineInstr *MI) {
	return MI->isCall() \|\| MI->hasUnmodeledSideEffects() \|\|
	(MI->hasOrderedMemoryRef() && !MI->isDereferenceableInvariantLoad());
	}

[ModuloSchedule] Implement modulo variable expansion for pipelining #65609

Are you sure you want to change the base?

[ModuloSchedule] Implement modulo variable expansion for pipelining #65609

Conversation

ytmukai commented Sep 7, 2023

ytmukai commented Sep 7, 2023

ytmukai commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llvmbot commented Sep 22, 2023 • edited

hgreving2304 commented Sep 22, 2023

ytmukai commented Sep 26, 2023

ytmukai commented Nov 21, 2023

ytmukai commented Apr 12, 2024

ytmukai commented Apr 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DragonDisciple commented Apr 26, 2024

ytmukai commented May 7, 2024

davemgreen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ytmukai Jun 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ytmukai commented Jun 6, 2024

davemgreen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llvmbot commented Sep 22, 2023 •

edited

ytmukai Jun 6, 2024 •

edited