Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ModuloSchedule] Implement modulo variable expansion for pipelining #65609

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

ytmukai
Copy link
Contributor

@ytmukai ytmukai commented Sep 7, 2023

Modulo variable expansion is a technique that resolves overlap of variable lifetimes by unrolling. The existing implementation solves it by making a copy by move instruction for processors with ordinary registers such as Arm and x86. This method may result in a very large number of move instructions, which can cause performance problems.

Modulo variable expansion is enabled by specifying -pipeliner-mve-cg. A backend must implement some newly defined interfaces in PipelinerLoopInfo.

Discourse thread: https://discourse.llvm.org/t/implementing-modulo-variable-expansion-for-machinepipeliner

@ytmukai
Copy link
Contributor Author

ytmukai commented Sep 7, 2023

I also created an implementation for AArch64 at https://github.com/ytmukai/llvm-project/tree/swpl-aarch64. I will create a PR for it after this PR is merged, as it depends on this PR.

I measured the effect of modulo variable expansion (MVE) with llvm-test-suite. I used Ampere Altra Max, an AArch64 processor with Neoverse N1 core.
The performance compared to the non-pipelined results are as follows:

MVE >3% improvement cases >3% degradation cases
Disabled (conventional) 50 222
Enabled 71 115

The total number of test cases is 811, selected by enabling TEST_SUITE_BENCHMARKING_ONLY.

The number of cases in which the performance improvement is more than 3% has increased by 40%.
There are still many cases showing degradation. I think this could be improved by adding a process to limit register pressure.

The optimization flags used are as follows:

  • non-pipelined: -O3
  • MVE disabled: non-pipelined flags + -mllvm -aarch64-enable-pipeliner=1 -mllvm -pipeliner-max-stages=1000 -mllvm --pipeliner-max-mii=1000 -mllvm -enable-misched=0 -mllvm -enable-post-misched=0 -mtune=neoverse-n1 -mllvm -pipeliner-enable-copytophi=0
  • MVE enabled: MVE disabled flags + -mllvm -pipeliner-mve-cg=1

@ytmukai
Copy link
Contributor Author

ytmukai commented Sep 20, 2023

ping

/// Check if ModuloScheduleExpanderMVE can be applied to L
bool ModuloScheduleExpanderMVE::canApply(MachineLoop &L) {
if (!L.getExitBlock()) {
LLVM_DEBUG(dbgs() << "Can not apply MVE expander\n";);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good debug enhancement to add small explanations to each failure so users know what in particular causes this not to apply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated as 85175fa

for (MachineOperand &MO : MI.defs())
if (MO.isReg())
for (MachineInstr &Ref : MRI.use_instructions(MO.getReg()))
if (Ref.getParent() != BB || Ref.isPHI()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: I'm curious how frequently and why LLVM generates PHI -> PHI within the same loop. I had only first seen it happen in the default ModuloScheduleExpander, which uses same-loop PHI->PHI lifetimes to represent values being passed through the pipeline stages.

What I'm asking I suppose is, was there an error case where you saw this occur?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that such phi may be generated as a result of eliminating mov as follows.
https://llvm.godbolt.org/z/3sMqc5hTK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd always have a phi-phi if there is a loop-carried value with input distance > 1, in which case during software pipelining, the producer can move into a higher stage than the consumer (not sure if LLVM's upstream modulo scheduler can support this, we can in our downstream version). During swp, you'd always have phi-phi if producer stage - consumer stage > 1

@@ -748,6 +748,20 @@ class TargetInstrInfo : public MCInstrInfo {
createTripCountGreaterCondition(int TC, MachineBasicBlock &MBB,
SmallVectorImpl<MachineOperand> &Cond) = 0;

/// Create a condtion to determine if the remaining trip count represented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that a different method name be used for this new method. The reason is that the assumptions of this variant of createTripCountGreaterCondition are quite different from those of the original function.

The original method is trying to create a comparison against the original trip counter coming into the loop and can assume that there is a reaching definition of that original trip counter at the end of MBB. Some targets (PowerPC and ARM) actually ignore the TC in some circumstances because there is a further assumption that this function will be called to add comparisons at the end of prologue stages, so the TC which is wanted will always be "the right one" for the prologue stage. This even permits pipelining of loops which don't really have a trip count. The original method also can return whether the comparison is statically known to be taken or not taken, and the caller must make use of that.

The new method wants something different: it wants a comparison of how many iterations remain to the desired TC value and thus requires that a register containing that remaining iteration count be passed in. It doesn't worry about static comparisons -- in fact, the return value is ignored. Thus it really should have a different name. How about createRemainingIterationsGreaterCondition? (That's a bit long, but it is descriptive).

It would be even nicer if this method could subsume the three that have been added. As all that is done with getCounterInitReg is to just use it immediately as a call argument, and getCounterUpdatedReg is to look it up in a map and then use the result as a call argument, how about adding a map parameter to createRemainingIterationsGreaterCondition so that the target's override of the method can do the lookup. If the map is empty, then the initial value can be used. (That convention should then be documented.)
Another nice feature of this approach is that if non-virtual registers contain the loop count (as is the case with some targets), the conditions can still be properly generated.

I would also add another function which says whether MVE is supported by the target and call that at the appropriate place; this will prevent erroring out on a MVE-able loop when a target doesn't actually support it.

Finally, the convention has been to provide no default implementation for all the methods (with one exception); while maintaining the convention will require adding a dummy implementation to pipelining targets which won't support this kind of schedule expansion, it is probably better to stick with convention here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice! I fixed all points with 354a42c: rename and integration to createRemainingIterationsGreaterCondition(), addition of isMVESupported(), non-use of default implementation. I verified that the tests for the existing target (Hexagon, ARM and PowerPC) pass.

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 22, 2023

@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-backend-arm

Changes

Modulo variable expansion is a technique that resolves overlap of variable lifetimes by unrolling. The existing implementation solves it by making a copy by move instruction for processors with ordinary registers such as Arm and x86. This method may result in a very large number of move instructions, which can cause performance problems.

Modulo variable expansion is enabled by specifying -pipeliner-mve-cg. A backend must implement some newly defined interfaces in PipelinerLoopInfo.

Discourse thread: https://discourse.llvm.org/t/implementing-modulo-variable-expansion-for-machinepipeliner


Patch is 32.38 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/65609.diff

7 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/ModuloSchedule.h (+65)
  • (modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+17)
  • (modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+9)
  • (modified) llvm/lib/CodeGen/ModuloSchedule.cpp (+619)
  • (modified) llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp (+9)
  • (modified) llvm/lib/Target/Hexagon/HexagonInstrInfo.cpp (+9)
  • (modified) llvm/lib/Target/PowerPC/PPCInstrInfo.cpp (+9)
diff --git a/llvm/include/llvm/CodeGen/ModuloSchedule.h b/llvm/include/llvm/CodeGen/ModuloSchedule.h
index d03f7b4959159e4..8aa0a1a81186ad6 100644
--- a/llvm/include/llvm/CodeGen/ModuloSchedule.h
+++ b/llvm/include/llvm/CodeGen/ModuloSchedule.h
@@ -369,6 +369,71 @@ class PeelingModuloScheduleExpander {
   std::unique_ptr<TargetInstrInfo::PipelinerLoopInfo> LoopInfo;
 };
 
+/// Expand the kernel using modulo variable expansion algorithm (MVE).
+/// It unrolls the kernel enough to avoid overlap of register lifetime.
+class ModuloScheduleExpanderMVE {
+private:
+  using ValueMapTy = DenseMap<unsigned, unsigned>;
+  using MBBVectorTy = SmallVectorImpl<MachineBasicBlock *>;
+  using InstrMapTy = DenseMap<MachineInstr *, MachineInstr *>;
+
+  ModuloSchedule &Schedule;
+  MachineFunction &MF;
+  const TargetSubtargetInfo &ST;
+  MachineRegisterInfo &MRI;
+  const TargetInstrInfo *TII = nullptr;
+  LiveIntervals &LIS;
+
+  MachineBasicBlock *OrigKernel = nullptr;
+  MachineBasicBlock *OrigPreheader = nullptr;
+  MachineBasicBlock *OrigExit = nullptr;
+  MachineBasicBlock *Check = nullptr;
+  MachineBasicBlock *Prolog = nullptr;
+  MachineBasicBlock *NewKernel = nullptr;
+  MachineBasicBlock *Epilog = nullptr;
+  MachineBasicBlock *NewPreheader = nullptr;
+  MachineBasicBlock *NewExit = nullptr;
+  std::unique_ptr<TargetInstrInfo::PipelinerLoopInfo> LoopInfo;
+
+  /// The number of unroll required to avoid overlap of live ranges.
+  /// NumUnroll = 1 means no unrolling.
+  int NumUnroll;
+
+  void calcNumUnroll();
+  void generatePipelinedLoop();
+  void generateProlog(SmallVectorImpl<ValueMapTy> &VRMap);
+  void generatePhi(MachineInstr *OrigMI, int UnrollNum,
+                   SmallVectorImpl<ValueMapTy> &PrologVRMap,
+                   SmallVectorImpl<ValueMapTy> &KernelVRMap,
+                   SmallVectorImpl<ValueMapTy> &PhiVRMap);
+  void generateKernel(SmallVectorImpl<ValueMapTy> &PrologVRMap,
+                      SmallVectorImpl<ValueMapTy> &KernelVRMap);
+  void generateEpilog(SmallVectorImpl<ValueMapTy> &KernelVRMap,
+                      SmallVectorImpl<ValueMapTy> &EpilogVRMap);
+  void mergeRegUsesAfterPipeline(Register OrigReg, Register NewReg);
+
+  MachineInstr *cloneInstr(MachineInstr *OldMI);
+
+  void updateInstrDef(MachineInstr *NewMI, ValueMapTy &VRMap, bool LastDef);
+
+  void generateKernelPhi(Register OrigLoopVal, Register NewLoopVal,
+                         unsigned UnrollNum,
+                         SmallVectorImpl<ValueMapTy> &VRMapProlog,
+                         SmallVectorImpl<ValueMapTy> &VRMapPhi);
+  void updateInstrUse(MachineInstr *MI, int StageNum, int PhaseNum,
+                      SmallVectorImpl<ValueMapTy> &CurVRMap,
+                      SmallVectorImpl<ValueMapTy> *PrevVRMap);
+
+public:
+  ModuloScheduleExpanderMVE(MachineFunction &MF, ModuloSchedule &S,
+                            LiveIntervals &LIS)
+      : Schedule(S), MF(MF), ST(MF.getSubtarget()), MRI(MF.getRegInfo()),
+        TII(ST.getInstrInfo()), LIS(LIS) {}
+
+  void expand();
+  static bool canApply(MachineLoop &L);
+};
+
 /// Expander that simply annotates each scheduled instruction with a post-instr
 /// symbol that can be consumed by the ModuloScheduleTest pass.
 ///
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 1c2ca8678346472..5344960b711cdf9 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -748,6 +748,19 @@ class TargetInstrInfo : public MCInstrInfo {
     createTripCountGreaterCondition(int TC, MachineBasicBlock &MBB,
                                     SmallVectorImpl<MachineOperand> &Cond) = 0;
 
+    /// Create a condition to determine if the remaining trip count for a phase
+    /// is greater than TC. Some instructions such as comparisons may be
+    /// inserted at the bottom of MBB. The all instructions expanded for the
+    /// phase must be inserted in MBB before calling this function. RegMap is
+    /// the map from the original registers to the expanded registers for the
+    /// phase.
+    ///
+    /// MBB can also be a predecessor of the prologue block. Then RegMap must be
+    /// empty and the compared value is the initial value of the trip count.
+    virtual void createRemainingIterationsGreaterCondition(
+        int TC, MachineBasicBlock &MBB, SmallVectorImpl<MachineOperand> &Cond,
+        DenseMap<unsigned, unsigned> RegMap) = 0;
+
     /// Modify the loop such that the trip count is
     /// OriginalTC + TripCountAdjust.
     virtual void adjustTripCount(int TripCountAdjust) = 0;
@@ -761,6 +774,10 @@ class TargetInstrInfo : public MCInstrInfo {
     /// Once this function is called, no other functions on this object are
     /// valid; the loop has been removed.
     virtual void disposed() = 0;
+
+    /// Return true if the target can expand pipelined schedule with modulo
+    /// variable expansion.
+    virtual bool isMVEExpanderSupported() = 0;
   };
 
   /// Analyze loop L, which must be a single-basic-block loop, and if the
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index 788ff5b3b5acdfc..3242dcc855672ca 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -174,6 +174,10 @@ static cl::opt<bool> ExperimentalCodeGen(
     cl::desc(
         "Use the experimental peeling code generator for software pipelining"));
 
+static cl::opt<bool>
+    MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
+               cl::desc("Use the MVE code generator for software pipelining"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -659,6 +663,11 @@ void SwingSchedulerDAG::schedule() {
   if (ExperimentalCodeGen && NewInstrChanges.empty()) {
     PeelingModuloScheduleExpander MSE(MF, MS, &LIS);
     MSE.expand();
+  } else if (MVECodeGen && NewInstrChanges.empty() &&
+             LoopPipelinerInfo->isMVEExpanderSupported() &&
+             ModuloScheduleExpanderMVE::canApply(Loop)) {
+    ModuloScheduleExpanderMVE MSE(MF, MS, LIS);
+    MSE.expand();
   } else {
     ModuloScheduleExpander MSE(MF, MS, LIS, std::move(NewInstrChanges));
     MSE.expand();
diff --git a/llvm/lib/CodeGen/ModuloSchedule.cpp b/llvm/lib/CodeGen/ModuloSchedule.cpp
index 0bef513342ff123..6a2d141e8f0f7cc 100644
--- a/llvm/lib/CodeGen/ModuloSchedule.cpp
+++ b/llvm/lib/CodeGen/ModuloSchedule.cpp
@@ -2096,6 +2096,625 @@ void PeelingModuloScheduleExpander::validateAgainstModuloScheduleExpander() {
   MSE.cleanup();
 }
 
+MachineInstr *ModuloScheduleExpanderMVE::cloneInstr(MachineInstr *OldMI) {
+  MachineInstr *NewMI = MF.CloneMachineInstr(OldMI);
+
+  // TODO: Offset information needs to be corrected.
+  NewMI->dropMemRefs(MF);
+
+  return NewMI;
+}
+
+/// Create a dedicated exit for Loop. Exit is the original exit for Loop.
+/// If it is already dedicated exit, return it. Otherwise, insert a new
+/// block between them and return the new block.
+static MachineBasicBlock *createDedicatedExit(MachineBasicBlock *Loop,
+                                              MachineBasicBlock *Exit) {
+  if (Exit->pred_size() == 1)
+    return Exit;
+
+  MachineFunction *MF = Loop->getParent();
+  const TargetInstrInfo *TII = MF->getSubtarget().getInstrInfo();
+
+  MachineBasicBlock *NewExit =
+      MF->CreateMachineBasicBlock(Loop->getBasicBlock());
+  MF->insert(Loop->getIterator(), NewExit);
+
+  MachineBasicBlock *TBB = nullptr, *FBB = nullptr;
+  SmallVector<MachineOperand, 4> Cond;
+  TII->analyzeBranch(*Loop, TBB, FBB, Cond);
+  if (TBB == Loop)
+    FBB = NewExit;
+  else if (FBB == Loop)
+    TBB = NewExit;
+  else
+    llvm_unreachable("unexpected loop structure");
+  TII->removeBranch(*Loop);
+  TII->insertBranch(*Loop, TBB, FBB, Cond, DebugLoc());
+  Loop->removeSuccessor(Exit);
+  Loop->addSuccessor(NewExit);
+  TII->insertUnconditionalBranch(*NewExit, Exit, DebugLoc());
+  NewExit->addSuccessor(Exit);
+
+  for (MachineInstr &Phi : Exit->phis()) {
+    for (MachineOperand &MO : Phi.operands())
+      if (MO.isMBB() && MO.getMBB() == Loop)
+        MO.setMBB(NewExit);
+  }
+
+  return NewExit;
+}
+
+/// Generate a pipelined loop that is unrolled by using MVE algorithm and any
+/// other necessary blocks. The control flow is modified to execute the
+/// pipelined loop if the trip count satisfies the condition, otherwise the
+/// original loop. The original loop is also used to execute the reminder
+/// iterations which occur due to unrolling.
+void ModuloScheduleExpanderMVE::generatePipelinedLoop() {
+  // The control flow for pipelining with MVE:
+  //
+  // OrigPreheader:
+  //   // The block that is originally the loop preheader
+  //   goto Check
+  //
+  // Check:
+  //   // Check whether the trip count satisfies the requirements to pipeline.
+  //   if (LoopCounter > NumStages + NumUnroll - 2)
+  //     // The minimum number of iterations to pipeline =
+  //     //   iterations executed in prolog/epilog (NumStages-1) +
+  //     //   iterations executed in one kernel run (NumUnroll)
+  //     goto Prolog
+  //   // fallback to the original loop
+  //   goto NewPreheader
+  //
+  // Prolog:
+  //   // All prolog stages. There are no direct branches to the epilogue.
+  //   goto NewKernel
+  //
+  // NewKernel:
+  //   // NumUnroll copies of the kernel
+  //   if (LoopCounter > MVE-1)
+  //     goto NewKernel
+  //   goto Epilog
+  //
+  // Epilog:
+  //   // All epilog stages.
+  //   if (LoopCounter > 0)
+  //     // The remainder is executed in the original loop
+  //     goto NewPreheader
+  //   goto NewExit
+  //
+  // NewPreheader:
+  //   // Newly created preheader for the original loop.
+  //   // The initial values of the phis in the loop are merged from two paths.
+  //   NewInitVal = Phi OrigInitVal, Check, PipelineLastVal, Epilog
+  //   goto OrigKernel
+  //
+  // OrigKernel:
+  //   // The original loop block.
+  //   if (LoopCounter != 0)
+  //     goto OrigKernel
+  //   goto NewExit
+  //
+  // NewExit:
+  //   // Newly created dedicated exit for the original loop.
+  //   // Merge values which are referenced after the loop
+  //   Merged = Phi OrigVal, OrigKernel, PipelineVal, Epilog
+  //   goto OrigExit
+  //
+  // OrigExit:
+  //   // The block that is originally the loop exit.
+  //   // If it is already deicated exit, NewExit is not created.
+
+  // An example of where each stage is executed:
+  // Assume #Stages 3, #MVE 4, #Iterations 12
+  // Iter   0 1 2 3 4 5 6 7 8 9 10-11
+  // -------------------------------------------------
+  // Stage  0                          Prolog#0
+  // Stage  1 0                        Prolog#1
+  // Stage  2 1 0                      Kernel Unroll#0 Iter#0
+  // Stage    2 1 0                    Kernel Unroll#1 Iter#0
+  // Stage      2 1 0                  Kernel Unroll#2 Iter#0
+  // Stage        2 1 0                Kernel Unroll#3 Iter#0
+  // Stage          2 1 0              Kernel Unroll#0 Iter#1
+  // Stage            2 1 0            Kernel Unroll#1 Iter#1
+  // Stage              2 1 0          Kernel Unroll#2 Iter#1
+  // Stage                2 1 0        Kernel Unroll#3 Iter#1
+  // Stage                  2 1        Epilog#0
+  // Stage                    2        Epilog#1
+  // Stage                      0-2    OrigKernel
+
+  LoopInfo = TII->analyzeLoopForPipelining(OrigKernel);
+  assert(LoopInfo && "Must be able to analyze loop!");
+
+  calcNumUnroll();
+
+  Check = MF.CreateMachineBasicBlock(OrigKernel->getBasicBlock());
+  Prolog = MF.CreateMachineBasicBlock(OrigKernel->getBasicBlock());
+  NewKernel = MF.CreateMachineBasicBlock(OrigKernel->getBasicBlock());
+  Epilog = MF.CreateMachineBasicBlock(OrigKernel->getBasicBlock());
+  NewPreheader = MF.CreateMachineBasicBlock(OrigKernel->getBasicBlock());
+
+  MF.insert(OrigKernel->getIterator(), Check);
+  MF.insert(OrigKernel->getIterator(), Prolog);
+  MF.insert(OrigKernel->getIterator(), NewKernel);
+  MF.insert(OrigKernel->getIterator(), Epilog);
+  MF.insert(OrigKernel->getIterator(), NewPreheader);
+
+  NewExit = createDedicatedExit(OrigKernel, OrigExit);
+
+  NewPreheader->transferSuccessorsAndUpdatePHIs(OrigPreheader);
+  TII->insertUnconditionalBranch(*NewPreheader, OrigKernel, DebugLoc());
+
+  OrigPreheader->addSuccessor(Check);
+  TII->removeBranch(*OrigPreheader);
+  TII->insertUnconditionalBranch(*OrigPreheader, Check, DebugLoc());
+
+  Check->addSuccessor(Prolog);
+  Check->addSuccessor(NewPreheader);
+
+  Prolog->addSuccessor(NewKernel);
+
+  NewKernel->addSuccessor(NewKernel);
+  NewKernel->addSuccessor(Epilog);
+
+  Epilog->addSuccessor(NewPreheader);
+  Epilog->addSuccessor(NewExit);
+
+  SmallVector<MachineOperand, 4> Cond;
+  LoopInfo->createRemainingIterationsGreaterCondition(
+      Schedule.getNumStages() + NumUnroll - 2, *Check, Cond, ValueMapTy());
+  TII->insertBranch(*Check, Prolog, NewPreheader, Cond, DebugLoc());
+
+  // VRMaps map (prolog/kernel/epilog phase#, original register#) to new
+  // register#
+  SmallVector<ValueMapTy> PrologVRMap, KernelVRMap, EpilogVRMap;
+  generateProlog(PrologVRMap);
+  generateKernel(PrologVRMap, KernelVRMap);
+  generateEpilog(KernelVRMap, EpilogVRMap);
+}
+
+/// Replace MI's use operands according to the maps.
+void ModuloScheduleExpanderMVE::updateInstrUse(
+    MachineInstr *MI, int StageNum, int PhaseNum,
+    SmallVectorImpl<ValueMapTy> &CurVRMap,
+    SmallVectorImpl<ValueMapTy> *PrevVRMap) {
+  // If MI is in the prolog/kernel/epilog block, CurVRMap is
+  // PrologVRMap/KernelVRMap/EpilogVRMap respectively.
+  // PrevVRMap is nullptr/PhiVRMap/KernelVRMap respectively.
+  // Refer to the appropriate map according to the stage difference between
+  // MI and the definition of an operand.
+
+  for (MachineOperand &UseMO : MI->uses()) {
+    if (!UseMO.isReg() || !UseMO.getReg().isVirtual())
+      continue;
+    int DiffStage = 0;
+    Register OrigReg = UseMO.getReg();
+    MachineInstr *DefInst = MRI.getVRegDef(OrigReg);
+    if (!DefInst || DefInst->getParent() != OrigKernel)
+      continue;
+    unsigned InitReg = 0;
+    unsigned DefReg = OrigReg;
+    if (DefInst->isPHI()) {
+      ++DiffStage;
+      unsigned LoopReg;
+      getPhiRegs(*DefInst, OrigKernel, InitReg, LoopReg);
+      // LoopReg is guaranteed to be defined within the loop by canApply()
+      DefReg = LoopReg;
+      DefInst = MRI.getVRegDef(LoopReg);
+    }
+    unsigned DefStageNum = Schedule.getStage(DefInst);
+    DiffStage += StageNum - DefStageNum;
+    Register NewReg;
+    if (PhaseNum >= DiffStage && CurVRMap[PhaseNum - DiffStage].count(DefReg))
+      // NewReg is defined in a previous phase of the same block
+      NewReg = CurVRMap[PhaseNum - DiffStage][DefReg];
+    else if (!PrevVRMap)
+      // Since this is the first iteration, refer the initial register of the
+      // loop
+      NewReg = InitReg;
+    else
+      // Cases where DiffStage is larger than PhaseNum.
+      // If MI is in the kernel block, the value is defined by the previous
+      // iteration and PhiVRMap is referenced. If MI is in the epilog block, the
+      // value is defined in the kernel block and KernelVRMap is referenced.
+      NewReg = (*PrevVRMap)[PrevVRMap->size() - (DiffStage - PhaseNum)][DefReg];
+
+    const TargetRegisterClass *NRC =
+        MRI.constrainRegClass(NewReg, MRI.getRegClass(OrigReg));
+    if (NRC)
+      UseMO.setReg(NewReg);
+    else {
+      Register SplitReg = MRI.createVirtualRegister(MRI.getRegClass(OrigReg));
+      BuildMI(*OrigKernel, MI, MI->getDebugLoc(), TII->get(TargetOpcode::COPY),
+              SplitReg)
+          .addReg(NewReg);
+      UseMO.setReg(SplitReg);
+    }
+  }
+}
+
+/// Return a phi if Reg is referenced by the phi.
+/// canApply() guarantees that at most only one such phi exists.
+static MachineInstr *getLoopPhiUser(Register Reg, MachineBasicBlock *Loop) {
+  for (MachineInstr &Phi : Loop->phis()) {
+    unsigned InitVal, LoopVal;
+    getPhiRegs(Phi, Loop, InitVal, LoopVal);
+    if (LoopVal == Reg)
+      return &Phi;
+  }
+  return nullptr;
+}
+
+/// Generate phis for registers defined by OrigMI.
+void ModuloScheduleExpanderMVE::generatePhi(
+    MachineInstr *OrigMI, int UnrollNum,
+    SmallVectorImpl<ValueMapTy> &PrologVRMap,
+    SmallVectorImpl<ValueMapTy> &KernelVRMap,
+    SmallVectorImpl<ValueMapTy> &PhiVRMap) {
+  int StageNum = Schedule.getStage(OrigMI);
+  bool UsePrologReg;
+  if (Schedule.getNumStages() - NumUnroll + UnrollNum - 1 >= StageNum)
+    UsePrologReg = true;
+  else if (Schedule.getNumStages() - NumUnroll + UnrollNum == StageNum)
+    UsePrologReg = false;
+  else
+    return;
+
+  // Examples that show which stages are merged by phi.
+  // Meaning of the symbol following the stage number:
+  //   a/b: Stages with the same letter are merged (UsePrologReg == true)
+  //   +: Merged with the initial value (UsePrologReg == false)
+  //   *: No phis required
+  //
+  // #Stages 3, #MVE 4
+  // Iter   0 1 2 3 4 5 6 7 8
+  // -----------------------------------------
+  // Stage  0a                 Prolog#0
+  // Stage  1a 0b              Prolog#1
+  // Stage  2* 1* 0*           Kernel Unroll#0
+  // Stage     2* 1* 0+        Kernel Unroll#1
+  // Stage        2* 1+ 0a     Kernel Unroll#2
+  // Stage           2+ 1a 0b  Kernel Unroll#3
+  //
+  // #Stages 3, #MVE 2
+  // Iter   0 1 2 3 4 5 6 7 8
+  // -----------------------------------------
+  // Stage  0a                 Prolog#0
+  // Stage  1a 0b              Prolog#1
+  // Stage  2* 1+ 0a           Kernel Unroll#0
+  // Stage     2+ 1a 0b        Kernel Unroll#1
+  //
+  // #Stages 3, #MVE 1
+  // Iter   0 1 2 3 4 5 6 7 8
+  // -----------------------------------------
+  // Stage  0*                 Prolog#0
+  // Stage  1a 0b              Prolog#1
+  // Stage  2+ 1a 0b           Kernel Unroll#0
+
+  for (MachineOperand &DefMO : OrigMI->defs()) {
+    if (!DefMO.isReg())
+      continue;
+    Register OrigReg = DefMO.getReg();
+    auto NewReg = KernelVRMap[UnrollNum].find(OrigReg);
+    if (NewReg == KernelVRMap[UnrollNum].end())
+      continue;
+    Register CorrespondReg;
+    if (UsePrologReg) {
+      int PrologNum = Schedule.getNumStages() - NumUnroll + UnrollNum - 1;
+      CorrespondReg = PrologVRMap[PrologNum][OrigReg];
+    } else {
+      MachineInstr *Phi = getLoopPhiUser(OrigReg, OrigKernel);
+      if (!Phi)
+        continue;
+      CorrespondReg = getInitPhiReg(*Phi, OrigKernel);
+    }
+
+    assert(CorrespondReg.isValid());
+    Register PhiReg = MRI.createVirtualRegister(MRI.getRegClass(OrigReg));
+    BuildMI(*NewKernel, NewKernel->getFirstNonPHI(), DebugLoc(),
+            TII->get(TargetOpcode::PHI), PhiReg)
+        .addReg(NewReg->second)
+        .addMBB(NewKernel)
+        .addReg(CorrespondReg)
+        .addMBB(Prolog);
+    PhiVRMap[UnrollNum][OrigReg] = PhiReg;
+  }
+}
+
+static void replacePhiSrc(MachineInstr &Phi, Register OrigReg, Register NewReg,
+                          MachineBasicBlock *NewMBB) {
+  for (unsigned Idx = 1; Idx < Phi.getNumOperands(); Idx += 2) {
+    if (Phi.getOperand(Idx).getReg() == OrigReg) {
+      Phi.getOperand(Idx).setReg(NewReg);
+      Phi.getOperand(Idx + 1).setMBB(NewMBB);
+      return;
+    }
+  }
+}
+
+/// Generate phis that merge values from multiple routes
+void ModuloScheduleExpanderMVE::mergeRegUsesAfterPipeline(Register OrigReg,
+                                                          Register NewReg) {
+  SmallVector<MachineOperand *> UsesAfterLoop;
+  SmallVector<MachineInstr *> LoopPhis;
+  for (MachineRegisterInfo::use_iterator I = MRI.use_begin(OrigReg),
+                                         E = MRI.use_end();
+       I != E; ++I) {
+    MachineOperand &O = *I;
+    if (O.getParent()->getParent() != OrigKernel &&
+        O.getParent()->getParent() != Prolog &&
+        O.getParent()->getParent() != NewKernel &&
+        O.getParent()->getParent() != Epilog)
+      UsesAfterLoop.push_back(&O);
+    if (O.getParent()->getParent() == OrigKernel && O.getParent()->isPHI())
+      LoopPhis.push_back(O.getParent());
+  }
+
+  // Merge the route that only execute the pipelined loop (when there are no
+  // remaining iterations) with the route that execute the original loop.
+  if (!UsesAfterLoop.empty()) {
+    Register PhiReg = MRI.createVirtualRegister(MRI.getRegClass(OrigReg));
+    BuildMI(*NewExit, NewExit->getFirstNonPHI(), DebugLoc(),
+            TII...
[truncated]

@hgreving2304
Copy link
Member

ping

Cool.

Could you describe this a bit more (for example how are low trip counts handled?)

I'm actually not sure if the upstream LLVM swing scheduler stitches the prolig/epilog schedule for low trip counts or whether it versions the loop.

Could you comment on the cost model used to unroll per MVE? It should only be needed if there are copies, or an excessive number of copies.

@ytmukai
Copy link
Contributor Author

ytmukai commented Sep 26, 2023

Could you describe this a bit more (for example how are low trip counts handled?)

The pipelined code is only selected if the trip count is more than the amount consumed by the prologue/epilogue and one iteration of the kernel. If less than that, the original loop is selected. The original loop is also used to handle the remainder of the unroll by MVE.

Details of the control flow are described below:

// The control flow for pipelining with MVE:
//
// OrigPreheader:
// // The block that is originally the loop preheader
// goto Check
//
// Check:
// // Check whether the trip count satisfies the requirements to pipeline.
// if (LoopCounter > NumStages + NumUnroll - 2)
// // The minimum number of iterations to pipeline =
// // iterations executed in prolog/epilog (NumStages-1) +
// // iterations executed in one kernel run (NumUnroll)
// goto Prolog
// // fallback to the original loop
// goto NewPreheader
//
// Prolog:
// // All prolog stages. There are no direct branches to the epilogue.
// goto NewKernel
//
// NewKernel:
// // NumUnroll copies of the kernel
// if (LoopCounter > MVE-1)
// goto NewKernel
// goto Epilog
//
// Epilog:
// // All epilog stages.
// if (LoopCounter > 0)
// // The remainder is executed in the original loop
// goto NewPreheader
// goto NewExit
//
// NewPreheader:
// // Newly created preheader for the original loop.
// // The initial values of the phis in the loop are merged from two paths.
// NewInitVal = Phi OrigInitVal, Check, PipelineLastVal, Epilog
// goto OrigKernel
//
// OrigKernel:
// // The original loop block.
// if (LoopCounter != 0)
// goto OrigKernel
// goto NewExit
//
// NewExit:
// // Newly created dedicated exit for the original loop.
// // Merge values which are referenced after the loop
// Merged = Phi OrigVal, OrigKernel, PipelineVal, Epilog
// goto OrigExit
//
// OrigExit:
// // The block that is originally the loop exit.
// // If it is already deicated exit, NewExit is not created.
// An example of where each stage is executed:
// Assume #Stages 3, #MVE 4, #Iterations 12
// Iter 0 1 2 3 4 5 6 7 8 9 10-11
// -------------------------------------------------
// Stage 0 Prolog#0
// Stage 1 0 Prolog#1
// Stage 2 1 0 Kernel Unroll#0 Iter#0
// Stage 2 1 0 Kernel Unroll#1 Iter#0
// Stage 2 1 0 Kernel Unroll#2 Iter#0
// Stage 2 1 0 Kernel Unroll#3 Iter#0
// Stage 2 1 0 Kernel Unroll#0 Iter#1
// Stage 2 1 0 Kernel Unroll#1 Iter#1
// Stage 2 1 0 Kernel Unroll#2 Iter#1
// Stage 2 1 0 Kernel Unroll#3 Iter#1
// Stage 2 1 Epilog#0
// Stage 2 Epilog#1
// Stage 0-2 OrigKernel

I'm actually not sure if the upstream LLVM swing scheduler stitches the prolig/epilog schedule for low trip counts or whether it versions the loop.

The upstream implementation has no unrolling and also allows branching from the prologue to the epilogue, so that any trip count is handled only by pipelined code.

If we adopt a branch from the prologue to the epilogue, we should not be able to schedule those parts. Therefore, we did not employ that in this implementation. A further reason is that the original loop is needed to handle the remainder of the unroll anyway. (It might be possible to solve this problem by branching out into an epilogue in the middle of the kernel, but I believe it would require a large number of copies or multiple versions of the epilogue.)

Could you comment on the cost model used to unroll per MVE? It should only be needed if there are copies, or an excessive number of copies.

Actually, this implementation unrolls a number of times such that it requires none of the copies. A large number of unrolls has the disadvantage of higher required trip counts and increased code size. Since that disadvantage and the cost of copies are not directly comparable, I feel it is difficult to consider those tradeoffs.

It may be a good idea to choose the minimum number of unrolls where the ResMII, recalculated considering the resources consumed by the copies, does not exceed the scheduled II. It may also be necessary to check that the latency of the copies does not extend the critical path.

This patch does not replace the existing process, but adds a new entire process that uses MVE. We believe that the first step is to choose the appropriate method depending on the processor. For example, Intel processors seem to have an optimization that handles copies of floating-point registers at a lower cost in some cases. For processors without such optimizations, copy intructions often leads to an increase in II, so I think it may be better to unroll it sufficiently.

@ytmukai
Copy link
Contributor Author

ytmukai commented Nov 21, 2023

In order to verify the effect of MVE in more detail, I tested it in a scientific computation loop, for which software pipelining would be useful. The tested code is the following loop from IAMR, one of the ECP proxy applications, with some modifications.
https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584
(The modifications include removing branches and adjusting the number of iterations so that accesses hit the L1 cache to make it easier to check the effects of the optimization, which may change the conditions from the actual application execution. I am working on making the modified code publicly available.)

The modified code was measured with Graviton3 (Neoverse V1) and the results are as follows.

-pipeliner-force-ii=N cycles (no mve) instructions (no mve) cycles (mve) instructions (mve) cycles (no swpl) instructions (no swpl)
- - - - - 18.0 62.6
11 (MII) 29.3 134.4 19.6 100.7 - -
12 22.1 119.0 18.1 95.1 - -
13 20.0 103.8 17.6 89.8 - -
14 17.8 92.1 17.1 86.4 - -
15 17.3 85.3 16.6 80.4 - -
16 18.4 89.1 16.4 75.0 - -
17 17.5 80.3 15.6 67.6 - -
18 17.2 76.4 15.7 65.6 - -
19 16.6 70.4 15.5 64.1 - -
20 16.5 69.5 15.7 64.6 - -
21 16.6 69.5 15.7 64.6 - -
22 15.9 67.6 15.7 64.6 - -
23 16.5 67.5 16.1 64.6 - -
24 16.5 67.6 16.0 64.6 - -
25 16.5 66.5 16.2 64.6 - -

The table compares the number of cycles and instructions per iteration with and without MVE for each scheduled II. The number of cycles is reduced by 3% by MVE when comparing the fastest results for each. The reduction is 14% compared to the case without pipelining.

The number of floating-point instructions per iteration of the loop is 43 (division is reciprocally approximated by -mrecip) and Neoverse V1 has four arithmetic units, so ResMII is 11. With small II, the performance is reduced by spills due to lack of register.
This result indicates that the number of registers should also be considered when determining II. My colleague @kasuga-fj is working on this issue (https://discourse.llvm.org/t/considering-register-pressure-when-deciding-initiation-interval-in-machinepipeliner/74725).

I would appreciate any advice for further verification.

Modulo variable expansion is a technique that resolves overlap of
variable lifetimes by unrolling. The existing implementation solves
it by making a copy by move instruction for processors with ordinary
registers such as Arm and x86. This method may result in a very large
number of move instructions, which can cause performance problems.

Modulo variable expansion is enabled by specifing -pipeliner-mve-cg.
A backend must implement some newly defined interfaces in
PipelinerLoopInfo.
@ytmukai
Copy link
Contributor Author

ytmukai commented Apr 12, 2024

I have updated the patch; interface fixes and AArch64 implementation. Could you please restart the review?

  • Replace a parameter RegMap to LastStage0Insts in createRemainingIterationsGreaterCondition(). These maps are used to obtain the loop counter value at that point. We found that RegMap is not sufficient if that register is defined by PHI, so it was changed to an instruction map.
  • Implement interfaces of MVE for AArch64. The main part is to determine whether the loop ends at the unrolled kernel. It is implemented in a way that generates comparison instructions for the next unroll count iteration and accumulates the results with CINC instructions. Although the number of instructions is large, it can handle many forms of loops and is not a significant problem for large kernels where pipelining is effective.

I tested the performance with llvm-test-suite on a Neoverse V1 processor. The following table lists the most effective test cases with pipelining. All results are uploaded as swp.performance.llvm-test-suite.csv.

test case NOSWP SWP SWP+MVE SWP+faster speedup by SWP(faster) speedup by MVE
SingleSource/Benchmarks/Misc/flops-3.test 0.67 0.82 0.54 0.54 24.9% 53.2%
MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test 2.59 2.61 2.24 2.24 15.6% 16.5%
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test 2.62 2.63 2.28 2.28 14.7% 15.1%
SingleSource/Benchmarks/Misc/flops-1.test 0.54 0.68 0.47 0.47 13.6% 43.1%
SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 0.46 0.41 0.41 0.41 12.7% 0.0%
SingleSource/Benchmarks/Misc/flops.test 2.66 2.79 2.41 2.41 10.3% 15.7%

Optimization flags:

  • NOSWP: -O3 -mcpu=neoverse-v1
  • SWP: NOSWP + -mllvm -aarch64-enable-pipeliner -mllvm -pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0 -mllvm -pipeliner-register-pressure
  • SWP+MVE: SWP + -mllvm -pipeliner-mve-cg
  • SWP+faster is the faster one for each cases.

In many of the above cases, the effect of MVE is significant and the csv shows that it is better in general. I believe that MVE should be the default in AArch64.

Note: Cases with unstable runtimes and micro-benchmarks are excluded. The micro-benchmarks are mostly the same and are not suitable for pipelining because its kernel is very small.

@ytmukai
Copy link
Contributor Author

ytmukai commented Apr 26, 2024

We have also evaluated other benchmarks and will share the results.

In a hotspot kernel in ExaMiniMD, pipelining with MVE is effective.

ExaMiniMD is a benchmark for molecular dynamics in ECP Proxy Applications. Then the kernel calculates the LJ potential, a kind of pair potential.

It was measured on one Neoverse V1 core. The execution time of the kernel accounts for approximately 50% of the total. The number of cycles of the kernel measured by perf record is as follows (The fastest II is selected for each):

NOSWP SWP(II=25) SWP+MVE(II=17)
15.2e9 15.3e9 14.2e9

(Compile flags: -Ofast -mrecip -g -mcpu=neoverse-v1 -mllvm -sve-gather-overhead=1 -mllvm -sve-scatter-overhead=1 -mllvm -aarch64-enable-pipeliner=1 -mllvm -pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0 -mllvm -pipeliner-mve-cg=0or1 -mllvm -pipeliner-force-ii=N -fno-unroll-loops)

Pipelining with MVE reduced execution time by 6.7%. (3.4% in total)

We also evaluated other benchmarks in ECP Proxy Applications, CORAL-2 Benchmarks and SPEC, but no hotspot kernels found to be effective. This was mostly due to the fact that pipelining was inherently ineffective for the kernels because of a small number instructions, etc. Some kernels could benefit if pipelining can be applied by improving other optimizations, alias analysis, inline expansion of math functions, template optimizations.

void createRemainingIterationsGreaterCondition(
int TC, MachineBasicBlock &MBB, SmallVectorImpl<MachineOperand> &Cond,
DenseMap<MachineInstr *, MachineInstr *> LastStage0Insts) override {
llvm_unreachable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could save some target code by making this unreachable a default implementation in PipelineLoopInfo, such that any target which enables MVE would by default run into the error case if they did not define it themselves.

This is similar to what's done in, for example, optimizeSelect (which is guarded by analyzeSelect), and replaceBranchWithTailCall (guarded by canMakeTailCallConditional)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to provide default implementation. I previously considered it a convention not to provide a default implementation, but since the implementation of this interface is optional, it would be more convenient to have a default.

@@ -766,6 +766,22 @@ class TargetInstrInfo : public MCInstrInfo {
createTripCountGreaterCondition(int TC, MachineBasicBlock &MBB,
SmallVectorImpl<MachineOperand> &Cond) = 0;

/// Create a condition to determine if the remaining trip count for a phase
/// is greater than TC. Some instructions such as comparisons may be
/// inserted at the bottom of MBB. The all instructions expanded for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The all instructions" -> "All instructions"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


/// Return true if the target can expand pipelined schedule with modulo
/// variable expansion.
virtual bool isMVEExpanderSupported() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a sane default (false, almost certainly for now) to avoid forcing downstream targets to update their implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to provide a default that returns false.

TII->insertUnconditionalBranch(*NewExit, Exit, DebugLoc());
NewExit->addSuccessor(Exit);

for (MachineInstr &Phi : Exit->phis()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MachineBasicBlock::replacePhiUsesWith

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

llvm_unreachable("unexpected loop structure");
TII->removeBranch(*Loop);
TII->insertBranch(*Loop, TBB, FBB, Cond, DebugLoc());
Loop->removeSuccessor(Exit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MachineBasicBlock::replaceSuccessor does the remove/add, plus more. Would it make sense to call that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks, I didn't know the appropriate interface.

/// Generate a pipelined loop that is unrolled by using MVE algorithm and any
/// other necessary blocks. The control flow is modified to execute the
/// pipelined loop if the trip count satisfies the condition, otherwise the
/// original loop. The original loop is also used to execute the reminder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'reminder' -> 'remainder'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

//
// Check:
// // Check whether the trip count satisfies the requirements to pipeline.
// if (LoopCounter > NumStages + NumUnroll - 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this branch order make sense for optimal performance? If the 'expected' case is to run the pipelined loop, avoiding a forced branch seems more profitable, and the conditional should fire if and only if the 'exceptional' case of going around the pipelined loop occurs.

This is what the base expander seems to do, but it's confusing...

LoopInfo->createTripCountGreaterCondition(j + 1, *Prolog, Cond);

It gets the trip count condition, and then inserts a branch that uses 'epilog' (branch-around) as the true and 'lastpro' (execute loop) as the false branch. This is counterintuitive but leads to a forced branch only if there's not enough iterations.

Unless I'm missing something, of course... I suppose a target could identify and swap the condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I added a private option to swap the target blocks.

static cl::opt<bool> SwapBranchTargetsMVE(
"pipeliner-swap-branch-targets-mve", cl::Hidden, cl::init(false),
cl::desc("Swap target blocks of a conditional branch for MVE expander"));

if (SwapBranchTargetsMVE) {
// Set SwapBranchTargetsMVE to true if a target prefers to replace TBB and
// FBB for optimal performance.
if (TII->reverseBranchCondition(Cond))
llvm_unreachable("can not reverse branch condition");
TII->insertBranch(MBB, &Otherwise, &GreaterThan, Cond, DebugLoc());
} else {
TII->insertBranch(MBB, &GreaterThan, &Otherwise, Cond, DebugLoc());
}

I was not aware of this, but if a processor predicts conditions, I think the number of executed instructions will be smaller if the 'expected' case is the destination of the conditional branch. But for processors that don't, I agree with you.
I would like to later create a framework for setting this up for each target, along with several other private options such as MVE activation.

@DragonDisciple
Copy link
Contributor

I am by no means an expert in AArch64, so someone with a better handle on it than I should review the implementations and tests.

However, this looks fine, and the results seem generally positive. I'd like to get this upstream so my team can pull it down and play with it. One thing I think we'd like to improve (and share upstream, of course!) is handling loop-carried dependences such as accumulations.

We have a pre-pipeliner unroller that does this, but think that using this method is likely cleaner.

tl;dr: LGTM

@ytmukai
Copy link
Contributor Author

ytmukai commented May 7, 2024

Thanks for the review! I would be very grateful if you would use and improve this.

I understand pre-pipeliner unroll and issues about accumulations as I commented in an earlier discussion. (https://discourse.llvm.org/t/implementing-modulo-variable-expansion-for-machinepipeliner/71748/7?u=ytmukai)

As other improvements, we are aware of a problem of large unroll counts, which occur when the same addressing register is used by a load and a store. They are often located at the beginning and end of the DDG and cross all stages.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I looked through the AArch64 code and it looks OK to me.

/// If Reg is an induction variable, return true and set some parameters
static bool getIndVarInfo(Register Reg, const MachineBasicBlock *LoopBB,
MachineInstr *&UpdateInst,
unsigned *UpdateCounterOprNum, Register *InitReg,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass UpdateCounterOprNum and the other "outputs" by reference? unsigned &UpdateCounterOprNum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. There was no reason to make it a pointer.

@@ -1,80 +0,0 @@
# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -pipeliner-enable-copytophi=0 -debug-only=pipeliner 2>&1 | FileCheck %s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the CBZ that is no longer supported? Is that something that we would re-add support for? Should we keep the test with a note about not supporting CBZ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I omitted CBZ because of the additional processing required to support it and because we did not see it used in typical loops to pipeline. If that support is found to be useful I would like to modify it.

The tests were changed to remain and marked as unsupported.

# UNSUPPORTED: target={{.*}}
# Compare and branch instructions are not supported now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

I would say it might be best to update the CHECK below instead or marking it unsupported. We can then check we don't do anything bad with the CBZ. The comment is good to keep in explaining that the instruction isn't supported.

MachineInstr *Comp = nullptr;
unsigned CompCounterOprNum = 0;
for (MachineInstr &MI : reverse(*LoopBB)) {
if (MI.modifiesRegister(AArch64::NZCV, &TRI)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check for other uses of NZCV?

Copy link
Contributor Author

@ytmukai ytmukai Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All instructions that use NZCV are scheduled in the first stage (not pipelined). This means that their NZCV def-use do not cross blocks. Thus, they do not interfere each other with the use of NZCV for loop control.
That case is tested by https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AArch64/sms-unpipeline-insts1.mir.

In addition, it seems that a case in which NZCV is referenced across loops is considered an undefined reference.
https://godbolt.org/z/PP3fzdd8a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CALL instructions could cause problems because NZCV use/def is not set. It is fixed not to pipeline when CALL or instructions which have unmodeled side effects exist as 0afa899.

Looking at the code in MachineScheduler, it seems that there are no other instructions that implicitly use NZCV.

/// Returns true if MI is an instruction we are unable to reason about
/// (like a call or something with unmodeled side effects).
static inline bool isGlobalMemoryObject(MachineInstr *MI) {
return MI->isCall() || MI->hasUnmodeledSideEffects() ||
(MI->hasOrderedMemoryRef() && !MI->isDereferenceableInvariantLoad());
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp Show resolved Hide resolved
@ytmukai
Copy link
Contributor Author

ytmukai commented Jun 6, 2024

Thank you for the review! I think I have fixed all the issues.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I would recommend updating the tests as opposed to marking them as unsupported, but otherwise LGTM.

@@ -1,80 +0,0 @@
# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -pipeliner-enable-copytophi=0 -debug-only=pipeliner 2>&1 | FileCheck %s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

I would say it might be best to update the CHECK below instead or marking it unsupported. We can then check we don't do anything bad with the CBZ. The comment is good to keep in explaining that the instruction isn't supported.

MachineInstr *Comp = nullptr;
unsigned CompCounterOprNum = 0;
for (MachineInstr &MI : reverse(*LoopBB)) {
if (MI.modifiesRegister(AArch64::NZCV, &TRI)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants