[AArch64][MachinePipeliner] Add pipeliner support for AArch64 #79589

ytmukai · 2024-01-26T12:41:23Z

Add AArch64 implementations for the interfaces of MachinePipeliner pass. The pass is disabled by default for AArch64. It is enabled by specifying --aarch64-enable-pipeliner.

5 tests in llvm-test-suites show performance improvement by more than 5% on a Neoverse V1 processor.

test	improvement
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test	16%
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-flt.test	16%
SingleSource/Benchmarks/Adobe-C++/loop_unroll.test	14%
SingleSource/Benchmarks/Misc/flops-5.test	13%
SingleSource/Benchmarks/BenchmarkGame/spectral-norm.test	6%

(base flags: -mcpu=neoverse-v1 -O3 -mrecip, flags for pipelining: -mllvm -aarch64-enable-pipeliner -mllvm
-pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0)

On the other hand, there are cases of significant performance degradation. Algorithm improvements and adding the option/pragma will be needed in the future.

llvmbot · 2024-01-26T12:41:58Z

@llvm/pr-subscribers-backend-aarch64

Author: Yuta Mukai (ytmukai)

Changes

Add AArch64 implementations for the interfaces of MachinePipeliner pass. The pass is disabled by default for AArch64. It is enabled by specifying --aarch64-enable-pipeliner.

5 tests in llvm-test-suites show performance improvement by more than 5% on a Neoverse V1 processor.

test	improvement
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test	16%
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-flt.test	16%
SingleSource/Benchmarks/Adobe-C++/loop_unroll.test	14%
SingleSource/Benchmarks/Misc/flops-5.test	13%
SingleSource/Benchmarks/BenchmarkGame/spectral-norm.test	6%

(base flags: -mcpu=neoverse-v1 -O3 -mrecip, flags for pipelining: -mllvm -aarch64-enable-pipeliner -mllvm
-pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0)

On the other hand, there are cases of significant performance degradation. Algorithm improvements and adding the option/pragma will be needed in the future.

Patch is 34.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/79589.diff

14 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.cpp (+89)
(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.h (+4)
(modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+9)
(modified) llvm/lib/Target/AArch64/AArch64Subtarget.h (+3)
(modified) llvm/lib/Target/AArch64/AArch64TargetMachine.cpp (+2)
(modified) llvm/test/CodeGen/AArch64/O3-pipeline.ll (+8-1)
(added) llvm/test/CodeGen/AArch64/sms-acceptable-loop1.mir (+86)
(added) llvm/test/CodeGen/AArch64/sms-acceptable-loop2.mir (+86)
(added) llvm/test/CodeGen/AArch64/sms-acceptable-loop3.mir (+87)
(added) llvm/test/CodeGen/AArch64/sms-acceptable-loop4.mir (+87)
(added) llvm/test/CodeGen/AArch64/sms-unacceptable-loop1.mir (+85)
(added) llvm/test/CodeGen/AArch64/sms-unacceptable-loop2.mir (+88)
(added) llvm/test/CodeGen/AArch64/sms-unpipeline-insts1.mir (+95)
(added) llvm/test/CodeGen/AArch64/sms-unpipeline-insts2.mir (+88)

diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
index 2e8d8c63d6bec24..9771af127760622 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
@@ -9605,6 +9605,95 @@ AArch64InstrInfo::probedStackAlloc(MachineBasicBlock::iterator MBBI,
   return ExitMBB->begin();
 }
 
+namespace {
+class AArch64PipelinerLoopInfo : public TargetInstrInfo::PipelinerLoopInfo {
+  MachineInstr *PredBranch;
+  SmallVector<MachineOperand, 4> Cond;
+
+public:
+  AArch64PipelinerLoopInfo(MachineInstr *PredBranch,
+                           const SmallVectorImpl<MachineOperand> &Cond)
+      : PredBranch(PredBranch), Cond(Cond.begin(), Cond.end()) {}
+
+  bool shouldIgnoreForPipelining(const MachineInstr *MI) const override {
+    // Make the instructions for loop control be placed in stage 0.
+    // The predecessors of PredBranch are considered by the caller.
+    return MI == PredBranch;
+  }
+
+  std::optional<bool> createTripCountGreaterCondition(
+      int TC, MachineBasicBlock &MBB,
+      SmallVectorImpl<MachineOperand> &CondParam) override {
+    // A branch instruction will be inserted as "if (Cond) goto epilogue".
+    // Cond is normalized for such use.
+    // The predecessors of the branch are assumed to have already been inserted.
+    CondParam = Cond;
+    return {};
+  }
+
+  void setPreheader(MachineBasicBlock *NewPreheader) override {}
+
+  void adjustTripCount(int TripCountAdjust) override {}
+
+  void disposed() override {}
+};
+} // namespace
+
+std::unique_ptr<TargetInstrInfo::PipelinerLoopInfo>
+AArch64InstrInfo::analyzeLoopForPipelining(MachineBasicBlock *LoopBB) const {
+  MachineBasicBlock *TBB = nullptr, *FBB = nullptr;
+  SmallVector<MachineOperand, 4> Cond;
+  if (analyzeBranch(*LoopBB, TBB, FBB, Cond))
+    return nullptr;
+
+  // Infinite loops are not supported
+  if (TBB == LoopBB && FBB == LoopBB)
+    return nullptr;
+
+  // Must be conditional branch
+  if (FBB == nullptr)
+    return nullptr;
+
+  assert((TBB == LoopBB || FBB == LoopBB) &&
+         "The Loop must be a single-basic-block loop");
+
+  // Normalization for createTripCountGreaterCondition()
+  if (TBB == LoopBB)
+    reverseBranchCondition(Cond);
+
+  MachineInstr *CondBranch = &*LoopBB->getFirstTerminator();
+  const TargetRegisterInfo &TRI = getRegisterInfo();
+
+  // Find the immediate predecessor of the conditional branch
+  MachineInstr *PredBranch = nullptr;
+  if (CondBranch->getOpcode() == AArch64::Bcc) {
+    for (MachineInstr &MI : reverse(*LoopBB)) {
+      if (MI.modifiesRegister(AArch64::NZCV, &TRI)) {
+        PredBranch = &MI;
+        break;
+      }
+    }
+    if (!PredBranch)
+      return nullptr;
+  } else {
+    // For compare and branch
+    const MachineRegisterInfo &MRI = LoopBB->getParent()->getRegInfo();
+    Register Reg = CondBranch->getOperand(0).getReg();
+    if (!Reg.isVirtual())
+      return nullptr;
+    PredBranch = MRI.getVRegDef(Reg);
+
+    // MachinePipeliner does not expect Phi
+    if (PredBranch->isPHI())
+      return nullptr;
+
+    if (PredBranch->getParent() != LoopBB)
+      return nullptr;
+  }
+
+  return std::make_unique<AArch64PipelinerLoopInfo>(PredBranch, Cond);
+}
+
 #define GET_INSTRINFO_HELPERS
 #define GET_INSTRMAP_INFO
 #include "AArch64GenInstrInfo.inc"
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.h b/llvm/lib/Target/AArch64/AArch64InstrInfo.h
index db24a19fe5f8e3c..c9b159a913907b3 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.h
@@ -250,6 +250,10 @@ class AArch64InstrInfo final : public AArch64GenInstrInfo {
                         MachineBasicBlock *FBB, ArrayRef<MachineOperand> Cond,
                         const DebugLoc &DL,
                         int *BytesAdded = nullptr) const override;
+
+  std::unique_ptr<TargetInstrInfo::PipelinerLoopInfo>
+  analyzeLoopForPipelining(MachineBasicBlock *LoopBB) const override;
+
   bool
   reverseBranchCondition(SmallVectorImpl<MachineOperand> &Cond) const override;
   bool canInsertSelect(const MachineBasicBlock &, ArrayRef<MachineOperand> Cond,
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index e3a0606331db1c0..299e1dd5cf49e4a 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -81,6 +81,11 @@ static cl::opt<unsigned> AArch64MinimumJumpTableEntries(
     "aarch64-min-jump-table-entries", cl::init(13), cl::Hidden,
     cl::desc("Set minimum number of entries to use a jump table on AArch64"));
 
+static cl::opt<bool>
+    EnableMachinePipeliner("aarch64-enable-pipeliner",
+                           cl::desc("Enable Machine Pipeliner for AArch64"),
+                           cl::init(false), cl::Hidden);
+
 unsigned AArch64Subtarget::getVectorInsertExtractBaseCost() const {
   if (OverrideVectorInsertExtractBaseCost.getNumOccurrences() > 0)
     return OverrideVectorInsertExtractBaseCost;
@@ -540,3 +545,7 @@ AArch64Subtarget::getAuthenticatedLRCheckMethod() const {
   // performance regression or incompatibility with execute-only mappings.
   return AArch64PAuth::AuthCheckMethod::None;
 }
+
+bool AArch64Subtarget::enableMachinePipeliner() const {
+  return getSchedModel().hasInstrSchedModel() && EnableMachinePipeliner;
+}
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index 16864102df59b05..0292c018f1dbc1e 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -201,6 +201,9 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
   bool enableMachineScheduler() const override { return true; }
   bool enablePostRAScheduler() const override { return usePostRAScheduler(); }
 
+  bool enableMachinePipeliner() const override;
+  bool useDFAforSMS() const override { return false; }
+
   /// Returns ARM processor family.
   /// Avoid this function! CPU specifics should be kept local to this class
   /// and preferably modeled with SubtargetFeatures or properties in
diff --git a/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp b/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
index 6fbc13d8904f2e2..ebc06dd219deefc 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
@@ -779,6 +779,8 @@ void AArch64PassConfig::addPreRegAlloc() {
     // be register coalescer friendly.
     addPass(&PeepholeOptimizerID);
   }
+  if (TM->getOptLevel() != CodeGenOptLevel::None)
+    addPass(&MachinePipelinerID);
 }
 
 void AArch64PassConfig::addPostRegAlloc() {
diff --git a/llvm/test/CodeGen/AArch64/O3-pipeline.ll b/llvm/test/CodeGen/AArch64/O3-pipeline.ll
index 638f26298ee26aa..4f3d04afce4bcf9 100644
--- a/llvm/test/CodeGen/AArch64/O3-pipeline.ll
+++ b/llvm/test/CodeGen/AArch64/O3-pipeline.ll
@@ -154,13 +154,20 @@
 ; CHECK-NEXT:       Remove dead machine instructions
 ; CHECK-NEXT:       AArch64 MI Peephole Optimization pass
 ; CHECK-NEXT:       AArch64 Dead register definitions
+; CHECK-NEXT:       MachineDominator Tree Construction
+; CHECK-NEXT:       Slot index numbering
+; CHECK-NEXT:       Live Interval Analysis
+; CHECK-NEXT:       Lazy Machine Block Frequency Analysis
+; CHECK-NEXT:       Machine Optimization Remark Emitter
+; CHECK-NEXT:       Modulo Software Pipelining
 ; CHECK-NEXT:       Detect Dead Lanes
 ; CHECK-NEXT:       Process Implicit Definitions
 ; CHECK-NEXT:       Remove unreachable machine basic blocks
 ; CHECK-NEXT:       Live Variable Analysis
+; CHECK-NEXT:       MachineDominator Tree Construction
+; CHECK-NEXT:       Machine Natural Loop Construction
 ; CHECK-NEXT:       Eliminate PHI nodes for register allocation
 ; CHECK-NEXT:       Two-Address instruction pass
-; CHECK-NEXT:       MachineDominator Tree Construction
 ; CHECK-NEXT:       Slot index numbering
 ; CHECK-NEXT:       Live Interval Analysis
 ; CHECK-NEXT:       Register Coalescer
diff --git a/llvm/test/CodeGen/AArch64/sms-acceptable-loop1.mir b/llvm/test/CodeGen/AArch64/sms-acceptable-loop1.mir
new file mode 100644
index 000000000000000..2267b7055494896
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sms-acceptable-loop1.mir
@@ -0,0 +1,86 @@
+# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -debug-only=pipeliner 2>&1 | FileCheck %s
+
+# An acceptable loop by pipeliner: TBB == ExitBB, FBB == LoopBB, Branch with NZCV flags
+# CHECK: Schedule Found? 1
+
+--- |
+  define dso_local void @func(ptr noalias nocapture noundef writeonly %a, ptr nocapture noundef readonly %b, i32 noundef %n) local_unnamed_addr #0 {
+  entry:
+    %cmp6 = icmp sgt i32 %n, 0
+    br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup
+
+  for.body.preheader:                               ; preds = %entry
+    %wide.trip.count = zext nneg i32 %n to i64
+    br label %for.body
+
+  for.cond.cleanup:                                 ; preds = %for.body, %entry
+    ret void
+
+  for.body:                                         ; preds = %for.body.preheader, %for.body
+    %lsr.iv11 = phi i64 [ %wide.trip.count, %for.body.preheader ], [ %lsr.iv.next, %for.body ]
+    %lsr.iv9 = phi ptr [ %b, %for.body.preheader ], [ %scevgep10, %for.body ]
+    %lsr.iv = phi ptr [ %a, %for.body.preheader ], [ %scevgep, %for.body ]
+    %0 = load float, ptr %lsr.iv9, align 4, !tbaa !6
+    %add = fadd float %0, 1.000000e+00
+    store float %add, ptr %lsr.iv, align 4, !tbaa !6
+    %scevgep = getelementptr i8, ptr %lsr.iv, i64 4
+    %scevgep10 = getelementptr i8, ptr %lsr.iv9, i64 4
+    %lsr.iv.next = add nsw i64 %lsr.iv11, -1
+    %exitcond.not = icmp eq i64 %lsr.iv.next, 0
+    br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !10
+  }
+
+  !6 = !{!7, !7, i64 0}
+  !7 = !{!"float", !8, i64 0}
+  !8 = !{!"omnipotent char", !9, i64 0}
+  !9 = !{!"Simple C/C++ TBAA"}
+  !10 = distinct !{!10, !11, !12}
+  !11 = !{!"llvm.loop.mustprogress"}
+  !12 = !{!"llvm.loop.unroll.disable"}
+
+...
+---
+name:            func
+tracksRegLiveness: true
+liveins:
+  - { reg: '$x0', virtual-reg: '%7' }
+  - { reg: '$x1', virtual-reg: '%8' }
+  - { reg: '$w2', virtual-reg: '%9' }
+body:             |
+  bb.0.entry:
+    successors: %bb.1(0x50000000), %bb.2(0x30000000)
+    liveins: $x0, $x1, $w2
+
+    %9:gpr32common = COPY $w2
+    %8:gpr64 = COPY $x1
+    %7:gpr64 = COPY $x0
+    dead $wzr = SUBSWri %9, 1, 0, implicit-def $nzcv
+    Bcc 11, %bb.2, implicit $nzcv
+    B %bb.1
+
+  bb.1.for.body.preheader:
+    %11:gpr32 = ORRWrs $wzr, %9, 0
+    %0:gpr64all = SUBREG_TO_REG 0, killed %11, %subreg.sub_32
+    %14:fpr32 = FMOVSi 112
+    B %bb.3
+
+  bb.2.for.cond.cleanup:
+    RET_ReallyLR
+
+  bb.3.for.body:
+    successors: %bb.2(0x04000000), %bb.3(0x7c000000)
+
+    %1:gpr64sp = PHI %0, %bb.1, %6, %bb.3
+    %2:gpr64sp = PHI %8, %bb.1, %5, %bb.3
+    %3:gpr64sp = PHI %7, %bb.1, %4, %bb.3
+    early-clobber %12:gpr64sp, %13:fpr32 = LDRSpost %2, 4 :: (load (s32) from %ir.lsr.iv9, !tbaa !6)
+    %15:fpr32 = nofpexcept FADDSrr killed %13, %14, implicit $fpcr
+    early-clobber %16:gpr64sp = STRSpost killed %15, %3, 4 :: (store (s32) into %ir.lsr.iv, !tbaa !6)
+    %4:gpr64all = COPY %16
+    %5:gpr64all = COPY %12
+    %17:gpr64 = nsw SUBSXri %1, 1, 0, implicit-def $nzcv
+    %6:gpr64all = COPY %17
+    Bcc 0, %bb.2, implicit $nzcv
+    B %bb.3
+
+...
diff --git a/llvm/test/CodeGen/AArch64/sms-acceptable-loop2.mir b/llvm/test/CodeGen/AArch64/sms-acceptable-loop2.mir
new file mode 100644
index 000000000000000..aeb9770bcb47fde
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sms-acceptable-loop2.mir
@@ -0,0 +1,86 @@
+# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -debug-only=pipeliner 2>&1 | FileCheck %s
+
+# An acceptable loop by pipeliner: TBB == LoopBB, FBB == ExitBB, Branch with NZCV flags
+# CHECK: Schedule Found? 1
+
+--- |
+  define dso_local void @func(ptr noalias nocapture noundef writeonly %a, ptr nocapture noundef readonly %b, i32 noundef %n) local_unnamed_addr #0 {
+  entry:
+    %cmp6 = icmp sgt i32 %n, 0
+    br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup
+
+  for.body.preheader:                               ; preds = %entry
+    %wide.trip.count = zext nneg i32 %n to i64
+    br label %for.body
+
+  for.cond.cleanup:                                 ; preds = %for.body, %entry
+    ret void
+
+  for.body:                                         ; preds = %for.body.preheader, %for.body
+    %lsr.iv11 = phi i64 [ %wide.trip.count, %for.body.preheader ], [ %lsr.iv.next, %for.body ]
+    %lsr.iv9 = phi ptr [ %b, %for.body.preheader ], [ %scevgep10, %for.body ]
+    %lsr.iv = phi ptr [ %a, %for.body.preheader ], [ %scevgep, %for.body ]
+    %0 = load float, ptr %lsr.iv9, align 4, !tbaa !6
+    %add = fadd float %0, 1.000000e+00
+    store float %add, ptr %lsr.iv, align 4, !tbaa !6
+    %scevgep = getelementptr i8, ptr %lsr.iv, i64 4
+    %scevgep10 = getelementptr i8, ptr %lsr.iv9, i64 4
+    %lsr.iv.next = add nsw i64 %lsr.iv11, -1
+    %exitcond.not = icmp eq i64 %lsr.iv.next, 0
+    br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !10
+  }
+
+  !6 = !{!7, !7, i64 0}
+  !7 = !{!"float", !8, i64 0}
+  !8 = !{!"omnipotent char", !9, i64 0}
+  !9 = !{!"Simple C/C++ TBAA"}
+  !10 = distinct !{!10, !11, !12}
+  !11 = !{!"llvm.loop.mustprogress"}
+  !12 = !{!"llvm.loop.unroll.disable"}
+
+...
+---
+name:            func
+tracksRegLiveness: true
+liveins:
+  - { reg: '$x0', virtual-reg: '%7' }
+  - { reg: '$x1', virtual-reg: '%8' }
+  - { reg: '$w2', virtual-reg: '%9' }
+body:             |
+  bb.0.entry:
+    successors: %bb.1(0x50000000), %bb.2(0x30000000)
+    liveins: $x0, $x1, $w2
+
+    %9:gpr32common = COPY $w2
+    %8:gpr64 = COPY $x1
+    %7:gpr64 = COPY $x0
+    dead $wzr = SUBSWri %9, 1, 0, implicit-def $nzcv
+    Bcc 11, %bb.2, implicit $nzcv
+    B %bb.1
+
+  bb.1.for.body.preheader:
+    %11:gpr32 = ORRWrs $wzr, %9, 0
+    %0:gpr64all = SUBREG_TO_REG 0, killed %11, %subreg.sub_32
+    %14:fpr32 = FMOVSi 112
+    B %bb.3
+
+  bb.2.for.cond.cleanup:
+    RET_ReallyLR
+
+  bb.3.for.body:
+    successors: %bb.2(0x04000000), %bb.3(0x7c000000)
+
+    %1:gpr64sp = PHI %0, %bb.1, %6, %bb.3
+    %2:gpr64sp = PHI %8, %bb.1, %5, %bb.3
+    %3:gpr64sp = PHI %7, %bb.1, %4, %bb.3
+    early-clobber %12:gpr64sp, %13:fpr32 = LDRSpost %2, 4 :: (load (s32) from %ir.lsr.iv9, !tbaa !6)
+    %15:fpr32 = nofpexcept FADDSrr killed %13, %14, implicit $fpcr
+    early-clobber %16:gpr64sp = STRSpost killed %15, %3, 4 :: (store (s32) into %ir.lsr.iv, !tbaa !6)
+    %4:gpr64all = COPY %16
+    %5:gpr64all = COPY %12
+    %17:gpr64 = nsw SUBSXri %1, 1, 0, implicit-def $nzcv
+    %6:gpr64all = COPY %17
+    Bcc 1, %bb.3, implicit $nzcv
+    B %bb.2
+
+...
diff --git a/llvm/test/CodeGen/AArch64/sms-acceptable-loop3.mir b/llvm/test/CodeGen/AArch64/sms-acceptable-loop3.mir
new file mode 100644
index 000000000000000..d5928c5e385c5f5
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sms-acceptable-loop3.mir
@@ -0,0 +1,87 @@
+# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -pipeliner-enable-copytophi=0 -debug-only=pipeliner 2>&1 | FileCheck %s
+
+# An acceptable loop by pipeliner: TBB == ExitBB, FBB == LoopBB, Compare and branch
+# CHECK: Schedule Found? 1
+
+--- |
+  define dso_local void @func(ptr noalias nocapture noundef writeonly %a, ptr nocapture noundef readonly %b, i32 noundef %n) local_unnamed_addr #0 {
+  entry:
+    %or.cond = icmp ult i32 %n, 2
+    br i1 %or.cond, label %for.end, label %for.body.preheader
+  
+  for.body.preheader:                               ; preds = %entry
+    %i.07 = add i32 %n, -1
+    %0 = sext i32 %i.07 to i64
+    br label %for.body
+  
+  for.body:                                         ; preds = %for.body.preheader, %for.body
+    %indvars.iv = phi i64 [ %0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
+    %1 = shl nsw i64 %indvars.iv, 2
+    %scevgep = getelementptr i8, ptr %b, i64 %1
+    %2 = load float, ptr %scevgep, align 4, !tbaa !6
+    %add = fadd float %2, 1.000000e+00
+    %3 = shl nsw i64 %indvars.iv, 2
+    %scevgep11 = getelementptr i8, ptr %a, i64 %3
+    store float %add, ptr %scevgep11, align 4, !tbaa !6
+    %indvars.iv.next = add nsw i64 %indvars.iv, -1
+    %4 = add i64 %indvars.iv, -1
+    %5 = and i64 %4, 4294967295
+    %tobool.not = icmp eq i64 %5, 0
+    br i1 %tobool.not, label %for.end, label %for.body, !llvm.loop !10
+  
+  for.end:                                          ; preds = %for.body, %entry
+    ret void
+  }
+  
+  !6 = !{!7, !7, i64 0}
+  !7 = !{!"float", !8, i64 0}
+  !8 = !{!"omnipotent char", !9, i64 0}
+  !9 = !{!"Simple C/C++ TBAA"}
+  !10 = distinct !{!10, !11, !12}
+  !11 = !{!"llvm.loop.mustprogress"}
+  !12 = !{!"llvm.loop.unroll.disable"}
+
+...
+---
+name:            func
+tracksRegLiveness: true
+liveins:
+  - { reg: '$x0', virtual-reg: '%3' }
+  - { reg: '$x1', virtual-reg: '%4' }
+  - { reg: '$w2', virtual-reg: '%5' }
+body:             |
+  bb.0.entry:
+    liveins: $x0, $x1, $w2
+  
+    %5:gpr32common = COPY $w2
+    %4:gpr64common = COPY $x1
+    %3:gpr64common = COPY $x0
+    dead $wzr = SUBSWri %5, 2, 0, implicit-def $nzcv
+    Bcc 3, %bb.3, implicit $nzcv
+    B %bb.1
+  
+  bb.1.for.body.preheader:
+    %7:gpr32common = SUBWri %5, 1, 0
+    %9:gpr64all = IMPLICIT_DEF
+    %8:gpr64 = SUBREG_TO_REG 0, killed %7, %subreg.sub_32
+    %10:gpr64 = SBFMXri killed %8, 0, 31
+    %0:gpr64all = COPY %10
+    %12:fpr32 = FMOVSi 112
+  
+  bb.2.for.body:
+    successors: %bb.3(0x04000000), %bb.2(0x7c000000)
+  
+    %1:gpr64common = PHI %0, %bb.1, %2, %bb.2
+    %11:fpr32 = LDRSroX %4, %1, 0, 1 :: (load (s32) from %ir.scevgep, !tbaa !6)
+    %13:fpr32 = nofpexcept FADDSrr killed %11, %12, implicit $fpcr
+    STRSroX killed %13, %3, %1, 0, 1 :: (store (s32) into %ir.scevgep11, !tbaa !6)
+    %14:gpr64common = SUBXri %1, 1, 0
+    %2:gpr64all = COPY %14
+    %15:gpr32 = COPY %14.sub_32
+    CBZW killed %15, %bb.3
+    B %bb.2
+  
+  bb.3.for.end:
+    RET_ReallyLR
+
+...
diff --git a/llvm/test/CodeGen/AArch64/sms-acceptable-loop4.mir b/llvm/test/CodeGen/AArch64/sms-acceptable-loop4.mir
new file mode 100644
index 000000000000000..ccf262dcb1efc10
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sms-acceptable-loop4.mir
@@ -0,0 +1,87 @@
+# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -pipeliner-enable-copytophi=0 -debug-only=pipeliner 2>&1 | FileCheck %s
+
+# An acceptable loop by pipeliner TBB == LoopBB, FBB == ExitBB, Compare and branch
+# CHECK: Schedule Found? 1
+
+--- |
+  define dso_local void @func(ptr noalias nocapture noundef writeonly %a, ptr nocapture noundef readonly %b, i32 noundef %n) local_unnamed_addr #0 {
+  entry:
+    %or.cond = icmp ult i32 %n, 2
+    br i1 %or.cond, label %for.end, label %for.body.preheader
+  
+  for.body.preheader:                               ; preds = %entry
+    %i.07 = add i32 %n, -1
+    %0 = sext i32 %i.07 to i64
+    br label %for.body
+  
+  for.body:                                         ; preds = %for.body.preheader, %for.body
+    %indvars.iv = phi i64 [ %0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
+    %1 = shl nsw i64 %indvars.iv, 2
+    %scevgep = getelementptr i8, ptr %b, i64 %1
+    %2 = load float, ptr %scevgep, align 4, !tbaa !6
+    %add = fadd float %2, 1.000000e+00
+    %3 = shl nsw i64 %indvars.iv, 2
+    %scevgep11 = getelementptr i8, ptr %a, i64 %3
+    store float %add, ptr %scevgep11, align 4, !tbaa !6
+    %indvars.iv.next = add nsw i64 %indvars.iv, -1
+    %4 = add i64 %indvars.iv, -1
+    %5 = and i64 %4, 4294967295
+    %tobool.not = icmp eq i64 %5, 0
+    br i1 %tobool.not, label %for.end, label %for.body, !llvm.loop !10
+  
+  for.end:                                          ; preds = %for.body, %entry
+    ret void
+  }
+  
+  !6 = !{!7, !7, i64 0}
+  !7 = !{!"float", !8, i64 0}
+  !8 = !{!"omnipotent char", !9, i64 0}
+  !9 = !{!"Simple C/C++ TBAA"}
+  !10 = distinct !{!10, !11, !12}
+  !11 = !{!"llvm.loop.mustprogress"}
+  !12 = !{!"llvm.loop.unroll.disable"}
+
+...
+---
+name:            func
+tracksRegLiveness: true
+liveins:
+  - { reg: '$x0', virtual-reg: '%3' }
+  - { reg: '$x1...
[truncated]

ytmukai · 2024-01-26T12:57:15Z

I have already issued a PR for code expansion with MVE (#65609), but since existing method (renames register by move instructions) also improve performance, I thought it would be good to support it first. I intend to use this as a basis for further improvements.

Note: The following three patches that fix the existing code are required to run llvm-test-suite; I would like to submit PRs after this is accepted to include AArch64 tests to the patches.
https://github.com/ytmukai/llvm-project/tree/pipeliner-avoid-long-compile-time
https://github.com/ytmukai/llvm-project/tree/pipeliner-fix-invalid-memoperand
https://github.com/ytmukai/llvm-project/tree/pipeliner-fix-incorrect-element-insertion

davemgreen

Hello. I agree that so long as it is disabled by default this would be good to get into trunk. It sounds like there are still a number of issues, both performance and correctness, but having the ability to test and adjust it as we go should be useful.

Can I suggest that instead of the option controlling whether enableMachinePipeliner returns true that instead it changes whether the pass is added to the pass pipeline. That should hopefully remove the changes from the pass pipeline, and mean we don't need to pay for the extra analyses when the pass is disabled.

davemgreen · 2024-01-29T17:39:20Z

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

+    if (!PredBranch)
+      return nullptr;
+  } else {
+    // For compare and branch


It might be best to check the opcodes, in case of other odd instruction either now or added in the future.

davemgreen · 2024-01-29T17:59:27Z

llvm/test/CodeGen/AArch64/sms-acceptable-loop4.mir

+  !8 = !{!"omnipotent char", !9, i64 0}
+  !9 = !{!"Simple C/C++ TBAA"}
+  !10 = distinct !{!10, !11, !12}
+  !11 = !{!"llvm.loop.mustprogress"}


You might be able to remove some of this metadata, if it is not important for the tests.

I have confirmed that it works even if all metadata is erased, so I erased it.

github-actions · 2024-01-30T12:40:32Z

✅ With the latest revision this PR passed the C/C++ code formatter.

ytmukai · 2024-01-30T13:08:35Z

@davemgreen Thank you for the review! You are correct, I changed it so that when the pass is disabled it is not added.

davemgreen

Thanks. LGTM

Add AArch64 implementations for the interfaces of MachinePipeliner pass. The pass is disabled by default for AArch64. It is enabled by specifying --aarch64-enable-pipeliner. 5 tests in llvm-test-suites show performance improvement by more than 5% on a Neoverse V1 processor. | test | improvement | | ---------------------------------------------------------------- | -----------:| | MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test | 16% | | MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-flt.test | 16% | | SingleSource/Benchmarks/Adobe-C++/loop_unroll.test | 14% | | SingleSource/Benchmarks/Misc/flops-5.test | 13% | | SingleSource/Benchmarks/BenchmarkGame/spectral-norm.test | 6% | (base flags: -mcpu=neoverse-v1 -O3 -mrecip, flags for pipelining: -mllvm -aarch64-enable-pipeliner -mllvm -pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0) On the other hand, there are cases of significant performance degradation. Algorithm improvements and adding the option/pragma will be needed in the future.

…9589) Add AArch64 implementations for the interfaces of MachinePipeliner pass. The pass is disabled by default for AArch64. It is enabled by specifying --aarch64-enable-pipeliner. 5 tests in llvm-test-suites show performance improvement by more than 5% on a Neoverse V1 processor. | test | improvement | | ---------------------------------------------------------------- | -----------:| | MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test | 16% | | MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-flt.test | 16% | | SingleSource/Benchmarks/Adobe-C++/loop_unroll.test | 14% | | SingleSource/Benchmarks/Misc/flops-5.test | 13% | | SingleSource/Benchmarks/BenchmarkGame/spectral-norm.test | 6% | (base flags: -mcpu=neoverse-v1 -O3 -mrecip, flags for pipelining: -mllvm -aarch64-enable-pipeliner -mllvm -pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0) On the other hand, there are cases of significant performance degradation. Algorithm improvements and adding the option/pragma will be needed in the future.

MrLop · 2024-03-04T09:31:03Z

I have a question about the build options of llvm-test-suites, why set pipeliner-enable-copytophi flag to 0. In some cases, this can lead to a compilation failure because the compiler may use registers before register definitions.

bcahoon · 2024-03-04T20:12:02Z

I have a question about the build options of llvm-test-suites, why set pipeliner-enable-copytophi flag to 0. In some cases, this can lead to a compilation failure because the compiler may use registers before register definitions.

The intent of that flag is that it should not affect correctness. Perhaps setting it to 0 exposes a different bug?

ytmukai · 2024-03-06T12:50:18Z

why set pipeliner-enable-copytophi flag to 0

pipeliner-enable-copytophi often makes DDG unschedulable.
The following is an example. The debug message indicates that the schedule failed.
https://godbolt.org/z/66avx9vvh
(The original C source is https://godbolt.org/z/Ksb3o45Yv)

In this case, SU#4(SU(4): %35:gpr64 = nuw ADDXrr %2:gpr64common) is forced to schedule after the nodes that refer %2 (such as SU(3): ST1D %33:zpr, %31:ppr_3b, %14:gpr64common, %2:gpr64common) so that %2 and %35 do not live at the same time by pipeliner-enable-copytophi.
However, SU#4 has a real successor SU#6(SU(6): dead $xzr = SUBSXrr %29:gpr64, %35:gpr64), and it is always placed in stage#0 because it is part of the loop control.
As a result, scheduling becomes almost impossible.

I think pipeliner-enable-copytophi needs to be disabled according to the architecture.

In some cases, this can lead to a compilation failure because the compiler may use registers before register definitions.

I have never seen such a case. What are the conditions of this failure?

ytmukai requested review from davemgreen, luporl, bcahoon and dpenry January 26, 2024 12:41

llvmbot added the backend:AArch64 label Jan 26, 2024

davemgreen reviewed Jan 29, 2024

View reviewed changes

davemgreen approved these changes Jan 31, 2024

View reviewed changes

ytmukai force-pushed the pipeliner-support-aarch64 branch from 95edfff to bcdb1e4 Compare February 1, 2024 12:22

ytmukai force-pushed the pipeliner-support-aarch64 branch from e9e2609 to 83dbce0 Compare February 2, 2024 00:22

ytmukai merged commit 70eab12 into llvm:main Feb 2, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64][MachinePipeliner] Add pipeliner support for AArch64 #79589

[AArch64][MachinePipeliner] Add pipeliner support for AArch64 #79589

ytmukai commented Jan 26, 2024

llvmbot commented Jan 26, 2024

ytmukai commented Jan 26, 2024

davemgreen left a comment

davemgreen Jan 29, 2024

ytmukai Jan 30, 2024

davemgreen Jan 29, 2024

ytmukai Jan 30, 2024

github-actions bot commented Jan 30, 2024 •

edited

ytmukai commented Jan 30, 2024

davemgreen left a comment

MrLop commented Mar 4, 2024 •

edited

bcahoon commented Mar 4, 2024

ytmukai commented Mar 6, 2024

[AArch64][MachinePipeliner] Add pipeliner support for AArch64 #79589

[AArch64][MachinePipeliner] Add pipeliner support for AArch64 #79589

Conversation

ytmukai commented Jan 26, 2024

llvmbot commented Jan 26, 2024

ytmukai commented Jan 26, 2024

davemgreen left a comment

Choose a reason for hiding this comment

davemgreen Jan 29, 2024

Choose a reason for hiding this comment

ytmukai Jan 30, 2024

Choose a reason for hiding this comment

davemgreen Jan 29, 2024

Choose a reason for hiding this comment

ytmukai Jan 30, 2024

Choose a reason for hiding this comment

github-actions bot commented Jan 30, 2024 • edited

ytmukai commented Jan 30, 2024

davemgreen left a comment

Choose a reason for hiding this comment

MrLop commented Mar 4, 2024 • edited

bcahoon commented Mar 4, 2024

ytmukai commented Mar 6, 2024

github-actions bot commented Jan 30, 2024 •

edited

MrLop commented Mar 4, 2024 •

edited