Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64][MachinePipeliner] Add pipeliner support for AArch64 #79589

Merged
merged 1 commit into from
Feb 2, 2024

Conversation

ytmukai
Copy link
Contributor

@ytmukai ytmukai commented Jan 26, 2024

Add AArch64 implementations for the interfaces of MachinePipeliner pass. The pass is disabled by default for AArch64. It is enabled by specifying --aarch64-enable-pipeliner.

5 tests in llvm-test-suites show performance improvement by more than 5% on a Neoverse V1 processor.

test improvement
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test 16%
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-flt.test 16%
SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 14%
SingleSource/Benchmarks/Misc/flops-5.test 13%
SingleSource/Benchmarks/BenchmarkGame/spectral-norm.test 6%

(base flags: -mcpu=neoverse-v1 -O3 -mrecip, flags for pipelining: -mllvm -aarch64-enable-pipeliner -mllvm
-pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0)

On the other hand, there are cases of significant performance degradation. Algorithm improvements and adding the option/pragma will be needed in the future.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 26, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Yuta Mukai (ytmukai)

Changes

Add AArch64 implementations for the interfaces of MachinePipeliner pass. The pass is disabled by default for AArch64. It is enabled by specifying --aarch64-enable-pipeliner.

5 tests in llvm-test-suites show performance improvement by more than 5% on a Neoverse V1 processor.

test improvement
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test 16%
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-flt.test 16%
SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 14%
SingleSource/Benchmarks/Misc/flops-5.test 13%
SingleSource/Benchmarks/BenchmarkGame/spectral-norm.test 6%

(base flags: -mcpu=neoverse-v1 -O3 -mrecip, flags for pipelining: -mllvm -aarch64-enable-pipeliner -mllvm
-pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm -pipeliner-enable-copytophi=0)

On the other hand, there are cases of significant performance degradation. Algorithm improvements and adding the option/pragma will be needed in the future.


Patch is 34.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/79589.diff

14 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64InstrInfo.cpp (+89)
  • (modified) llvm/lib/Target/AArch64/AArch64InstrInfo.h (+4)
  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+9)
  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.h (+3)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetMachine.cpp (+2)
  • (modified) llvm/test/CodeGen/AArch64/O3-pipeline.ll (+8-1)
  • (added) llvm/test/CodeGen/AArch64/sms-acceptable-loop1.mir (+86)
  • (added) llvm/test/CodeGen/AArch64/sms-acceptable-loop2.mir (+86)
  • (added) llvm/test/CodeGen/AArch64/sms-acceptable-loop3.mir (+87)
  • (added) llvm/test/CodeGen/AArch64/sms-acceptable-loop4.mir (+87)
  • (added) llvm/test/CodeGen/AArch64/sms-unacceptable-loop1.mir (+85)
  • (added) llvm/test/CodeGen/AArch64/sms-unacceptable-loop2.mir (+88)
  • (added) llvm/test/CodeGen/AArch64/sms-unpipeline-insts1.mir (+95)
  • (added) llvm/test/CodeGen/AArch64/sms-unpipeline-insts2.mir (+88)
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
index 2e8d8c63d6bec24..9771af127760622 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
@@ -9605,6 +9605,95 @@ AArch64InstrInfo::probedStackAlloc(MachineBasicBlock::iterator MBBI,
   return ExitMBB->begin();
 }
 
+namespace {
+class AArch64PipelinerLoopInfo : public TargetInstrInfo::PipelinerLoopInfo {
+  MachineInstr *PredBranch;
+  SmallVector<MachineOperand, 4> Cond;
+
+public:
+  AArch64PipelinerLoopInfo(MachineInstr *PredBranch,
+                           const SmallVectorImpl<MachineOperand> &Cond)
+      : PredBranch(PredBranch), Cond(Cond.begin(), Cond.end()) {}
+
+  bool shouldIgnoreForPipelining(const MachineInstr *MI) const override {
+    // Make the instructions for loop control be placed in stage 0.
+    // The predecessors of PredBranch are considered by the caller.
+    return MI == PredBranch;
+  }
+
+  std::optional<bool> createTripCountGreaterCondition(
+      int TC, MachineBasicBlock &MBB,
+      SmallVectorImpl<MachineOperand> &CondParam) override {
+    // A branch instruction will be inserted as "if (Cond) goto epilogue".
+    // Cond is normalized for such use.
+    // The predecessors of the branch are assumed to have already been inserted.
+    CondParam = Cond;
+    return {};
+  }
+
+  void setPreheader(MachineBasicBlock *NewPreheader) override {}
+
+  void adjustTripCount(int TripCountAdjust) override {}
+
+  void disposed() override {}
+};
+} // namespace
+
+std::unique_ptr<TargetInstrInfo::PipelinerLoopInfo>
+AArch64InstrInfo::analyzeLoopForPipelining(MachineBasicBlock *LoopBB) const {
+  MachineBasicBlock *TBB = nullptr, *FBB = nullptr;
+  SmallVector<MachineOperand, 4> Cond;
+  if (analyzeBranch(*LoopBB, TBB, FBB, Cond))
+    return nullptr;
+
+  // Infinite loops are not supported
+  if (TBB == LoopBB && FBB == LoopBB)
+    return nullptr;
+
+  // Must be conditional branch
+  if (FBB == nullptr)
+    return nullptr;
+
+  assert((TBB == LoopBB || FBB == LoopBB) &&
+         "The Loop must be a single-basic-block loop");
+
+  // Normalization for createTripCountGreaterCondition()
+  if (TBB == LoopBB)
+    reverseBranchCondition(Cond);
+
+  MachineInstr *CondBranch = &*LoopBB->getFirstTerminator();
+  const TargetRegisterInfo &TRI = getRegisterInfo();
+
+  // Find the immediate predecessor of the conditional branch
+  MachineInstr *PredBranch = nullptr;
+  if (CondBranch->getOpcode() == AArch64::Bcc) {
+    for (MachineInstr &MI : reverse(*LoopBB)) {
+      if (MI.modifiesRegister(AArch64::NZCV, &TRI)) {
+        PredBranch = &MI;
+        break;
+      }
+    }
+    if (!PredBranch)
+      return nullptr;
+  } else {
+    // For compare and branch
+    const MachineRegisterInfo &MRI = LoopBB->getParent()->getRegInfo();
+    Register Reg = CondBranch->getOperand(0).getReg();
+    if (!Reg.isVirtual())
+      return nullptr;
+    PredBranch = MRI.getVRegDef(Reg);
+
+    // MachinePipeliner does not expect Phi
+    if (PredBranch->isPHI())
+      return nullptr;
+
+    if (PredBranch->getParent() != LoopBB)
+      return nullptr;
+  }
+
+  return std::make_unique<AArch64PipelinerLoopInfo>(PredBranch, Cond);
+}
+
 #define GET_INSTRINFO_HELPERS
 #define GET_INSTRMAP_INFO
 #include "AArch64GenInstrInfo.inc"
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.h b/llvm/lib/Target/AArch64/AArch64InstrInfo.h
index db24a19fe5f8e3c..c9b159a913907b3 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.h
@@ -250,6 +250,10 @@ class AArch64InstrInfo final : public AArch64GenInstrInfo {
                         MachineBasicBlock *FBB, ArrayRef<MachineOperand> Cond,
                         const DebugLoc &DL,
                         int *BytesAdded = nullptr) const override;
+
+  std::unique_ptr<TargetInstrInfo::PipelinerLoopInfo>
+  analyzeLoopForPipelining(MachineBasicBlock *LoopBB) const override;
+
   bool
   reverseBranchCondition(SmallVectorImpl<MachineOperand> &Cond) const override;
   bool canInsertSelect(const MachineBasicBlock &, ArrayRef<MachineOperand> Cond,
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index e3a0606331db1c0..299e1dd5cf49e4a 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -81,6 +81,11 @@ static cl::opt<unsigned> AArch64MinimumJumpTableEntries(
     "aarch64-min-jump-table-entries", cl::init(13), cl::Hidden,
     cl::desc("Set minimum number of entries to use a jump table on AArch64"));
 
+static cl::opt<bool>
+    EnableMachinePipeliner("aarch64-enable-pipeliner",
+                           cl::desc("Enable Machine Pipeliner for AArch64"),
+                           cl::init(false), cl::Hidden);
+
 unsigned AArch64Subtarget::getVectorInsertExtractBaseCost() const {
   if (OverrideVectorInsertExtractBaseCost.getNumOccurrences() > 0)
     return OverrideVectorInsertExtractBaseCost;
@@ -540,3 +545,7 @@ AArch64Subtarget::getAuthenticatedLRCheckMethod() const {
   // performance regression or incompatibility with execute-only mappings.
   return AArch64PAuth::AuthCheckMethod::None;
 }
+
+bool AArch64Subtarget::enableMachinePipeliner() const {
+  return getSchedModel().hasInstrSchedModel() && EnableMachinePipeliner;
+}
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index 16864102df59b05..0292c018f1dbc1e 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -201,6 +201,9 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
   bool enableMachineScheduler() const override { return true; }
   bool enablePostRAScheduler() const override { return usePostRAScheduler(); }
 
+  bool enableMachinePipeliner() const override;
+  bool useDFAforSMS() const override { return false; }
+
   /// Returns ARM processor family.
   /// Avoid this function! CPU specifics should be kept local to this class
   /// and preferably modeled with SubtargetFeatures or properties in
diff --git a/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp b/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
index 6fbc13d8904f2e2..ebc06dd219deefc 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
@@ -779,6 +779,8 @@ void AArch64PassConfig::addPreRegAlloc() {
     // be register coalescer friendly.
     addPass(&PeepholeOptimizerID);
   }
+  if (TM->getOptLevel() != CodeGenOptLevel::None)
+    addPass(&MachinePipelinerID);
 }
 
 void AArch64PassConfig::addPostRegAlloc() {
diff --git a/llvm/test/CodeGen/AArch64/O3-pipeline.ll b/llvm/test/CodeGen/AArch64/O3-pipeline.ll
index 638f26298ee26aa..4f3d04afce4bcf9 100644
--- a/llvm/test/CodeGen/AArch64/O3-pipeline.ll
+++ b/llvm/test/CodeGen/AArch64/O3-pipeline.ll
@@ -154,13 +154,20 @@
 ; CHECK-NEXT:       Remove dead machine instructions
 ; CHECK-NEXT:       AArch64 MI Peephole Optimization pass
 ; CHECK-NEXT:       AArch64 Dead register definitions
+; CHECK-NEXT:       MachineDominator Tree Construction
+; CHECK-NEXT:       Slot index numbering
+; CHECK-NEXT:       Live Interval Analysis
+; CHECK-NEXT:       Lazy Machine Block Frequency Analysis
+; CHECK-NEXT:       Machine Optimization Remark Emitter
+; CHECK-NEXT:       Modulo Software Pipelining
 ; CHECK-NEXT:       Detect Dead Lanes
 ; CHECK-NEXT:       Process Implicit Definitions
 ; CHECK-NEXT:       Remove unreachable machine basic blocks
 ; CHECK-NEXT:       Live Variable Analysis
+; CHECK-NEXT:       MachineDominator Tree Construction
+; CHECK-NEXT:       Machine Natural Loop Construction
 ; CHECK-NEXT:       Eliminate PHI nodes for register allocation
 ; CHECK-NEXT:       Two-Address instruction pass
-; CHECK-NEXT:       MachineDominator Tree Construction
 ; CHECK-NEXT:       Slot index numbering
 ; CHECK-NEXT:       Live Interval Analysis
 ; CHECK-NEXT:       Register Coalescer
diff --git a/llvm/test/CodeGen/AArch64/sms-acceptable-loop1.mir b/llvm/test/CodeGen/AArch64/sms-acceptable-loop1.mir
new file mode 100644
index 000000000000000..2267b7055494896
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sms-acceptable-loop1.mir
@@ -0,0 +1,86 @@
+# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -debug-only=pipeliner 2>&1 | FileCheck %s
+
+# An acceptable loop by pipeliner: TBB == ExitBB, FBB == LoopBB, Branch with NZCV flags
+# CHECK: Schedule Found? 1
+
+--- |
+  define dso_local void @func(ptr noalias nocapture noundef writeonly %a, ptr nocapture noundef readonly %b, i32 noundef %n) local_unnamed_addr #0 {
+  entry:
+    %cmp6 = icmp sgt i32 %n, 0
+    br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup
+
+  for.body.preheader:                               ; preds = %entry
+    %wide.trip.count = zext nneg i32 %n to i64
+    br label %for.body
+
+  for.cond.cleanup:                                 ; preds = %for.body, %entry
+    ret void
+
+  for.body:                                         ; preds = %for.body.preheader, %for.body
+    %lsr.iv11 = phi i64 [ %wide.trip.count, %for.body.preheader ], [ %lsr.iv.next, %for.body ]
+    %lsr.iv9 = phi ptr [ %b, %for.body.preheader ], [ %scevgep10, %for.body ]
+    %lsr.iv = phi ptr [ %a, %for.body.preheader ], [ %scevgep, %for.body ]
+    %0 = load float, ptr %lsr.iv9, align 4, !tbaa !6
+    %add = fadd float %0, 1.000000e+00
+    store float %add, ptr %lsr.iv, align 4, !tbaa !6
+    %scevgep = getelementptr i8, ptr %lsr.iv, i64 4
+    %scevgep10 = getelementptr i8, ptr %lsr.iv9, i64 4
+    %lsr.iv.next = add nsw i64 %lsr.iv11, -1
+    %exitcond.not = icmp eq i64 %lsr.iv.next, 0
+    br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !10
+  }
+
+  !6 = !{!7, !7, i64 0}
+  !7 = !{!"float", !8, i64 0}
+  !8 = !{!"omnipotent char", !9, i64 0}
+  !9 = !{!"Simple C/C++ TBAA"}
+  !10 = distinct !{!10, !11, !12}
+  !11 = !{!"llvm.loop.mustprogress"}
+  !12 = !{!"llvm.loop.unroll.disable"}
+
+...
+---
+name:            func
+tracksRegLiveness: true
+liveins:
+  - { reg: '$x0', virtual-reg: '%7' }
+  - { reg: '$x1', virtual-reg: '%8' }
+  - { reg: '$w2', virtual-reg: '%9' }
+body:             |
+  bb.0.entry:
+    successors: %bb.1(0x50000000), %bb.2(0x30000000)
+    liveins: $x0, $x1, $w2
+
+    %9:gpr32common = COPY $w2
+    %8:gpr64 = COPY $x1
+    %7:gpr64 = COPY $x0
+    dead $wzr = SUBSWri %9, 1, 0, implicit-def $nzcv
+    Bcc 11, %bb.2, implicit $nzcv
+    B %bb.1
+
+  bb.1.for.body.preheader:
+    %11:gpr32 = ORRWrs $wzr, %9, 0
+    %0:gpr64all = SUBREG_TO_REG 0, killed %11, %subreg.sub_32
+    %14:fpr32 = FMOVSi 112
+    B %bb.3
+
+  bb.2.for.cond.cleanup:
+    RET_ReallyLR
+
+  bb.3.for.body:
+    successors: %bb.2(0x04000000), %bb.3(0x7c000000)
+
+    %1:gpr64sp = PHI %0, %bb.1, %6, %bb.3
+    %2:gpr64sp = PHI %8, %bb.1, %5, %bb.3
+    %3:gpr64sp = PHI %7, %bb.1, %4, %bb.3
+    early-clobber %12:gpr64sp, %13:fpr32 = LDRSpost %2, 4 :: (load (s32) from %ir.lsr.iv9, !tbaa !6)
+    %15:fpr32 = nofpexcept FADDSrr killed %13, %14, implicit $fpcr
+    early-clobber %16:gpr64sp = STRSpost killed %15, %3, 4 :: (store (s32) into %ir.lsr.iv, !tbaa !6)
+    %4:gpr64all = COPY %16
+    %5:gpr64all = COPY %12
+    %17:gpr64 = nsw SUBSXri %1, 1, 0, implicit-def $nzcv
+    %6:gpr64all = COPY %17
+    Bcc 0, %bb.2, implicit $nzcv
+    B %bb.3
+
+...
diff --git a/llvm/test/CodeGen/AArch64/sms-acceptable-loop2.mir b/llvm/test/CodeGen/AArch64/sms-acceptable-loop2.mir
new file mode 100644
index 000000000000000..aeb9770bcb47fde
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sms-acceptable-loop2.mir
@@ -0,0 +1,86 @@
+# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -debug-only=pipeliner 2>&1 | FileCheck %s
+
+# An acceptable loop by pipeliner: TBB == LoopBB, FBB == ExitBB, Branch with NZCV flags
+# CHECK: Schedule Found? 1
+
+--- |
+  define dso_local void @func(ptr noalias nocapture noundef writeonly %a, ptr nocapture noundef readonly %b, i32 noundef %n) local_unnamed_addr #0 {
+  entry:
+    %cmp6 = icmp sgt i32 %n, 0
+    br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup
+
+  for.body.preheader:                               ; preds = %entry
+    %wide.trip.count = zext nneg i32 %n to i64
+    br label %for.body
+
+  for.cond.cleanup:                                 ; preds = %for.body, %entry
+    ret void
+
+  for.body:                                         ; preds = %for.body.preheader, %for.body
+    %lsr.iv11 = phi i64 [ %wide.trip.count, %for.body.preheader ], [ %lsr.iv.next, %for.body ]
+    %lsr.iv9 = phi ptr [ %b, %for.body.preheader ], [ %scevgep10, %for.body ]
+    %lsr.iv = phi ptr [ %a, %for.body.preheader ], [ %scevgep, %for.body ]
+    %0 = load float, ptr %lsr.iv9, align 4, !tbaa !6
+    %add = fadd float %0, 1.000000e+00
+    store float %add, ptr %lsr.iv, align 4, !tbaa !6
+    %scevgep = getelementptr i8, ptr %lsr.iv, i64 4
+    %scevgep10 = getelementptr i8, ptr %lsr.iv9, i64 4
+    %lsr.iv.next = add nsw i64 %lsr.iv11, -1
+    %exitcond.not = icmp eq i64 %lsr.iv.next, 0
+    br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !10
+  }
+
+  !6 = !{!7, !7, i64 0}
+  !7 = !{!"float", !8, i64 0}
+  !8 = !{!"omnipotent char", !9, i64 0}
+  !9 = !{!"Simple C/C++ TBAA"}
+  !10 = distinct !{!10, !11, !12}
+  !11 = !{!"llvm.loop.mustprogress"}
+  !12 = !{!"llvm.loop.unroll.disable"}
+
+...
+---
+name:            func
+tracksRegLiveness: true
+liveins:
+  - { reg: '$x0', virtual-reg: '%7' }
+  - { reg: '$x1', virtual-reg: '%8' }
+  - { reg: '$w2', virtual-reg: '%9' }
+body:             |
+  bb.0.entry:
+    successors: %bb.1(0x50000000), %bb.2(0x30000000)
+    liveins: $x0, $x1, $w2
+
+    %9:gpr32common = COPY $w2
+    %8:gpr64 = COPY $x1
+    %7:gpr64 = COPY $x0
+    dead $wzr = SUBSWri %9, 1, 0, implicit-def $nzcv
+    Bcc 11, %bb.2, implicit $nzcv
+    B %bb.1
+
+  bb.1.for.body.preheader:
+    %11:gpr32 = ORRWrs $wzr, %9, 0
+    %0:gpr64all = SUBREG_TO_REG 0, killed %11, %subreg.sub_32
+    %14:fpr32 = FMOVSi 112
+    B %bb.3
+
+  bb.2.for.cond.cleanup:
+    RET_ReallyLR
+
+  bb.3.for.body:
+    successors: %bb.2(0x04000000), %bb.3(0x7c000000)
+
+    %1:gpr64sp = PHI %0, %bb.1, %6, %bb.3
+    %2:gpr64sp = PHI %8, %bb.1, %5, %bb.3
+    %3:gpr64sp = PHI %7, %bb.1, %4, %bb.3
+    early-clobber %12:gpr64sp, %13:fpr32 = LDRSpost %2, 4 :: (load (s32) from %ir.lsr.iv9, !tbaa !6)
+    %15:fpr32 = nofpexcept FADDSrr killed %13, %14, implicit $fpcr
+    early-clobber %16:gpr64sp = STRSpost killed %15, %3, 4 :: (store (s32) into %ir.lsr.iv, !tbaa !6)
+    %4:gpr64all = COPY %16
+    %5:gpr64all = COPY %12
+    %17:gpr64 = nsw SUBSXri %1, 1, 0, implicit-def $nzcv
+    %6:gpr64all = COPY %17
+    Bcc 1, %bb.3, implicit $nzcv
+    B %bb.2
+
+...
diff --git a/llvm/test/CodeGen/AArch64/sms-acceptable-loop3.mir b/llvm/test/CodeGen/AArch64/sms-acceptable-loop3.mir
new file mode 100644
index 000000000000000..d5928c5e385c5f5
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sms-acceptable-loop3.mir
@@ -0,0 +1,87 @@
+# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -pipeliner-enable-copytophi=0 -debug-only=pipeliner 2>&1 | FileCheck %s
+
+# An acceptable loop by pipeliner: TBB == ExitBB, FBB == LoopBB, Compare and branch
+# CHECK: Schedule Found? 1
+
+--- |
+  define dso_local void @func(ptr noalias nocapture noundef writeonly %a, ptr nocapture noundef readonly %b, i32 noundef %n) local_unnamed_addr #0 {
+  entry:
+    %or.cond = icmp ult i32 %n, 2
+    br i1 %or.cond, label %for.end, label %for.body.preheader
+  
+  for.body.preheader:                               ; preds = %entry
+    %i.07 = add i32 %n, -1
+    %0 = sext i32 %i.07 to i64
+    br label %for.body
+  
+  for.body:                                         ; preds = %for.body.preheader, %for.body
+    %indvars.iv = phi i64 [ %0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
+    %1 = shl nsw i64 %indvars.iv, 2
+    %scevgep = getelementptr i8, ptr %b, i64 %1
+    %2 = load float, ptr %scevgep, align 4, !tbaa !6
+    %add = fadd float %2, 1.000000e+00
+    %3 = shl nsw i64 %indvars.iv, 2
+    %scevgep11 = getelementptr i8, ptr %a, i64 %3
+    store float %add, ptr %scevgep11, align 4, !tbaa !6
+    %indvars.iv.next = add nsw i64 %indvars.iv, -1
+    %4 = add i64 %indvars.iv, -1
+    %5 = and i64 %4, 4294967295
+    %tobool.not = icmp eq i64 %5, 0
+    br i1 %tobool.not, label %for.end, label %for.body, !llvm.loop !10
+  
+  for.end:                                          ; preds = %for.body, %entry
+    ret void
+  }
+  
+  !6 = !{!7, !7, i64 0}
+  !7 = !{!"float", !8, i64 0}
+  !8 = !{!"omnipotent char", !9, i64 0}
+  !9 = !{!"Simple C/C++ TBAA"}
+  !10 = distinct !{!10, !11, !12}
+  !11 = !{!"llvm.loop.mustprogress"}
+  !12 = !{!"llvm.loop.unroll.disable"}
+
+...
+---
+name:            func
+tracksRegLiveness: true
+liveins:
+  - { reg: '$x0', virtual-reg: '%3' }
+  - { reg: '$x1', virtual-reg: '%4' }
+  - { reg: '$w2', virtual-reg: '%5' }
+body:             |
+  bb.0.entry:
+    liveins: $x0, $x1, $w2
+  
+    %5:gpr32common = COPY $w2
+    %4:gpr64common = COPY $x1
+    %3:gpr64common = COPY $x0
+    dead $wzr = SUBSWri %5, 2, 0, implicit-def $nzcv
+    Bcc 3, %bb.3, implicit $nzcv
+    B %bb.1
+  
+  bb.1.for.body.preheader:
+    %7:gpr32common = SUBWri %5, 1, 0
+    %9:gpr64all = IMPLICIT_DEF
+    %8:gpr64 = SUBREG_TO_REG 0, killed %7, %subreg.sub_32
+    %10:gpr64 = SBFMXri killed %8, 0, 31
+    %0:gpr64all = COPY %10
+    %12:fpr32 = FMOVSi 112
+  
+  bb.2.for.body:
+    successors: %bb.3(0x04000000), %bb.2(0x7c000000)
+  
+    %1:gpr64common = PHI %0, %bb.1, %2, %bb.2
+    %11:fpr32 = LDRSroX %4, %1, 0, 1 :: (load (s32) from %ir.scevgep, !tbaa !6)
+    %13:fpr32 = nofpexcept FADDSrr killed %11, %12, implicit $fpcr
+    STRSroX killed %13, %3, %1, 0, 1 :: (store (s32) into %ir.scevgep11, !tbaa !6)
+    %14:gpr64common = SUBXri %1, 1, 0
+    %2:gpr64all = COPY %14
+    %15:gpr32 = COPY %14.sub_32
+    CBZW killed %15, %bb.3
+    B %bb.2
+  
+  bb.3.for.end:
+    RET_ReallyLR
+
+...
diff --git a/llvm/test/CodeGen/AArch64/sms-acceptable-loop4.mir b/llvm/test/CodeGen/AArch64/sms-acceptable-loop4.mir
new file mode 100644
index 000000000000000..ccf262dcb1efc10
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sms-acceptable-loop4.mir
@@ -0,0 +1,87 @@
+# RUN: llc --verify-machineinstrs -mtriple=aarch64 -o - %s -run-pass pipeliner -aarch64-enable-pipeliner -pipeliner-enable-copytophi=0 -debug-only=pipeliner 2>&1 | FileCheck %s
+
+# An acceptable loop by pipeliner TBB == LoopBB, FBB == ExitBB, Compare and branch
+# CHECK: Schedule Found? 1
+
+--- |
+  define dso_local void @func(ptr noalias nocapture noundef writeonly %a, ptr nocapture noundef readonly %b, i32 noundef %n) local_unnamed_addr #0 {
+  entry:
+    %or.cond = icmp ult i32 %n, 2
+    br i1 %or.cond, label %for.end, label %for.body.preheader
+  
+  for.body.preheader:                               ; preds = %entry
+    %i.07 = add i32 %n, -1
+    %0 = sext i32 %i.07 to i64
+    br label %for.body
+  
+  for.body:                                         ; preds = %for.body.preheader, %for.body
+    %indvars.iv = phi i64 [ %0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
+    %1 = shl nsw i64 %indvars.iv, 2
+    %scevgep = getelementptr i8, ptr %b, i64 %1
+    %2 = load float, ptr %scevgep, align 4, !tbaa !6
+    %add = fadd float %2, 1.000000e+00
+    %3 = shl nsw i64 %indvars.iv, 2
+    %scevgep11 = getelementptr i8, ptr %a, i64 %3
+    store float %add, ptr %scevgep11, align 4, !tbaa !6
+    %indvars.iv.next = add nsw i64 %indvars.iv, -1
+    %4 = add i64 %indvars.iv, -1
+    %5 = and i64 %4, 4294967295
+    %tobool.not = icmp eq i64 %5, 0
+    br i1 %tobool.not, label %for.end, label %for.body, !llvm.loop !10
+  
+  for.end:                                          ; preds = %for.body, %entry
+    ret void
+  }
+  
+  !6 = !{!7, !7, i64 0}
+  !7 = !{!"float", !8, i64 0}
+  !8 = !{!"omnipotent char", !9, i64 0}
+  !9 = !{!"Simple C/C++ TBAA"}
+  !10 = distinct !{!10, !11, !12}
+  !11 = !{!"llvm.loop.mustprogress"}
+  !12 = !{!"llvm.loop.unroll.disable"}
+
+...
+---
+name:            func
+tracksRegLiveness: true
+liveins:
+  - { reg: '$x0', virtual-reg: '%3' }
+  - { reg: '$x1...
[truncated]

@ytmukai
Copy link
Contributor Author

ytmukai commented Jan 26, 2024

I have already issued a PR for code expansion with MVE (#65609), but since existing method (renames register by move instructions) also improve performance, I thought it would be good to support it first. I intend to use this as a basis for further improvements.

Note: The following three patches that fix the existing code are required to run llvm-test-suite; I would like to submit PRs after this is accepted to include AArch64 tests to the patches.
https://github.com/ytmukai/llvm-project/tree/pipeliner-avoid-long-compile-time
https://github.com/ytmukai/llvm-project/tree/pipeliner-fix-invalid-memoperand
https://github.com/ytmukai/llvm-project/tree/pipeliner-fix-incorrect-element-insertion

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. I agree that so long as it is disabled by default this would be good to get into trunk. It sounds like there are still a number of issues, both performance and correctness, but having the ability to test and adjust it as we go should be useful.

Can I suggest that instead of the option controlling whether enableMachinePipeliner returns true that instead it changes whether the pass is added to the pass pipeline. That should hopefully remove the changes from the pass pipeline, and mean we don't need to pay for the extra analyses when the pass is disabled.

if (!PredBranch)
return nullptr;
} else {
// For compare and branch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be best to check the opcodes, in case of other odd instruction either now or added in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

!8 = !{!"omnipotent char", !9, i64 0}
!9 = !{!"Simple C/C++ TBAA"}
!10 = distinct !{!10, !11, !12}
!11 = !{!"llvm.loop.mustprogress"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to remove some of this metadata, if it is not important for the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed that it works even if all metadata is erased, so I erased it.

Copy link

github-actions bot commented Jan 30, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@ytmukai
Copy link
Contributor Author

ytmukai commented Jan 30, 2024

@davemgreen Thank you for the review! You are correct, I changed it so that when the pass is disabled it is not added.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM

Add AArch64 implementations for the interfaces of MachinePipeliner
pass. The pass is disabled by default for AArch64. It is enabled by
specifying --aarch64-enable-pipeliner.

5 tests in llvm-test-suites show performance improvement by more than
5% on a Neoverse V1 processor.

| test                                                             | improvement |
| ---------------------------------------------------------------- | -----------:|
| MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test |         16% |
| MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-flt.test |         16% |
| SingleSource/Benchmarks/Adobe-C++/loop_unroll.test               |         14% |
| SingleSource/Benchmarks/Misc/flops-5.test                        |         13% |
| SingleSource/Benchmarks/BenchmarkGame/spectral-norm.test         |          6% |

(base flags: -mcpu=neoverse-v1 -O3 -mrecip, flags for pipelining:
-mllvm -aarch64-enable-pipeliner -mllvm
-pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm
-pipeliner-enable-copytophi=0)

On the other hand, there are cases of significant performance
degradation. Algorithm improvements and adding the option/pragma will
be needed in the future.
@ytmukai ytmukai merged commit 70eab12 into llvm:main Feb 2, 2024
4 checks passed
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
…9589)

Add AArch64 implementations for the interfaces of MachinePipeliner pass.
The pass is disabled by default for AArch64. It is enabled by specifying
--aarch64-enable-pipeliner.

5 tests in llvm-test-suites show performance improvement by more than 5%
on a Neoverse V1 processor.

| test | improvement |
| ---------------------------------------------------------------- |
-----------:|
| MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test | 16%
|
| MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-flt.test | 16%
|
| SingleSource/Benchmarks/Adobe-C++/loop_unroll.test | 14% |
| SingleSource/Benchmarks/Misc/flops-5.test | 13% |
| SingleSource/Benchmarks/BenchmarkGame/spectral-norm.test | 6% |

(base flags: -mcpu=neoverse-v1 -O3 -mrecip, flags for pipelining: -mllvm
-aarch64-enable-pipeliner -mllvm
-pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm
-pipeliner-enable-copytophi=0)

On the other hand, there are cases of significant performance
degradation. Algorithm improvements and adding the option/pragma will be
needed in the future.
@MrLop
Copy link

MrLop commented Mar 4, 2024

I have a question about the build options of llvm-test-suites, why set pipeliner-enable-copytophi flag to 0. In some cases, this can lead to a compilation failure because the compiler may use registers before register definitions.

@bcahoon
Copy link
Contributor

bcahoon commented Mar 4, 2024

I have a question about the build options of llvm-test-suites, why set pipeliner-enable-copytophi flag to 0. In some cases, this can lead to a compilation failure because the compiler may use registers before register definitions.

The intent of that flag is that it should not affect correctness. Perhaps setting it to 0 exposes a different bug?

@ytmukai
Copy link
Contributor Author

ytmukai commented Mar 6, 2024

why set pipeliner-enable-copytophi flag to 0

pipeliner-enable-copytophi often makes DDG unschedulable.
The following is an example. The debug message indicates that the schedule failed.
https://godbolt.org/z/66avx9vvh
(The original C source is https://godbolt.org/z/Ksb3o45Yv)

In this case, SU#4(SU(4): %35:gpr64 = nuw ADDXrr %2:gpr64common) is forced to schedule after the nodes that refer %2 (such as SU(3): ST1D %33:zpr, %31:ppr_3b, %14:gpr64common, %2:gpr64common) so that %2 and %35 do not live at the same time by pipeliner-enable-copytophi.
However, SU#4 has a real successor SU#6(SU(6): dead $xzr = SUBSXrr %29:gpr64, %35:gpr64), and it is always placed in stage#0 because it is part of the loop control.
As a result, scheduling becomes almost impossible.

I think pipeliner-enable-copytophi needs to be disabled according to the architecture.

In some cases, this can lead to a compilation failure because the compiler may use registers before register definitions.

I have never seen such a case. What are the conditions of this failure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants