Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] Improve scheduling latency into Bundles #86310

Merged
merged 1 commit into from
Apr 12, 2024

Conversation

davemgreen
Copy link
Collaborator

By default the scheduling info of instructions into a BUNDLE are given a latency of 0 as they operate on the implicit register of the bundle. This modifies that for AArch64 so that the latency is adjusted to use the latency from the instruction in the bundle instead. This essentially assumes that the bundled instructions are executed in a single cycle, which for AArch64 is probably OK considering they are mostly used for MOVPFX bundles, where this can help create slightly better scheduling especially for in-order cores.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 22, 2024

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-aarch64

Author: David Green (davemgreen)

Changes

By default the scheduling info of instructions into a BUNDLE are given a latency of 0 as they operate on the implicit register of the bundle. This modifies that for AArch64 so that the latency is adjusted to use the latency from the instruction in the bundle instead. This essentially assumes that the bundled instructions are executed in a single cycle, which for AArch64 is probably OK considering they are mostly used for MOVPFX bundles, where this can help create slightly better scheduling especially for in-order cores.


Patch is 119.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/86310.diff

36 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetSubtargetInfo.h (+3-1)
  • (modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+2-1)
  • (modified) llvm/lib/CodeGen/ScheduleDAGInstrs.cpp (+5-3)
  • (modified) llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp (+1-1)
  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+46)
  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.h (+3)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp (+3-2)
  • (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+2-1)
  • (modified) llvm/lib/Target/Hexagon/HexagonSubtarget.cpp (+3-3)
  • (modified) llvm/lib/Target/Hexagon/HexagonSubtarget.h (+2-1)
  • (modified) llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll (+3-3)
  • (modified) llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/misched-bundle.mir (+18-18)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-build-vector.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-fp-fma.ll (+3-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-fp-minmax.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-arith.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-minmax.ll (+16-16)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-mulh.ll (+8-8)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll (+48-48)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-shifts.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-fpext-load.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/sve-fptosi-sat.ll (+35-35)
  • (modified) llvm/test/CodeGen/AArch64/sve-fptoui-sat.ll (+61-61)
  • (modified) llvm/test/CodeGen/AArch64/sve-gather-scatter-addr-opts.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/sve-masked-gather-legalize.ll (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/sve-smulo-sdnode.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-split-fcvt.ll (+17-17)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-extend-trunc.ll (+3-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-div.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-rem.ll (+16-16)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll (+10-10)
  • (modified) llvm/test/CodeGen/AArch64/sve-umulo-sdnode.ll (+8-8)
  • (modified) llvm/test/CodeGen/AArch64/sve-vecreduce-dot.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/sve2-xar.ll (+1-1)
diff --git a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
index 7f8ed5c5019890..5023e29ce145ac 100644
--- a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
@@ -235,7 +235,9 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
   // and UseOpIdx are the indices of the operands in Def and Use, respectively.
   // Otherwise, either may be -1.
   virtual void adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use,
-                                     int UseOpIdx, SDep &Dep) const {}
+                                     int UseOpIdx, SDep &Dep,
+                                     const TargetSchedModel *SchedModel) const {
+  }
 
   // For use with PostRAScheduling: get the anti-dependence breaking that should
   // be performed before post-RA scheduling.
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index eb42a78603d407..a986dc13370568 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -919,7 +919,8 @@ void SwingSchedulerDAG::updatePhiDependences() {
           if (!MI->isPHI()) {
             SDep Dep(SU, SDep::Data, Reg);
             Dep.setLatency(0);
-            ST.adjustSchedDependency(SU, 0, &I, MO.getOperandNo(), Dep);
+            ST.adjustSchedDependency(SU, 0, &I, MO.getOperandNo(), Dep,
+                                     &SchedModel);
             I.addPred(Dep);
           } else {
             HasPhiUse = Reg;
diff --git a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
index 51ede7992af53d..c848ce4af86ccc 100644
--- a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
@@ -282,7 +282,7 @@ void ScheduleDAGInstrs::addPhysRegDataDeps(SUnit *SU, unsigned OperIdx) {
       } else {
         Dep.setLatency(0);
       }
-      ST.adjustSchedDependency(SU, OperIdx, UseSU, UseOpIdx, Dep);
+      ST.adjustSchedDependency(SU, OperIdx, UseSU, UseOpIdx, Dep, &SchedModel);
       UseSU->addPred(Dep);
     }
   }
@@ -323,7 +323,8 @@ void ScheduleDAGInstrs::addPhysRegDeps(SUnit *SU, unsigned OperIdx) {
           Dep.setLatency(
               SchedModel.computeOutputLatency(MI, OperIdx, DefInstr));
         }
-        ST.adjustSchedDependency(SU, OperIdx, DefSU, I->OpIdx, Dep);
+        ST.adjustSchedDependency(SU, OperIdx, DefSU, I->OpIdx, Dep,
+                                 &SchedModel);
         DefSU->addPred(Dep);
       }
     }
@@ -453,7 +454,8 @@ void ScheduleDAGInstrs::addVRegDefDeps(SUnit *SU, unsigned OperIdx) {
         SDep Dep(SU, SDep::Data, Reg);
         Dep.setLatency(SchedModel.computeOperandLatency(MI, OperIdx, Use,
                                                         I->OperandIndex));
-        ST.adjustSchedDependency(SU, OperIdx, UseSU, I->OperandIndex, Dep);
+        ST.adjustSchedDependency(SU, OperIdx, UseSU, I->OperandIndex, Dep,
+                                 &SchedModel);
         UseSU->addPred(Dep);
       }
 
diff --git a/llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp b/llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp
index c9e2745f00c958..1066b1423ae4e4 100644
--- a/llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp
@@ -512,7 +512,7 @@ void ScheduleDAGSDNodes::AddSchedEdges() {
         Dep.setLatency(OpLatency);
         if (!isChain && !UnitLatencies) {
           computeOperandLatency(OpN, N, i, Dep);
-          ST.adjustSchedDependency(OpSU, DefIdx, &SU, i, Dep);
+          ST.adjustSchedDependency(OpSU, DefIdx, &SU, i, Dep, nullptr);
         }
 
         if (!SU.addPred(Dep) && !Dep.isCtrl() && OpSU->NumRegDefsLeft > 1) {
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index bb268b2ba926cb..b8c874f132e1a0 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -472,6 +472,52 @@ void AArch64Subtarget::overrideSchedPolicy(MachineSchedPolicy &Policy,
   Policy.DisableLatencyHeuristic = DisableLatencySchedHeuristic;
 }
 
+void AArch64Subtarget::adjustSchedDependency(
+    SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx, SDep &Dep,
+    const TargetSchedModel *SchedModel) const {
+  if (!SchedModel || Dep.getKind() != SDep::Kind::Data || !Dep.getReg() ||
+      !Def->isInstr() || !Use->isInstr() ||
+      (Def->getInstr()->getOpcode() != TargetOpcode::BUNDLE &&
+       Use->getInstr()->getOpcode() != TargetOpcode::BUNDLE))
+    return;
+
+  // If the Def is a BUNDLE, find the last instruction in the bundle that defs
+  // the register.
+  MachineInstr *DefMI = Def->getInstr();
+  if (DefMI->getOpcode() == TargetOpcode::BUNDLE) {
+    Register Reg = DefMI->getOperand(DefOpIdx).getReg();
+    MachineBasicBlock::instr_iterator I = std::next(DefMI->getIterator());
+    while (I != DefMI->getParent()->end() && I->isBundledWithPred()) {
+      int Idx =
+          I->findRegisterDefOperandIdx(Reg, false, false, getRegisterInfo());
+      if (Idx != -1) {
+        DefMI = &*I;
+        DefOpIdx = Idx;
+      }
+      ++I;
+    }
+  }
+
+  // If the Use is a BUNDLE, find the first instruction that uses the Reg.
+  MachineInstr *UseMI = Use->getInstr();
+  if (UseMI->getOpcode() == TargetOpcode::BUNDLE) {
+    Register Reg = UseMI->getOperand(UseOpIdx).getReg();
+    MachineBasicBlock::instr_iterator I = std::next(UseMI->getIterator());
+    while (I != UseMI->getParent()->end() && I->isBundledWithPred()) {
+      int Idx = I->findRegisterUseOperandIdx(Reg, false, getRegisterInfo());
+      if (Idx != -1) {
+        UseMI = &*I;
+        UseOpIdx = Idx;
+        break;
+      }
+      ++I;
+    }
+  }
+
+  Dep.setLatency(
+      SchedModel->computeOperandLatency(DefMI, DefOpIdx, UseMI, UseOpIdx));
+}
+
 bool AArch64Subtarget::enableEarlyIfConversion() const {
   return EnableEarlyIfConvert;
 }
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index 95bef7a76bcab7..c5ebd4c6cc615a 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -354,6 +354,9 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
 
   void overrideSchedPolicy(MachineSchedPolicy &Policy,
                            unsigned NumRegionInstrs) const override;
+  void adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx,
+                             SDep &Dep,
+                             const TargetSchedModel *SchedModel) const override;
 
   bool enableEarlyIfConversion() const override;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index fa77b94fc22def..4e35a6e405a5fa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -849,8 +849,9 @@ unsigned GCNSubtarget::getMaxNumVGPRs(const MachineFunction &MF) const {
   return getBaseMaxNumVGPRs(F, MFI.getWavesPerEU());
 }
 
-void GCNSubtarget::adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use,
-                                         int UseOpIdx, SDep &Dep) const {
+void GCNSubtarget::adjustSchedDependency(
+    SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx, SDep &Dep,
+    const TargetSchedModel *SchedModel) const {
   if (Dep.getKind() != SDep::Kind::Data || !Dep.getReg() ||
       !Def->isInstr() || !Use->isInstr())
     return;
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index ca51da659c3311..5d43d3aa320fd0 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -1495,7 +1495,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   }
 
   void adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx,
-                             SDep &Dep) const override;
+                             SDep &Dep,
+                             const TargetSchedModel *SchedModel) const override;
 
   // \returns true if it's beneficial on this subtarget for the scheduler to
   // cluster stores as well as loads.
diff --git a/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp b/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp
index 2d320e6b0cad75..da8ab5c4b21bbf 100644
--- a/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp
+++ b/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp
@@ -437,9 +437,9 @@ bool HexagonSubtarget::useAA() const {
 
 /// Perform target specific adjustments to the latency of a schedule
 /// dependency.
-void HexagonSubtarget::adjustSchedDependency(SUnit *Src, int SrcOpIdx,
-                                             SUnit *Dst, int DstOpIdx,
-                                             SDep &Dep) const {
+void HexagonSubtarget::adjustSchedDependency(
+    SUnit *Src, int SrcOpIdx, SUnit *Dst, int DstOpIdx, SDep &Dep,
+    const TargetSchedModel *SchedModel) const {
   if (!Src->isInstr() || !Dst->isInstr())
     return;
 
diff --git a/llvm/lib/Target/Hexagon/HexagonSubtarget.h b/llvm/lib/Target/Hexagon/HexagonSubtarget.h
index e56007ee21e75f..da3f8f56ccc4a7 100644
--- a/llvm/lib/Target/Hexagon/HexagonSubtarget.h
+++ b/llvm/lib/Target/Hexagon/HexagonSubtarget.h
@@ -308,7 +308,8 @@ class HexagonSubtarget : public HexagonGenSubtargetInfo {
   /// Perform target specific adjustments to the latency of a schedule
   /// dependency.
   void adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx,
-                             SDep &Dep) const override;
+                             SDep &Dep,
+                             const TargetSchedModel *SchedModel) const override;
 
   unsigned getVectorLength() const {
     assert(useHVXOps());
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll
index aac43f7b960fed..1751f99b3e92c8 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll
@@ -190,8 +190,8 @@ define <vscale x 4 x double> @mul_add_rot_mull(<vscale x 4 x double> %a, <vscale
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    uzp1 z24.d, z2.d, z3.d
 ; CHECK-NEXT:    uzp2 z25.d, z0.d, z1.d
-; CHECK-NEXT:    ptrue p0.d
 ; CHECK-NEXT:    uzp2 z2.d, z2.d, z3.d
+; CHECK-NEXT:    ptrue p0.d
 ; CHECK-NEXT:    uzp1 z0.d, z0.d, z1.d
 ; CHECK-NEXT:    uzp1 z26.d, z6.d, z7.d
 ; CHECK-NEXT:    fmul z1.d, z24.d, z25.d
@@ -199,12 +199,12 @@ define <vscale x 4 x double> @mul_add_rot_mull(<vscale x 4 x double> %a, <vscale
 ; CHECK-NEXT:    uzp2 z25.d, z4.d, z5.d
 ; CHECK-NEXT:    uzp1 z4.d, z4.d, z5.d
 ; CHECK-NEXT:    uzp2 z5.d, z6.d, z7.d
-; CHECK-NEXT:    fmla z1.d, p0/m, z2.d, z0.d
 ; CHECK-NEXT:    fmla z3.d, p0/m, z26.d, z25.d
+; CHECK-NEXT:    fmla z1.d, p0/m, z2.d, z0.d
 ; CHECK-NEXT:    movprfx z2, z3
 ; CHECK-NEXT:    fmla z2.d, p0/m, z5.d, z4.d
-; CHECK-NEXT:    fnmls z2.d, p0/m, z24.d, z0.d
 ; CHECK-NEXT:    fmla z1.d, p0/m, z26.d, z4.d
+; CHECK-NEXT:    fnmls z2.d, p0/m, z24.d, z0.d
 ; CHECK-NEXT:    fmls z1.d, p0/m, z5.d, z25.d
 ; CHECK-NEXT:    zip1 z0.d, z2.d, z1.d
 ; CHECK-NEXT:    zip2 z1.d, z2.d, z1.d
diff --git a/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll b/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll
index f8cba4ce4b63bf..ab15bf564ec425 100644
--- a/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll
+++ b/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll
@@ -1145,8 +1145,8 @@ define <vscale x 4 x i64> @fshl_rot_illegal_i64(<vscale x 4 x i64> %a, <vscale x
 ; CHECK-NEXT:    subr z3.d, z3.d, #0 // =0x0
 ; CHECK-NEXT:    and z4.d, z4.d, #0x3f
 ; CHECK-NEXT:    and z2.d, z2.d, #0x3f
-; CHECK-NEXT:    and z3.d, z3.d, #0x3f
 ; CHECK-NEXT:    and z5.d, z5.d, #0x3f
+; CHECK-NEXT:    and z3.d, z3.d, #0x3f
 ; CHECK-NEXT:    lslr z4.d, p0/m, z4.d, z0.d
 ; CHECK-NEXT:    lsr z0.d, p0/m, z0.d, z2.d
 ; CHECK-NEXT:    movprfx z2, z1
diff --git a/llvm/test/CodeGen/AArch64/misched-bundle.mir b/llvm/test/CodeGen/AArch64/misched-bundle.mir
index a947c04a4229e6..9adcd2904a250a 100644
--- a/llvm/test/CodeGen/AArch64/misched-bundle.mir
+++ b/llvm/test/CodeGen/AArch64/misched-bundle.mir
@@ -46,9 +46,9 @@
 # CHECK-NEXT:   # rdefs left       : 0
 # CHECK-NEXT:   Latency            : 3
 # CHECK-NEXT:   Depth              : 0
-# CHECK-NEXT:   Height             : 0
+# CHECK-NEXT:   Height             : 7
 # CHECK-NEXT:   Successors:
-# CHECK-NEXT:     SU(7): Data Latency=0 Reg=$z3
+# CHECK-NEXT:     SU(7): Data Latency=3 Reg=$z3
 # CHECK-NEXT:     SU(9): Ord  Latency=0 Memory
 # CHECK-NEXT:     SU(8): Ord  Latency=0 Memory
 # CHECK-NEXT:   Single Issue       : false;
@@ -58,9 +58,9 @@
 # CHECK-NEXT:   # rdefs left       : 0
 # CHECK-NEXT:   Latency            : 3
 # CHECK-NEXT:   Depth              : 0
-# CHECK-NEXT:   Height             : 0
+# CHECK-NEXT:   Height             : 7
 # CHECK-NEXT:   Successors:
-# CHECK-NEXT:     SU(7): Data Latency=0 Reg=$z4
+# CHECK-NEXT:     SU(7): Data Latency=3 Reg=$z4
 # CHECK-NEXT:     SU(9): Ord  Latency=0 Memory
 # CHECK-NEXT:     SU(8): Ord  Latency=0 Memory
 # CHECK-NEXT:   Single Issue       : false;
@@ -70,9 +70,9 @@
 # CHECK-NEXT:   # rdefs left       : 0
 # CHECK-NEXT:   Latency            : 3
 # CHECK-NEXT:   Depth              : 0
-# CHECK-NEXT:   Height             : 0
+# CHECK-NEXT:   Height             : 7
 # CHECK-NEXT:   Successors:
-# CHECK-NEXT:     SU(7): Data Latency=0 Reg=$z5
+# CHECK-NEXT:     SU(7): Data Latency=3 Reg=$z5
 # CHECK-NEXT:     SU(9): Ord  Latency=0 Memory
 # CHECK-NEXT:     SU(8): Ord  Latency=0 Memory
 # CHECK-NEXT:   Single Issue       : false;
@@ -98,15 +98,15 @@
 # CHECK-NEXT:   # rdefs left       : 0
 # CHECK-NEXT:   Latency            : 1
 # CHECK-NEXT:   Depth              : 3
-# CHECK-NEXT:   Height             : 0
+# CHECK-NEXT:   Height             : 4
 # CHECK-NEXT:   Predecessors:
 # CHECK-NEXT:     SU(6): Anti Latency=0
-# CHECK-NEXT:     SU(5): Data Latency=0 Reg=$z5
-# CHECK-NEXT:     SU(4): Data Latency=0 Reg=$z4
-# CHECK-NEXT:     SU(3): Data Latency=0 Reg=$z3
+# CHECK-NEXT:     SU(5): Data Latency=3 Reg=$z5
+# CHECK-NEXT:     SU(4): Data Latency=3 Reg=$z4
+# CHECK-NEXT:     SU(3): Data Latency=3 Reg=$z3
 # CHECK-NEXT:     SU(1): Out  Latency=1
 # CHECK-NEXT:   Successors:
-# CHECK-NEXT:     SU(9): Data Latency=0 Reg=$z1
+# CHECK-NEXT:     SU(9): Data Latency=4 Reg=$z1
 # CHECK-NEXT:   Single Issue       : false;
 # CHECK-NEXT: SU(8):   ST1H killed renamable $z0, renamable $p0, renamable $x0, renamable $x10 :: (store unknown-size, align 1)
 # CHECK-NEXT:   # preds left       : 7
@@ -135,7 +135,7 @@
 # CHECK-NEXT:   Height             : 0
 # CHECK-NEXT:   Predecessors:
 # CHECK-NEXT:     SU(8): Ord  Latency=0 Memory
-# CHECK-NEXT:     SU(7): Data Latency=0 Reg=$z1
+# CHECK-NEXT:     SU(7): Data Latency=4 Reg=$z1
 # CHECK-NEXT:     SU(5): Ord  Latency=0 Memory
 # CHECK-NEXT:     SU(4): Ord  Latency=0 Memory
 # CHECK-NEXT:     SU(3): Ord  Latency=0 Memory
@@ -159,24 +159,24 @@ body:             |
   bb.0.entry:
     liveins: $p0, $x0, $x1, $x2, $x10, $x11, $x12, $x13
 
+
     ; CHECK-LABEL: name: test
     ; CHECK: liveins: $p0, $x0, $x1, $x2, $x10, $x11, $x12, $x13
     ; CHECK-NEXT: {{  $}}
     ; CHECK-NEXT: renamable $z0 = LD1H renamable $p0, renamable $x1, renamable $x10 :: (load unknown-size, align 1)
     ; CHECK-NEXT: renamable $z1 = LD1H renamable $p0, renamable $x2, renamable $x10 :: (load unknown-size, align 1)
     ; CHECK-NEXT: renamable $z2 = LD1H renamable $p0, renamable $x0, renamable $x10 :: (load unknown-size, align 1)
-    ; CHECK-NEXT: $z0 = FMAD_ZPmZZ_H renamable $p0, killed $z0, renamable $z1, killed renamable $z2
     ; CHECK-NEXT: renamable $z3 = LD1H renamable $p0, renamable $x11, renamable $x10 :: (load unknown-size, align 1)
     ; CHECK-NEXT: renamable $z4 = LD1H renamable $p0, renamable $x12, renamable $x10 :: (load unknown-size, align 1)
     ; CHECK-NEXT: renamable $z5 = LD1H renamable $p0, renamable $x13, renamable $x10 :: (load unknown-size, align 1)
-    ; CHECK-NEXT: ST1H killed renamable $z0, renamable $p0, renamable $x0, renamable $x10 :: (store unknown-size, align 1)
-    ; CHECK-NEXT: BUNDLE implicit-def $z1, implicit-def $q1, implicit-def $d1, implicit-def $s1, implicit-def $h1, implicit-def $b1, implicit $z5, implicit $p0, implicit $z4, implicit $z3 {
+    ; CHECK-NEXT: $z0 = FMAD_ZPmZZ_H renamable $p0, killed $z0, killed renamable $z1, killed renamable $z2
+    ; CHECK-NEXT: BUNDLE implicit-def $z1, implicit-def $q1, implicit-def $d1, implicit-def $s1, implicit-def $h1, implicit-def $b1, implicit $z5, implicit $p0, implicit killed $z4, implicit killed $z3 {
     ; CHECK-NEXT:   $z1 = MOVPRFX_ZZ $z5
-    ; CHECK-NEXT:   $z1 = FMLA_ZPmZZ_H renamable $p0, internal $z1, renamable $z4, renamable $z3
+    ; CHECK-NEXT:   $z1 = FMLA_ZPmZZ_H renamable $p0, internal killed $z1, killed renamable $z4, killed renamable $z3
     ; CHECK-NEXT: }
-    ; CHECK-NEXT: ST1H renamable $z1, renamable $p0, renamable $x13, renamable $x10 :: (store unknown-size, align 1)
+    ; CHECK-NEXT: ST1H killed renamable $z0, renamable $p0, renamable $x0, renamable $x10 :: (store unknown-size, align 1)
+    ; CHECK-NEXT: ST1H killed renamable $z1, renamable $p0, renamable $x13, renamable $x10 :: (store unknown-size, align 1)
     ; CHECK-NEXT: RET_ReallyLR
-
     renamable $z0 = LD1H renamable $p0, renamable $x1, renamable $x10 :: (load unknown-size)
     renamable $z1 = LD1H renamable $p0, renamable $x2, renamable $x10 :: (load unknown-size)
     renamable $z2 = LD1H renamable $p0, renamable $x0, renamable $x10 :: (load unknown-size)
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-build-vector.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-build-vector.ll
index 0e3307d1617298..ad482118ec0bbe 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-build-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-build-vector.ll
@@ -43,7 +43,7 @@ define void @build_vector_minus2_dec32_v4i64(ptr %a) #0 {
 ; VBITS_GE_256-LABEL: build_vector_minus2_dec32_v4i64:
 ; VBITS_GE_256:       // %bb.0:
 ; VBITS_GE_256-NEXT:    ptrue p0.d, vl4
-; VBITS_GE_256-NEXT:    mov x8, #-32
+; VBITS_GE_256-NEXT:    mov x8, #-32 // =0xffffffffffffffe0
 ; VBITS_GE_256-NEXT:    index z0.d, #-2, x8
 ; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x0]
 ; VBITS_GE_256-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll
index 64c4eea0ee5f36..6fbae7edfec0a3 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll
@@ -57,8 +57,8 @@ define void @fadd_v32f16(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    fadd z0.h, p0/m, z0.h, z1.h
 ; VBITS_GE_256-NEXT:    movprfx z1, z2
 ; VBITS_GE_256-NEXT:    fadd z1.h, p0/m, z1.h, z3.h
-; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x0]
 ; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x0, x8, lsl #1]
+; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x0]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: fadd_v32f16:
@@ -156,8 +156,8 @@ define void @fadd_v16f32(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    fadd z0.s, p0/m, z0.s, z1.s
 ; VBITS_GE_256-NEXT:    movprfx z1, z2
 ; VBITS_GE_256-NEXT:    fadd z1.s, p0/m, z1.s, z3.s
-; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x0]
 ; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x0, x8, lsl #2]
+; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x0]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: fadd_v16f32:
@@ -255,8 +255,8 @@ define void @fadd_v8f64(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    fadd z0.d, p0/m, z0.d, z1.d
 ; VBITS_GE_256-NEXT:    movprfx z1, z2
 ; VBITS_GE_256-NEXT:    fadd z1.d, p0/m, z1.d, z3.d
-; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x0]
 ; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x0, x8, lsl #3]
+; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x0]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: fadd_v8f64:
@@ -662,8 +662,8 @@ define void @fma_v32f16(ptr %a, ptr %b, ptr %c) #0 {
 ; VBITS_GE_256-NEXT:    fmad z0.h, p0/m, z1.h, z2.h
 ; VBITS_GE_256-NEXT:    movprfx z1, z5
 ; VBITS_GE_256-NEXT:    fmla z1.h, p0/m, z3.h, z4.h
-; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x0]
 ; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x0, x8, lsl #1]
+; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x0]
 ;...
[truncated]

Copy link

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link

✅ With the latest revision this PR passed the Python code formatter.

By default the scheduling info of instructions into a BUNDLE are given a
latency of 0 as they operate on the implicit register of the bundle. This
modifies that for AArch64 so that the latency is adjusted to use the latency
from the instruction in the bundle instead. This essentially assumes that the
bundled instructions are executed in a single cycle, which for AArch64 is
probably OK considering they are mostly used for MOVPFX bundles, where this can
help create slightly better scheduling especially for in-order cores.
@davemgreen
Copy link
Collaborator Author

Rebase and ping. Thanks

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new function adjustSchedDependency added for AArch64Subtarget seems sensible and looks like a definite improvement from before. I just had a question about whether this interface change is needed?

// the register.
const MachineInstr *DefMI = Def->getInstr();
if (DefMI->getOpcode() == TargetOpcode::BUNDLE) {
Register Reg = DefMI->getOperand(DefOpIdx).getReg();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a silly question, but is this always guaranteed to be a register?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Bundle? Yep I believe they will always be registers, especially the DefOpIdx. It checks above that the dependency is a Reg.

@@ -235,7 +235,9 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
// and UseOpIdx are the indices of the operands in Def and Use, respectively.
// Otherwise, either may be -1.
virtual void adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use,
int UseOpIdx, SDep &Dep) const {}
int UseOpIdx, SDep &Dep,
const TargetSchedModel *SchedModel) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the AMDGPU overrides this and gets access to the schedmodel via the subtarget, so essentially in the AMDGPU version it's possibly now a bit confusing. Does it need an assert that the SchedModel function argument matches the scheduling model it already uses? Alternatively, do we even need to pass in the SchedModel - can the AArch64Subtarget also just store the scheduling model?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the AMDGPU InstrInfo will create a second TargetScheduleModel that it can use from InstrInfo. I think it would be better if there was only one, the one from the MachineScheduler/MachinePipeliner that it is being called from.

I'm open to suggestions if the AMD folks want it to work differently (preferably so long as it doesn't involve creating a new TargetScheduleModel!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case then passing in the SchedModel as a parameter definitely seems preferable to creating a second copy in AArch64Subtarget. What I was wondering was - is there a way to stash the SchedModel passed in to adjustSchedDependency directly in the subtarget class itself? Maybe that's not easy due to the way it's set up, but I just thought it might avoid changing the interface. If it's too difficult that's fine and I'll accept the patch!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. Im not sure if holding state in the subtarget is better than passing the model over the interface. Otherwise there can be use-after-free issues and potentially using the incorrect model when it is not valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, that seems like a sensible argument to me!

@davemgreen davemgreen requested a review from jayfoad April 9, 2024 15:35
Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@davemgreen davemgreen merged commit b24af43 into llvm:main Apr 12, 2024
4 checks passed
@davemgreen davemgreen deleted the gh-a64-bundlesched branch April 12, 2024 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants