[AArch64] Improve scheduling latency into Bundles #86310

davemgreen · 2024-03-22T16:57:53Z

By default the scheduling info of instructions into a BUNDLE are given a latency of 0 as they operate on the implicit register of the bundle. This modifies that for AArch64 so that the latency is adjusted to use the latency from the instruction in the bundle instead. This essentially assumes that the bundled instructions are executed in a single cycle, which for AArch64 is probably OK considering they are mostly used for MOVPFX bundles, where this can help create slightly better scheduling especially for in-order cores.

llvmbot · 2024-03-22T16:58:23Z

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-aarch64

Author: David Green (davemgreen)

Changes

By default the scheduling info of instructions into a BUNDLE are given a latency of 0 as they operate on the implicit register of the bundle. This modifies that for AArch64 so that the latency is adjusted to use the latency from the instruction in the bundle instead. This essentially assumes that the bundled instructions are executed in a single cycle, which for AArch64 is probably OK considering they are mostly used for MOVPFX bundles, where this can help create slightly better scheduling especially for in-order cores.

Patch is 119.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/86310.diff

36 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetSubtargetInfo.h (+3-1)
(modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+2-1)
(modified) llvm/lib/CodeGen/ScheduleDAGInstrs.cpp (+5-3)
(modified) llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp (+1-1)
(modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+46)
(modified) llvm/lib/Target/AArch64/AArch64Subtarget.h (+3)
(modified) llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp (+3-2)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+2-1)
(modified) llvm/lib/Target/Hexagon/HexagonSubtarget.cpp (+3-3)
(modified) llvm/lib/Target/Hexagon/HexagonSubtarget.h (+2-1)
(modified) llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll (+3-3)
(modified) llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll (+1-1)
(modified) llvm/test/CodeGen/AArch64/misched-bundle.mir (+18-18)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-build-vector.ll (+1-1)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-fp-fma.ll (+3-3)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-fp-minmax.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-arith.ll (+4-4)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-minmax.ll (+16-16)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-mulh.ll (+8-8)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll (+48-48)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-shifts.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-fpext-load.ll (+1-1)
(modified) llvm/test/CodeGen/AArch64/sve-fptosi-sat.ll (+35-35)
(modified) llvm/test/CodeGen/AArch64/sve-fptoui-sat.ll (+61-61)
(modified) llvm/test/CodeGen/AArch64/sve-gather-scatter-addr-opts.ll (+5-5)
(modified) llvm/test/CodeGen/AArch64/sve-masked-gather-legalize.ll (+2-2)
(modified) llvm/test/CodeGen/AArch64/sve-smulo-sdnode.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-split-fcvt.ll (+17-17)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-extend-trunc.ll (+3-3)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-div.ll (+4-4)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-rem.ll (+16-16)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll (+10-10)
(modified) llvm/test/CodeGen/AArch64/sve-umulo-sdnode.ll (+8-8)
(modified) llvm/test/CodeGen/AArch64/sve-vecreduce-dot.ll (+1-1)
(modified) llvm/test/CodeGen/AArch64/sve2-xar.ll (+1-1)

diff --git a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
index 7f8ed5c5019890..5023e29ce145ac 100644
--- a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
@@ -235,7 +235,9 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
   // and UseOpIdx are the indices of the operands in Def and Use, respectively.
   // Otherwise, either may be -1.
   virtual void adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use,
-                                     int UseOpIdx, SDep &Dep) const {}
+                                     int UseOpIdx, SDep &Dep,
+                                     const TargetSchedModel *SchedModel) const {
+  }
 
   // For use with PostRAScheduling: get the anti-dependence breaking that should
   // be performed before post-RA scheduling.
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index eb42a78603d407..a986dc13370568 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -919,7 +919,8 @@ void SwingSchedulerDAG::updatePhiDependences() {
           if (!MI->isPHI()) {
             SDep Dep(SU, SDep::Data, Reg);
             Dep.setLatency(0);
-            ST.adjustSchedDependency(SU, 0, &I, MO.getOperandNo(), Dep);
+            ST.adjustSchedDependency(SU, 0, &I, MO.getOperandNo(), Dep,
+                                     &SchedModel);
             I.addPred(Dep);
           } else {
             HasPhiUse = Reg;
diff --git a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
index 51ede7992af53d..c848ce4af86ccc 100644
--- a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
@@ -282,7 +282,7 @@ void ScheduleDAGInstrs::addPhysRegDataDeps(SUnit *SU, unsigned OperIdx) {
       } else {
         Dep.setLatency(0);
       }
-      ST.adjustSchedDependency(SU, OperIdx, UseSU, UseOpIdx, Dep);
+      ST.adjustSchedDependency(SU, OperIdx, UseSU, UseOpIdx, Dep, &SchedModel);
       UseSU->addPred(Dep);
     }
   }
@@ -323,7 +323,8 @@ void ScheduleDAGInstrs::addPhysRegDeps(SUnit *SU, unsigned OperIdx) {
           Dep.setLatency(
               SchedModel.computeOutputLatency(MI, OperIdx, DefInstr));
         }
-        ST.adjustSchedDependency(SU, OperIdx, DefSU, I->OpIdx, Dep);
+        ST.adjustSchedDependency(SU, OperIdx, DefSU, I->OpIdx, Dep,
+                                 &SchedModel);
         DefSU->addPred(Dep);
       }
     }
@@ -453,7 +454,8 @@ void ScheduleDAGInstrs::addVRegDefDeps(SUnit *SU, unsigned OperIdx) {
         SDep Dep(SU, SDep::Data, Reg);
         Dep.setLatency(SchedModel.computeOperandLatency(MI, OperIdx, Use,
                                                         I->OperandIndex));
-        ST.adjustSchedDependency(SU, OperIdx, UseSU, I->OperandIndex, Dep);
+        ST.adjustSchedDependency(SU, OperIdx, UseSU, I->OperandIndex, Dep,
+                                 &SchedModel);
         UseSU->addPred(Dep);
       }
 
diff --git a/llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp b/llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp
index c9e2745f00c958..1066b1423ae4e4 100644
--- a/llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp
@@ -512,7 +512,7 @@ void ScheduleDAGSDNodes::AddSchedEdges() {
         Dep.setLatency(OpLatency);
         if (!isChain && !UnitLatencies) {
           computeOperandLatency(OpN, N, i, Dep);
-          ST.adjustSchedDependency(OpSU, DefIdx, &SU, i, Dep);
+          ST.adjustSchedDependency(OpSU, DefIdx, &SU, i, Dep, nullptr);
         }
 
         if (!SU.addPred(Dep) && !Dep.isCtrl() && OpSU->NumRegDefsLeft > 1) {
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index bb268b2ba926cb..b8c874f132e1a0 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -472,6 +472,52 @@ void AArch64Subtarget::overrideSchedPolicy(MachineSchedPolicy &Policy,
   Policy.DisableLatencyHeuristic = DisableLatencySchedHeuristic;
 }
 
+void AArch64Subtarget::adjustSchedDependency(
+    SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx, SDep &Dep,
+    const TargetSchedModel *SchedModel) const {
+  if (!SchedModel || Dep.getKind() != SDep::Kind::Data || !Dep.getReg() ||
+      !Def->isInstr() || !Use->isInstr() ||
+      (Def->getInstr()->getOpcode() != TargetOpcode::BUNDLE &&
+       Use->getInstr()->getOpcode() != TargetOpcode::BUNDLE))
+    return;
+
+  // If the Def is a BUNDLE, find the last instruction in the bundle that defs
+  // the register.
+  MachineInstr *DefMI = Def->getInstr();
+  if (DefMI->getOpcode() == TargetOpcode::BUNDLE) {
+    Register Reg = DefMI->getOperand(DefOpIdx).getReg();
+    MachineBasicBlock::instr_iterator I = std::next(DefMI->getIterator());
+    while (I != DefMI->getParent()->end() && I->isBundledWithPred()) {
+      int Idx =
+          I->findRegisterDefOperandIdx(Reg, false, false, getRegisterInfo());
+      if (Idx != -1) {
+        DefMI = &*I;
+        DefOpIdx = Idx;
+      }
+      ++I;
+    }
+  }
+
+  // If the Use is a BUNDLE, find the first instruction that uses the Reg.
+  MachineInstr *UseMI = Use->getInstr();
+  if (UseMI->getOpcode() == TargetOpcode::BUNDLE) {
+    Register Reg = UseMI->getOperand(UseOpIdx).getReg();
+    MachineBasicBlock::instr_iterator I = std::next(UseMI->getIterator());
+    while (I != UseMI->getParent()->end() && I->isBundledWithPred()) {
+      int Idx = I->findRegisterUseOperandIdx(Reg, false, getRegisterInfo());
+      if (Idx != -1) {
+        UseMI = &*I;
+        UseOpIdx = Idx;
+        break;
+      }
+      ++I;
+    }
+  }
+
+  Dep.setLatency(
+      SchedModel->computeOperandLatency(DefMI, DefOpIdx, UseMI, UseOpIdx));
+}
+
 bool AArch64Subtarget::enableEarlyIfConversion() const {
   return EnableEarlyIfConvert;
 }
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index 95bef7a76bcab7..c5ebd4c6cc615a 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -354,6 +354,9 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
 
   void overrideSchedPolicy(MachineSchedPolicy &Policy,
                            unsigned NumRegionInstrs) const override;
+  void adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx,
+                             SDep &Dep,
+                             const TargetSchedModel *SchedModel) const override;
 
   bool enableEarlyIfConversion() const override;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index fa77b94fc22def..4e35a6e405a5fa 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -849,8 +849,9 @@ unsigned GCNSubtarget::getMaxNumVGPRs(const MachineFunction &MF) const {
   return getBaseMaxNumVGPRs(F, MFI.getWavesPerEU());
 }
 
-void GCNSubtarget::adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use,
-                                         int UseOpIdx, SDep &Dep) const {
+void GCNSubtarget::adjustSchedDependency(
+    SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx, SDep &Dep,
+    const TargetSchedModel *SchedModel) const {
   if (Dep.getKind() != SDep::Kind::Data || !Dep.getReg() ||
       !Def->isInstr() || !Use->isInstr())
     return;
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index ca51da659c3311..5d43d3aa320fd0 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -1495,7 +1495,8 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   }
 
   void adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx,
-                             SDep &Dep) const override;
+                             SDep &Dep,
+                             const TargetSchedModel *SchedModel) const override;
 
   // \returns true if it's beneficial on this subtarget for the scheduler to
   // cluster stores as well as loads.
diff --git a/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp b/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp
index 2d320e6b0cad75..da8ab5c4b21bbf 100644
--- a/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp
+++ b/llvm/lib/Target/Hexagon/HexagonSubtarget.cpp
@@ -437,9 +437,9 @@ bool HexagonSubtarget::useAA() const {
 
 /// Perform target specific adjustments to the latency of a schedule
 /// dependency.
-void HexagonSubtarget::adjustSchedDependency(SUnit *Src, int SrcOpIdx,
-                                             SUnit *Dst, int DstOpIdx,
-                                             SDep &Dep) const {
+void HexagonSubtarget::adjustSchedDependency(
+    SUnit *Src, int SrcOpIdx, SUnit *Dst, int DstOpIdx, SDep &Dep,
+    const TargetSchedModel *SchedModel) const {
   if (!Src->isInstr() || !Dst->isInstr())
     return;
 
diff --git a/llvm/lib/Target/Hexagon/HexagonSubtarget.h b/llvm/lib/Target/Hexagon/HexagonSubtarget.h
index e56007ee21e75f..da3f8f56ccc4a7 100644
--- a/llvm/lib/Target/Hexagon/HexagonSubtarget.h
+++ b/llvm/lib/Target/Hexagon/HexagonSubtarget.h
@@ -308,7 +308,8 @@ class HexagonSubtarget : public HexagonGenSubtargetInfo {
   /// Perform target specific adjustments to the latency of a schedule
   /// dependency.
   void adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use, int UseOpIdx,
-                             SDep &Dep) const override;
+                             SDep &Dep,
+                             const TargetSchedModel *SchedModel) const override;
 
   unsigned getVectorLength() const {
     assert(useHVXOps());
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll
index aac43f7b960fed..1751f99b3e92c8 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll
@@ -190,8 +190,8 @@ define <vscale x 4 x double> @mul_add_rot_mull(<vscale x 4 x double> %a, <vscale
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    uzp1 z24.d, z2.d, z3.d
 ; CHECK-NEXT:    uzp2 z25.d, z0.d, z1.d
-; CHECK-NEXT:    ptrue p0.d
 ; CHECK-NEXT:    uzp2 z2.d, z2.d, z3.d
+; CHECK-NEXT:    ptrue p0.d
 ; CHECK-NEXT:    uzp1 z0.d, z0.d, z1.d
 ; CHECK-NEXT:    uzp1 z26.d, z6.d, z7.d
 ; CHECK-NEXT:    fmul z1.d, z24.d, z25.d
@@ -199,12 +199,12 @@ define <vscale x 4 x double> @mul_add_rot_mull(<vscale x 4 x double> %a, <vscale
 ; CHECK-NEXT:    uzp2 z25.d, z4.d, z5.d
 ; CHECK-NEXT:    uzp1 z4.d, z4.d, z5.d
 ; CHECK-NEXT:    uzp2 z5.d, z6.d, z7.d
-; CHECK-NEXT:    fmla z1.d, p0/m, z2.d, z0.d
 ; CHECK-NEXT:    fmla z3.d, p0/m, z26.d, z25.d
+; CHECK-NEXT:    fmla z1.d, p0/m, z2.d, z0.d
 ; CHECK-NEXT:    movprfx z2, z3
 ; CHECK-NEXT:    fmla z2.d, p0/m, z5.d, z4.d
-; CHECK-NEXT:    fnmls z2.d, p0/m, z24.d, z0.d
 ; CHECK-NEXT:    fmla z1.d, p0/m, z26.d, z4.d
+; CHECK-NEXT:    fnmls z2.d, p0/m, z24.d, z0.d
 ; CHECK-NEXT:    fmls z1.d, p0/m, z5.d, z25.d
 ; CHECK-NEXT:    zip1 z0.d, z2.d, z1.d
 ; CHECK-NEXT:    zip2 z1.d, z2.d, z1.d
diff --git a/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll b/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll
index f8cba4ce4b63bf..ab15bf564ec425 100644
--- a/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll
+++ b/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll
@@ -1145,8 +1145,8 @@ define <vscale x 4 x i64> @fshl_rot_illegal_i64(<vscale x 4 x i64> %a, <vscale x
 ; CHECK-NEXT:    subr z3.d, z3.d, #0 // =0x0
 ; CHECK-NEXT:    and z4.d, z4.d, #0x3f
 ; CHECK-NEXT:    and z2.d, z2.d, #0x3f
-; CHECK-NEXT:    and z3.d, z3.d, #0x3f
 ; CHECK-NEXT:    and z5.d, z5.d, #0x3f
+; CHECK-NEXT:    and z3.d, z3.d, #0x3f
 ; CHECK-NEXT:    lslr z4.d, p0/m, z4.d, z0.d
 ; CHECK-NEXT:    lsr z0.d, p0/m, z0.d, z2.d
 ; CHECK-NEXT:    movprfx z2, z1
diff --git a/llvm/test/CodeGen/AArch64/misched-bundle.mir b/llvm/test/CodeGen/AArch64/misched-bundle.mir
index a947c04a4229e6..9adcd2904a250a 100644
--- a/llvm/test/CodeGen/AArch64/misched-bundle.mir
+++ b/llvm/test/CodeGen/AArch64/misched-bundle.mir
@@ -46,9 +46,9 @@
 # CHECK-NEXT:   # rdefs left       : 0
 # CHECK-NEXT:   Latency            : 3
 # CHECK-NEXT:   Depth              : 0
-# CHECK-NEXT:   Height             : 0
+# CHECK-NEXT:   Height             : 7
 # CHECK-NEXT:   Successors:
-# CHECK-NEXT:     SU(7): Data Latency=0 Reg=$z3
+# CHECK-NEXT:     SU(7): Data Latency=3 Reg=$z3
 # CHECK-NEXT:     SU(9): Ord  Latency=0 Memory
 # CHECK-NEXT:     SU(8): Ord  Latency=0 Memory
 # CHECK-NEXT:   Single Issue       : false;
@@ -58,9 +58,9 @@
 # CHECK-NEXT:   # rdefs left       : 0
 # CHECK-NEXT:   Latency            : 3
 # CHECK-NEXT:   Depth              : 0
-# CHECK-NEXT:   Height             : 0
+# CHECK-NEXT:   Height             : 7
 # CHECK-NEXT:   Successors:
-# CHECK-NEXT:     SU(7): Data Latency=0 Reg=$z4
+# CHECK-NEXT:     SU(7): Data Latency=3 Reg=$z4
 # CHECK-NEXT:     SU(9): Ord  Latency=0 Memory
 # CHECK-NEXT:     SU(8): Ord  Latency=0 Memory
 # CHECK-NEXT:   Single Issue       : false;
@@ -70,9 +70,9 @@
 # CHECK-NEXT:   # rdefs left       : 0
 # CHECK-NEXT:   Latency            : 3
 # CHECK-NEXT:   Depth              : 0
-# CHECK-NEXT:   Height             : 0
+# CHECK-NEXT:   Height             : 7
 # CHECK-NEXT:   Successors:
-# CHECK-NEXT:     SU(7): Data Latency=0 Reg=$z5
+# CHECK-NEXT:     SU(7): Data Latency=3 Reg=$z5
 # CHECK-NEXT:     SU(9): Ord  Latency=0 Memory
 # CHECK-NEXT:     SU(8): Ord  Latency=0 Memory
 # CHECK-NEXT:   Single Issue       : false;
@@ -98,15 +98,15 @@
 # CHECK-NEXT:   # rdefs left       : 0
 # CHECK-NEXT:   Latency            : 1
 # CHECK-NEXT:   Depth              : 3
-# CHECK-NEXT:   Height             : 0
+# CHECK-NEXT:   Height             : 4
 # CHECK-NEXT:   Predecessors:
 # CHECK-NEXT:     SU(6): Anti Latency=0
-# CHECK-NEXT:     SU(5): Data Latency=0 Reg=$z5
-# CHECK-NEXT:     SU(4): Data Latency=0 Reg=$z4
-# CHECK-NEXT:     SU(3): Data Latency=0 Reg=$z3
+# CHECK-NEXT:     SU(5): Data Latency=3 Reg=$z5
+# CHECK-NEXT:     SU(4): Data Latency=3 Reg=$z4
+# CHECK-NEXT:     SU(3): Data Latency=3 Reg=$z3
 # CHECK-NEXT:     SU(1): Out  Latency=1
 # CHECK-NEXT:   Successors:
-# CHECK-NEXT:     SU(9): Data Latency=0 Reg=$z1
+# CHECK-NEXT:     SU(9): Data Latency=4 Reg=$z1
 # CHECK-NEXT:   Single Issue       : false;
 # CHECK-NEXT: SU(8):   ST1H killed renamable $z0, renamable $p0, renamable $x0, renamable $x10 :: (store unknown-size, align 1)
 # CHECK-NEXT:   # preds left       : 7
@@ -135,7 +135,7 @@
 # CHECK-NEXT:   Height             : 0
 # CHECK-NEXT:   Predecessors:
 # CHECK-NEXT:     SU(8): Ord  Latency=0 Memory
-# CHECK-NEXT:     SU(7): Data Latency=0 Reg=$z1
+# CHECK-NEXT:     SU(7): Data Latency=4 Reg=$z1
 # CHECK-NEXT:     SU(5): Ord  Latency=0 Memory
 # CHECK-NEXT:     SU(4): Ord  Latency=0 Memory
 # CHECK-NEXT:     SU(3): Ord  Latency=0 Memory
@@ -159,24 +159,24 @@ body:             |
   bb.0.entry:
     liveins: $p0, $x0, $x1, $x2, $x10, $x11, $x12, $x13
 
+
     ; CHECK-LABEL: name: test
     ; CHECK: liveins: $p0, $x0, $x1, $x2, $x10, $x11, $x12, $x13
     ; CHECK-NEXT: {{  $}}
     ; CHECK-NEXT: renamable $z0 = LD1H renamable $p0, renamable $x1, renamable $x10 :: (load unknown-size, align 1)
     ; CHECK-NEXT: renamable $z1 = LD1H renamable $p0, renamable $x2, renamable $x10 :: (load unknown-size, align 1)
     ; CHECK-NEXT: renamable $z2 = LD1H renamable $p0, renamable $x0, renamable $x10 :: (load unknown-size, align 1)
-    ; CHECK-NEXT: $z0 = FMAD_ZPmZZ_H renamable $p0, killed $z0, renamable $z1, killed renamable $z2
     ; CHECK-NEXT: renamable $z3 = LD1H renamable $p0, renamable $x11, renamable $x10 :: (load unknown-size, align 1)
     ; CHECK-NEXT: renamable $z4 = LD1H renamable $p0, renamable $x12, renamable $x10 :: (load unknown-size, align 1)
     ; CHECK-NEXT: renamable $z5 = LD1H renamable $p0, renamable $x13, renamable $x10 :: (load unknown-size, align 1)
-    ; CHECK-NEXT: ST1H killed renamable $z0, renamable $p0, renamable $x0, renamable $x10 :: (store unknown-size, align 1)
-    ; CHECK-NEXT: BUNDLE implicit-def $z1, implicit-def $q1, implicit-def $d1, implicit-def $s1, implicit-def $h1, implicit-def $b1, implicit $z5, implicit $p0, implicit $z4, implicit $z3 {
+    ; CHECK-NEXT: $z0 = FMAD_ZPmZZ_H renamable $p0, killed $z0, killed renamable $z1, killed renamable $z2
+    ; CHECK-NEXT: BUNDLE implicit-def $z1, implicit-def $q1, implicit-def $d1, implicit-def $s1, implicit-def $h1, implicit-def $b1, implicit $z5, implicit $p0, implicit killed $z4, implicit killed $z3 {
     ; CHECK-NEXT:   $z1 = MOVPRFX_ZZ $z5
-    ; CHECK-NEXT:   $z1 = FMLA_ZPmZZ_H renamable $p0, internal $z1, renamable $z4, renamable $z3
+    ; CHECK-NEXT:   $z1 = FMLA_ZPmZZ_H renamable $p0, internal killed $z1, killed renamable $z4, killed renamable $z3
     ; CHECK-NEXT: }
-    ; CHECK-NEXT: ST1H renamable $z1, renamable $p0, renamable $x13, renamable $x10 :: (store unknown-size, align 1)
+    ; CHECK-NEXT: ST1H killed renamable $z0, renamable $p0, renamable $x0, renamable $x10 :: (store unknown-size, align 1)
+    ; CHECK-NEXT: ST1H killed renamable $z1, renamable $p0, renamable $x13, renamable $x10 :: (store unknown-size, align 1)
     ; CHECK-NEXT: RET_ReallyLR
-
     renamable $z0 = LD1H renamable $p0, renamable $x1, renamable $x10 :: (load unknown-size)
     renamable $z1 = LD1H renamable $p0, renamable $x2, renamable $x10 :: (load unknown-size)
     renamable $z2 = LD1H renamable $p0, renamable $x0, renamable $x10 :: (load unknown-size)
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-build-vector.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-build-vector.ll
index 0e3307d1617298..ad482118ec0bbe 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-build-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-build-vector.ll
@@ -43,7 +43,7 @@ define void @build_vector_minus2_dec32_v4i64(ptr %a) #0 {
 ; VBITS_GE_256-LABEL: build_vector_minus2_dec32_v4i64:
 ; VBITS_GE_256:       // %bb.0:
 ; VBITS_GE_256-NEXT:    ptrue p0.d, vl4
-; VBITS_GE_256-NEXT:    mov x8, #-32
+; VBITS_GE_256-NEXT:    mov x8, #-32 // =0xffffffffffffffe0
 ; VBITS_GE_256-NEXT:    index z0.d, #-2, x8
 ; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x0]
 ; VBITS_GE_256-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll
index 64c4eea0ee5f36..6fbae7edfec0a3 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll
@@ -57,8 +57,8 @@ define void @fadd_v32f16(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    fadd z0.h, p0/m, z0.h, z1.h
 ; VBITS_GE_256-NEXT:    movprfx z1, z2
 ; VBITS_GE_256-NEXT:    fadd z1.h, p0/m, z1.h, z3.h
-; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x0]
 ; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x0, x8, lsl #1]
+; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x0]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: fadd_v32f16:
@@ -156,8 +156,8 @@ define void @fadd_v16f32(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    fadd z0.s, p0/m, z0.s, z1.s
 ; VBITS_GE_256-NEXT:    movprfx z1, z2
 ; VBITS_GE_256-NEXT:    fadd z1.s, p0/m, z1.s, z3.s
-; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x0]
 ; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x0, x8, lsl #2]
+; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x0]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: fadd_v16f32:
@@ -255,8 +255,8 @@ define void @fadd_v8f64(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    fadd z0.d, p0/m, z0.d, z1.d
 ; VBITS_GE_256-NEXT:    movprfx z1, z2
 ; VBITS_GE_256-NEXT:    fadd z1.d, p0/m, z1.d, z3.d
-; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x0]
 ; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x0, x8, lsl #3]
+; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x0]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: fadd_v8f64:
@@ -662,8 +662,8 @@ define void @fma_v32f16(ptr %a, ptr %b, ptr %c) #0 {
 ; VBITS_GE_256-NEXT:    fmad z0.h, p0/m, z1.h, z2.h
 ; VBITS_GE_256-NEXT:    movprfx z1, z5
 ; VBITS_GE_256-NEXT:    fmla z1.h, p0/m, z3.h, z4.h
-; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x0]
 ; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x0, x8, lsl #1]
+; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x0]
 ;...
[truncated]

llvm/lib/Target/AArch64/AArch64Subtarget.cpp

github-actions · 2024-03-24T11:34:19Z

✅ With the latest revision this PR passed the C/C++ code formatter.

github-actions · 2024-03-24T11:34:19Z

✅ With the latest revision this PR passed the Python code formatter.

By default the scheduling info of instructions into a BUNDLE are given a latency of 0 as they operate on the implicit register of the bundle. This modifies that for AArch64 so that the latency is adjusted to use the latency from the instruction in the bundle instead. This essentially assumes that the bundled instructions are executed in a single cycle, which for AArch64 is probably OK considering they are mostly used for MOVPFX bundles, where this can help create slightly better scheduling especially for in-order cores.

davemgreen · 2024-04-08T14:32:11Z

Rebase and ping. Thanks

david-arm

The new function adjustSchedDependency added for AArch64Subtarget seems sensible and looks like a definite improvement from before. I just had a question about whether this interface change is needed?

david-arm · 2024-04-09T08:57:32Z

llvm/lib/Target/AArch64/AArch64Subtarget.cpp

+  // the register.
+  const MachineInstr *DefMI = Def->getInstr();
+  if (DefMI->getOpcode() == TargetOpcode::BUNDLE) {
+    Register Reg = DefMI->getOperand(DefOpIdx).getReg();


This might be a silly question, but is this always guaranteed to be a register?

On the Bundle? Yep I believe they will always be registers, especially the DefOpIdx. It checks above that the dependency is a Reg.

david-arm · 2024-04-09T10:40:37Z

llvm/include/llvm/CodeGen/TargetSubtargetInfo.h

@@ -235,7 +235,9 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
  // and UseOpIdx are the indices of the operands in Def and Use, respectively.
  // Otherwise, either may be -1.
  virtual void adjustSchedDependency(SUnit *Def, int DefOpIdx, SUnit *Use,
-                                     int UseOpIdx, SDep &Dep) const {}
+                                     int UseOpIdx, SDep &Dep,
+                                     const TargetSchedModel *SchedModel) const {


I noticed that the AMDGPU overrides this and gets access to the schedmodel via the subtarget, so essentially in the AMDGPU version it's possibly now a bit confusing. Does it need an assert that the SchedModel function argument matches the scheduling model it already uses? Alternatively, do we even need to pass in the SchedModel - can the AArch64Subtarget also just store the scheduling model?

It looks like the AMDGPU InstrInfo will create a second TargetScheduleModel that it can use from InstrInfo. I think it would be better if there was only one, the one from the MachineScheduler/MachinePipeliner that it is being called from.

I'm open to suggestions if the AMD folks want it to work differently (preferably so long as it doesn't involve creating a new TargetScheduleModel!)

If that's the case then passing in the SchedModel as a parameter definitely seems preferable to creating a second copy in AArch64Subtarget. What I was wondering was - is there a way to stash the SchedModel passed in to adjustSchedDependency directly in the subtarget class itself? Maybe that's not easy due to the way it's set up, but I just thought it might avoid changing the interface. If it's too difficult that's fine and I'll accept the patch!

Hi. Im not sure if holding state in the subtarget is better than passing the model over the interface. Otherwise there can be use-after-free issues and potentially using the incorrect model when it is not valid.

Fair enough, that seems like a sensible argument to me!

david-arm

LGTM!

davemgreen requested review from SamTebbs33, harviniriawan, sjoerdmeijer and david-arm March 22, 2024 16:57

llvmbot added backend:AArch64 backend:AMDGPU llvm:SelectionDAG SelectionDAGISel as well labels Mar 22, 2024

arsenm reviewed Mar 23, 2024

View reviewed changes

llvm/lib/Target/AArch64/AArch64Subtarget.cpp Show resolved Hide resolved

davemgreen force-pushed the gh-a64-bundlesched branch from 9154a2b to d3819c6 Compare March 24, 2024 11:31

davemgreen force-pushed the gh-a64-bundlesched branch from d3819c6 to ef0c219 Compare April 1, 2024 12:46

davemgreen force-pushed the gh-a64-bundlesched branch from ef0c219 to ae2c68d Compare April 8, 2024 14:31

david-arm reviewed Apr 9, 2024

View reviewed changes

davemgreen requested a review from jayfoad April 9, 2024 15:35

david-arm approved these changes Apr 11, 2024

View reviewed changes

davemgreen merged commit b24af43 into llvm:main Apr 12, 2024
4 checks passed

davemgreen deleted the gh-a64-bundlesched branch April 12, 2024 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64] Improve scheduling latency into Bundles #86310

[AArch64] Improve scheduling latency into Bundles #86310

davemgreen commented Mar 22, 2024

llvmbot commented Mar 22, 2024 •

edited

github-actions bot commented Mar 24, 2024

github-actions bot commented Mar 24, 2024

davemgreen commented Apr 8, 2024

david-arm left a comment

david-arm Apr 9, 2024

davemgreen Apr 9, 2024

david-arm Apr 9, 2024

davemgreen Apr 9, 2024

david-arm Apr 11, 2024

davemgreen Apr 11, 2024

david-arm Apr 11, 2024

david-arm left a comment

[AArch64] Improve scheduling latency into Bundles #86310

[AArch64] Improve scheduling latency into Bundles #86310

Conversation

davemgreen commented Mar 22, 2024

llvmbot commented Mar 22, 2024 • edited

github-actions bot commented Mar 24, 2024

github-actions bot commented Mar 24, 2024

davemgreen commented Apr 8, 2024

david-arm left a comment

Choose a reason for hiding this comment

david-arm Apr 9, 2024

Choose a reason for hiding this comment

davemgreen Apr 9, 2024

Choose a reason for hiding this comment

david-arm Apr 9, 2024

Choose a reason for hiding this comment

davemgreen Apr 9, 2024

Choose a reason for hiding this comment

david-arm Apr 11, 2024

Choose a reason for hiding this comment

davemgreen Apr 11, 2024

Choose a reason for hiding this comment

david-arm Apr 11, 2024

Choose a reason for hiding this comment

david-arm left a comment

Choose a reason for hiding this comment

llvmbot commented Mar 22, 2024 •

edited