[RISCV] Enable load clustering by default #73789

asb · 2023-11-29T12:27:44Z

Stacks on top of #75338 and #75341

Posting for comment and for people to test on their workloads, feedback
on ideas for a tweaked heuristic etc.

llvmbot · 2023-11-29T12:28:17Z

@llvm/pr-subscribers-llvm-globalisel
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-backend-aarch64

Author: Alex Bradbury (asb)

Changes

[RISCV] Enable load clustering by default

Also tweaks the heuristic to cluster if operations are within a cache
line of each other (as AMDGPU does in shouldScheduleLoadsNear). X86 does
something similar, but does ((Offset2 - Offset1) / 8 > 64). I'm not
sure if that's intentionally set to 512 bytes or if the division is in
error.

Posting for comment and for people to test on their workloads, feedback
on ideas for a tweaked heuristic etc.

Stacks on top of #73778.

Patch is 662.06 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73789.diff

45 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+6)
(modified) llvm/lib/CodeGen/MachineScheduler.cpp (+9-5)
(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.cpp (+3-2)
(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.h (+2)
(modified) llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp (+4-1)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+2)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+2)
(modified) llvm/lib/Target/PowerPC/PPCInstrInfo.cpp (+3-2)
(modified) llvm/lib/Target/PowerPC/PPCInstrInfo.h (+2)
(modified) llvm/lib/Target/RISCV/RISCVInstrInfo.cpp (+6-5)
(modified) llvm/lib/Target/RISCV/RISCVInstrInfo.h (+2)
(modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+3-13)
(modified) llvm/test/CodeGen/RISCV/add-before-shl.ll (+13-13)
(modified) llvm/test/CodeGen/RISCV/callee-saved-fpr32s.ll (+360-360)
(modified) llvm/test/CodeGen/RISCV/callee-saved-fpr64s.ll (+248-248)
(modified) llvm/test/CodeGen/RISCV/callee-saved-gprs.ll (+522-522)
(modified) llvm/test/CodeGen/RISCV/calling-conv-ilp32-ilp32f-ilp32d-common.ll (+44-44)
(modified) llvm/test/CodeGen/RISCV/calling-conv-lp64-lp64f-lp64d-common.ll (+22-22)
(modified) llvm/test/CodeGen/RISCV/iabs.ll (+52-52)
(modified) llvm/test/CodeGen/RISCV/idiv_large.ll (+2-4)
(modified) llvm/test/CodeGen/RISCV/intrinsic-cttz-elts.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/legalize-fneg.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/llvm.exp10.ll (+13-13)
(modified) llvm/test/CodeGen/RISCV/mul.ll (+30-30)
(modified) llvm/test/CodeGen/RISCV/nontemporal.ll (+710-710)
(modified) llvm/test/CodeGen/RISCV/push-pop-popret.ll (+1002-1002)
(modified) llvm/test/CodeGen/RISCV/reduction-formation.ll (+36-36)
(modified) llvm/test/CodeGen/RISCV/rv32zbb.ll (+22-22)
(modified) llvm/test/CodeGen/RISCV/rv64i-shift-sext.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/expand-no-v.ll (+22-22)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll (+17-17)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access-zve32x.ll (+11-11)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+14-14)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-store-asm.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll (+132-132)
(modified) llvm/test/CodeGen/RISCV/rvv/pr63596.ll (+6-6)
(modified) llvm/test/CodeGen/RISCV/shifts.ll (+259-241)
(modified) llvm/test/CodeGen/RISCV/srem-seteq-illegal-types.ll (+91-91)
(modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+107-107)
(modified) llvm/test/CodeGen/RISCV/umulo-128-legalisation-lowering.ll (+49-49)
(modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+28-29)
(modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+83-83)
(modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+1073-1065)
(modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+1911-1883)
(modified) llvm/test/CodeGen/RISCV/xtheadmempair.ll (+7-7)

diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 665b7449ddb820a..4ec46f9bde1adf2 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -1496,12 +1496,18 @@ class TargetInstrInfo : public MCInstrInfo {
   /// to TargetPassConfig::createMachineScheduler() to have an effect.
   ///
   /// \p BaseOps1 and \p BaseOps2 are memory operands of two memory operations.
+  /// \p Offset1 and \p Offset2 are the byte offsets for the memory
+  /// operations.
+  /// \p OffsetIsScalable1 and \p OffsetIsScalable2 indicate if the offset is
+  /// scaled by a runtime quantity.
   /// \p ClusterSize is the number of operations in the resulting load/store
   /// cluster if this hook returns true.
   /// \p NumBytes is the number of bytes that will be loaded from all the
   /// clustered loads if this hook returns true.
   virtual bool shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
+                                   int64_t Offset1, bool OffsetIsScalable1,
                                    ArrayRef<const MachineOperand *> BaseOps2,
+                                   int64_t Offset2, bool OffsetIsScalable2,
                                    unsigned ClusterSize,
                                    unsigned NumBytes) const {
     llvm_unreachable("target did not implement shouldClusterMemOps()");
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 4add33ba0996af0..cd5fe71ef0c1ad1 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -1698,11 +1698,12 @@ class BaseMemOpClusterMutation : public ScheduleDAGMutation {
     SmallVector<const MachineOperand *, 4> BaseOps;
     int64_t Offset;
     unsigned Width;
+    bool OffsetIsScalable;
 
     MemOpInfo(SUnit *SU, ArrayRef<const MachineOperand *> BaseOps,
-              int64_t Offset, unsigned Width)
+              int64_t Offset, bool OffsetIsScalable, unsigned Width)
         : SU(SU), BaseOps(BaseOps.begin(), BaseOps.end()), Offset(Offset),
-          Width(Width) {}
+          OffsetIsScalable(OffsetIsScalable), Width(Width) {}
 
     static bool Compare(const MachineOperand *const &A,
                         const MachineOperand *const &B) {
@@ -1831,8 +1832,10 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
           SUnit2ClusterInfo[MemOpa.SU->NodeNum].second + MemOpb.Width;
     }
 
-    if (!TII->shouldClusterMemOps(MemOpa.BaseOps, MemOpb.BaseOps, ClusterLength,
-                                  CurrentClusterBytes))
+    if (!TII->shouldClusterMemOps(MemOpa.BaseOps, MemOpa.Offset,
+                                  MemOpa.OffsetIsScalable, MemOpb.BaseOps,
+                                  MemOpb.Offset, MemOpb.OffsetIsScalable,
+                                  ClusterLength, CurrentClusterBytes))
       continue;
 
     SUnit *SUa = MemOpa.SU;
@@ -1899,7 +1902,8 @@ void BaseMemOpClusterMutation::collectMemOpRecords(
     unsigned Width;
     if (TII->getMemOperandsWithOffsetWidth(MI, BaseOps, Offset,
                                            OffsetIsScalable, Width, TRI)) {
-      MemOpRecords.push_back(MemOpInfo(&SU, BaseOps, Offset, Width));
+      MemOpRecords.push_back(
+          MemOpInfo(&SU, BaseOps, Offset, OffsetIsScalable, Width));
 
       LLVM_DEBUG(dbgs() << "Num BaseOps: " << BaseOps.size() << ", Offset: "
                         << Offset << ", OffsetIsScalable: " << OffsetIsScalable
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
index e97f17e3f49c587..6b49e17528ada74 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
@@ -4230,8 +4230,9 @@ static bool shouldClusterFI(const MachineFrameInfo &MFI, int FI1,
 ///
 /// Only called for LdSt for which getMemOperandWithOffset returns true.
 bool AArch64InstrInfo::shouldClusterMemOps(
-    ArrayRef<const MachineOperand *> BaseOps1,
-    ArrayRef<const MachineOperand *> BaseOps2, unsigned ClusterSize,
+    ArrayRef<const MachineOperand *> BaseOps1, int64_t OpOffset1,
+    bool OffsetIsScalable1, ArrayRef<const MachineOperand *> BaseOps2,
+    int64_t OpOffset2, bool OffsetIsScalable2, unsigned ClusterSize,
     unsigned NumBytes) const {
   assert(BaseOps1.size() == 1 && BaseOps2.size() == 1);
   const MachineOperand &BaseOp1 = *BaseOps1.front();
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.h b/llvm/lib/Target/AArch64/AArch64InstrInfo.h
index cc588cdad6b8e5a..65e5fb49536da24 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.h
@@ -179,7 +179,9 @@ class AArch64InstrInfo final : public AArch64GenInstrInfo {
                            int64_t &MinOffset, int64_t &MaxOffset);
 
   bool shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
+                           int64_t Offset1, bool OffsetIsScalable1,
                            ArrayRef<const MachineOperand *> BaseOps2,
+                           int64_t Offset2, bool OffsetIsScalable2,
                            unsigned ClusterSize,
                            unsigned NumBytes) const override;
 
diff --git a/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp b/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
index 50f8ad4433c6d5c..442ae4dd7b34fe1 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertHardClauses.cpp
@@ -232,7 +232,10 @@ class SIInsertHardClauses : public MachineFunctionPass {
               // scheduler it limits the size of the cluster to avoid increasing
               // register pressure too much, but this pass runs after register
               // allocation so there is no need for that kind of limit.
-              !SII->shouldClusterMemOps(CI.BaseOps, BaseOps, 2, 2)))) {
+              // We also lie about the Offset and OffsetIsScalable parameters,
+              // as they aren't used in the SIInstrInfo implementation.
+              !SII->shouldClusterMemOps(CI.BaseOps, 0, false, BaseOps, 0, false,
+                                        2, 2)))) {
           // Finish the current clause.
           Changed |= emitClause(CI, SII);
           CI = ClauseInfo();
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index b5b456d6912544f..0a06fa88b6b1025 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -541,7 +541,9 @@ static bool memOpsHaveSameBasePtr(const MachineInstr &MI1,
 }
 
 bool SIInstrInfo::shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
+                                      int64_t Offset1, bool OffsetIsScalable1,
                                       ArrayRef<const MachineOperand *> BaseOps2,
+                                      int64_t Offset2, bool OffsetIsScalable2,
                                       unsigned ClusterSize,
                                       unsigned NumBytes) const {
   // If the mem ops (to be clustered) do not have the same base ptr, then they
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index eba817756e9c58e..6da4f74dfe5f3ea 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -234,7 +234,9 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
       const TargetRegisterInfo *TRI) const final;
 
   bool shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
+                           int64_t Offset1, bool OffsetIsScalable1,
                            ArrayRef<const MachineOperand *> BaseOps2,
+                           int64_t Offset2, bool OffsetIsScalable2,
                            unsigned ClusterSize,
                            unsigned NumBytes) const override;
 
diff --git a/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp b/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp
index 6784049348b1638..0de795d9d5e812e 100644
--- a/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp
+++ b/llvm/lib/Target/PowerPC/PPCInstrInfo.cpp
@@ -2878,8 +2878,9 @@ static bool isClusterableLdStOpcPair(unsigned FirstOpc, unsigned SecondOpc,
 }
 
 bool PPCInstrInfo::shouldClusterMemOps(
-    ArrayRef<const MachineOperand *> BaseOps1,
-    ArrayRef<const MachineOperand *> BaseOps2, unsigned ClusterSize,
+    ArrayRef<const MachineOperand *> BaseOps1, int64_t OpOffset1,
+    bool OffsetIsScalable1, ArrayRef<const MachineOperand *> BaseOps2,
+    int64_t OpOffset2, bool OffsetIsScalable2, unsigned ClusterSize,
     unsigned NumBytes) const {
 
   assert(BaseOps1.size() == 1 && BaseOps2.size() == 1);
diff --git a/llvm/lib/Target/PowerPC/PPCInstrInfo.h b/llvm/lib/Target/PowerPC/PPCInstrInfo.h
index 31e9859a41739a1..0c9ad607418ecc7 100644
--- a/llvm/lib/Target/PowerPC/PPCInstrInfo.h
+++ b/llvm/lib/Target/PowerPC/PPCInstrInfo.h
@@ -530,7 +530,9 @@ class PPCInstrInfo : public PPCGenInstrInfo {
   /// Returns true if the two given memory operations should be scheduled
   /// adjacent.
   bool shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
+                           int64_t Offset1, bool OffsetIsScalable1,
                            ArrayRef<const MachineOperand *> BaseOps2,
+                           int64_t Offset2, bool OffsetIsScalable2,
                            unsigned ClusterSize,
                            unsigned NumBytes) const override;
 
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
index 2918e5654db4f9f..cf399f37d255c29 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
@@ -2266,8 +2266,9 @@ static bool memOpsHaveSameBasePtr(const MachineInstr &MI1,
 }
 
 bool RISCVInstrInfo::shouldClusterMemOps(
-    ArrayRef<const MachineOperand *> BaseOps1,
-    ArrayRef<const MachineOperand *> BaseOps2, unsigned ClusterSize,
+    ArrayRef<const MachineOperand *> BaseOps1, int64_t Offset1,
+    bool OffsetIsScalable1, ArrayRef<const MachineOperand *> BaseOps2,
+    int64_t Offset2, bool OffsetIsScalable2, unsigned ClusterSize,
     unsigned NumBytes) const {
   // If the mem ops (to be clustered) do not have the same base ptr, then they
   // should not be clustered
@@ -2281,9 +2282,9 @@ bool RISCVInstrInfo::shouldClusterMemOps(
     return false;
   }
 
-  // TODO: Use a more carefully chosen heuristic, e.g. only cluster if offsets
-  // indicate they likely share a cache line.
-  return ClusterSize <= 4;
+  // A cache line is typically 64 bytes, so cluster if the memory ops are on
+  // the same or a neighbouring cache line.
+  return std::abs(Offset1 - Offset2) < 64;
 }
 
 // Set BaseReg (the base register operand), Offset (the byte offset being
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.h b/llvm/lib/Target/RISCV/RISCVInstrInfo.h
index 0954286a419bdd5..7e1d3f31180650d 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.h
@@ -158,7 +158,9 @@ class RISCVInstrInfo : public RISCVGenInstrInfo {
       const TargetRegisterInfo *TRI) const override;
 
   bool shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
+                           int64_t Offset1, bool OffsetIsScalable1,
                            ArrayRef<const MachineOperand *> BaseOps2,
+                           int64_t Offset2, bool OffsetIsScalable2,
                            unsigned ClusterSize,
                            unsigned NumBytes) const override;
 
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index b6c194f03f54209..0954fbb8314c6fa 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -95,11 +95,6 @@ static cl::opt<bool>
                         cl::desc("Enable Split RegisterAlloc for RVV"),
                         cl::init(false));
 
-static cl::opt<bool> EnableMISchedLoadClustering(
-    "riscv-misched-load-clustering", cl::Hidden,
-    cl::desc("Enable load clustering in the machine scheduler"),
-    cl::init(false));
-
 extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeRISCVTarget() {
   RegisterTargetMachine<RISCVTargetMachine> X(getTheRISCV32Target());
   RegisterTargetMachine<RISCVTargetMachine> Y(getTheRISCV64Target());
@@ -350,15 +345,10 @@ class RISCVPassConfig : public TargetPassConfig {
   ScheduleDAGInstrs *
   createMachineScheduler(MachineSchedContext *C) const override {
     const RISCVSubtarget &ST = C->MF->getSubtarget<RISCVSubtarget>();
-    ScheduleDAGMILive *DAG = nullptr;
-    if (EnableMISchedLoadClustering) {
-      DAG = createGenericSchedLive(C);
-      DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
-    }
-    if (ST.hasMacroFusion()) {
-      DAG = DAG ? DAG : createGenericSchedLive(C);
+    ScheduleDAGMILive *DAG = createGenericSchedLive(C);
+    DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
+    if (ST.hasMacroFusion())
       DAG->addMutation(createRISCVMacroFusionDAGMutation());
-    }
     return DAG;
   }
 
diff --git a/llvm/test/CodeGen/RISCV/add-before-shl.ll b/llvm/test/CodeGen/RISCV/add-before-shl.ll
index 274f1cef49aa955..3695a8a7f60862f 100644
--- a/llvm/test/CodeGen/RISCV/add-before-shl.ll
+++ b/llvm/test/CodeGen/RISCV/add-before-shl.ll
@@ -200,25 +200,25 @@ define i128 @add_wide_operand(i128 %a) nounwind {
 ;
 ; RV32C-LABEL: add_wide_operand:
 ; RV32C:       # %bb.0:
-; RV32C-NEXT:    lw a6, 4(a1)
+; RV32C-NEXT:    lw a6, 8(a1)
 ; RV32C-NEXT:    c.lw a3, 12(a1)
-; RV32C-NEXT:    c.lw a4, 0(a1)
-; RV32C-NEXT:    c.lw a1, 8(a1)
+; RV32C-NEXT:    c.lw a2, 4(a1)
+; RV32C-NEXT:    c.lw a1, 0(a1)
 ; RV32C-NEXT:    c.lui a5, 16
 ; RV32C-NEXT:    c.add a3, a5
 ; RV32C-NEXT:    c.slli a3, 3
-; RV32C-NEXT:    srli a5, a1, 29
-; RV32C-NEXT:    c.or a3, a5
-; RV32C-NEXT:    srli a5, a4, 29
-; RV32C-NEXT:    slli a2, a6, 3
-; RV32C-NEXT:    c.or a2, a5
 ; RV32C-NEXT:    srli a5, a6, 29
+; RV32C-NEXT:    c.or a3, a5
+; RV32C-NEXT:    srli a5, a1, 29
+; RV32C-NEXT:    slli a4, a2, 3
+; RV32C-NEXT:    c.or a4, a5
+; RV32C-NEXT:    c.srli a2, 29
+; RV32C-NEXT:    c.slli a6, 3
+; RV32C-NEXT:    or a2, a6, a2
 ; RV32C-NEXT:    c.slli a1, 3
-; RV32C-NEXT:    c.or a1, a5
-; RV32C-NEXT:    c.slli a4, 3
-; RV32C-NEXT:    c.sw a4, 0(a0)
-; RV32C-NEXT:    c.sw a1, 8(a0)
-; RV32C-NEXT:    c.sw a2, 4(a0)
+; RV32C-NEXT:    c.sw a1, 0(a0)
+; RV32C-NEXT:    c.sw a2, 8(a0)
+; RV32C-NEXT:    c.sw a4, 4(a0)
 ; RV32C-NEXT:    c.sw a3, 12(a0)
 ; RV32C-NEXT:    c.jr ra
 ;
diff --git a/llvm/test/CodeGen/RISCV/callee-saved-fpr32s.ll b/llvm/test/CodeGen/RISCV/callee-saved-fpr32s.ll
index 7111316931f19b7..7aeaaab68a208cb 100644
--- a/llvm/test/CodeGen/RISCV/callee-saved-fpr32s.ll
+++ b/llvm/test/CodeGen/RISCV/callee-saved-fpr32s.ll
@@ -25,69 +25,69 @@ define void @callee() nounwind {
 ; ILP32:       # %bb.0:
 ; ILP32-NEXT:    lui a0, %hi(var)
 ; ILP32-NEXT:    flw fa5, %lo(var)(a0)
-; ILP32-NEXT:    flw fa4, %lo(var+4)(a0)
-; ILP32-NEXT:    flw fa3, %lo(var+8)(a0)
-; ILP32-NEXT:    flw fa2, %lo(var+12)(a0)
 ; ILP32-NEXT:    addi a1, a0, %lo(var)
-; ILP32-NEXT:    flw fa1, 16(a1)
-; ILP32-NEXT:    flw fa0, 20(a1)
-; ILP32-NEXT:    flw ft0, 24(a1)
-; ILP32-NEXT:    flw ft1, 28(a1)
-; ILP32-NEXT:    flw ft2, 32(a1)
-; ILP32-NEXT:    flw ft3, 36(a1)
-; ILP32-NEXT:    flw ft4, 40(a1)
-; ILP32-NEXT:    flw ft5, 44(a1)
-; ILP32-NEXT:    flw ft6, 48(a1)
-; ILP32-NEXT:    flw ft7, 52(a1)
-; ILP32-NEXT:    flw fa6, 56(a1)
-; ILP32-NEXT:    flw fa7, 60(a1)
-; ILP32-NEXT:    flw ft8, 64(a1)
-; ILP32-NEXT:    flw ft9, 68(a1)
-; ILP32-NEXT:    flw ft10, 72(a1)
-; ILP32-NEXT:    flw ft11, 76(a1)
-; ILP32-NEXT:    flw fs0, 80(a1)
-; ILP32-NEXT:    flw fs1, 84(a1)
-; ILP32-NEXT:    flw fs2, 88(a1)
-; ILP32-NEXT:    flw fs3, 92(a1)
-; ILP32-NEXT:    flw fs4, 96(a1)
-; ILP32-NEXT:    flw fs5, 100(a1)
-; ILP32-NEXT:    flw fs6, 104(a1)
-; ILP32-NEXT:    flw fs7, 108(a1)
+; ILP32-NEXT:    flw fa4, 16(a1)
+; ILP32-NEXT:    flw fa3, 20(a1)
+; ILP32-NEXT:    flw fa2, 24(a1)
+; ILP32-NEXT:    flw fa1, 28(a1)
+; ILP32-NEXT:    flw fa0, 32(a1)
+; ILP32-NEXT:    flw ft0, 36(a1)
+; ILP32-NEXT:    flw ft1, 40(a1)
+; ILP32-NEXT:    flw ft2, 44(a1)
+; ILP32-NEXT:    flw ft3, 48(a1)
+; ILP32-NEXT:    flw ft4, 52(a1)
+; ILP32-NEXT:    flw ft5, 56(a1)
+; ILP32-NEXT:    flw ft6, 60(a1)
+; ILP32-NEXT:    flw ft7, 64(a1)
+; ILP32-NEXT:    flw fa6, 68(a1)
+; ILP32-NEXT:    flw fa7, 72(a1)
+; ILP32-NEXT:    flw ft8, 76(a1)
+; ILP32-NEXT:    flw ft9, 80(a1)
+; ILP32-NEXT:    flw ft10, 84(a1)
+; ILP32-NEXT:    flw ft11, 88(a1)
+; ILP32-NEXT:    flw fs0, 92(a1)
+; ILP32-NEXT:    flw fs1, 96(a1)
+; ILP32-NEXT:    flw fs2, 100(a1)
+; ILP32-NEXT:    flw fs3, 104(a1)
+; ILP32-NEXT:    flw fs4, 108(a1)
+; ILP32-NEXT:    flw fs5, 112(a1)
+; ILP32-NEXT:    flw fs6, 116(a1)
+; ILP32-NEXT:    flw fs7, 120(a1)
 ; ILP32-NEXT:    flw fs8, 124(a1)
-; ILP32-NEXT:    flw fs9, 120(a1)
-; ILP32-NEXT:    flw fs10, 116(a1)
-; ILP32-NEXT:    flw fs11, 112(a1)
+; ILP32-NEXT:    flw fs9, %lo(var+4)(a0)
+; ILP32-NEXT:    flw fs10, %lo(var+8)(a0)
+; ILP32-NEXT:    flw fs11, %lo(var+12)(a0)
 ; ILP32-NEXT:    fsw fs8, 124(a1)
-; ILP32-NEXT:    fsw fs9, 120(a1)
-; ILP32-NEXT:    fsw fs10, 116(a1)
-; ILP32-NEXT:    fsw fs11, 112(a1)
-; ILP32-NEXT:    fsw fs7, 108(a1)
-; ILP32-NEXT:    fsw fs6, 104(a1)
-; ILP32-NEXT:    fsw fs5, 100(a1)
-; ILP32-NEXT:    fsw fs4, 96(a1)
-; ILP32-NEXT:    fsw fs3, 92(a1)
-; ILP32-NEXT:    fsw fs2, 88(a1)
-; ILP32-NEXT:    fsw fs1, 84(a1)
-; ILP32-NEXT:    fsw fs0, 80(a1)
-; ILP32-NEXT:    fsw ft11, 76(a1)
-; ILP32-NEXT:    fsw ft10, 72(a1)
-; ILP32-NEXT:    fsw ft9, 68(a1)
-; ILP32-NEXT:    fsw ft8, 64(a1)
-; ILP32-NEXT:    fsw fa7, 60(a1)
-; ILP32-NEXT:    fsw fa6, 56(a1)
-; ILP32-NEXT:    fsw ft7, 52(a1)
-; ILP32-NEXT:    fsw ft6, 48(a1)
-; ILP32-NEXT:    fsw ft5, 44(a1)
-; ILP32-NEXT:    fsw ft4, 40(a1)
-; ILP32-NEXT:    fsw ft3, 36(a1)
-; ILP32-NEXT:    fsw ft2, 32(a1)
-; ILP32-NEXT:    fsw ft1, 28(a1)
-; ILP32-NEXT:    fsw ft0, 24(a1)
-; ILP32-NEXT:    fsw fa0, 20(a1)
-; ILP32-NEXT:    fsw fa1, 16(a1)
-; ILP32-NEXT:    fsw fa2, %lo(var+12)(a0)
-; ILP32-NEXT:    fsw fa3, %lo(var+8)(a0)
-; ILP32-NEXT:    fsw fa4, %lo(var+4)(a0)
+; ILP32-NEXT:    fsw fs7, 120(a1)
+; ILP32-NEXT:    fsw fs6, 116(a1)
+; ILP32-NEXT:    fsw fs5, 112(a1)
+; ILP32-NEXT:    fsw fs4, 108(a1)
+; ILP32-NEXT:    fsw fs3, 104(a1)
+; ILP32-NEXT:    fsw fs2, 100(a1)
+; ILP32-NEXT:    fsw fs1, 96(a1)
+; ILP32-NEXT:    fsw fs0, 92(a1)
+; ILP32-NEXT:    fsw ft11, 88(a1)
+; ILP32-NEXT:    fsw ft10, 84(a1)
+; ILP32-NEXT:    fsw ft9, 80(a1)
+; ILP32-NEXT:    fsw ft8, 76(a1)
+; ILP32-NEXT:    fsw fa7, 72(a1)
+; ILP32-NEXT:    fsw fa6, 68(a1)
+; ILP32-NEXT:    fsw ft7, 64(a1)
+; ILP32-NEXT:    fsw ft6, 60(a1)
+; ILP32-NEXT:    fsw ft5, 56(a1)
+; ILP32-NEXT:    fsw ft4, 52(a1)
+; ILP32-NEXT:    fsw ft3, 48(a1)
+; ILP32-NEXT:    fsw ft2, 44(a1)
+; ILP32-NEXT:    fsw ft1, 40(a1)
+; ILP32-NEXT:    fsw ft0, 36(a1)
+; ILP32-NEXT:    fsw fa0, 32(a1)
+; ILP32-NEXT:    fsw fa1, 28(a1)
+; ILP32-NEXT:    fsw fa2, 24(a1)
+; ILP32-NEXT:    fsw fa3, 20(a1)
+; ILP32-NEXT:    fsw fa4, 16(a1)
+; ILP32-NEXT:    fsw fs11, %lo(var+12)(a0)
+; ILP32-NEXT:    fsw fs10, %lo(var+8)(a0)
+; ILP32-NEXT:    fsw fs9, %lo(var+4)(a0)
 ; ILP32-NEXT:    fsw fa5, %lo(var)(a0)
 ; ILP32-NEXT:    ret
 ;
@@ -95,69 +95,69 @@ define void @callee() nounwind {
 ; LP64:       # %bb.0:
 ; LP64-NEXT:    lui a0, %hi(var)
 ; LP64-NEXT:    flw fa5, %lo(var)(a0)
-; LP64-NEXT:    flw fa4, %lo(var+4)(a0)
-; LP64-NEXT:    flw fa3, %lo(var+8)(a0)
-; LP64-NEXT:    flw fa2, %lo(var+12)(a0)
 ; LP64-NEXT:    addi a1, a0, %lo(var)
-; LP64-NEXT:    flw fa1, 16(a1)
-; LP64-NEXT:    flw fa0, 20(a1)
-; LP64-NEXT:    flw ft0, 24(a1)
-; LP64-NEXT:    flw ft1, 28(a1)
-; LP64-NEXT:    flw ft2, 32(a1)
-; LP64-NEXT:    flw ft3, 36(a1)
-; LP64-NEXT:    flw ft4, 40(a1)
-; LP64-NEXT:    flw ft5, 44(a1)
-; LP64-NEXT:    flw ft6, 48(a1)
-; LP64-NEXT:    flw ft7, 52(a1)
-; LP64-NEXT:    flw fa6, 56(a1)
-; LP64-NEXT:    flw fa7, 60(a1)
-; LP64-NEXT:    flw ft8, 64(a1)
-; LP64-NEXT:    flw ft9, 68(a1)
-; LP64-NEXT:    flw ft10, 72(a1)
-; LP64-NEXT:    flw ft11, 76(a1)
-; LP64-NEXT:    flw fs0, 80(a1)
-; LP64-NEXT:    flw fs1, 84(a1)
-; LP64-NEXT:    flw fs2, 88(a1)
-; LP64-NEXT:    flw fs3, 92(a1)
-; LP64-NEXT:    flw fs4, 96(a1)
-; LP64-NEXT:    flw fs5, 100(a1)
-; LP64-NEXT:  ...
[truncated]

preames

Looking at the test diffs, two observations:

This appears to be increasing register pressure in some cases. I can't find any case where it causes a spill it didn't before, but we do have cases where more registers are live at the same time.
Clustering doesn't appear to be sorting by offset, and instead appears to be a somewhat arbitrary reordering of already adjacent loads. (When they happen to already be adjacent.)

Does this work if we try doing this post RA? That would cut down on the potential register pressure concern. To be clear, this is just a question, I could easily be convinced that doing this pre-RA is the right answer.

Your heuristic looks reasonable to me. I don't have data one way or another to indicate a preference for it, or anything else.

wangpc-pp · 2023-12-04T06:52:22Z

llvm/lib/Target/RISCV/RISCVInstrInfo.cpp

-  // TODO: Use a more carefully chosen heuristic, e.g. only cluster if offsets
-  // indicate they likely share a cache line.
-  return ClusterSize <= 4;
+  // A cache line is typically 64 bytes, so cluster if the memory ops are on


We can get cache line size via TargetTransformInfo::getCacheLineSize or MCSubtargetInfo::getCacheLineSize (for the default implementation, TargetTransformInfo::getCacheLineSize just calls MCSubtargetInfo::getCacheLineSize). If the return value is not 0, we can use it as a more exact value.
For RISCV, the cache line size info can be specified when defining processor.

…/ getMemOperandWithOffsetWidth I noted AArch64 happily accepts a FrameIndex operand as well as a register. This doesn't cause any changes outside of my C++ unit test for the current state of in-tree, but this will cause additional test changes if llvm#73789 is rebased on top of it. Note that the returned Offset doesn't seem at all as meaningful if you hae a FrameIndex base, though the approach taken here follows AArch64 (see D54847). This change won't harm the approach taken in shouldClusterMemOps because memOpsHaveSameBasePtr will only return true if the FrameIndex operand is the same for both operations.

…/ getMemOperandWithOffsetWidth (#73802) I noted AArch64 happily accepts a FrameIndex operand as well as a register. This doesn't cause any changes outside of my C++ unit test for the current state of in-tree, but this will cause additional test changes if #73789 is rebased on top of it. Note that the returned Offset doesn't seem at all as meaningful if you have a FrameIndex base, though the approach taken here follows AArch64 (see D54847). This change won't harm the approach taken in shouldClusterMemOps because memOpsHaveSameBasePtr will only return true if the FrameIndex operand is the same for both operations.

github-actions · 2023-12-06T14:22:52Z

:white_check_mark: With the latest revision this PR passed the C/C++ code formatter.

asb · 2023-12-06T14:27:06Z

Looking at the test diffs, two observations:
1. This appears to be increasing register pressure in some cases.  I can't find any case where it causes a spill it didn't before, but we do have cases where more registers are live at the same time.

2. Clustering doesn't appear to be sorting by offset, and instead appears to be a somewhat arbitrary reordering of already adjacent loads.  (When they happen to already be adjacent.)
Does this work if we try doing this post RA? That would cut down on the potential register pressure concern. To be clear, this is just a question, I could easily be convinced that doing this pre-RA is the right answer.

Your heuristic looks reasonable to me. I don't have data one way or another to indicate a preference for it, or anything else.

Digging into the ordering behaviour, it is indeed a bit confusing. BaseMemOpClusterMutation::apply will collect the MemOpInfo (calling getMemOperandsWithOffsetWidth for this), then they're put into group (commonly one big group, unless over a certain size threshold), and then the array of MemOpInfo is sorted into ascending order based on base+offset. clusterNeighbouringMemOps then performs clustering using that order except there's logic that avoids reordering that was added in https://reviews.llvm.org/D72706 with the argument that the reordering "leads to test case churn and 'goes against the machine schedulers "do no harm" philosophy'. I'm not sure I follow how the reordering could be harmful, and it does seem unfortunate because it seems to me the sdag scheduler does do this reordering so there's even more churn than you'd think when trying to check the incremental benefit of adding sdag load clustering in addition.

I've added a new parameter to the pass that controls this behaviour, and enabled the reordering for RISC-V. If people are happy with this direction, I'd intend to split this out as a patch that this one depends on.

Trying to enable this for the post-ra scheduler it seems to fail to make any changes (i.e. doesn't work), though it's possible I made an error.

Thanks @wangpc-pp for the suggestion re querying cache line size. I've not added that in this iteration, but will do so in the next rev.

asb · 2023-12-06T15:00:38Z

For reference, #74600 shows the additional diff if you then enable the selectiondag scheduler. Lots of cases where there's just noise (e.g. different register choice), but a few I want to look at more closely as ideally we'd get by only enabling load clustering in the MachineScheduler I think.

zixuan-wu · 2023-12-08T07:17:41Z

llvm/include/llvm/CodeGen/TargetInstrInfo.h

  /// \p ClusterSize is the number of operations in the resulting load/store
  /// cluster if this hook returns true.
  /// \p NumBytes is the number of bytes that will be loaded from all the
  /// clustered loads if this hook returns true.
  virtual bool shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
+                                   int64_t Offset1, bool OffsetIsScalable1,


Do we need a width operand too?

You could imagine trying to use that, but our current heuristic doesn't make use of that information so I don't think it's necessary to expand the parameter list further.

…ustering Reordering based on the sort order of the MemOpInfo array was disabled in <https://reviews.llvm.org/D72706>. However, it's not clear this is desirable for al targets. It also makes it more difficult to compare the incremental benefit of enabling load clustering in the selectiondag scheduler as well was the machinescheduler, as the sdag scheduler does seem to allow this reordering. This patch adds a parameter that can control the behaviour on a per-target basis.

@wangpc-pp

Split out from llvm#73789. Clusters if the operations are within a cache line of each other (as AMDGPU does in shouldScheduleLoadsNear). X86 does something similar, but does `((Offset2 - Offset1) / 8 > 64)`. I'm not sure if that's intentionally set to 512 bytes or if the division is in error. Adopts the suggestion from @wangpc-pp to query the cache line size and use it if available.

asb · 2023-12-13T14:16:01Z

I've rebased and split out two patches for separate review (#75338 and #75341) in order to hopefully inch this forward. The rebase will also pick up some new test changes due to #73802 landing.

@wangpc-pp

Split out from llvm#73789. Clusters if the operations are within a cache line of each other (as AMDGPU does in shouldScheduleLoadsNear). X86 does something similar, but does `((Offset2 - Offset1) / 8 > 64)`. I'm not sure if that's intentionally set to 512 bytes or if the division is in error. Adopts the suggestion from @wangpc-pp to query the cache line size and use it if available.

@wangpc-pp

Split out from #73789, so as to leave that PR just for flipping load clustering to on by default. Clusters if the operations are within a cache line of each other (as AMDGPU does in shouldScheduleLoadsNear). X86 does something similar, but does `((Offset2 - Offset1) / 8 > 64)`. I'm not sure if that's intentionally set to 512 bytes or if the division is in error. Adopts the suggestion from @wangpc-pp to query the cache line size and use it if available. We also cap the maximum cluster size to cap the potential register pressure impact (which may lead to additional spills).

preames

LGTM w/comment addressed. The comment is basically a rework of the patch, but the resulting patch is near trivial so it felt reasonable to do a conditional LGTM here.

preames · 2024-01-10T18:24:03Z

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp

-static cl::opt<bool> EnableMISchedLoadClustering(
-    "riscv-misched-load-clustering", cl::Hidden,
-    cl::desc("Enable load clustering in the machine scheduler"),
-    cl::init(false));


In this patch, please just flip the bool default here. If you want to remove the option entirely, do that in a separate change in a couple weeks.

…ustering (#75338) Reordering based on the sort order of the MemOpInfo array was disabled in <https://reviews.llvm.org/D72706>. However, it's not clear this is desirable for al targets. It also makes it more difficult to compare the incremental benefit of enabling load clustering in the selectiondag scheduler as well was the machinescheduler, as the sdag scheduler does seem to allow this reordering. This patch adds a parameter that can control the behaviour on a per-target basis. Split out from #73789.

…ustering (llvm#75338) Reordering based on the sort order of the MemOpInfo array was disabled in <https://reviews.llvm.org/D72706>. However, it's not clear this is desirable for al targets. It also makes it more difficult to compare the incremental benefit of enabling load clustering in the selectiondag scheduler as well was the machinescheduler, as the sdag scheduler does seem to allow this reordering. This patch adds a parameter that can control the behaviour on a per-target basis. Split out from llvm#73789.

asb requested review from preames, kito-cheng, topperc and wangpc-pp November 29, 2023 12:27

llvmbot added backend:AArch64 backend:AMDGPU backend:RISC-V labels Nov 29, 2023

asb mentioned this pull request Nov 29, 2023

[RISCV][RFC] Enable store clustering by default #73796

Draft

asb force-pushed the 2023q4-riscv-cluster-memops-2 branch from 7a50880 to ceb5e02 Compare November 29, 2023 14:09

asb mentioned this pull request Nov 29, 2023

[RISCV] Support FrameIndex operands in getMemOperandsWithOffsetWidth / getMemOperandWithOffsetWidth #73802

Merged

preames reviewed Nov 29, 2023

View reviewed changes

wangpc-pp reviewed Dec 4, 2023

View reviewed changes

asb force-pushed the 2023q4-riscv-cluster-memops-2 branch 2 times, most recently from 06c5c58 to c9ca21b Compare December 6, 2023 14:28

asb mentioned this pull request Dec 6, 2023

[RISCV][RFC/WIP] Enable load clustering in SelectionDAG scheduler #74600

Draft

zixuan-wu reviewed Dec 8, 2023

View reviewed changes

asb mentioned this pull request Dec 13, 2023

[MachineScheduler] Add option to control reordering for store/load clustering #75338

Merged

asb mentioned this pull request Dec 13, 2023

[RISCV] Change heuristic used for load clustering #75341

Merged

asb force-pushed the 2023q4-riscv-cluster-memops-2 branch from c9ca21b to a387218 Compare December 13, 2023 14:11

llvmbot added the llvm:globalisel label Dec 13, 2023

[RISCV] Enable load clustering by default

1a46b84

asb force-pushed the 2023q4-riscv-cluster-memops-2 branch from a387218 to 1a46b84 Compare December 13, 2023 14:13

preames approved these changes Jan 10, 2024

View reviewed changes

lukel97 mentioned this pull request Feb 27, 2024

[RISCV] Recursively split concat_vector into smaller LMULs #83035

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISCV] Enable load clustering by default #73789

[RISCV] Enable load clustering by default #73789

asb commented Nov 29, 2023 •

edited

llvmbot commented Nov 29, 2023 •

edited

preames left a comment

wangpc-pp Dec 4, 2023

github-actions bot commented Dec 6, 2023 •

edited

asb commented Dec 6, 2023

asb commented Dec 6, 2023

zixuan-wu Dec 8, 2023

asb Dec 13, 2023

asb commented Dec 13, 2023

preames left a comment

preames Jan 10, 2024

[RISCV] Enable load clustering by default #73789

Are you sure you want to change the base?

[RISCV] Enable load clustering by default #73789

Conversation

asb commented Nov 29, 2023 • edited

llvmbot commented Nov 29, 2023 • edited

preames left a comment

Choose a reason for hiding this comment

wangpc-pp Dec 4, 2023

Choose a reason for hiding this comment

github-actions bot commented Dec 6, 2023 • edited

asb commented Dec 6, 2023

asb commented Dec 6, 2023

zixuan-wu Dec 8, 2023

Choose a reason for hiding this comment

asb Dec 13, 2023

Choose a reason for hiding this comment

asb commented Dec 13, 2023

preames left a comment

Choose a reason for hiding this comment

preames Jan 10, 2024

Choose a reason for hiding this comment

asb commented Nov 29, 2023 •

edited

llvmbot commented Nov 29, 2023 •

edited

github-actions bot commented Dec 6, 2023 •

edited