Skip to content

Conversation

chinmaydd
Copy link
Contributor

@chinmaydd chinmaydd commented Jun 6, 2024

Latency for load operations could be modeled better by using the LoadLatency field in the provided scheduling model for the target architecture. This PR adds a new command line flag use-load-latency that enables this behavior. The flag is off by default.

This attempts a fix for the comment here - LSUnit.h#403. I would be happy to know if there is a better way to approach this.

@llvmbot
Copy link
Member

llvmbot commented Jun 6, 2024

@llvm/pr-subscribers-tools-llvm-mca

@llvm/pr-subscribers-backend-x86

Author: Chinmay Deshpande (chinmaydd)

Changes

Latency for load operations could be modeled better by using the LoadLatency field in the provided scheduling model for the target architecture. This PR adds a new command line flag use-load-latency that enables this behavior. The flag is off by default.

This attempts a fix for the comment here - LSUnit.h#370. I would be happy to know if there is a better way to approach this.


Full diff: https://github.com/llvm/llvm-project/pull/94566.diff

6 Files Affected:

  • (modified) llvm/include/llvm/MCA/InstrBuilder.h (+3-1)
  • (modified) llvm/lib/MCA/InstrBuilder.cpp (+13-5)
  • (added) llvm/test/tools/llvm-mca/X86/use-load-latency.s (+58)
  • (modified) llvm/tools/llvm-mca/llvm-mca.cpp (+7-1)
  • (modified) llvm/unittests/tools/llvm-mca/MCATestBase.cpp (+2-1)
  • (modified) llvm/unittests/tools/llvm-mca/X86/TestIncrementalMCA.cpp (+4-2)
diff --git a/llvm/include/llvm/MCA/InstrBuilder.h b/llvm/include/llvm/MCA/InstrBuilder.h
index 00c7942e4fa16..6880a213d48b5 100644
--- a/llvm/include/llvm/MCA/InstrBuilder.h
+++ b/llvm/include/llvm/MCA/InstrBuilder.h
@@ -79,6 +79,7 @@ class InstrBuilder {
   bool FirstCallInst;
   bool FirstReturnInst;
   unsigned CallLatency;
+  bool UseLoadLatency;
 
   using InstRecycleCallback = std::function<Instruction *(const InstrDesc &)>;
   InstRecycleCallback InstRecycleCB;
@@ -99,7 +100,8 @@ class InstrBuilder {
 public:
   InstrBuilder(const MCSubtargetInfo &STI, const MCInstrInfo &MCII,
                const MCRegisterInfo &RI, const MCInstrAnalysis *IA,
-               const InstrumentManager &IM, unsigned CallLatency);
+               const InstrumentManager &IM, unsigned CallLatency,
+               bool UseLoadLatency);
 
   void clear() {
     Descriptors.clear();
diff --git a/llvm/lib/MCA/InstrBuilder.cpp b/llvm/lib/MCA/InstrBuilder.cpp
index d5cbdc5de0b84..db4eda40de9d0 100644
--- a/llvm/lib/MCA/InstrBuilder.cpp
+++ b/llvm/lib/MCA/InstrBuilder.cpp
@@ -31,9 +31,10 @@ InstrBuilder::InstrBuilder(const llvm::MCSubtargetInfo &sti,
                            const llvm::MCInstrInfo &mcii,
                            const llvm::MCRegisterInfo &mri,
                            const llvm::MCInstrAnalysis *mcia,
-                           const mca::InstrumentManager &im, unsigned cl)
+                           const mca::InstrumentManager &im, unsigned cl,
+                           bool ull)
     : STI(sti), MCII(mcii), MRI(mri), MCIA(mcia), IM(im), FirstCallInst(true),
-      FirstReturnInst(true), CallLatency(cl) {
+      FirstReturnInst(true), CallLatency(cl), UseLoadLatency(ull) {
   const MCSchedModel &SM = STI.getSchedModel();
   ProcResourceMasks.resize(SM.getNumProcResourceKinds());
   computeProcResourceMasks(STI.getSchedModel(), ProcResourceMasks);
@@ -220,8 +221,8 @@ static void initializeUsedResources(InstrDesc &ID,
 
 static void computeMaxLatency(InstrDesc &ID, const MCInstrDesc &MCDesc,
                               const MCSchedClassDesc &SCDesc,
-                              const MCSubtargetInfo &STI,
-                              unsigned CallLatency) {
+                              const MCSubtargetInfo &STI, unsigned CallLatency,
+                              bool UseLoadLatency) {
   if (MCDesc.isCall()) {
     // We cannot estimate how long this call will take.
     // Artificially set an arbitrarily high latency.
@@ -230,6 +231,13 @@ static void computeMaxLatency(InstrDesc &ID, const MCInstrDesc &MCDesc,
   }
 
   int Latency = MCSchedModel::computeInstrLatency(STI, SCDesc);
+
+  // If `UseLoadLatency` is set, we use the value in `MCSchedModel::LoadLatency`
+  // for load instructions.
+  if (MCDesc.mayLoad() && UseLoadLatency) {
+    const auto &SM = STI.getSchedModel();
+    Latency = std::max(int(SM.LoadLatency), Latency);
+  }
   // If latency is unknown, then conservatively assume the MaxLatency set for
   // calls.
   ID.MaxLatency = Latency < 0 ? CallLatency : static_cast<unsigned>(Latency);
@@ -582,7 +590,7 @@ InstrBuilder::createInstrDescImpl(const MCInst &MCI,
   }
 
   initializeUsedResources(*ID, SCDesc, STI, ProcResourceMasks);
-  computeMaxLatency(*ID, MCDesc, SCDesc, STI, CallLatency);
+  computeMaxLatency(*ID, MCDesc, SCDesc, STI, CallLatency, UseLoadLatency);
 
   if (Error Err = verifyOperands(MCDesc, MCI))
     return std::move(Err);
diff --git a/llvm/test/tools/llvm-mca/X86/use-load-latency.s b/llvm/test/tools/llvm-mca/X86/use-load-latency.s
new file mode 100644
index 0000000000000..54c37c5c8699f
--- /dev/null
+++ b/llvm/test/tools/llvm-mca/X86/use-load-latency.s
@@ -0,0 +1,58 @@
+# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
+# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2                   -iterations=1 %s | FileCheck --check-prefixes=ALL,DEFAULT %s
+# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -use-load-latency -iterations=1 %s | FileCheck --check-prefixes=ALL,CUSTOM %s
+
+movq      (%rsp), %rdi
+
+# ALL:          Iterations:        1
+# ALL-NEXT:     Instructions:      1
+
+# CUSTOM-NEXT:  Total Cycles:      8
+# DEFAULT-NEXT: Total Cycles:      6
+
+# ALL-NEXT:     Total uOps:        1
+
+# ALL:          Dispatch Width:    2
+
+# CUSTOM-NEXT:  uOps Per Cycle:    0.13
+# CUSTOM-NEXT:  IPC:               0.13
+
+# DEFAULT-NEXT: uOps Per Cycle:    0.17
+# DEFAULT-NEXT: IPC:               0.17
+
+# ALL-NEXT:     Block RThroughput: 1.0
+
+# ALL:          Instruction Info:
+# ALL-NEXT:     [1]: #uOps
+# ALL-NEXT:     [2]: Latency
+# ALL-NEXT:     [3]: RThroughput
+# ALL-NEXT:     [4]: MayLoad
+# ALL-NEXT:     [5]: MayStore
+# ALL-NEXT:     [6]: HasSideEffects (U)
+
+# ALL:          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
+# ALL-NEXT:      1      3     1.00    *                   movq	(%rsp), %rdi
+
+# ALL:          Resources:
+# ALL-NEXT:     [0]   - JALU0
+# ALL-NEXT:     [1]   - JALU1
+# ALL-NEXT:     [2]   - JDiv
+# ALL-NEXT:     [3]   - JFPA
+# ALL-NEXT:     [4]   - JFPM
+# ALL-NEXT:     [5]   - JFPU0
+# ALL-NEXT:     [6]   - JFPU1
+# ALL-NEXT:     [7]   - JLAGU
+# ALL-NEXT:     [8]   - JMul
+# ALL-NEXT:     [9]   - JSAGU
+# ALL-NEXT:     [10]  - JSTC
+# ALL-NEXT:     [11]  - JVALU0
+# ALL-NEXT:     [12]  - JVALU1
+# ALL-NEXT:     [13]  - JVIMUL
+
+# ALL:          Resource pressure per iteration:
+# ALL-NEXT:     [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
+# ALL-NEXT:      -      -      -      -      -      -      -     1.00    -      -      -      -      -      -
+
+# ALL:          Resource pressure by instruction:
+# ALL-NEXT:     [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
+# ALL-NEXT:      -      -      -      -      -      -      -     1.00    -      -      -      -      -      -     movq	(%rsp), %rdi
diff --git a/llvm/tools/llvm-mca/llvm-mca.cpp b/llvm/tools/llvm-mca/llvm-mca.cpp
index cc5d4f5fa05de..927b2b97c4d68 100644
--- a/llvm/tools/llvm-mca/llvm-mca.cpp
+++ b/llvm/tools/llvm-mca/llvm-mca.cpp
@@ -140,6 +140,11 @@ static cl::opt<unsigned>
                 cl::desc("Number of cycles to assume for a call instruction"),
                 cl::cat(ToolOptions), cl::init(100U));
 
+static cl::opt<bool> UseLoadLatency(
+    "use-load-latency", cl::Hidden,
+    cl::desc("Use target specific latency for load instructions"),
+    cl::cat(ToolOptions), cl::init(false));
+
 enum class SkipType { NONE, LACK_SCHED, PARSE_FAILURE, ANY_FAILURE };
 
 static cl::opt<enum SkipType> SkipUnsupportedInstructions(
@@ -573,7 +578,8 @@ int main(int argc, char **argv) {
   }
 
   // Create an instruction builder.
-  mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, CallLatency);
+  mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, CallLatency,
+                       UseLoadLatency);
 
   // Create a context to control ownership of the pipeline hardware.
   mca::Context MCA(*MRI, *STI);
diff --git a/llvm/unittests/tools/llvm-mca/MCATestBase.cpp b/llvm/unittests/tools/llvm-mca/MCATestBase.cpp
index 4a39f5e663f23..515fbe5bd6b69 100644
--- a/llvm/unittests/tools/llvm-mca/MCATestBase.cpp
+++ b/llvm/unittests/tools/llvm-mca/MCATestBase.cpp
@@ -66,7 +66,8 @@ Error MCATestBase::runBaselineMCA(json::Object &Result, ArrayRef<MCInst> Insts,
 
   // Default InstrumentManager
   auto IM = std::make_unique<mca::InstrumentManager>(*STI, *MCII);
-  mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100);
+  mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100,
+                       /*UseLoadLatency*/ false);
 
   const SmallVector<mca::Instrument *> Instruments;
   SmallVector<std::unique_ptr<mca::Instruction>> LoweredInsts;
diff --git a/llvm/unittests/tools/llvm-mca/X86/TestIncrementalMCA.cpp b/llvm/unittests/tools/llvm-mca/X86/TestIncrementalMCA.cpp
index ac35dce522ae1..80a2f49ca4aa4 100644
--- a/llvm/unittests/tools/llvm-mca/X86/TestIncrementalMCA.cpp
+++ b/llvm/unittests/tools/llvm-mca/X86/TestIncrementalMCA.cpp
@@ -33,7 +33,8 @@ TEST_F(X86TestBase, TestResumablePipeline) {
   P->addEventListener(SV.get());
 
   auto IM = std::make_unique<mca::InstrumentManager>(*STI, *MCII);
-  mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100);
+  mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100,
+                       /*UseLoadLatency*/ false);
 
   const SmallVector<mca::Instrument *> Instruments;
   // Tile size = 7
@@ -124,7 +125,8 @@ TEST_F(X86TestBase, TestInstructionRecycling) {
   // Default InstrumentManager
   auto IM = std::make_unique<mca::InstrumentManager>(*STI, *MCII);
 
-  mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100);
+  mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100,
+                       /*UseLoadLatency*/ false);
   IB.setInstRecycleCallback(GetRecycledInst);
 
   const SmallVector<mca::Instrument *> Instruments;

@chinmaydd
Copy link
Contributor Author

@mshockwave @michaelmaitland - tagging the both of you since you helped review my last PR.

// If `UseLoadLatency` is set, we use the value in `MCSchedModel::LoadLatency`
// for load instructions.
if (MCDesc.mayLoad() && UseLoadLatency) {
const auto &SM = STI.getSchedModel();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please spell out the type (rather than using auto) here.

auto IM = std::make_unique<mca::InstrumentManager>(*STI, *MCII);
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100);
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100,
/*UseLoadLatency*/ false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://www.llvm.org/docs/CodingStandards.html#comment-formatting :

Suggested change
/*UseLoadLatency*/ false);
/*UseLoadLatency=*/false);

auto IM = std::make_unique<mca::InstrumentManager>(*STI, *MCII);
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100);
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100,
/*UseLoadLatency*/ false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100);
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100,
/*UseLoadLatency*/ false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

chinmaydd added 2 commits June 7, 2024 15:26
Latency for load operations could be modeled better by
using the `LoadLatency` field in the provided scheduling
model for the target architecture.
Comment on lines +237 to +240
if (MCDesc.mayLoad() && UseLoadLatency) {
const MCSchedModel &SM = STI.getSchedModel();
Latency = std::max(int(SM.LoadLatency), Latency);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but this doesn't work.

You cannot simply override Latency that way. Latency of Instructions with folded memory operands isn't just equivalent to the load latency.

To further complicate matters, it is wrong to assume a single load latency value for all memory operations.
For example, on AMD processors, the load-to-use latency changes depending on whether the opcode is INT or FPU.
That is because there is an extra delay in the data path when loaded values are forwarded to the floating point unit (often implemented as a coprocessor).


// If `UseLoadLatency` is set, we use the value in `MCSchedModel::LoadLatency`
// for load instructions.
if (MCDesc.mayLoad() && UseLoadLatency) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if LoadLatency is not set (i.e. it is -1) but -use-load-latency is passed? Should a warning be printed if -use-load-latency is set?

return;
}

int Latency = MCSchedModel::computeInstrLatency(STI, SCDesc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, there are two ways we use LoadLatency in LLVM today:

  1. In TargetSchedModel::computeInstrLatency under the condition !hasInstrSchedModel && !SCDesc.isValid && MI.mayLoad() by making a call to defaultDefLatency. It seems like LoadLatency is really used as a backup in case something is misconfigured.

  2. Scheduler models will use it as a helper variable in computing instruction latencies. I am not 100% sure because I am not an expert on these models, but I think this is modeling the worse case scenario of checking the cache, having a miss, and having to do the load. Or it is possible that these models are misusing LoadLatency as a BaseLatency and adjusting the latency accordingly. For example:

llvm/lib/Target/X86/X86ScheduleZnver2.td:252:  let Latency = !add(Zn2WriteIMulH.Latency, Znver2Model.LoadLatency);

It seems like you are not trying to use it under the (1) or (2) interpretations of this variable. And here is why we shouldn't use it in these circumstances:

  • In the first case, if we are using llvm-mca with !hasInstrSchedModel, then something is probably wrong and a warning or error should have occurred sooner. This would mean that the entire report is not meaningful. If we are using llvm-mca where !SCDest.isValid, then again there is a bigger issue here. Using LoadLatency probably isn't the solution. So if either of these conditions are true, it does not seem like the right solution to use LoadLatency in the llvm-mca reports. It makes sense to error or warn earlier.
  • In the second case, the LoadLatency should be baked into int Latency = MCSchedModel::computeInstrLatency(STI, SCDesc);. I think this makes the assignment Latency = std::max(int(SM.LoadLatency), Latency); incorrect.

Prior to this patch, it does not seem like we are using LoadLatency to model the case you describe: as the cycles for loads to access the cache and assume that there is always a cache hit. It seems to be specified in such a way where it would make sense to use it in this way, so I think it is reasonable to have llvm-mca use it as such.

As @adibiagio points out, I don't think this is what the code added in this patch models. We would need to determine the depth of the load pipeline. I think the current interpretation of this patch is to use whatever latency is higher: latency assuming a cache hit but ignoring load pipeline depth or latency assigned to instruction. I think in most reasonable models, the second one should always be higher, so I'm not sure how useful this would be.

As @adibiagio pointed out load-to-use latency changes depending on whether the opcode type (int, fp, vector). I wonder if LoadLatency is too simple/naive of a value to be useful on processors that have multiple types of loads. One solution is to move LoadLatency field in ProcWriteResources that can be set per scheduler class. I would not be opposed if the solution proposed here came first and this improvement was added on as a follow up. This would allow models that do have a single LoadLatency to benefit from this kind of reporting.

As side note, I wonder if the scheduler could or should make more use of this interpretation in scheduling.

Copy link
Collaborator

@adibiagio adibiagio Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Scheduler models will use it as a helper variable in computing instruction latencies. I am not 100% sure because I am not an expert on these models, but I think this is modeling the worse case scenario of checking the cache, having a miss, and having to do the load. Or it is possible that these models are misusing LoadLatency as a BaseLatency and adjusting the latency accordingly. For example:
llvm/lib/Target/X86/X86ScheduleZnver2.td:252:  let Latency = !add(Zn2WriteIMulH.Latency, Znver2Model.LoadLatency);

X86 scheduling models (optimistically) define LoadLatency as the L1D load-to-use latency.

Things are a bit more complicated for AMD processors, where load-to-use latency varies depending on whether loaded values are used by the FPU or INT.

According to the official AMD SoG for family 17h: "The L1 data cache has a 4- or 5-cycle integer load-to-use latency, and a 7- or 8-cycle FPU load-to-use latency". That is the reason why the Znver2Model optimistically defines it as 4.

If you look at the scheduling model for ZnVer2, you can see two multiclasses:

  • multiclass Zn2WriteResPair is used by integer instructions with a folded load operand
  • multiclass Zn2WriteResFpuPair is used by floating point/vector instructions with a folded load.

multiclass Zn2WriteResPair sets the default value for param loadLat to 4 (i.e. optimistic L1 data load-to-use INT latency according to the AMD docs).
On the other hand, multiclass Zn2WriteResFpuPair sets that param to 7 (cycles).

As far as I am aware of, Intel processors don't need to model multiple values for the load-to-use latency. On X86, only for AMD processors the load latency changes depending on whether the user is INT or FPU.

@chinmaydd
Copy link
Contributor Author

@michaelmaitland and @adibiagio - thank you very much for your time and useful comments. I think I understand the fundamental problems with this patch.

It may make sense for a consumer of llvm-mca (looking to more precisely model latency counts for load instructions) to define custom HW units, such as the LSUnit and something like a CacheManager, tuned to their specific processor.

Closing the PR, thanks again !

@chinmaydd chinmaydd closed this Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants