-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[llvm-mca] Add command line option -use-load-latency
#94566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-tools-llvm-mca @llvm/pr-subscribers-backend-x86 Author: Chinmay Deshpande (chinmaydd) ChangesLatency for load operations could be modeled better by using the This attempts a fix for the comment here - LSUnit.h#370. I would be happy to know if there is a better way to approach this. Full diff: https://github.com/llvm/llvm-project/pull/94566.diff 6 Files Affected:
diff --git a/llvm/include/llvm/MCA/InstrBuilder.h b/llvm/include/llvm/MCA/InstrBuilder.h
index 00c7942e4fa16..6880a213d48b5 100644
--- a/llvm/include/llvm/MCA/InstrBuilder.h
+++ b/llvm/include/llvm/MCA/InstrBuilder.h
@@ -79,6 +79,7 @@ class InstrBuilder {
bool FirstCallInst;
bool FirstReturnInst;
unsigned CallLatency;
+ bool UseLoadLatency;
using InstRecycleCallback = std::function<Instruction *(const InstrDesc &)>;
InstRecycleCallback InstRecycleCB;
@@ -99,7 +100,8 @@ class InstrBuilder {
public:
InstrBuilder(const MCSubtargetInfo &STI, const MCInstrInfo &MCII,
const MCRegisterInfo &RI, const MCInstrAnalysis *IA,
- const InstrumentManager &IM, unsigned CallLatency);
+ const InstrumentManager &IM, unsigned CallLatency,
+ bool UseLoadLatency);
void clear() {
Descriptors.clear();
diff --git a/llvm/lib/MCA/InstrBuilder.cpp b/llvm/lib/MCA/InstrBuilder.cpp
index d5cbdc5de0b84..db4eda40de9d0 100644
--- a/llvm/lib/MCA/InstrBuilder.cpp
+++ b/llvm/lib/MCA/InstrBuilder.cpp
@@ -31,9 +31,10 @@ InstrBuilder::InstrBuilder(const llvm::MCSubtargetInfo &sti,
const llvm::MCInstrInfo &mcii,
const llvm::MCRegisterInfo &mri,
const llvm::MCInstrAnalysis *mcia,
- const mca::InstrumentManager &im, unsigned cl)
+ const mca::InstrumentManager &im, unsigned cl,
+ bool ull)
: STI(sti), MCII(mcii), MRI(mri), MCIA(mcia), IM(im), FirstCallInst(true),
- FirstReturnInst(true), CallLatency(cl) {
+ FirstReturnInst(true), CallLatency(cl), UseLoadLatency(ull) {
const MCSchedModel &SM = STI.getSchedModel();
ProcResourceMasks.resize(SM.getNumProcResourceKinds());
computeProcResourceMasks(STI.getSchedModel(), ProcResourceMasks);
@@ -220,8 +221,8 @@ static void initializeUsedResources(InstrDesc &ID,
static void computeMaxLatency(InstrDesc &ID, const MCInstrDesc &MCDesc,
const MCSchedClassDesc &SCDesc,
- const MCSubtargetInfo &STI,
- unsigned CallLatency) {
+ const MCSubtargetInfo &STI, unsigned CallLatency,
+ bool UseLoadLatency) {
if (MCDesc.isCall()) {
// We cannot estimate how long this call will take.
// Artificially set an arbitrarily high latency.
@@ -230,6 +231,13 @@ static void computeMaxLatency(InstrDesc &ID, const MCInstrDesc &MCDesc,
}
int Latency = MCSchedModel::computeInstrLatency(STI, SCDesc);
+
+ // If `UseLoadLatency` is set, we use the value in `MCSchedModel::LoadLatency`
+ // for load instructions.
+ if (MCDesc.mayLoad() && UseLoadLatency) {
+ const auto &SM = STI.getSchedModel();
+ Latency = std::max(int(SM.LoadLatency), Latency);
+ }
// If latency is unknown, then conservatively assume the MaxLatency set for
// calls.
ID.MaxLatency = Latency < 0 ? CallLatency : static_cast<unsigned>(Latency);
@@ -582,7 +590,7 @@ InstrBuilder::createInstrDescImpl(const MCInst &MCI,
}
initializeUsedResources(*ID, SCDesc, STI, ProcResourceMasks);
- computeMaxLatency(*ID, MCDesc, SCDesc, STI, CallLatency);
+ computeMaxLatency(*ID, MCDesc, SCDesc, STI, CallLatency, UseLoadLatency);
if (Error Err = verifyOperands(MCDesc, MCI))
return std::move(Err);
diff --git a/llvm/test/tools/llvm-mca/X86/use-load-latency.s b/llvm/test/tools/llvm-mca/X86/use-load-latency.s
new file mode 100644
index 0000000000000..54c37c5c8699f
--- /dev/null
+++ b/llvm/test/tools/llvm-mca/X86/use-load-latency.s
@@ -0,0 +1,58 @@
+# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
+# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=1 %s | FileCheck --check-prefixes=ALL,DEFAULT %s
+# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -use-load-latency -iterations=1 %s | FileCheck --check-prefixes=ALL,CUSTOM %s
+
+movq (%rsp), %rdi
+
+# ALL: Iterations: 1
+# ALL-NEXT: Instructions: 1
+
+# CUSTOM-NEXT: Total Cycles: 8
+# DEFAULT-NEXT: Total Cycles: 6
+
+# ALL-NEXT: Total uOps: 1
+
+# ALL: Dispatch Width: 2
+
+# CUSTOM-NEXT: uOps Per Cycle: 0.13
+# CUSTOM-NEXT: IPC: 0.13
+
+# DEFAULT-NEXT: uOps Per Cycle: 0.17
+# DEFAULT-NEXT: IPC: 0.17
+
+# ALL-NEXT: Block RThroughput: 1.0
+
+# ALL: Instruction Info:
+# ALL-NEXT: [1]: #uOps
+# ALL-NEXT: [2]: Latency
+# ALL-NEXT: [3]: RThroughput
+# ALL-NEXT: [4]: MayLoad
+# ALL-NEXT: [5]: MayStore
+# ALL-NEXT: [6]: HasSideEffects (U)
+
+# ALL: [1] [2] [3] [4] [5] [6] Instructions:
+# ALL-NEXT: 1 3 1.00 * movq (%rsp), %rdi
+
+# ALL: Resources:
+# ALL-NEXT: [0] - JALU0
+# ALL-NEXT: [1] - JALU1
+# ALL-NEXT: [2] - JDiv
+# ALL-NEXT: [3] - JFPA
+# ALL-NEXT: [4] - JFPM
+# ALL-NEXT: [5] - JFPU0
+# ALL-NEXT: [6] - JFPU1
+# ALL-NEXT: [7] - JLAGU
+# ALL-NEXT: [8] - JMul
+# ALL-NEXT: [9] - JSAGU
+# ALL-NEXT: [10] - JSTC
+# ALL-NEXT: [11] - JVALU0
+# ALL-NEXT: [12] - JVALU1
+# ALL-NEXT: [13] - JVIMUL
+
+# ALL: Resource pressure per iteration:
+# ALL-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
+# ALL-NEXT: - - - - - - - 1.00 - - - - - -
+
+# ALL: Resource pressure by instruction:
+# ALL-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
+# ALL-NEXT: - - - - - - - 1.00 - - - - - - movq (%rsp), %rdi
diff --git a/llvm/tools/llvm-mca/llvm-mca.cpp b/llvm/tools/llvm-mca/llvm-mca.cpp
index cc5d4f5fa05de..927b2b97c4d68 100644
--- a/llvm/tools/llvm-mca/llvm-mca.cpp
+++ b/llvm/tools/llvm-mca/llvm-mca.cpp
@@ -140,6 +140,11 @@ static cl::opt<unsigned>
cl::desc("Number of cycles to assume for a call instruction"),
cl::cat(ToolOptions), cl::init(100U));
+static cl::opt<bool> UseLoadLatency(
+ "use-load-latency", cl::Hidden,
+ cl::desc("Use target specific latency for load instructions"),
+ cl::cat(ToolOptions), cl::init(false));
+
enum class SkipType { NONE, LACK_SCHED, PARSE_FAILURE, ANY_FAILURE };
static cl::opt<enum SkipType> SkipUnsupportedInstructions(
@@ -573,7 +578,8 @@ int main(int argc, char **argv) {
}
// Create an instruction builder.
- mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, CallLatency);
+ mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, CallLatency,
+ UseLoadLatency);
// Create a context to control ownership of the pipeline hardware.
mca::Context MCA(*MRI, *STI);
diff --git a/llvm/unittests/tools/llvm-mca/MCATestBase.cpp b/llvm/unittests/tools/llvm-mca/MCATestBase.cpp
index 4a39f5e663f23..515fbe5bd6b69 100644
--- a/llvm/unittests/tools/llvm-mca/MCATestBase.cpp
+++ b/llvm/unittests/tools/llvm-mca/MCATestBase.cpp
@@ -66,7 +66,8 @@ Error MCATestBase::runBaselineMCA(json::Object &Result, ArrayRef<MCInst> Insts,
// Default InstrumentManager
auto IM = std::make_unique<mca::InstrumentManager>(*STI, *MCII);
- mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100);
+ mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100,
+ /*UseLoadLatency*/ false);
const SmallVector<mca::Instrument *> Instruments;
SmallVector<std::unique_ptr<mca::Instruction>> LoweredInsts;
diff --git a/llvm/unittests/tools/llvm-mca/X86/TestIncrementalMCA.cpp b/llvm/unittests/tools/llvm-mca/X86/TestIncrementalMCA.cpp
index ac35dce522ae1..80a2f49ca4aa4 100644
--- a/llvm/unittests/tools/llvm-mca/X86/TestIncrementalMCA.cpp
+++ b/llvm/unittests/tools/llvm-mca/X86/TestIncrementalMCA.cpp
@@ -33,7 +33,8 @@ TEST_F(X86TestBase, TestResumablePipeline) {
P->addEventListener(SV.get());
auto IM = std::make_unique<mca::InstrumentManager>(*STI, *MCII);
- mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100);
+ mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100,
+ /*UseLoadLatency*/ false);
const SmallVector<mca::Instrument *> Instruments;
// Tile size = 7
@@ -124,7 +125,8 @@ TEST_F(X86TestBase, TestInstructionRecycling) {
// Default InstrumentManager
auto IM = std::make_unique<mca::InstrumentManager>(*STI, *MCII);
- mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100);
+ mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100,
+ /*UseLoadLatency*/ false);
IB.setInstRecycleCallback(GetRecycledInst);
const SmallVector<mca::Instrument *> Instruments;
|
@mshockwave @michaelmaitland - tagging the both of you since you helped review my last PR. |
llvm/lib/MCA/InstrBuilder.cpp
Outdated
// If `UseLoadLatency` is set, we use the value in `MCSchedModel::LoadLatency` | ||
// for load instructions. | ||
if (MCDesc.mayLoad() && UseLoadLatency) { | ||
const auto &SM = STI.getSchedModel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please spell out the type (rather than using auto) here.
auto IM = std::make_unique<mca::InstrumentManager>(*STI, *MCII); | ||
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100); | ||
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100, | ||
/*UseLoadLatency*/ false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://www.llvm.org/docs/CodingStandards.html#comment-formatting :
/*UseLoadLatency*/ false); | |
/*UseLoadLatency=*/false); |
auto IM = std::make_unique<mca::InstrumentManager>(*STI, *MCII); | ||
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100); | ||
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100, | ||
/*UseLoadLatency*/ false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
||
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100); | ||
mca::InstrBuilder IB(*STI, *MCII, *MRI, MCIA.get(), *IM, /*CallLatency=*/100, | ||
/*UseLoadLatency*/ false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Latency for load operations could be modeled better by using the `LoadLatency` field in the provided scheduling model for the target architecture.
553c8b2
to
0f3a127
Compare
if (MCDesc.mayLoad() && UseLoadLatency) { | ||
const MCSchedModel &SM = STI.getSchedModel(); | ||
Latency = std::max(int(SM.LoadLatency), Latency); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but this doesn't work.
You cannot simply override Latency that way. Latency of Instructions with folded memory operands isn't just equivalent to the load latency.
To further complicate matters, it is wrong to assume a single load latency value for all memory operations.
For example, on AMD processors, the load-to-use
latency changes depending on whether the opcode is INT or FPU.
That is because there is an extra delay in the data path when loaded values are forwarded to the floating point unit (often implemented as a coprocessor).
|
||
// If `UseLoadLatency` is set, we use the value in `MCSchedModel::LoadLatency` | ||
// for load instructions. | ||
if (MCDesc.mayLoad() && UseLoadLatency) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if LoadLatency is not set (i.e. it is -1
) but -use-load-latency
is passed? Should a warning be printed if -use-load-latency
is set?
return; | ||
} | ||
|
||
int Latency = MCSchedModel::computeInstrLatency(STI, SCDesc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, there are two ways we use LoadLatency
in LLVM today:
-
In
TargetSchedModel::computeInstrLatency
under the condition!hasInstrSchedModel && !SCDesc.isValid && MI.mayLoad()
by making a call todefaultDefLatency
. It seems likeLoadLatency
is really used as a backup in case something is misconfigured. -
Scheduler models will use it as a helper variable in computing instruction latencies. I am not 100% sure because I am not an expert on these models, but I think this is modeling the worse case scenario of checking the cache, having a miss, and having to do the load. Or it is possible that these models are misusing
LoadLatency
as aBaseLatency
and adjusting the latency accordingly. For example:
llvm/lib/Target/X86/X86ScheduleZnver2.td:252: let Latency = !add(Zn2WriteIMulH.Latency, Znver2Model.LoadLatency);
It seems like you are not trying to use it under the (1) or (2) interpretations of this variable. And here is why we shouldn't use it in these circumstances:
- In the first case, if we are using llvm-mca with
!hasInstrSchedModel
, then something is probably wrong and a warning or error should have occurred sooner. This would mean that the entire report is not meaningful. If we are using llvm-mca where!SCDest.isValid
, then again there is a bigger issue here. Using LoadLatency probably isn't the solution. So if either of these conditions are true, it does not seem like the right solution to useLoadLatency
in the llvm-mca reports. It makes sense to error or warn earlier. - In the second case, the
LoadLatency
should be baked intoint Latency = MCSchedModel::computeInstrLatency(STI, SCDesc);
. I think this makes the assignmentLatency = std::max(int(SM.LoadLatency), Latency);
incorrect.
Prior to this patch, it does not seem like we are using LoadLatency
to model the case you describe: as the cycles for loads to access the cache and assume that there is always a cache hit. It seems to be specified in such a way where it would make sense to use it in this way, so I think it is reasonable to have llvm-mca use it as such.
As @adibiagio points out, I don't think this is what the code added in this patch models. We would need to determine the depth of the load pipeline. I think the current interpretation of this patch is to use whatever latency is higher: latency assuming a cache hit but ignoring load pipeline depth or latency assigned to instruction. I think in most reasonable models, the second one should always be higher, so I'm not sure how useful this would be.
As @adibiagio pointed out load-to-use latency changes depending on whether the opcode type (int, fp, vector). I wonder if LoadLatency
is too simple/naive of a value to be useful on processors that have multiple types of loads. One solution is to move LoadLatency
field in ProcWriteResources
that can be set per scheduler class. I would not be opposed if the solution proposed here came first and this improvement was added on as a follow up. This would allow models that do have a single LoadLatency to benefit from this kind of reporting.
As side note, I wonder if the scheduler could or should make more use of this interpretation in scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Scheduler models will use it as a helper variable in computing instruction latencies. I am not 100% sure because I am not an expert on these models, but I think this is modeling the worse case scenario of checking the cache, having a miss, and having to do the load. Or it is possible that these models are misusing
LoadLatency
as aBaseLatency
and adjusting the latency accordingly. For example:llvm/lib/Target/X86/X86ScheduleZnver2.td:252: let Latency = !add(Zn2WriteIMulH.Latency, Znver2Model.LoadLatency);
X86 scheduling models (optimistically) define LoadLatency
as the L1D load-to-use latency.
Things are a bit more complicated for AMD processors, where load-to-use latency varies depending on whether loaded values are used by the FPU or INT.
According to the official AMD SoG for family 17h: "The L1 data cache has a 4- or 5-cycle integer load-to-use latency, and a 7- or 8-cycle FPU load-to-use latency". That is the reason why the Znver2Model optimistically defines it as 4.
If you look at the scheduling model for ZnVer2, you can see two multiclasses:
- multiclass Zn2WriteResPair is used by integer instructions with a folded load operand
- multiclass Zn2WriteResFpuPair is used by floating point/vector instructions with a folded load.
multiclass Zn2WriteResPair sets the default value for param loadLat
to 4 (i.e. optimistic L1 data load-to-use INT latency according to the AMD docs).
On the other hand, multiclass Zn2WriteResFpuPair sets that param to 7 (cycles).
As far as I am aware of, Intel processors don't need to model multiple values for the load-to-use latency. On X86, only for AMD processors the load latency changes depending on whether the user is INT or FPU.
@michaelmaitland and @adibiagio - thank you very much for your time and useful comments. I think I understand the fundamental problems with this patch. It may make sense for a consumer of llvm-mca (looking to more precisely model latency counts for load instructions) to define custom HW units, such as the LSUnit and something like a CacheManager, tuned to their specific processor. Closing the PR, thanks again ! |
Latency for load operations could be modeled better by using the
LoadLatency
field in the provided scheduling model for the target architecture. This PR adds a new command line flaguse-load-latency
that enables this behavior. The flag is off by default.This attempts a fix for the comment here - LSUnit.h#403. I would be happy to know if there is a better way to approach this.