-
Notifications
You must be signed in to change notification settings - Fork 14.8k
[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking #162077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
@llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) ChangesThe pass was already "reinventing" the concept just to deal with 16 bit registers. Clean up the entire tracking logic to only use register units. There are no test changes because functionality didn't change, except:
This also changes the tracking to use a DenseMap instead of a massive I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap I think this can be cleaned up a bit more. I want to see if we can remove PhysRegs from the WaitCnt bracket APIs, so it only reasons in terms of RegUnits and nothing else. Patch is 35.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/162077.diff 1 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 76bfce8c0f6f9..90a5cd4e87ae7 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -97,7 +97,27 @@ auto inst_counter_types(InstCounterType MaxCounter = NUM_INST_CNTS) {
return enum_seq(LOAD_CNT, MaxCounter);
}
-using RegInterval = std::pair<int, int>;
+/// Integer IDs used to track vector memory locations we may have to wait on.
+/// Encoded as u16 chunks:
+///
+/// [0, MAX_REGUNITS ): MCRegUnit
+/// [FIRST_LDSDMA, LAST_LDSDMA ): LDS DMA IDs
+using VMEMID = uint32_t;
+
+enum : VMEMID {
+ TRACKINGID_RANGE_LEN = (1 << 16),
+
+ REGUNITS_BEGIN = 0,
+ REGUNITS_END = REGUNITS_BEGIN + TRACKINGID_RANGE_LEN,
+
+ // Note for LDSDMA: LDSDMA_BEGIN corresponds to the "common"
+ // entry, which is updated for all LDS DMA operations encountered.
+ // Specific LDS DMA IDs start at LDSDMA_BEGIN + 1.
+ LDSDMA_BEGIN = REGUNITS_END,
+ LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN,
+
+ NUM_LDSDMA = TRACKINGID_RANGE_LEN
+};
struct HardwareLimits {
unsigned LoadcntMax; // Corresponds to VMcnt prior to gfx12.
@@ -146,30 +166,6 @@ static constexpr StringLiteral WaitEventTypeName[] = {
#undef AMDGPU_EVENT_NAME
// clang-format on
-// The mapping is:
-// 0 .. SQ_MAX_PGM_VGPRS-1 real VGPRs
-// SQ_MAX_PGM_VGPRS .. NUM_ALL_VGPRS-1 extra VGPR-like slots
-// NUM_ALL_VGPRS .. NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS-1 real SGPRs
-// NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS .. SCC
-// We reserve a fixed number of VGPR slots in the scoring tables for
-// special tokens like SCMEM_LDS (needed for buffer load to LDS).
-enum RegisterMapping {
- SQ_MAX_PGM_VGPRS = 2048, // Maximum programmable VGPRs across all targets.
- AGPR_OFFSET = 512, // Maximum programmable ArchVGPRs across all targets.
- SQ_MAX_PGM_SGPRS = 128, // Maximum programmable SGPRs across all targets.
- // Artificial register slots to track LDS writes into specific LDS locations
- // if a location is known. When slots are exhausted or location is
- // unknown use the first slot. The first slot is also always updated in
- // addition to known location's slot to properly generate waits if dependent
- // instruction's location is unknown.
- FIRST_LDS_VGPR = SQ_MAX_PGM_VGPRS, // Extra slots for LDS stores.
- NUM_LDS_VGPRS = 9, // One more than the stores we track.
- NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_LDS_VGPRS, // Where SGPRs start.
- NUM_ALL_ALLOCATABLE = NUM_ALL_VGPRS + SQ_MAX_PGM_SGPRS,
- // Remaining non-allocatable registers
- SCC = NUM_ALL_ALLOCATABLE
-};
-
// Enumerate different types of result-returning VMEM operations. Although
// s_waitcnt orders them all with a single vmcnt counter, in the absence of
// s_waitcnt only instructions of the same VmemType are guaranteed to write
@@ -616,32 +612,26 @@ class WaitcntBrackets {
return getScoreUB(T) - getScoreLB(T);
}
- unsigned getRegScore(int GprNo, InstCounterType T) const {
- if (GprNo < NUM_ALL_VGPRS)
- return VgprScores[T][GprNo];
-
- if (GprNo < NUM_ALL_ALLOCATABLE)
- return SgprScores[getSgprScoresIdx(T)][GprNo - NUM_ALL_VGPRS];
+ unsigned getSGPRScore(MCRegUnit RU, InstCounterType T) const {
+ auto It = SGPRs.find(RU);
+ return (It != SGPRs.end()) ? It->second.Scores[getSgprScoresIdx(T)] : 0;
+ }
- assert(GprNo == SCC);
- return SCCScore;
+ unsigned getVMemScore(VMEMID TID, InstCounterType T) const {
+ auto It = VMem.find(TID);
+ return (It != VMem.end()) ? It->second.Scores[T] : 0;
}
bool merge(const WaitcntBrackets &Other);
- RegInterval getRegInterval(const MachineInstr *MI,
- const MachineOperand &Op) const;
-
bool counterOutOfOrder(InstCounterType T) const;
void simplifyWaitcnt(AMDGPU::Waitcnt &Wait) const;
void simplifyWaitcnt(InstCounterType T, unsigned &Count) const;
- void determineWait(InstCounterType T, RegInterval Interval,
- AMDGPU::Waitcnt &Wait) const;
- void determineWait(InstCounterType T, int RegNo,
- AMDGPU::Waitcnt &Wait) const {
- determineWait(T, {RegNo, RegNo + 1}, Wait);
- }
+ void determineWaitForPhysReg(InstCounterType T, MCPhysReg Reg,
+ AMDGPU::Waitcnt &Wait) const;
+ void determineWaitForLDSDMA(InstCounterType T, VMEMID TID,
+ AMDGPU::Waitcnt &Wait) const;
void tryClearSCCWriteEvent(MachineInstr *Inst);
void applyWaitcnt(const AMDGPU::Waitcnt &Wait);
@@ -690,19 +680,19 @@ class WaitcntBrackets {
// Return true if there might be pending writes to the vgpr-interval by VMEM
// instructions with types different from V.
- bool hasOtherPendingVmemTypes(RegInterval Interval, VmemType V) const {
- for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
- assert(RegNo < NUM_ALL_VGPRS);
- if (VgprVmemTypes[RegNo] & ~(1 << V))
+ bool hasOtherPendingVmemTypes(MCPhysReg Reg, VmemType V) const {
+ for (MCRegUnit RU : regunits(Reg)) {
+ auto It = VMem.find(RU);
+ if (It != VMem.end() && (It->second.VMEMTypes & ~(1 << V)))
return true;
}
return false;
}
- void clearVgprVmemTypes(RegInterval Interval) {
- for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
- assert(RegNo < NUM_ALL_VGPRS);
- VgprVmemTypes[RegNo] = 0;
+ void clearVgprVmemTypes(MCPhysReg Reg) {
+ for (MCRegUnit RU : regunits(Reg)) {
+ if (auto It = VMem.find(RU); It != VMem.end())
+ It->second.VMEMTypes = 0;
}
}
@@ -718,7 +708,7 @@ class WaitcntBrackets {
bool hasPointSampleAccel(const MachineInstr &MI) const;
bool hasPointSamplePendingVmemTypes(const MachineInstr &MI,
- RegInterval Interval) const;
+ MCPhysReg RU) const;
void print(raw_ostream &) const;
void dump() const { print(dbgs()); }
@@ -730,9 +720,24 @@ class WaitcntBrackets {
unsigned MyShift;
unsigned OtherShift;
};
+
+ void determineWaitForScore(InstCounterType T, unsigned Score,
+ AMDGPU::Waitcnt &Wait) const;
+
static bool mergeScore(const MergeInfo &M, unsigned &Score,
unsigned OtherScore);
+ iterator_range<MCRegUnitIterator> regunits(MCPhysReg Reg) const {
+ assert(Reg != AMDGPU::SCC && "Shouldn't be used on SCC");
+ const TargetRegisterClass *RC = Context->TRI->getPhysRegBaseClass(Reg);
+ unsigned Size = Context->TRI->getRegSizeInBits(*RC);
+ if (!Context->TRI->isInAllocatableClass(Reg))
+ return {{}, {}};
+ if (Size == 16 && Context->ST->hasD16Writes32BitVgpr())
+ Reg = Context->TRI->get32BitRegister(Reg);
+ return Context->TRI->regunits(Reg);
+ }
+
void setScoreLB(InstCounterType T, unsigned Val) {
assert(T < NUM_INST_CNTS);
ScoreLBs[T] = Val;
@@ -749,15 +754,26 @@ class WaitcntBrackets {
ScoreLBs[EXP_CNT] = ScoreUBs[EXP_CNT] - Context->getWaitCountMax(EXP_CNT);
}
- void setRegScore(int GprNo, InstCounterType T, unsigned Val) {
- setScoreByInterval({GprNo, GprNo + 1}, T, Val);
+ void setRegScore(MCPhysReg Reg, InstCounterType T, unsigned Val) {
+ const SIRegisterInfo *TRI = Context->TRI;
+ if (Reg == AMDGPU::SCC) {
+ SCCScore = Val;
+ } else if (TRI->isVectorRegister(*Context->MRI, Reg)) {
+ for (MCRegUnit RU : regunits(Reg))
+ VMem[RU].Scores[T] = Val;
+ } else if (TRI->isSGPRReg(*Context->MRI, Reg)) {
+ auto STy = getSgprScoresIdx(T);
+ for (MCRegUnit RU : regunits(Reg))
+ SGPRs[RU].Scores[STy] = Val;
+ }
}
- void setScoreByInterval(RegInterval Interval, InstCounterType CntTy,
- unsigned Score);
+ void setVMemScore(VMEMID TID, InstCounterType T, unsigned Val) {
+ VMem[TID].Scores[T] = Val;
+ }
- void setScoreByOperand(const MachineInstr *MI, const MachineOperand &Op,
- InstCounterType CntTy, unsigned Val);
+ void setScoreByOperand(const MachineOperand &Op, InstCounterType CntTy,
+ unsigned Val);
const SIInsertWaitcnts *Context;
@@ -768,26 +784,39 @@ class WaitcntBrackets {
unsigned LastFlat[NUM_INST_CNTS] = {0};
// Remember the last GDS operation.
unsigned LastGDS = 0;
- // wait_cnt scores for every vgpr.
- // Keep track of the VgprUB and SgprUB to make merge at join efficient.
- int VgprUB = -1;
- int SgprUB = -1;
- unsigned VgprScores[NUM_INST_CNTS][NUM_ALL_VGPRS] = {{0}};
- // Wait cnt scores for every sgpr, the DS_CNT (corresponding to LGKMcnt
- // pre-gfx12) or KM_CNT (gfx12+ only), and X_CNT (gfx1250) are relevant.
- // Row 0 represents the score for either DS_CNT or KM_CNT and row 1 keeps the
- // X_CNT score.
- unsigned SgprScores[2][SQ_MAX_PGM_SGPRS] = {{0}};
+
+ /// The score tracking logic is fragmented as follows:
+ /// - VMem: VGPR RegUnits and LDS DMA IDs, see the VMEMID encoding.
+ /// - SGPRs: SGPR RegUnits
+ /// - SCC
+
+ struct VGPRInfo {
+ // Scores for all instruction counters.
+ unsigned Scores[NUM_INST_CNTS] = {0};
+ // Bitmask of the VmemTypes of VMEM instructions for this VGPR.
+ unsigned VMEMTypes = 0;
+ };
+
+ struct SGPRInfo {
+ // Wait cnt scores for every sgpr, the DS_CNT (corresponding to LGKMcnt
+ // pre-gfx12) or KM_CNT (gfx12+ only), and X_CNT (gfx1250) are relevant.
+ // Row 0 represents the score for either DS_CNT or KM_CNT and row 1 keeps
+ // the X_CNT score.
+ unsigned Scores[2] = {0};
+ };
+
+ DenseMap<VMEMID, VGPRInfo> VMem; // VGPR + LDS DMA
+ DenseMap<MCRegUnit, SGPRInfo> SGPRs;
+
// Reg score for SCC.
unsigned SCCScore = 0;
// The unique instruction that has an SCC write pending, if there is one.
const MachineInstr *PendingSCCWrite = nullptr;
- // Bitmask of the VmemTypes of VMEM instructions that might have a pending
- // write to each vgpr.
- unsigned char VgprVmemTypes[NUM_ALL_VGPRS] = {0};
+
// Store representative LDS DMA operations. The only useful info here is
// alias info. One store is kept per unique AAInfo.
- SmallVector<const MachineInstr *, NUM_LDS_VGPRS - 1> LDSDMAStores;
+ // Entry zero is the "generic" entry that applies to all LDSDMA stores.
+ SmallVector<const MachineInstr *> LDSDMAStores;
};
class SIInsertWaitcntsLegacy : public MachineFunctionPass {
@@ -813,82 +842,10 @@ class SIInsertWaitcntsLegacy : public MachineFunctionPass {
} // end anonymous namespace
-RegInterval WaitcntBrackets::getRegInterval(const MachineInstr *MI,
- const MachineOperand &Op) const {
- if (Op.getReg() == AMDGPU::SCC)
- return {SCC, SCC + 1};
-
- const SIRegisterInfo *TRI = Context->TRI;
- const MachineRegisterInfo *MRI = Context->MRI;
-
- if (!TRI->isInAllocatableClass(Op.getReg()))
- return {-1, -1};
-
- // A use via a PW operand does not need a waitcnt.
- // A partial write is not a WAW.
- assert(!Op.getSubReg() || !Op.isUndef());
-
- RegInterval Result;
-
- MCRegister MCReg = AMDGPU::getMCReg(Op.getReg(), *Context->ST);
- unsigned RegIdx = TRI->getHWRegIndex(MCReg);
-
- const TargetRegisterClass *RC = TRI->getPhysRegBaseClass(Op.getReg());
- unsigned Size = TRI->getRegSizeInBits(*RC);
-
- // AGPRs/VGPRs are tracked every 16 bits, SGPRs by 32 bits
- if (TRI->isVectorRegister(*MRI, Op.getReg())) {
- unsigned Reg = RegIdx << 1 | (AMDGPU::isHi16Reg(MCReg, *TRI) ? 1 : 0);
- assert(!Context->ST->hasMAIInsts() || Reg < AGPR_OFFSET);
- Result.first = Reg;
- if (TRI->isAGPR(*MRI, Op.getReg()))
- Result.first += AGPR_OFFSET;
- assert(Result.first >= 0 && Result.first < SQ_MAX_PGM_VGPRS);
- assert(Size % 16 == 0);
- Result.second = Result.first + (Size / 16);
-
- if (Size == 16 && Context->ST->hasD16Writes32BitVgpr()) {
- // Regardless of which lo16/hi16 is used, consider the full 32-bit
- // register used.
- if (AMDGPU::isHi16Reg(MCReg, *TRI))
- Result.first -= 1;
- else
- Result.second += 1;
- }
- } else if (TRI->isSGPRReg(*MRI, Op.getReg()) && RegIdx < SQ_MAX_PGM_SGPRS) {
- // SGPRs including VCC, TTMPs and EXEC but excluding read-only scalar
- // sources like SRC_PRIVATE_BASE.
- Result.first = RegIdx + NUM_ALL_VGPRS;
- Result.second = Result.first + divideCeil(Size, 32);
- } else {
- return {-1, -1};
- }
-
- return Result;
-}
-
-void WaitcntBrackets::setScoreByInterval(RegInterval Interval,
- InstCounterType CntTy,
- unsigned Score) {
- for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
- if (RegNo < NUM_ALL_VGPRS) {
- VgprUB = std::max(VgprUB, RegNo);
- VgprScores[CntTy][RegNo] = Score;
- } else if (RegNo < NUM_ALL_ALLOCATABLE) {
- SgprUB = std::max(SgprUB, RegNo - NUM_ALL_VGPRS);
- SgprScores[getSgprScoresIdx(CntTy)][RegNo - NUM_ALL_VGPRS] = Score;
- } else {
- assert(RegNo == SCC);
- SCCScore = Score;
- }
- }
-}
-
-void WaitcntBrackets::setScoreByOperand(const MachineInstr *MI,
- const MachineOperand &Op,
+void WaitcntBrackets::setScoreByOperand(const MachineOperand &Op,
InstCounterType CntTy, unsigned Score) {
- RegInterval Interval = getRegInterval(MI, Op);
- setScoreByInterval(Interval, CntTy, Score);
+ assert(Op.isReg());
+ setRegScore(Op.getReg().asMCReg(), CntTy, Score);
}
// Return true if the subtarget is one that enables Point Sample Acceleration
@@ -911,12 +868,12 @@ bool WaitcntBrackets::hasPointSampleAccel(const MachineInstr &MI) const {
// one that has outstanding writes to vmem-types different than VMEM_NOSAMPLER
// (this is the type that a point sample accelerated instruction effectively
// becomes)
-bool WaitcntBrackets::hasPointSamplePendingVmemTypes(
- const MachineInstr &MI, RegInterval Interval) const {
+bool WaitcntBrackets::hasPointSamplePendingVmemTypes(const MachineInstr &MI,
+ MCPhysReg Reg) const {
if (!hasPointSampleAccel(MI))
return false;
- return hasOtherPendingVmemTypes(Interval, VMEM_NOSAMPLER);
+ return hasOtherPendingVmemTypes(Reg, VMEM_NOSAMPLER);
}
void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
@@ -943,57 +900,52 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
// All GDS operations must protect their address register (same as
// export.)
if (const auto *AddrOp = TII->getNamedOperand(Inst, AMDGPU::OpName::addr))
- setScoreByOperand(&Inst, *AddrOp, EXP_CNT, CurrScore);
+ setScoreByOperand(*AddrOp, EXP_CNT, CurrScore);
if (Inst.mayStore()) {
if (const auto *Data0 =
TII->getNamedOperand(Inst, AMDGPU::OpName::data0))
- setScoreByOperand(&Inst, *Data0, EXP_CNT, CurrScore);
+ setScoreByOperand(*Data0, EXP_CNT, CurrScore);
if (const auto *Data1 =
TII->getNamedOperand(Inst, AMDGPU::OpName::data1))
- setScoreByOperand(&Inst, *Data1, EXP_CNT, CurrScore);
+ setScoreByOperand(*Data1, EXP_CNT, CurrScore);
} else if (SIInstrInfo::isAtomicRet(Inst) && !SIInstrInfo::isGWS(Inst) &&
Inst.getOpcode() != AMDGPU::DS_APPEND &&
Inst.getOpcode() != AMDGPU::DS_CONSUME &&
Inst.getOpcode() != AMDGPU::DS_ORDERED_COUNT) {
for (const MachineOperand &Op : Inst.all_uses()) {
if (TRI->isVectorRegister(*MRI, Op.getReg()))
- setScoreByOperand(&Inst, Op, EXP_CNT, CurrScore);
+ setScoreByOperand(Op, EXP_CNT, CurrScore);
}
}
} else if (TII->isFLAT(Inst)) {
if (Inst.mayStore()) {
- setScoreByOperand(&Inst,
- *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+ setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
EXP_CNT, CurrScore);
} else if (SIInstrInfo::isAtomicRet(Inst)) {
- setScoreByOperand(&Inst,
- *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+ setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
EXP_CNT, CurrScore);
}
} else if (TII->isMIMG(Inst)) {
if (Inst.mayStore()) {
- setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+ setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
} else if (SIInstrInfo::isAtomicRet(Inst)) {
- setScoreByOperand(&Inst,
- *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+ setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
EXP_CNT, CurrScore);
}
} else if (TII->isMTBUF(Inst)) {
if (Inst.mayStore())
- setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+ setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
} else if (TII->isMUBUF(Inst)) {
if (Inst.mayStore()) {
- setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+ setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
} else if (SIInstrInfo::isAtomicRet(Inst)) {
- setScoreByOperand(&Inst,
- *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+ setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
EXP_CNT, CurrScore);
}
} else if (TII->isLDSDIR(Inst)) {
// LDSDIR instructions attach the score to the destination.
- setScoreByOperand(&Inst,
- *TII->getNamedOperand(Inst, AMDGPU::OpName::vdst),
+ setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::vdst),
EXP_CNT, CurrScore);
} else {
if (TII->isEXP(Inst)) {
@@ -1003,18 +955,18 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
// score.
for (MachineOperand &DefMO : Inst.all_defs()) {
if (TRI->isVGPR(*MRI, DefMO.getReg())) {
- setScoreByOperand(&Inst, DefMO, EXP_CNT, CurrScore);
+ setScoreByOperand(DefMO, EXP_CNT, CurrScore);
}
}
}
for (const MachineOperand &Op : Inst.all_uses()) {
if (TRI->isVectorRegister(*MRI, Op.getReg()))
- setScoreByOperand(&Inst, Op, EXP_CNT, CurrScore);
+ setScoreByOperand(Op, EXP_CNT, CurrScore);
}
}
} else if (T == X_CNT) {
for (const MachineOperand &Op : Inst.all_uses())
- setScoreByOperand(&Inst, Op, T, CurrScore);
+ setScoreByOperand(Op, T, CurrScore);
} else /* LGKM_CNT || EXP_CNT || VS_CNT || NUM_INST_CNTS */ {
// Match the score to the destination registers.
//
@@ -1026,9 +978,9 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
// Special cases where implicit register defs exists, such as M0 or VCC,
// but none with memory instructions.
for (const MachineOperand &Op : Inst.defs()) {
- RegInterval Interval = getRegInterval(&Inst, Op);
if (T == LOAD_CNT || T == SAMPLE_CNT || T == BVH_CNT) {
- if (Interval.first >= NUM_ALL_VGPRS)
+ if (!Context->TRI->isVectorRegister(*Context->MRI,
+ Op.getReg())) // TODO: add wrapper
continue;
if (updateVMCntOnly(Inst)) {
// updateVMCntOnly should only leave us with VGPRs
@@ -1041,11 +993,11 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
// this with another potential dependency
if (hasPointSampleAccel(Inst))
TypesMask |= 1 << VMEM_NOSAMPLER;
- for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo)
- VgprVmemTypes[RegNo] |= TypesMask;
+ for (MCRegUnit RU : regunits(Op.getReg().asMCReg()))
+ VMem[RU].VMEMTypes |= TypesMask;
}
}
- setScoreByInterval(Interval, T, CurrScore);
+ setScoreByOperand(Op, T, CurrScore);
}
if (Inst.mayStore() &&
(TII->isDS(Inst) || TII->mayWriteLDSThroughDMA(Inst))) {
@@ -1076,19 +1028,19...
[truncated]
|
You can get the unit's "root" register and test that. See MCRegUnitRootIterator. AMDGPU does not use ad hoc aliasing so I think every unit should have exactly one root. |
Is it worth adding this (potentially expensive?) query just so I can merge a few methods together though ? Perhaps I can try it in another diff on top of this ? This diff already changes a lot of thing, I want to make sure it doesn't become too big to review or debug in case of an issue. |
3ec5466
to
34a77ec
Compare
Clean up the tracking logic to rely on register units. The pass was already "reinventing" the concept just to deal with 16 bit registers. There are no test changes, functionality is the same, except we can now track more LDS DMA IDs if we need it. The debug prints also changed a bit because we now talk in terms of register units. This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the `memcpy` built-in). I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap accesses. This still isn't as clean as I'd like it to be though. There is a mix of "VMEMID", "LDS DMA ID", "SGPR RegUnit" and "PhysReg" in the API of WaitCntBrackets. There is no type safety to avoid mix-ups as these are all integers. We could add another layer of abstraction on top, but I feel like it's going to add too much code/boilerplate for such a small issue.
34a77ec
to
70baf17
Compare
The pass was already "reinventing" the concept just to deal with 16 bit registers. Clean up the entire tracking logic to only use register units.
There are no test changes because functionality didn't change, except:
1 << 16
)This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the
memcpy
built-in on a big test file).I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap accesses.
I think this can be cleaned up a bit more. I want to see if we can remove PhysRegs from the WaitCnt bracket APIs, so it only reasons in terms of RegUnits and nothing else.
One issue I've had with that is that I'm not sure how to tell if a RU is a SGPR or VGPR. I don't think RUs always considered registers so I can really fetch their regclass.