Skip to content

Conversation

@rampitec
Copy link
Collaborator

If we have 1024 VGPRs available we need to give priority to the
allocation of these registers where operands can only use low 256.
That is noteably scale operands of V_WMMA_SCALE instructions.
Otherwise large tuples will be allocated first and take all low
registers, so we would have to spill to get a room for these
scale registers.

Allocation priority itself does not eliminate spilling completely
in large kernels, although helps to some degree. Increasing spill
weight of a restricted class on top of it helps.

If we have 1024 VGPRs available we need to give priority to the
allocation of these registers where operands can only use low 256.
That is noteably scale operands of V_WMMA_SCALE instructions.
Otherwise large tuples will be allocated first and take all low
registers, so we would have to spill to get a room for these
scale registers.

Allocation priority itself does not eliminate spilling completely
in large kernels, although helps to some degree. Increasing spill
weight of a restricted class on top of it helps.
Copy link
Collaborator Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@rampitec rampitec requested a review from qcolombet November 13, 2025 23:32
@rampitec rampitec marked this pull request as ready for review November 13, 2025 23:32
@llvmbot
Copy link
Member

llvmbot commented Nov 13, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

If we have 1024 VGPRs available we need to give priority to the
allocation of these registers where operands can only use low 256.
That is noteably scale operands of V_WMMA_SCALE instructions.
Otherwise large tuples will be allocated first and take all low
registers, so we would have to spill to get a room for these
scale registers.

Allocation priority itself does not eliminate spilling completely
in large kernels, although helps to some degree. Increasing spill
weight of a restricted class on top of it helps.


Full diff: https://github.com/llvm/llvm-project/pull/167978.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.h (+11)
  • (modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.td (+1-1)
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.h b/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
index a6af25dfd7d6f..28ab2137b193c 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
@@ -501,6 +501,17 @@ class SIRegisterInfo final : public AMDGPUGenRegisterInfo {
 
   SmallVector<StringLiteral>
   getVRegFlagsOfReg(Register Reg, const MachineFunction &MF) const override;
+
+  float
+  getSpillWeightScaleFactor(const TargetRegisterClass *RC) const override {
+    // Prioritize VGPR_32_Lo256 over other classes which may occupy registers
+    // beyond v256.
+    return AMDGPUGenRegisterInfo::getSpillWeightScaleFactor(RC) *
+           ((RC == &AMDGPU::VGPR_32_Lo256RegClass ||
+             RC == &AMDGPU::VReg_64_Lo256_Align2RegClass)
+                ? 2.0
+                : 1.0);
+  }
 };
 
 namespace AMDGPU {
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
index abe12c17ae76c..5cff5f2248b02 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
@@ -644,7 +644,7 @@ def VGPR_32_Lo128 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg1
 // Identical to VGPR_32 except it only contains the low 256 (Lo256) registers.
 def VGPR_32_Lo256 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types.types), 32,
                                     (add (sequence "VGPR%u", 0, 255))> {
-  let AllocationPriority = 0;
+  let AllocationPriority = !add(3, !mul(BaseClassPriority, BaseClassScaleFactor));
   let GeneratePressureSet = 0;
   let Size = 32;
   let Weight = 1;

@rampitec
Copy link
Collaborator Author

I was unable to come up with a reasonably small testcase so far.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No test changes?

Comment on lines +507 to +514
// Prioritize VGPR_32_Lo256 over other classes which may occupy registers
// beyond v256.
return AMDGPUGenRegisterInfo::getSpillWeightScaleFactor(RC) *
((RC == &AMDGPU::VGPR_32_Lo256RegClass ||
RC == &AMDGPU::VReg_64_Lo256_Align2RegClass)
? 2.0
: 1.0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't require manual overriding

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some level this is even right. Using this or that register is not more expensive. That is when it comes to spilling it matters.

def VGPR_32_Lo256 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types.types), 32,
(add (sequence "VGPR%u", 0, 255))> {
let AllocationPriority = 0;
let AllocationPriority = !add(3, !mul(BaseClassPriority, BaseClassScaleFactor));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try setting CostPerUse? I think that makes more sense than changing priority

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's for registers, right? In this case it shall be per regclass. I.e., even if I increase it for low 256, it will be the same for a smaller and larger size. Here it is really a property of operand, not a register itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More or less that would make almost everything worse. Setting CostPerUse to non-zero for all low 256 VGPRs will persuade RA to allocate registers above 256 by default. At the end you will have programs using registers starting just from v256 and nothing at the range v0-v255.

@rampitec
Copy link
Collaborator Author

No test changes?

No, at least not yet. The real test is huge and I didn't reproduce spilling in anything smaller yet.

@rampitec
Copy link
Collaborator Author

No test changes?

No, at least not yet. The real test is huge and I didn't reproduce spilling in anything smaller yet.

This seems to be very difficult to force RA spilling in a small test, it has to be really huge.

@rampitec
Copy link
Collaborator Author

rampitec commented Nov 15, 2025

This seems to be very difficult to force RA spilling in a small test, it has to be really huge.

The minimal working testcase llvm-reduce left me with is 102K of IR, 416 IR instructions. Do you want me to add it?

@rampitec
Copy link
Collaborator Author

This seems to be very difficult to force RA spilling in a small test, it has to be really huge.

The minimal working testcase llvm-reduce left me with is 102K of IR, 416 IR instructions. Do you want me to add it?

That's the test: #168163. It seems I cannot make it smaller.

@rampitec
Copy link
Collaborator Author

Even this is not a magic bullet. I still can seduce RA to spill in the pathological very big case. Still correct, but will spill on lo256 operands. I guess this is the nature of having limited VGPR classes and inability to encode all operands to use all possible registers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants