-
Notifications
You must be signed in to change notification settings - Fork 15.5k
[AMDGPU] Allow folding of non-subregs through REG_SEQUENCE #151033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -730,14 +730,11 @@ bool SIFoldOperandsImpl::updateOperand(FoldCandidate &Fold) const { | |
| } | ||
| } | ||
|
|
||
| // Rework once the VS_16 register class is updated to include proper | ||
| // 16-bit SGPRs instead of 32-bit ones. | ||
| if (Old.getSubReg() == AMDGPU::lo16 && TRI->isSGPRReg(*MRI, New->getReg())) | ||
| Old.setSubReg(AMDGPU::NoSubRegister); | ||
| Old.setSubReg(New->getSubReg()); | ||
| if (New->getReg().isPhysical()) { | ||
| Old.substPhysReg(New->getReg(), *TRI); | ||
| } else { | ||
| Old.substVirtReg(New->getReg(), New->getSubReg(), *TRI); | ||
| Old.substVirtReg(New->getReg(), 0, *TRI); | ||
|
Comment on lines
-740
to
+737
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did have to change this to not pass in any subreg in order to make |
||
| Old.setIsUndef(New->isUndef()); | ||
| } | ||
| return true; | ||
|
|
@@ -1150,10 +1147,14 @@ void SIFoldOperandsImpl::foldOperand( | |
| if (UseOp->isReg() && OpToFold.isReg()) { | ||
| if (UseOp->isImplicit()) | ||
| return; | ||
| // Allow folding from SGPRs to 16-bit VGPRs. | ||
|
|
||
| MachineInstr *SourceInstruction = MRI->getVRegDef(UseOp->getReg()); | ||
| // Allow folding from SGPRs to 16-bit VGPRs | ||
| // or folding of non-subregs through REG_SEQUENCES. | ||
| if (UseOp->getSubReg() != AMDGPU::NoSubRegister && | ||
| (UseOp->getSubReg() != AMDGPU::lo16 || | ||
| !TRI->isSGPRReg(*MRI, OpToFold.getReg()))) | ||
| !TRI->isSGPRReg(*MRI, OpToFold.getReg())) && | ||
| !SourceInstruction->isRegSequence()) | ||
| return; | ||
| } | ||
|
|
||
|
|
@@ -1452,6 +1453,35 @@ void SIFoldOperandsImpl::foldOperand( | |
| return; | ||
| } | ||
|
|
||
| // FIXME: If we properly encode the 32-bit aligned register requirement for | ||
| // these DS_GWS instructions, this can be removed. | ||
| if (!FoldingImmLike && OpToFold.isReg() && ST->needsAlignedVGPRs()) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is too much special casing for an unrelated hack (really we should have an aligned 32-bit register class for this case) If you use getRegClassConstraintEffectForVReg instead of checking the specific operand to compute the constraint, do you avoid the need for this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that will work actually. I was going to suggest checking if the operand is tied, but that hack is not actually tying the operand and nothing is really ensuring that the two operands will remain the same
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we could remove this hack properly now. We can now use RegClassByHwMode, and will then just need to adjust the selection patterns to pad with undef on the relevant targets
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you elaborate on your comment? Is your suggestion that I add an aligned 32-bit RC, and then use RegClassByHwMode to have the relevant DS_GWS instructions use that RC on the subtargets that require it? If we do not have a 32-bit aligned RC, I don't understand how this feature and padding during selection helps enforce the alignment at the time of folding.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be a separate PR, since it's not really related to this. Adding an even aligned 32-bit register class would be difficult. It would be easier to model this as a 64-bit operand, and continue doing the pad-with-undef it does today. The issue is the hardware treats this like a 64-bit value anyway, so we can model it as a 64-bit input. It would require doing something like: And using that instead of VGPR_32 for the special case operands in the GWS instruction definitions. Then we'd need separate patterns for the FeatureRequiresAlignedVGPRs and !FeatureRequiresAlignedVGPRs cases
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Due to the way that subregisters are handled when folding the source of a copy through REG_SEQUENCE, we only ever try to fold if the subregister operand in the REG_SEQUENCE exactly matches the subregister of the use of the REG_SEQUENCE. To make sure I understand, are the high 32-bits of the 64-bit read required to be undef, or are they just discarded? If it is required, then I am not sure if we should be folding into these instructions anyway. If it is not, then I think we would never fold the 32-bit operand into the 64-bit operand (maybe with special casing the REG_SEQUENCE recursion for these instructions, along with special casing on the Register Class constraining).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the GWS cases, the high 32-bits are just discarded. The requirement is just the register used is even aligned, which currently we can only encode by using the 64-bit aligned register classes. This case is far removed from something this pass should have to worry about. If the instruction were properly defined, you wouldn't have to worry about it. Ideally instructions accurately express their constraints and don't require special casing anywhere else
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense to me that this isn't something core for the pass to worry about, I just don't understand how RegClassByHwMode solves this issue. I did add a FIXME comment. There are two pieces that we would need to fix it: RegClassByHwMode for these specific instructions, having access to a 32-bit aligned register class. Without having a 32-bit aligned register class, I don't think we would ever actually perform the folding without some special casing anyway. As you mentioned, this PR is not for addressing this underlying issue. Do you think this needs to be addressed before this PR can be accepted? Additionally, I borrowed this code from
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean like this: #169372
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same bug will exist for images with 32-bit addresses but the same thing should be applied |
||
| unsigned Opc = UseMI->getOpcode(); | ||
| // Special case for DS_GWS instructions that only use 32 bits but hardware | ||
| // treats it as a 64 bit read. | ||
| if (Opc == AMDGPU::DS_GWS_INIT || Opc == AMDGPU::DS_GWS_SEMA_BR || | ||
| Opc == AMDGPU::DS_GWS_BARRIER) { | ||
| const TargetRegisterClass *RC = | ||
| TRI->getRegClassForReg(*MRI, OpToFold.getReg()); | ||
| assert(RC); | ||
|
|
||
| const auto isAlignedReg = [&OpToFold, &UseOp, &UseMI, &RC, | ||
| this](AMDGPU::OpName OpName) -> bool { | ||
| const MachineOperand *Op = TII->getNamedOperand(*UseMI, OpName); | ||
| if (Op != UseOp) | ||
| return true; | ||
| Register Reg = OpToFold.getReg(); | ||
| assert(!Reg.isPhysical()); | ||
| return TRI->getRegSizeInBits(*RC) > 32 && | ||
| !(TRI->getChannelFromSubReg(OpToFold.getSubReg()) & 1) && | ||
| TRI->isProperlyAlignedRC(*RC); | ||
| }; | ||
|
|
||
| if (!isAlignedReg(AMDGPU::OpName::data0)) | ||
| return; | ||
| } | ||
| } | ||
|
|
||
| // FIXME: We could try to change the instruction from 64-bit to 32-bit | ||
| // to enable more folding opportunities. The shrink operands pass | ||
| // already does this. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@broxigarchen Is this previous piece of code necessary? Currently, and in #128929 when this
ifwas introduced, this branch is not taken in any of the tests. None of the tests fold into a lo16.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is to handle case like:
fold to
since inst_true16 can take both vgpr16 and sreg32 for 16bit operands.
However I run a quick test and seems the this case is not added to the FoldCandidate and thus not triggered. Maybe we should fix this? @Sisyph to help commenting on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have thought the line was exercised when added. Maybe the calling code flow changed some assumed precondition. I observed you can add
assert(Old.getSubReg() == AMDGPU::NoSubRegister)at this line. Is it expected we never have subregisters on Old in this function?I don't think the new version of the code in this PR is correct if Old did have a subregister (not sure about the old version).