-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[RISCV] Use C.ADD when OR is not compressible due to register restriction #156044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tions c.or requires that all the operands be the gprc register class, but c.add does not. As a result, we can use c.add for disjoint or to allow additional compression. This patch does the transform extremely late (when converting to MCInst) so that we only emit an OR as an ADD if the difference actually reduces code size. I haven't touched the register allocator hint mechanism (yet), so this is only catching cases which naturally end up reusing one of the source registers. This is a (likely much better) alternative to llvm#155669. When I first wrote that, I hadn't reazlied that we already propoagate disjoint onto the RISCV::OR MachineInst Node. Note there is a small correctness risk with this change - if we forgot to drop the disoint flag somewhere this could cause miscompiles, and I don't think we have another use of the flag this late in the backend.
@llvm/pr-subscribers-backend-risc-v Author: Philip Reames (preames) Changesc.or requires that all the operands be the gprc register class, but c.add does not. As a result, we can use c.add for disjoint or to allow additional compression. This patch does the transform extremely late (when converting to MCInst) so that we only emit an OR as an ADD if the difference actually reduces code size. I haven't touched the register allocator hint mechanism (yet), so this is only catching cases which naturally end up reusing one of the source registers. This is a (likely much better) alternative to #155669. When I first wrote that, I hadn't realized that we already propagate disjoint onto the RISCV::OR MachineInst Node. Note there is a small correctness risk with this change - if we forgot to drop the disioint flag somewhere this could cause miscompiles, and I don't think we have another use of the flag this late in the backend. Full diff: https://github.com/llvm/llvm-project/pull/156044.diff 2 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVAsmPrinter.cpp b/llvm/lib/Target/RISCV/RISCVAsmPrinter.cpp
index 83566b1c57782..f10d2a1ccba58 100644
--- a/llvm/lib/Target/RISCV/RISCVAsmPrinter.cpp
+++ b/llvm/lib/Target/RISCV/RISCVAsmPrinter.cpp
@@ -1178,7 +1178,21 @@ bool RISCVAsmPrinter::lowerToMCInst(const MachineInstr *MI, MCInst &OutMI) {
if (lowerRISCVVMachineInstrToMCInst(MI, OutMI, STI))
return false;
- OutMI.setOpcode(MI->getOpcode());
+ unsigned Opcode = MI->getOpcode();
+ // If we have a disjoint OR which isn't compressible as an c.or, we can
+ // convert it to a c.add which doesn't have the gprc register restriction.
+ if (STI->hasStdExtZca() && Opcode == RISCV::OR &&
+ MI->getFlag(MachineInstr::Disjoint)) {
+ Register Rd = MI->getOperand(0).getReg();
+ Register Rs1 = MI->getOperand(1).getReg();
+ Register Rs2 = MI->getOperand(2).getReg();
+ if ((Rd == Rs1 || Rd == Rs2) &&
+ !(RISCV::GPRCRegClass.contains(Rd) &&
+ RISCV::GPRCRegClass.contains(Rs1) &&
+ RISCV::GPRCRegClass.contains(Rs2)))
+ Opcode = RISCV::ADD;
+ }
+ OutMI.setOpcode(Opcode);
for (const MachineOperand &MO : MI->operands()) {
MCOperand MCOp;
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
index d9bb007a10f71..157ab0bd30d70 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll
@@ -1532,7 +1532,7 @@ define <16 x i8> @buildvec_v16i8_loads_contigous(ptr %p) {
; RVA22U64-NEXT: slli a4, a4, 24
; RVA22U64-NEXT: slli a5, a5, 32
; RVA22U64-NEXT: slli a1, a1, 40
-; RVA22U64-NEXT: or a6, a6, a2
+; RVA22U64-NEXT: add a6, a6, a2
; RVA22U64-NEXT: or t2, a4, a3
; RVA22U64-NEXT: or t1, a1, a5
; RVA22U64-NEXT: lbu a4, 8(a0)
@@ -1907,7 +1907,7 @@ define <16 x i8> @buildvec_v16i8_loads_gather(ptr %p) {
; RVA22U64-NEXT: slli a4, a4, 24
; RVA22U64-NEXT: slli a5, a5, 32
; RVA22U64-NEXT: slli a1, a1, 40
-; RVA22U64-NEXT: or a7, a7, a2
+; RVA22U64-NEXT: add a7, a7, a2
; RVA22U64-NEXT: or t3, a4, a3
; RVA22U64-NEXT: or t2, a1, a5
; RVA22U64-NEXT: lbu a4, 93(a0)
@@ -1918,13 +1918,13 @@ define <16 x i8> @buildvec_v16i8_loads_gather(ptr %p) {
; RVA22U64-NEXT: slli t0, t0, 56
; RVA22U64-NEXT: slli a4, a4, 8
; RVA22U64-NEXT: or a3, t0, a6
-; RVA22U64-NEXT: or a4, t1, a4
+; RVA22U64-NEXT: add a4, a4, t1
; RVA22U64-NEXT: lbu a5, 161(a0)
; RVA22U64-NEXT: lbu a1, 154(a0)
; RVA22U64-NEXT: lbu a0, 163(a0)
; RVA22U64-NEXT: slli t4, t4, 16
; RVA22U64-NEXT: slli a5, a5, 24
-; RVA22U64-NEXT: or a5, a5, t4
+; RVA22U64-NEXT: add a5, a5, t4
; RVA22U64-NEXT: slli a2, a2, 32
; RVA22U64-NEXT: slli a0, a0, 40
; RVA22U64-NEXT: or a0, a0, a2
@@ -3083,7 +3083,7 @@ define <8 x i8> @buildvec_v8i8_pack(i8 %e1, i8 %e2, i8 %e3, i8 %e4, i8 %e5, i8 %
; RVA22U64-NEXT: slli a2, a2, 16
; RVA22U64-NEXT: slli a3, a3, 24
; RVA22U64-NEXT: slli a1, a1, 8
-; RVA22U64-NEXT: or a5, a5, t0
+; RVA22U64-NEXT: add a5, a5, t0
; RVA22U64-NEXT: or a4, a7, a4
; RVA22U64-NEXT: or a2, a2, a3
; RVA22U64-NEXT: or a0, a0, a1
|
You can test this locally with the following command:git-clang-format --diff origin/main HEAD --extensions cpp -- llvm/lib/Target/RISCV/RISCVAsmPrinter.cpp
View the diff from clang-format here.diff --git a/llvm/lib/Target/RISCV/RISCVAsmPrinter.cpp b/llvm/lib/Target/RISCV/RISCVAsmPrinter.cpp
index f10d2a1cc..f0e4631e5 100644
--- a/llvm/lib/Target/RISCV/RISCVAsmPrinter.cpp
+++ b/llvm/lib/Target/RISCV/RISCVAsmPrinter.cpp
@@ -1186,10 +1186,9 @@ bool RISCVAsmPrinter::lowerToMCInst(const MachineInstr *MI, MCInst &OutMI) {
Register Rd = MI->getOperand(0).getReg();
Register Rs1 = MI->getOperand(1).getReg();
Register Rs2 = MI->getOperand(2).getReg();
- if ((Rd == Rs1 || Rd == Rs2) &&
- !(RISCV::GPRCRegClass.contains(Rd) &&
- RISCV::GPRCRegClass.contains(Rs1) &&
- RISCV::GPRCRegClass.contains(Rs2)))
+ if ((Rd == Rs1 || Rd == Rs2) && !(RISCV::GPRCRegClass.contains(Rd) &&
+ RISCV::GPRCRegClass.contains(Rs1) &&
+ RISCV::GPRCRegClass.contains(Rs2)))
Opcode = RISCV::ADD;
}
OutMI.setOpcode(Opcode);
|
unsigned Opcode = MI->getOpcode(); | ||
// If we have a disjoint OR which isn't compressible as an c.or, we can | ||
// convert it to a c.add which doesn't have the gprc register restriction. | ||
if (STI->hasStdExtZca() && Opcode == RISCV::OR && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy with the choice to do this here, honestly. This is mostly a rote function, and I don't think users would expect changes to happen in it.
Given you're not changing register allocation, can we do it in one of two places:
- A CompressPat, with
isCompressOnly=true
. I realise we might not be able to control the priority of the two patterns, but I think right now the system uses file order? - RISCVMakeCompressible - where we're sort-of expecting optimisations like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A CompressPat, with isCompressOnly=true. I realise we might not be able to control the priority of the two patterns, but I think right now the system uses file order?
I don't think we can check the flag in a CompressPat can we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, no, indeed we cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy with the choice to do this here, honestly. This is mostly a rote function, and I don't think users would expect changes to happen in it.
Understood, happy to rework if we see a better option.
Given you're not changing register allocation
Hold on. Not changing register allocation for now. I'm expecting to return to the register allocation in an upcoming patch; I was just staging work to make the diff more understandable. Does this change your take at all?
* A CompressPat, with `isCompressOnly=true`. I realise we might not be able to control the priority of the two patterns, but I think right now the system uses file order?
Craig addressed this idea.
* RISCVMakeCompressible - where we're sort-of expecting optimisations like this.
This only performs size optimizations, and the while this change does improve size, it's not gated on any of the Os/Oz flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not changing register allocation for now. I'm expecting to return to the register allocation in an upcoming patch; I was just staging work to make the diff more understandable. Does this change your take at all?
Not fundamentally - any changes to the register allocation would be in a completely different part of the codebase anyway, and would still need code in a later pass to turn or+disjoint into add.
[RISCVMakeCompressible] only performs size optimizations, and the while this change does improve size, it's not gated on any of the Os/Oz flags.
Sure, but this patch is still an optimisation about trying to make a specific pattern more amenable to future compression - your code is introducing add
with the knowledge that the compress patterns for add
cover more instances of the instruction vs the compress patterns for or
, which to my mind is essentially what RISCVMakeCompressible
is doing.
I don't entirely understand why this optimisation warrants a more special treatment than those in RISCVMakeCompressible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking at this code, noticed a cleanup of the existing code: #156482
Honestly, I don't really get your objection to this change at all so I suspect we're talking past each other. You'd previously said this routine was "mostly a rote function, and I don't think users would expect changes to happen in it". This routine already lowers all the V pseudo's and had special casing for the patchable function entry stuff (admittedly, that I'm now removing in the patch linked above). The sole caller has custom lowering for a bunch of MIs; this logic could be easily moved there. Is your objection that I'm adding an optimization into something otherwise purely functional?
I don't entirely understand why this optimisation warrants a more special treatment than those in RISCVMakeCompressible.
RISCVMakeCompressible is currently gated by MinSize. That's my entire objection. If you'd rather that pass had a section which ran unconditionally, and a second which ran only if MinSize, I'm happy to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your objection that I'm adding an optimization into something otherwise purely functional?
Broadly, yes. All the MIs with special handling in that function are Pseudos, not well-formed executable/encodable instructions. I don't think someone debugging codegen would expect that function to modify instructions in this way, but maybe it would be obvious with a -print-after-all
or similar invocation.
RISCVMakeCompressible is currently gated by MinSize. That's my entire objection.
Understood on the MinSize vs OptSize vs no-gating on MakeCompressible. We want to run that pass at OptSize as well as MinSize (we have an internal patch for this that should go up in the next day or so). A quick look at https://reviews.llvm.org/D92105 shows that there was some debate about Oz vs Os and you suggested the pass would be good even at O2. I'm not sure these decisions have been revisited since then with the community.
I'm about to be out for a week, so:
- I'm not going to block this patch.
- I'm in favour of revisiting when RISCVMakeCompressible runs.
- After that we can make a decision whether move this optimisation in there, or whether it's still better to be in this location.
I hope this is reasonable for a fix-forward approach, which should unblock any register allocation improvements you might want to do.
Just as a litmus test, I applied a modified version of this patch that unconditionally selects Obviously this isn't a proof of the absence of any lurking issues, but at least we know that the obvious litmus test is fine. |
I spoke with Philip earlier. We believe the transforms done by RISCVOptWInstrs invalidate the Disjoint flag, but don't clear the flag. For example, we can no longer guarantee the upper bits are disjoint if we optimize based on demanded bits. I'm not sure the intended transform in this patch is violated by RISCVOptWInstrs I just know the flag itself does not accurately describe the upper bits as it should. This makes me uncomfortable using the flag. |
c.or requires that all the operands be the gprc register class, but c.add does not. As a result, we can use c.add for disjoint or to allow additional compression.
This patch does the transform extremely late (when converting to MCInst) so that we only emit an OR as an ADD if the difference actually reduces code size.
I haven't touched the register allocator hint mechanism (yet), so this is only catching cases which naturally end up reusing one of the source registers.
This is a (likely much better) alternative to #155669. When I first wrote that, I hadn't realized that we already propagate disjoint onto the RISCV::OR MachineInst Node.
Note there is a small correctness risk with this change - if we forgot to drop the disioint flag somewhere this could cause miscompiles, and I don't think we have another use of the flag this late in the backend.