Skip to content

Conversation

dianqk
Copy link
Member

@dianqk dianqk commented Sep 14, 2025

Fixes #115862.

https://llvm.org/docs/Passes.html#licm-loop-invariant-code-motion has said:

Hoisting operations out of loops is a canonicalization transform. It enables and simplifies subsequent optimizations in the middle-end. Rematerialization of hoisted instructions to reduce register pressure is the responsibility of the back-end, which has more accurate information about register pressure and also handles other optimizations than LICM that increase live-ranges.

I do agree with this, but I cannot find any passes of the back-end doing this. MachineSink is what I'm looking for, but it's not enabled by default. After #117247, I don't think MachineSink is suitable for rematerializing these instructions. Hmm, and I don't really understand the code magic.

The first commit rematerializes all instructions before Machine LICM. It's easy to understand for me; we only need to improve Machine LICM if we find something. This compile time is https://llvm-compile-time-tracker.com/compare.php?from=a4993a27fb005c2c65e065e9d7703533f4d26bd2&to=8cb8bb00cce43634330626e5224cefb46696919c&stat=instructions:u.

The second commit is just my attempt to put it into MachineSink, and the compile time is https://llvm-compile-time-tracker.com/compare.php?from=a4993a27fb005c2c65e065e9d7703533f4d26bd2&to=fa91ca83e6fee687eae647d55a30190667a45954&stat=instructions%3Au.

I haven't added any test cases because I want to hear some ideas for this.

With the first commit, the new result of the reduced example that I added in #115862 (comment) is

+ clang -dumpversion
21.1.0
+ gcc -dumpversion
14.3.0
+ clang -O3 main.c
+ perf stat -e instructions:u ./a.out

 Performance counter stats for './a.out':

     2,550,317,381      instructions:u

       0.186548993 seconds time elapsed

       0.185919000 seconds user
       0.000000000 seconds sys


+ clang-dev -O3 main.c
+ perf stat -e instructions:u ./a.out

 Performance counter stats for './a.out':

       528,662,772      instructions:u

       0.019543261 seconds time elapsed

       0.018484000 seconds user
       0.001030000 seconds sys


+ gcc -O3 main.c
+ perf stat -e instructions:u ./a.out

 Performance counter stats for './a.out':

       453,161,286      instructions:u

       0.014953744 seconds time elapsed

       0.013878000 seconds user
       0.001064000 seconds sys

@dianqk dianqk force-pushed the licm-rematerialization branch from fa91ca8 to 9ffa46a Compare September 14, 2025 13:37
Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this.

The terminology used here is a bit confusing. "Rematerialization" in this context usually implies that a copy of the instruction is generated in the loop. This is something regalloc can do to avoid spills. As far as I can tell, this is not what you are doing here -- this is plain sinking, not rematerialization.

I'm not particularly familiar with these transforms, so I don't have much to say here. I'm not sure whether this approach of first sinking everything and then trying to hoist again makes sense -- it seems like this would likely end up overshooting in the other direction and end up moving too many calculations into the loop. It's hard to say without seeing how this affects codegen in practice.

@nikic nikic requested a review from preames September 14, 2025 16:13
@sharkautarch
Copy link

@dianqk it looks like the remainder of the hoisted instructions in the test c file in #115862 that MachineLICM currently fails to sink w/ -mllvm -sink-insts-to-avoid-spills=1, is due to this line:

if (I.getNumDefs() > 1)

this I inferred from the debug output regarding some of the hoisted instructions:

CycleSink: Analysing candidate: %239:gr32 = nsw ADD32ri %234:gr32(tied-def 0), 2, implicit-def dead $eflags
CycleSink: Instruction added as candidate.
CycleSink: Analysing candidate: %240:gr32 = nsw ADD32ri %235:gr32(tied-def 0), 2, implicit-def dead $eflags
CycleSink: Instruction added as candidate.
...
CycleSink: Analysing candidate: %0:gr64_nosp = MOVSX64rr32 %239:gr32
CycleSink: Instruction added as candidate.
CycleSink: Analysing candidate: %1:gr64_nosp = MOVSX64rr32 %240:gr32
CycleSink: Instruction added as candidate.
...
AggressiveCycleSink: Finding sink block for: %1:gr64_nosp = MOVSX64rr32 %240:gr32
AggressiveCycleSink:   Analysing use: 0x55d4e30cee10AggressiveCycleSink: Sinking instruction to block: %bb.8
AggressiveCycleSink: Finding sink block for: %0:gr64_nosp = MOVSX64rr32 %239:gr32
AggressiveCycleSink:   Analysing use: 0x55d4e30cecc8AggressiveCycleSink: Sinking instruction to block: %bb.8

Portion of the Machine IR showing said hoisted instructions (printed by compiling w/ -mllvm --print-changed=diff-quiet -mllvm --filter-print-funcs=loop) in function loop:

*** IR Dump After Machine code sinking (machine-sink) on loop ***
 # Machine code for function loop: IsSSA, TracksLiveness
 Function Live Ins: $edi in %232, $rsi in %233, $edx in %234, $ecx in %235, $r8d in %236, $r9d in %237
 
 bb.0.entry:
   successors: %bb.1(0x80000000); %bb.1(100.00%)
   liveins: $edi, $rsi, $edx, $ecx, $r8d, $r9d
   %237:gr32 = COPY $r9d
   %236:gr32 = COPY $r8d
   %235:gr32 = COPY $ecx
   %234:gr32 = COPY $edx
   %233:gr64 = COPY $rsi
   %232:gr32 = COPY $edi
   %239:gr32 = nsw ADD32ri %234:gr32(tied-def 0), 2, implicit-def dead $eflags
   %240:gr32 = nsw ADD32ri %235:gr32(tied-def 0), 2, implicit-def dead $eflags
-  %0:gr64_nosp = MOVSX64rr32 %239:gr32
-  %1:gr64_nosp = MOVSX64rr32 %240:gr32

Which indicates that the hoisted add instructions aren't sinked by AggressiveCycleSink, because they have more than one def, but the sign extensions on the output of the hoisted add instructions are sinked by AggressiveCycleSink, because they don't have more than one def

@dianqk
Copy link
Member Author

dianqk commented Sep 15, 2025

Thanks for looking into this.

The terminology used here is a bit confusing. "Rematerialization" in this context usually implies that a copy of the instruction is generated in the loop. This is something regalloc can do to avoid spills. As far as I can tell, this is not what you are doing here -- this is plain sinking, not rematerialization.

I'm not particularly familiar with these transforms, so I don't have much to say here. I'm not sure whether this approach of first sinking everything and then trying to hoist again makes sense -- it seems like this would likely end up overshooting in the other direction and end up moving too many calculations into the loop. It's hard to say without seeing how this affects codegen in practice.

Thanks for your explanation. Rematerialization handles the register spills, not the hoisted instructions.
I just found I missed the thing that the hoisted instructions do not always run even in the loop body. This probably be an issue to be addressed. I haven't checked the performance of these instructions yet.

return true;
}

bool llvm::mayLoadFromGOTOrConstantPool(MachineInstr &MI) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check so specific? Should it really be checking for invariant loads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's unnecessary. I'll check it later.

return true;
}

bool llvm::mayLoadFromGOTOrConstantPool(MachineInstr &MI) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool llvm::mayLoadFromGOTOrConstantPool(MachineInstr &MI) {
bool llvm::mayLoadFromGOTOrConstantPool(MachineInstr &MI) {

const. Also could just be directly a function of MachineInstr?


for (MachineMemOperand *MemOp : MI.memoperands())
if (const PseudoSourceValue *PSV = MemOp->getPseudoValue())
if (PSV->isGOT() || PSV->isConstantPool())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PSV->isConstant? Or MemOp->isInvariant? Not sure if we verify those are consistent

@sharkautarch
Copy link

sharkautarch commented Sep 15, 2025

@dianqk
following up on my previous comment (#158479 (comment))
In your specific testcase from the aforementioned issue, it seems like those hoisted adds had two defs:

  • one explicit def: dst register
  • one implicit def: eflags register

It seems that in this case, the eflags dst def of said add insn isn't actually used by anything
Which makes me wonder if it'd be better to open a new PR which patches the
if (I.getNumDefs() > 1) for the AggressiveCycleSink code, to, in the case where I.getNumDefs() > 1:

  • check if there's only one explicit def and if so, if all implicit defs are just on $eflags (early exit if either condition is false)
  • treat instruction I as only having one def (meaning, proceed to sink instruction I) if the implicit $eflags def isn't actually used by any instruction proceeding instruction I def(s) is/are marked dead

That should allow -mllvm -sink-insts-to-avoid-spills=1 to handle most instructions hoisted by the middle end licm
(and that should still handle most selects, since most selects should be lowered to a test or cmp x86 insn followed by cmovCC, since the test and cmp insns should only have an eflags def)

I would imagine that the only otherwise sinkable insns that would still otherwise be able to be handled by AggressiveCycleSink would be arithmetic insns where the eflags def is not used by any JCC, but is used by one or more cmovCC/setCC/sbb/etc, likely in the same BB the arithmetic insn is in.
I would reckon that the easiest to handle would be arithmetic insn + setCC, whereas w/ cmovCC & sbb would require you to additionally consider the src def(s) of the cmovCC/sbb insn(s).

Either way, I definitely think it'd be best to first create a PR that just handles the simpler case of arithmetic insns w/ one explicit def & one $eflags def, whose output $eflags def isn't used by any following insn.
EDIT: I just realized: the aforementioned hoisted add insns' implicit $eflag def are already marked as implicit-def dead $eflags
and according to the MIR register flags documentation: https://llvm.org/docs/MIRLangRef.html#register-flags
implicit-def dead means that the implicit def is marked as unused
so improving the heuristic of AggressiveCycleSink should be simpler than I originally thought
EDIT 2: simple patch that should hopefully achieve the same result as @dianqk's draft PR:
main...sharkautarch:llvm-project:MIRAggrSink_ignoreDeadDefs
EDIT 3: seems like my above patch leads to an assert fail in one specific case:
ld.lld: /tmp/makepkg/llvm-git/src/llvm-project/llvm/lib/CodeGen/InlineSpiller.cpp:1012: bool (anonymous namespace)::InlineSpiller::foldMemoryOperand(ArrayRef<std::pair<MachineInstr *, unsigned int>>, MachineInstr *): Assertion 'MO->isDead() && "Cannot fold physreg def"' failed.

Seems to be caused by my patch making MachineSink sink a MOV32r0
@RKSimon any idea of how to check for MOV32r0 without explicitly checking for that specific opcode?
Hmm… my one idea would be that, since MOV32r0 is printed as %gr:<dst> = MOV32r0 implicit-def dead $eflags, which makes it seem like MOV32r0 doesn’t internally hold a reg use
maybe I can just avoid sinking MOV32r0 by not sinking any instructions that have more than one def and no reg uses…

EDIT 4: Not sinking multi-def insns that don't have any reg uses works to prevent the assert from happening

@jayfoad jayfoad self-requested a review September 19, 2025 10:20
if (!MI.isSafeToMove(DontMoveAcrossStore))
return false;
// Dont sink GOT or constant pool loads.
if (MI.mayLoad() && !mayLoadFromGOTOrConstantPool(MI))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this conditional reversed?

If it's a load, and it doesn't load from GOT or constant pool, return false.
Looking at the definition, this predicate definitely returns true for GOT/CP loads, so I think this is a logical error.

This was likely copy-pasted from MachineSink, because that's where I saw this first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it does seem that way. I'll recheck this.

return false;
// Instruction not safe to move.
bool DontMoveAcrossStore = true;
if (!MI.isSafeToMove(DontMoveAcrossStore))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to make this less restrictive than always assuming an aliasing store? Some way to do analysis such that we can determine that moving across a store isn't an issue? In my personal case, which is a matrix multiplication loading from/storing to global arrays, this condition blocks sinking.

Pre-ISel, PRE/LICM is happy to hoist everything out and cause problems, but once we get here, we would be unable to sink things back in.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to make this less restrictive than always assuming an aliasing store? Some way to do analysis such that we can determine that moving across a store isn't an issue?

I think technically there is preexisting code that could be reused to do load/store aliasing checks...
See:

bool MachineSinking::hasStoreBetween(MachineBasicBlock *From,

Though IMO that'd probably be out of scope of the draft PR...

@dianqk
Copy link
Member Author

dianqk commented Sep 23, 2025

@dianqk following up on my previous comment (#158479 (comment)) In your specific testcase from the aforementioned issue, it seems like those hoisted adds had two defs:

* one explicit def: dst register

* one implicit def: eflags register
...

It seems that in this case, the eflags dst def of said add insn isn't actually used by anything Which makes me wonder if it'd be better to open a new PR which patches the if (I.getNumDefs() > 1) for the AggressiveCycleSink code, to, in the case where I.getNumDefs() > 1:

* check if there's only one _explicit_ def and if so, if all implicit defs are just on `$eflags` (early exit if either condition is false)

* treat instruction `I` as only having one def (meaning, proceed to sink instruction `I`) if the implicit ~`$eflags` def isn't actually used by any instruction proceeding instruction `I`~ def(s) is/are marked dead

That should allow -mllvm -sink-insts-to-avoid-spills=1 to handle most instructions hoisted by the middle end licm (and that should still handle most selects, since most selects should be lowered to a test or cmp x86 insn followed by cmovCC, since the test and cmp insns should only have an eflags def)

I would imagine that the only otherwise sinkable insns that would still otherwise be able to be handled by AggressiveCycleSink would be arithmetic insns where the eflags def is not used by any JCC, but is used by one or more cmovCC/setCC/sbb/etc, likely in the same BB the arithmetic insn is in. I would reckon that the easiest to handle would be arithmetic insn + setCC, whereas w/ cmovCC & sbb would require you to additionally consider the src def(s) of the cmovCC/sbb insn(s).

Either way, I definitely think it'd be best to first create a PR that just handles the simpler case of arithmetic insns w/ one explicit def & one $eflags def, whose output $eflags def isn't used by any following insn. EDIT: I just realized: the aforementioned hoisted add insns' implicit $eflag def are already marked as implicit-def dead $eflags and according to the MIR register flags documentation: https://llvm.org/docs/MIRLangRef.html#register-flags implicit-def dead means that the implicit def is marked as unused so improving the heuristic of AggressiveCycleSink should be simpler than I originally thought EDIT 2: simple patch that should hopefully achieve the same result as @dianqk's draft PR: main...sharkautarch:llvm-project:MIRAggrSink_ignoreDeadDefs EDIT 3: seems like my above patch leads to an assert fail in one specific case: ld.lld: /tmp/makepkg/llvm-git/src/llvm-project/llvm/lib/CodeGen/InlineSpiller.cpp:1012: bool (anonymous namespace)::InlineSpiller::foldMemoryOperand(ArrayRef<std::pair<MachineInstr *, unsigned int>>, MachineInstr *): Assertion 'MO->isDead() && "Cannot fold physreg def"' failed.

Seems to be caused by my patch making MachineSink sink a MOV32r0 @RKSimon any idea of how to check for MOV32r0 without explicitly checking for that specific opcode? Hmm… my one idea would be that, since MOV32r0 is printed as %gr:<dst> = MOV32r0 implicit-def dead $eflags, which makes it seem like MOV32r0 doesn’t internally hold a reg use maybe I can just avoid sinking MOV32r0 by not sinking any instructions that have more than one def and no reg uses…

EDIT 4: Not sinking multi-def insns that don't have any reg uses works to prevent the assert from happening

It sounds fine, but SinkInstsIntoCycle is false by default. We need a way that can work with the default options.

@preames
Copy link
Collaborator

preames commented Sep 23, 2025

Just to follow up on a point of confusion upthread. Rematerialization during register allocation (i.e. splitting and spilling) may duplicate instructions to their uses. This was said above, but there seemed to be a misunderstanding that the original instruction would survive. If all uses of the original instruction are rematerialized (not guaranteed if e.g. we split a live interval), then the original instruction would become dead, and should be deleted.

I'll note that the rematerialization heuristics are very delicate, and have lots of subtle interactions w/ flags such as CheapAsAMove.

MachineBasicBlock *Preheader = Cycle->getCyclePreheader();
assert(Preheader && "Cycle sink needs a preheader block");
MachineBasicBlock *SinkBlock = nullptr;
const MachineOperand &MO = I.getOperand(0);
Copy link

@sharkautarch sharkautarch Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is only checking the first def of MachineInstr I, and, unlike in aggressivelySinkIntoCycle(), instruction candidates (for rematerializeIntoCycle()) with more than one non-dead defs aren't rejected, there's probably a correctness issue here.

You should change this to either account for all non-dead defs of I, or simply do an early-return if I has more than one non-dead defs. For the latter approach, you can simply copy my code here: main...sharkautarch:llvm-project:MIRAggrSink_ignoreDeadDefs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MachineLICM][MachineSink] Sinking invariants into cycle
6 participants