-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SystemZ] Eliminate call sequence instructions early. #77812
Conversation
Do we need to support the inline-asm alignstack attribute? In that case we need to do a similar handling of INLINEASM[_BR] instructions to set the adjustsStack flag. My understanding is however that the stack is always aligned at 8 bytes and therefore this is not needed...? |
// remove these nodes. Given that these nodes start out as a glued sequence | ||
// it seems best to remove them here after instruction selection and | ||
// scheduling. NB: MIR testing does not work (yet) for call frames with | ||
// this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate what that problem with MIR testing is, and whether we need to fix it before landing this upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow - now that I checked it turns out that MIR actually does save this info as part of the .mir file. So this 'NB' should just be removed. Sorry - I thought I was in new territory here...
@@ -88,13 +88,16 @@ entry: | |||
ret i64 %retval | |||
} | |||
|
|||
; TODO: Unfortunately the lgdr is scheduled below the COPY from $r1d, causing | |||
; an overlap and thus an extra copy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the presence of these pseudos help with scheduling around calls in general? If yes, maybe we can leave them in, and just set the stack size argument always to zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a slight help over SPEC:
2017_A_main/ 2017
Spill|Reload : 609379 609657 +278
Copies : 1017860 1018237 +377
That's a bit unfortunate as there are supposed to be handling for physreg copies in the machinescheduler.
I did some experiments per your suggestion by keeping them but setting the value to zero. As they have the side-effects flag, they do affect other optimizers. If I remove those pseudos in ExpandPostRAPseudos, they live past the PEI pass and then the CFG optimizations are hindred (less dublicated return blocks ets).
I think it would be nice to take these out of the equation by simply not using them as they do not make much sense on our target. In theory it should actually be better to not have them around as they model false side-effects. If MachineScheduler could be fixed to not make more COPYs and if the AdjustsStack flag was set early in SelectionDAGISel, I think it would be nice to just remove them. Unfortunately we can't keep them "anonymously" without being mapped as frame instructions, as they do not get removed in time (unless we would insert a new "process" hook in PEI to do that).
So maybe for now we could just keep them as before and set them to 0. However, unfortunately that doesn't quite work either as PEI for some reason resets this value to 0 as it seems to want to recompute it even though it is set per isMaxCallFrameSizeComputed() ... :-/
I don't think this flag currently causes any special action on Z. (But it would be good to try a few test cases to verify that!) However, if and when that might change in the future, it wouldn't be good if the generic infrastructure around the flag is somehow broken ... As a more general question, would we be the first platform to not use the adjuststack pseudos, or is there precedent? If the latter, how are they handling this issue? |
My simple test case shows that this flag actually has an effect:
SelectionDAGISel (.cpp:674) actually checks for this attribute with MI.isStackAligningInlineAsm() and sets the HasCalls flag. So regardless of the AdjustsStack flag, the HasCall flag causes the reg save area to be created in the prolog, with or w/out this patch. I wonder why not the AdjustsStack is not set here directly instead of searching for it in PEI... Does not HasCalls always imply AdjustsStack? The only place the AdjustsStack flag is used in SystemZFrameLowering is in isXPLeafCandidate(), but there HasCalls has been checked before already, so it wouldn't matter in this case. PEI uses adjustsStack() if target returns false from targetHandlesStackFrameRounding() to round up the stack frame size to meet the stack alignment. I wonder if we could return true here instead as on ELF it seems SP is always aligned properly to 8 bytes, and in XPLINK there seems to be a rounding to 64 bytes which would cover the 32 byte stack alignment.
I saw that R600 is already doing this, and in PEI::calculateCallFrameInfo() there is already:
Not sure really what that GPU is doing, but it seems very different from X86/SystemZ. |
Patch updated with a version that gives zero change on SPEC and in tests:
|
This seems like an extreme measure to fix the reported issue. We're just missing a helper function to preserve the frame size attributes during a block size split? Seems easier to keep the call frame handling in line with other targets instead of doing something different |
I agree it would solve the current issue fairly well to add a helper function in MachineBasicBlock that would do this. But on the other hand, I wonder if it's only x86 primarily that needs the CALLSEQ pseudos around to do SP adjustments around each call. If that's the case, maybe the default should actually be to compute the MaxCallFrameSize during instruction selection? One step in that direction then, would be to eliminate them during finalize-isel for SystemZ. I have found myself the Frame Lowering to be quite complex to work with and it doesn't help to have those pseudos around. It's confusing that they are emitted and then used not until PEI to compute this value looking at all the calls, when in fact that is something known already during isel. And adding to this by worrying about preserving this value across MBBs is making things even worse... So this is not only to fix the issue, but also a nice cleanup of the SystemZ backend, I think. Does this make sense? |
Is it only x86? That would be useful to know
I think so, but it would help to better understand exactly if/why this is the way it is only for x86 |
I don't know if this is relevant, but: the reason I introduced |
I've had a bit of a closer look, and it turns out it is more targets than just x86. In general, the callseq pseudos seem to have to main functions:
This second aspect is only used on some platforms, however. This depends on how the stack space needed to hold outgoing function arguments is allocated: some some platforms, this happens during the call sequence (e.g. via "push" instructions on x86), in others (like SystemZ), the function prolog always allocates enough space for all calls in the function ahead of time, and on yet others this decision is made on a per-function basis (e.g. depending on whether the function also contains dynamic stack allocation). Platforms where The default definition of All other platforms either always reserve the call frame in the prolog, or else always use an FP whenever the call frame is not reserved in the prolog. In both cases, frame index eliminiation does not require the call frame size. |
@llvm/pr-subscribers-backend-systemz Author: Jonas Paulsson (JonPsson1) ChangesOn SystemZ, the outgoing argument area which is big enough for all calls in the function is created once during the prolog, as opposed to adjusting the stack around each call. The call-sequence instructions are therefore not really useful any more than to compute the maximum call frame size, which has so far been done by PEI, but can just as well be done at an earlier point. This patch removes the mapping of the CallFrameSetupOpcode and CallFrameDestroyOpcode and instead computes the MaxCallFrameSize directly after instruction selection and then removes the ADJCALLSTACK pseudos. This removes the confusing pseudos and also avoids the problem of having to keep the call frame size accurate when creating new MBBs. This fixes #76618 which exposed the need to maintain the call frame size when splitting blocks (which was not done). Full diff: https://github.com/llvm/llvm-project/pull/77812.diff 6 Files Affected:
diff --git a/llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp b/llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
index 80c994a32ea96a..dc7e6589b48af2 100644
--- a/llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
@@ -66,22 +66,6 @@ SystemZFrameLowering::create(const SystemZSubtarget &STI) {
return std::make_unique<SystemZELFFrameLowering>();
}
-MachineBasicBlock::iterator SystemZFrameLowering::eliminateCallFramePseudoInstr(
- MachineFunction &MF, MachineBasicBlock &MBB,
- MachineBasicBlock::iterator MI) const {
- switch (MI->getOpcode()) {
- case SystemZ::ADJCALLSTACKDOWN:
- case SystemZ::ADJCALLSTACKUP:
- assert(hasReservedCallFrame(MF) &&
- "ADJSTACKDOWN and ADJSTACKUP should be no-ops");
- return MBB.erase(MI);
- break;
-
- default:
- llvm_unreachable("Unexpected call frame instruction");
- }
-}
-
namespace {
struct SZFrameSortingObj {
bool IsValid = false; // True if we care about this Object.
@@ -439,6 +423,16 @@ bool SystemZELFFrameLowering::restoreCalleeSavedRegisters(
return true;
}
+static void removeCallSeqPseudos(MachineFunction &MF) {
+ // TODO: These could have been removed in finalize isel already as they are
+ // not mapped as frame instructions. See comment in emitAdjCallStack().
+ for (auto &MBB : MF)
+ for (MachineInstr &MI : llvm::make_early_inc_range(MBB))
+ if (MI.getOpcode() == SystemZ::ADJCALLSTACKDOWN ||
+ MI.getOpcode() == SystemZ::ADJCALLSTACKUP)
+ MI.eraseFromParent();
+}
+
void SystemZELFFrameLowering::processFunctionBeforeFrameFinalized(
MachineFunction &MF, RegScavenger *RS) const {
MachineFrameInfo &MFFrame = MF.getFrameInfo();
@@ -480,6 +474,8 @@ void SystemZELFFrameLowering::processFunctionBeforeFrameFinalized(
ZFI->getRestoreGPRRegs().LowGPR != SystemZ::R6D)
for (auto &MO : MRI->use_nodbg_operands(SystemZ::R6D))
MO.setIsKill(false);
+
+ removeCallSeqPseudos(MF);
}
// Emit instructions before MBBI (in MBB) to add NumBytes to Reg.
@@ -1471,6 +1467,8 @@ void SystemZXPLINKFrameLowering::processFunctionBeforeFrameFinalized(
// with existing compilers.
MFFrame.setMaxCallFrameSize(
std::max(64U, (unsigned)alignTo(MFFrame.getMaxCallFrameSize(), 64)));
+
+ removeCallSeqPseudos(MF);
}
// Determines the size of the frame, and creates the deferred spill objects.
diff --git a/llvm/lib/Target/SystemZ/SystemZFrameLowering.h b/llvm/lib/Target/SystemZ/SystemZFrameLowering.h
index 95f30e3c0d99c8..03ce8882c4de5d 100644
--- a/llvm/lib/Target/SystemZ/SystemZFrameLowering.h
+++ b/llvm/lib/Target/SystemZ/SystemZFrameLowering.h
@@ -41,9 +41,6 @@ class SystemZFrameLowering : public TargetFrameLowering {
}
bool hasReservedCallFrame(const MachineFunction &MF) const override;
- MachineBasicBlock::iterator
- eliminateCallFramePseudoInstr(MachineFunction &MF, MachineBasicBlock &MBB,
- MachineBasicBlock::iterator MI) const override;
};
class SystemZELFFrameLowering : public SystemZFrameLowering {
diff --git a/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp b/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
index da4bcd7f0c66ed..196903fa4d3202 100644
--- a/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
@@ -8197,6 +8197,30 @@ static void createPHIsForSelects(SmallVector<MachineInstr*, 8> &Selects,
MF->getProperties().reset(MachineFunctionProperties::Property::NoPHIs);
}
+MachineBasicBlock *
+SystemZTargetLowering::emitAdjCallStack(MachineInstr &MI,
+ MachineBasicBlock *BB) const {
+ MachineFunction &MF = *BB->getParent();
+ MachineFrameInfo &MFI = MF.getFrameInfo();
+ auto *TFL = Subtarget.getFrameLowering<SystemZFrameLowering>();
+ assert(TFL->hasReservedCallFrame(MF) &&
+ "ADJSTACKDOWN and ADJSTACKUP should be no-ops");
+ // Get the MaxCallFrameSize value and clear the NumBytes value to not
+ // confuse the verifier. Keep them around as scheduling barriers around
+ // call arguments even though they serve no further purpose as the call
+ // frame is statically reserved in the prolog.
+ uint32_t NumBytes = MI.getOperand(0).getImm();
+ if (NumBytes > MFI.getMaxCallFrameSize())
+ MFI.setMaxCallFrameSize(NumBytes);
+ // Set AdjustsStack as this is *not* mapped as a frame instruction.
+ MFI.setAdjustsStack(true);
+
+ // TODO: Fix machine scheduler and erase MI instead?
+ MI.getOperand(0).setImm(0);
+
+ return BB;
+}
+
// Implement EmitInstrWithCustomInserter for pseudo Select* instruction MI.
MachineBasicBlock *
SystemZTargetLowering::emitSelect(MachineInstr &MI,
@@ -9400,6 +9424,10 @@ getBackchainAddress(SDValue SP, SelectionDAG &DAG) const {
MachineBasicBlock *SystemZTargetLowering::EmitInstrWithCustomInserter(
MachineInstr &MI, MachineBasicBlock *MBB) const {
switch (MI.getOpcode()) {
+ case SystemZ::ADJCALLSTACKDOWN:
+ case SystemZ::ADJCALLSTACKUP:
+ return emitAdjCallStack(MI, MBB);
+
case SystemZ::Select32:
case SystemZ::Select64:
case SystemZ::Select128:
diff --git a/llvm/lib/Target/SystemZ/SystemZISelLowering.h b/llvm/lib/Target/SystemZ/SystemZISelLowering.h
index 4943c5cb703c33..7140287a886ccf 100644
--- a/llvm/lib/Target/SystemZ/SystemZISelLowering.h
+++ b/llvm/lib/Target/SystemZ/SystemZISelLowering.h
@@ -760,6 +760,8 @@ class SystemZTargetLowering : public TargetLowering {
MachineBasicBlock *Target) const;
// Implement EmitInstrWithCustomInserter for individual operation types.
+ MachineBasicBlock *emitAdjCallStack(MachineInstr &MI,
+ MachineBasicBlock *BB) const;
MachineBasicBlock *emitSelect(MachineInstr &MI, MachineBasicBlock *BB) const;
MachineBasicBlock *emitCondStore(MachineInstr &MI, MachineBasicBlock *BB,
unsigned StoreOpcode, unsigned STOCOpcode,
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
index 2a6dce863c28f1..950548abcfa92c 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
@@ -59,7 +59,7 @@ static uint64_t allOnes(unsigned int Count) {
void SystemZInstrInfo::anchor() {}
SystemZInstrInfo::SystemZInstrInfo(SystemZSubtarget &sti)
- : SystemZGenInstrInfo(SystemZ::ADJCALLSTACKDOWN, SystemZ::ADJCALLSTACKUP),
+ : SystemZGenInstrInfo(-1, -1),
RI(sti.getSpecialRegisters()->getReturnFunctionAddressRegister()),
STI(sti) {}
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.td b/llvm/lib/Target/SystemZ/SystemZInstrInfo.td
index 96ea65b6c3d881..7f3a143aad9709 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.td
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.td
@@ -13,9 +13,9 @@ def IsTargetELF : Predicate<"Subtarget->isTargetELF()">;
// Stack allocation
//===----------------------------------------------------------------------===//
-// The callseq_start node requires the hasSideEffects flag, even though these
-// instructions are noops on SystemZ.
-let hasNoSchedulingInfo = 1, hasSideEffects = 1 in {
+// These pseudos carry values needed to compute the MaxcallFrameSize of the
+// function. The callseq_start node requires the hasSideEffects flag.
+let usesCustomInserter = 1, hasNoSchedulingInfo = 1, hasSideEffects = 1 in {
def ADJCALLSTACKDOWN : Pseudo<(outs), (ins i64imm:$amt1, i64imm:$amt2),
[(callseq_start timm:$amt1, timm:$amt2)]>;
def ADJCALLSTACKUP : Pseudo<(outs), (ins i64imm:$amt1, i64imm:$amt2),
|
Patch rebased with some reworded comments, otherwise same as before: Compute the MaxCallFrameSize during custom insertion (isel) and then set all values to 0. The pseudos are not mapped as frame instructions anymore, so they are only left as scheduling boundaries. It may be that even though there seem to be some slight advantage for the phys-reg arguments (spilling/copys), perhaps that doesn't matter. If benchmarks show no actual difference, maybe they could just be removed during isel regardless. Alternatively, as stated in the comment, maybe the MIScheduler could be improved here. We could keep them as pseudos, but then PEI will recompute them and if they are 0 that will not work without first fixing PEI somehow. Unfortunately it seems like other targets want to recompute it for some reason, so not sure how to handle without adding yet another special casing... It is however better to not have this in the sense that they do not serve any purpose. |
Do you have any numbers how benchmarks would be affected by just removing the instructions early?
Do we know what specifically could be improved? (I guess we'd have to identify some actual regression first ...) |
See above: #77812 (comment)
I can give it a try and see... |
Right - I meant do we know whether these few extra instructions actually cause any visible difference in the results anywhere?
Thanks! |
I rebuilt everything, and to my surprise I now saw different numbers than before: main/nfc patch <-> patch but erasing MI
So at this moment it looks like that small variation can go either way depending on the shape of the compiler on the given day. I also ran spec. With all benchmarks, ("mini"):
That looked like a slight general regression. However, I also ran a few select benchmarks in parallel with the "full" run:
With these runs it was a general tie on those select benchmarks. Given that the spilling has varied a bit both ways, I don't see much point in looking at the MI scheduler. There is already a phys-reg heuristic there, so it should work well in general. All the benchmarks have not been run fully, but it looks to me that it should be ok to simply remove the MIs and trust that the mi-scheduler will handle it without the scheduling barriers. |
In SystemZ only. Try removeing callseq instructions. was 741b28ae
OK, this looks really like more in the noise to me. In that case I agree we should prefer to keep the code as simple as possible, and just remove the MIs early. |
ok - patch updated to remove the MI pseudo early. Some test changes where one was a bit of an example of where the mi-scheduler actually does mess things up a bit:
The LGDR was in the presence of ADJCALLSTACKDOWN below it, and is moved just above it by the mi-scheduler. For some reason, here without the ADJCALLSTACKDOWN, it is also moved above the COPY from $r1d, which is unfortunate as that causes an overlap of %3 which is to be COPY:ed to the same register. I don't think there is any "tracking" of which vreg ranges go to/from which physreg and with that avoid unnecessary overlaps like this. Maybe this is rare so it's not worth the effort? Probably not a quick thing to fix, but this is a good example of this problem. Maybe mark the test with a TODO if this is acceptable for now? |
Yes, I think this is really something that should be fixed in the register allocate (looks like a parallel-copy problem). But for now I think this patch is fine as is. Feel free to add a TODO to the test case, otherwise this LGTM. |
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
The ABI uses a reserved call frame, which is statically reserved in the prologue. Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early. My solution here is to not insert them in the first place. The calculation of the reserved frame size is now done during call lowering.
On SystemZ, the outgoing argument area which is big enough for all calls in the function is created once during the prolog, as opposed to adjusting the stack around each call. The call-sequence instructions are therefore not really useful any more than to compute the maximum call frame size, which has so far been done by PEI, but can just as well be done at an earlier point.
This patch removes the mapping of the CallFrameSetupOpcode and CallFrameDestroyOpcode and instead computes the MaxCallFrameSize directly after instruction selection and then removes the ADJCALLSTACK pseudos. This removes the confusing pseudos and also avoids the problem of having to keep the call frame size accurate when creating new MBBs.
This fixes #76618 which exposed the need to maintain the call frame size when splitting blocks (which was not done).