Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SystemZ] Eliminate call sequence instructions early. #77812

Merged
merged 9 commits into from
Mar 28, 2024

Conversation

JonPsson1
Copy link
Contributor

On SystemZ, the outgoing argument area which is big enough for all calls in the function is created once during the prolog, as opposed to adjusting the stack around each call. The call-sequence instructions are therefore not really useful any more than to compute the maximum call frame size, which has so far been done by PEI, but can just as well be done at an earlier point.

This patch removes the mapping of the CallFrameSetupOpcode and CallFrameDestroyOpcode and instead computes the MaxCallFrameSize directly after instruction selection and then removes the ADJCALLSTACK pseudos. This removes the confusing pseudos and also avoids the problem of having to keep the call frame size accurate when creating new MBBs.

This fixes #76618 which exposed the need to maintain the call frame size when splitting blocks (which was not done).

@JonPsson1
Copy link
Contributor Author

Do we need to support the inline-asm alignstack attribute? In that case we need to do a similar handling of INLINEASM[_BR] instructions to set the adjustsStack flag. My understanding is however that the stack is always aligned at 8 bytes and therefore this is not needed...?

// remove these nodes. Given that these nodes start out as a glued sequence
// it seems best to remove them here after instruction selection and
// scheduling. NB: MIR testing does not work (yet) for call frames with
// this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate what that problem with MIR testing is, and whether we need to fix it before landing this upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow - now that I checked it turns out that MIR actually does save this info as part of the .mir file. So this 'NB' should just be removed. Sorry - I thought I was in new territory here...

@@ -88,13 +88,16 @@ entry:
ret i64 %retval
}

; TODO: Unfortunately the lgdr is scheduled below the COPY from $r1d, causing
; an overlap and thus an extra copy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the presence of these pseudos help with scheduling around calls in general? If yes, maybe we can leave them in, and just set the stack size argument always to zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a slight help over SPEC:

                         2017_A_main/                   2017
Spill|Reload   :               609379               609657     +278
Copies         :              1017860              1018237     +377

That's a bit unfortunate as there are supposed to be handling for physreg copies in the machinescheduler.

I did some experiments per your suggestion by keeping them but setting the value to zero. As they have the side-effects flag, they do affect other optimizers. If I remove those pseudos in ExpandPostRAPseudos, they live past the PEI pass and then the CFG optimizations are hindred (less dublicated return blocks ets).

I think it would be nice to take these out of the equation by simply not using them as they do not make much sense on our target. In theory it should actually be better to not have them around as they model false side-effects. If MachineScheduler could be fixed to not make more COPYs and if the AdjustsStack flag was set early in SelectionDAGISel, I think it would be nice to just remove them. Unfortunately we can't keep them "anonymously" without being mapped as frame instructions, as they do not get removed in time (unless we would insert a new "process" hook in PEI to do that).

So maybe for now we could just keep them as before and set them to 0. However, unfortunately that doesn't quite work either as PEI for some reason resets this value to 0 as it seems to want to recompute it even though it is set per isMaxCallFrameSizeComputed() ... :-/

@uweigand
Copy link
Member

Do we need to support the inline-asm alignstack attribute? In that case we need to do a similar handling of INLINEASM[_BR] instructions to set the adjustsStack flag. My understanding is however that the stack is always aligned at 8 bytes and therefore this is not needed...?

I don't think this flag currently causes any special action on Z. (But it would be good to try a few test cases to verify that!) However, if and when that might change in the future, it wouldn't be good if the generic infrastructure around the flag is somehow broken ...

As a more general question, would we be the first platform to not use the adjuststack pseudos, or is there precedent? If the latter, how are they handling this issue?

@JonPsson1
Copy link
Contributor Author

Do we need to support the inline-asm alignstack attribute? In that case we need to do a similar handling of INLINEASM[_BR] instructions to set the adjustsStack flag. My understanding is however that the stack is always aligned at 8 bytes and therefore this is not needed...?

I don't think this flag currently causes any special action on Z. (But it would be good to try a few test cases to verify that!) However, if and when that might change in the future, it wouldn't be good if the generic infrastructure around the flag is somehow broken ...

My simple test case shows that this flag actually has an effect:

define i64 @f1() {
  %val = call i64 asm alignstack "blah $0 $1", "=r,a" (i8 1)
  ret i64 %val
}

SelectionDAGISel (.cpp:674) actually checks for this attribute with MI.isStackAligningInlineAsm() and sets the HasCalls flag. So regardless of the AdjustsStack flag, the HasCall flag causes the reg save area to be created in the prolog, with or w/out this patch.

I wonder why not the AdjustsStack is not set here directly instead of searching for it in PEI... Does not HasCalls always imply AdjustsStack?

The only place the AdjustsStack flag is used in SystemZFrameLowering is in isXPLeafCandidate(), but there HasCalls has been checked before already, so it wouldn't matter in this case.

PEI uses adjustsStack() if target returns false from targetHandlesStackFrameRounding() to round up the stack frame size to meet the stack alignment. I wonder if we could return true here instead as on ELF it seems SP is always aligned properly to 8 bytes, and in XPLINK there seems to be a rounding to 64 bytes which would cover the 32 byte stack alignment.

As a more general question, would we be the first platform to not use the adjuststack pseudos, or is there precedent? If the latter, how are they handling this issue?

I saw that R600 is already doing this, and in PEI::calculateCallFrameInfo() there is already:

  // Early exit for targets which have no call frame setup/destroy pseudo
  // instructions.
  if (FrameSetupOpcode == ~0u && FrameDestroyOpcode == ~0u)
    return;

Not sure really what that GPU is doing, but it seems very different from X86/SystemZ.

@JonPsson1
Copy link
Contributor Author

Patch updated with a version that gives zero change on SPEC and in tests:

  • remove the frame instruction mapping of these pseudos so that they are not recognized and thereby avoid the re-computation of MaxCallFrameSize in PEI.
  • Set them all to 0 instead of removing them during isel and instead remove them in processFunctionBeforeFrameFinalized(), during PEI.

@JonPsson1
Copy link
Contributor Author

JonPsson1 commented Jan 15, 2024

@jayfoad @topperc @MatzeB @arsenm Does this approach seem right for a target that never needs the call sequence instructions around the calls in the first place?

@arsenm
Copy link
Contributor

arsenm commented Jan 18, 2024

This seems like an extreme measure to fix the reported issue. We're just missing a helper function to preserve the frame size attributes during a block size split? Seems easier to keep the call frame handling in line with other targets instead of doing something different

@jayfoad
Copy link
Contributor

jayfoad commented Jan 18, 2024

@jayfoad @topperc @MatzeB @arsenm Does this approach seem right for a target that never needs the call sequence instructions around the calls in the first place?

No objection from me.

@JonPsson1
Copy link
Contributor Author

This seems like an extreme measure to fix the reported issue. We're just missing a helper function to preserve the frame size attributes during a block size split? Seems easier to keep the call frame handling in line with other targets instead of doing something different

I agree it would solve the current issue fairly well to add a helper function in MachineBasicBlock that would do this. But on the other hand, I wonder if it's only x86 primarily that needs the CALLSEQ pseudos around to do SP adjustments around each call. If that's the case, maybe the default should actually be to compute the MaxCallFrameSize during instruction selection? One step in that direction then, would be to eliminate them during finalize-isel for SystemZ.

I have found myself the Frame Lowering to be quite complex to work with and it doesn't help to have those pseudos around. It's confusing that they are emitted and then used not until PEI to compute this value looking at all the calls, when in fact that is something known already during isel. And adding to this by worrying about preserving this value across MBBs is making things even worse...

So this is not only to fix the issue, but also a nice cleanup of the SystemZ backend, I think. Does this make sense?

@arsenm
Copy link
Contributor

arsenm commented Jan 25, 2024

I agree it would solve the current issue fairly well to add a helper function in MachineBasicBlock that would do this. But on the other hand, I wonder if it's only x86 primarily that needs the CALLSEQ pseudos around to do SP adjustments around each call. If that's the case, maybe the default should actually be to compute the MaxCallFrameSize during instruction selection? One step in that direction then, would be to eliminate them during finalize-isel for SystemZ.

Is it only x86? That would be useful to know

So this is not only to fix the issue, but also a nice cleanup of the SystemZ backend, I think. Does this make sense?

I think so, but it would help to better understand exactly if/why this is the way it is only for x86

@jayfoad
Copy link
Contributor

jayfoad commented Jan 26, 2024

I agree it would solve the current issue fairly well to add a helper function in MachineBasicBlock that would do this. But on the other hand, I wonder if it's only x86 primarily that needs the CALLSEQ pseudos around to do SP adjustments around each call. If that's the case, maybe the default should actually be to compute the MaxCallFrameSize during instruction selection? One step in that direction then, would be to eliminate them during finalize-isel for SystemZ.

I don't know if this is relevant, but: the reason I introduced MachineBasicBlock::CallFrameSize was for ARM, which can create a call sequence with a non-zero frame size and then later split the block in the middle of the call sequence in ARMTargetLowering::EmitStructByval.

@uweigand
Copy link
Member

I think so, but it would help to better understand exactly if/why this is the way it is only for x86

I've had a bit of a closer look, and it turns out it is more targets than just x86. In general, the callseq pseudos seem to have to main functions:

  • They act as (pre-RA) scheduling barriers to attempt to keep call setup sequences closer together. This is done on all platforms; I'm not sure if this is required for correctness anywhere, but it does appear to have performance benefits on SystemZ at least.
  • They carry a "frame size" value, which is used on some targets to modify frame index elimination (i.e. the distance between SP and the frame varies from its default value while within a call sequence). Where required, this is of course needed for functional correctness.

This second aspect is only used on some platforms, however. This depends on how the stack space needed to hold outgoing function arguments is allocated: some some platforms, this happens during the call sequence (e.g. via "push" instructions on x86), in others (like SystemZ), the function prolog always allocates enough space for all calls in the function ahead of time, and on yet others this decision is made on a per-function basis (e.g. depending on whether the function also contains dynamic stack allocation).

Platforms where canSimplifyCallFramePseudos returns true do not need (or use) that frame size value. This is the case on SystemZ for all functions, in particular. However, even on those platforms the machine verifier requires that this value is maintained correctly whenever a basic block is split - although all those correctly maintained values will just be ignored in the end.

The default definition of canSimplifyCallFramePseudos is always true. However, that can change if the platform overrides either canSimplifyCallFramePseudos or hasReservedCallFrame. Looking at all platforms that do so, it seems the following platforms actually require the frame size: X86, AArch64/ARM, M86K, Mips, and WebAssembly.

All other platforms either always reserve the call frame in the prolog, or else always use an FP whenever the call frame is not reserved in the prolog. In both cases, frame index eliminiation does not require the call frame size.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 27, 2024

@llvm/pr-subscribers-backend-systemz

Author: Jonas Paulsson (JonPsson1)

Changes

On SystemZ, the outgoing argument area which is big enough for all calls in the function is created once during the prolog, as opposed to adjusting the stack around each call. The call-sequence instructions are therefore not really useful any more than to compute the maximum call frame size, which has so far been done by PEI, but can just as well be done at an earlier point.

This patch removes the mapping of the CallFrameSetupOpcode and CallFrameDestroyOpcode and instead computes the MaxCallFrameSize directly after instruction selection and then removes the ADJCALLSTACK pseudos. This removes the confusing pseudos and also avoids the problem of having to keep the call frame size accurate when creating new MBBs.

This fixes #76618 which exposed the need to maintain the call frame size when splitting blocks (which was not done).


Full diff: https://github.com/llvm/llvm-project/pull/77812.diff

6 Files Affected:

  • (modified) llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp (+14-16)
  • (modified) llvm/lib/Target/SystemZ/SystemZFrameLowering.h (-3)
  • (modified) llvm/lib/Target/SystemZ/SystemZISelLowering.cpp (+28)
  • (modified) llvm/lib/Target/SystemZ/SystemZISelLowering.h (+2)
  • (modified) llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp (+1-1)
  • (modified) llvm/lib/Target/SystemZ/SystemZInstrInfo.td (+3-3)
diff --git a/llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp b/llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
index 80c994a32ea96a..dc7e6589b48af2 100644
--- a/llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
@@ -66,22 +66,6 @@ SystemZFrameLowering::create(const SystemZSubtarget &STI) {
   return std::make_unique<SystemZELFFrameLowering>();
 }
 
-MachineBasicBlock::iterator SystemZFrameLowering::eliminateCallFramePseudoInstr(
-    MachineFunction &MF, MachineBasicBlock &MBB,
-    MachineBasicBlock::iterator MI) const {
-  switch (MI->getOpcode()) {
-  case SystemZ::ADJCALLSTACKDOWN:
-  case SystemZ::ADJCALLSTACKUP:
-    assert(hasReservedCallFrame(MF) &&
-           "ADJSTACKDOWN and ADJSTACKUP should be no-ops");
-    return MBB.erase(MI);
-    break;
-
-  default:
-    llvm_unreachable("Unexpected call frame instruction");
-  }
-}
-
 namespace {
 struct SZFrameSortingObj {
   bool IsValid = false;     // True if we care about this Object.
@@ -439,6 +423,16 @@ bool SystemZELFFrameLowering::restoreCalleeSavedRegisters(
   return true;
 }
 
+static void removeCallSeqPseudos(MachineFunction &MF) {
+  // TODO: These could have been removed in finalize isel already as they are
+  // not mapped as frame instructions. See comment in emitAdjCallStack().
+  for (auto &MBB : MF)
+    for (MachineInstr &MI : llvm::make_early_inc_range(MBB))
+      if (MI.getOpcode() == SystemZ::ADJCALLSTACKDOWN ||
+          MI.getOpcode() == SystemZ::ADJCALLSTACKUP)
+        MI.eraseFromParent();
+}
+
 void SystemZELFFrameLowering::processFunctionBeforeFrameFinalized(
     MachineFunction &MF, RegScavenger *RS) const {
   MachineFrameInfo &MFFrame = MF.getFrameInfo();
@@ -480,6 +474,8 @@ void SystemZELFFrameLowering::processFunctionBeforeFrameFinalized(
       ZFI->getRestoreGPRRegs().LowGPR != SystemZ::R6D)
     for (auto &MO : MRI->use_nodbg_operands(SystemZ::R6D))
       MO.setIsKill(false);
+
+  removeCallSeqPseudos(MF);
 }
 
 // Emit instructions before MBBI (in MBB) to add NumBytes to Reg.
@@ -1471,6 +1467,8 @@ void SystemZXPLINKFrameLowering::processFunctionBeforeFrameFinalized(
   // with existing compilers.
   MFFrame.setMaxCallFrameSize(
       std::max(64U, (unsigned)alignTo(MFFrame.getMaxCallFrameSize(), 64)));
+
+  removeCallSeqPseudos(MF);
 }
 
 // Determines the size of the frame, and creates the deferred spill objects.
diff --git a/llvm/lib/Target/SystemZ/SystemZFrameLowering.h b/llvm/lib/Target/SystemZ/SystemZFrameLowering.h
index 95f30e3c0d99c8..03ce8882c4de5d 100644
--- a/llvm/lib/Target/SystemZ/SystemZFrameLowering.h
+++ b/llvm/lib/Target/SystemZ/SystemZFrameLowering.h
@@ -41,9 +41,6 @@ class SystemZFrameLowering : public TargetFrameLowering {
   }
 
   bool hasReservedCallFrame(const MachineFunction &MF) const override;
-  MachineBasicBlock::iterator
-  eliminateCallFramePseudoInstr(MachineFunction &MF, MachineBasicBlock &MBB,
-                                MachineBasicBlock::iterator MI) const override;
 };
 
 class SystemZELFFrameLowering : public SystemZFrameLowering {
diff --git a/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp b/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
index da4bcd7f0c66ed..196903fa4d3202 100644
--- a/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
@@ -8197,6 +8197,30 @@ static void createPHIsForSelects(SmallVector<MachineInstr*, 8> &Selects,
   MF->getProperties().reset(MachineFunctionProperties::Property::NoPHIs);
 }
 
+MachineBasicBlock *
+SystemZTargetLowering::emitAdjCallStack(MachineInstr &MI,
+                                        MachineBasicBlock *BB) const {
+  MachineFunction &MF = *BB->getParent();
+  MachineFrameInfo &MFI = MF.getFrameInfo();
+  auto *TFL = Subtarget.getFrameLowering<SystemZFrameLowering>();
+  assert(TFL->hasReservedCallFrame(MF) &&
+         "ADJSTACKDOWN and ADJSTACKUP should be no-ops");
+  // Get the MaxCallFrameSize value and clear the NumBytes value to not
+  // confuse the verifier. Keep them around as scheduling barriers around
+  // call arguments even though they serve no further purpose as the call
+  // frame is statically reserved in the prolog.
+  uint32_t NumBytes = MI.getOperand(0).getImm();
+  if (NumBytes > MFI.getMaxCallFrameSize())
+    MFI.setMaxCallFrameSize(NumBytes);
+  // Set AdjustsStack as this is *not* mapped as a frame instruction.
+  MFI.setAdjustsStack(true);
+
+  // TODO: Fix machine scheduler and erase MI instead?
+  MI.getOperand(0).setImm(0);
+
+  return BB;
+}
+
 // Implement EmitInstrWithCustomInserter for pseudo Select* instruction MI.
 MachineBasicBlock *
 SystemZTargetLowering::emitSelect(MachineInstr &MI,
@@ -9400,6 +9424,10 @@ getBackchainAddress(SDValue SP, SelectionDAG &DAG) const {
 MachineBasicBlock *SystemZTargetLowering::EmitInstrWithCustomInserter(
     MachineInstr &MI, MachineBasicBlock *MBB) const {
   switch (MI.getOpcode()) {
+  case SystemZ::ADJCALLSTACKDOWN:
+  case SystemZ::ADJCALLSTACKUP:
+    return emitAdjCallStack(MI, MBB);
+
   case SystemZ::Select32:
   case SystemZ::Select64:
   case SystemZ::Select128:
diff --git a/llvm/lib/Target/SystemZ/SystemZISelLowering.h b/llvm/lib/Target/SystemZ/SystemZISelLowering.h
index 4943c5cb703c33..7140287a886ccf 100644
--- a/llvm/lib/Target/SystemZ/SystemZISelLowering.h
+++ b/llvm/lib/Target/SystemZ/SystemZISelLowering.h
@@ -760,6 +760,8 @@ class SystemZTargetLowering : public TargetLowering {
                                   MachineBasicBlock *Target) const;
 
   // Implement EmitInstrWithCustomInserter for individual operation types.
+  MachineBasicBlock *emitAdjCallStack(MachineInstr &MI,
+                                      MachineBasicBlock *BB) const;
   MachineBasicBlock *emitSelect(MachineInstr &MI, MachineBasicBlock *BB) const;
   MachineBasicBlock *emitCondStore(MachineInstr &MI, MachineBasicBlock *BB,
                                    unsigned StoreOpcode, unsigned STOCOpcode,
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
index 2a6dce863c28f1..950548abcfa92c 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
@@ -59,7 +59,7 @@ static uint64_t allOnes(unsigned int Count) {
 void SystemZInstrInfo::anchor() {}
 
 SystemZInstrInfo::SystemZInstrInfo(SystemZSubtarget &sti)
-    : SystemZGenInstrInfo(SystemZ::ADJCALLSTACKDOWN, SystemZ::ADJCALLSTACKUP),
+    : SystemZGenInstrInfo(-1, -1),
       RI(sti.getSpecialRegisters()->getReturnFunctionAddressRegister()),
       STI(sti) {}
 
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.td b/llvm/lib/Target/SystemZ/SystemZInstrInfo.td
index 96ea65b6c3d881..7f3a143aad9709 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.td
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.td
@@ -13,9 +13,9 @@ def IsTargetELF           : Predicate<"Subtarget->isTargetELF()">;
 // Stack allocation
 //===----------------------------------------------------------------------===//
 
-// The callseq_start node requires the hasSideEffects flag, even though these
-// instructions are noops on SystemZ.
-let hasNoSchedulingInfo = 1, hasSideEffects = 1 in {
+// These pseudos carry values needed to compute the MaxcallFrameSize of the
+// function.  The callseq_start node requires the hasSideEffects flag.
+let usesCustomInserter = 1, hasNoSchedulingInfo = 1, hasSideEffects = 1 in {
   def ADJCALLSTACKDOWN : Pseudo<(outs), (ins i64imm:$amt1, i64imm:$amt2),
                                 [(callseq_start timm:$amt1, timm:$amt2)]>;
   def ADJCALLSTACKUP   : Pseudo<(outs), (ins i64imm:$amt1, i64imm:$amt2),

@JonPsson1
Copy link
Contributor Author

Patch rebased with some reworded comments, otherwise same as before: Compute the MaxCallFrameSize during custom insertion (isel) and then set all values to 0. The pseudos are not mapped as frame instructions anymore, so they are only left as scheduling boundaries.

It may be that even though there seem to be some slight advantage for the phys-reg arguments (spilling/copys), perhaps that doesn't matter. If benchmarks show no actual difference, maybe they could just be removed during isel regardless. Alternatively, as stated in the comment, maybe the MIScheduler could be improved here.

We could keep them as pseudos, but then PEI will recompute them and if they are 0 that will not work without first fixing PEI somehow. Unfortunately it seems like other targets want to recompute it for some reason, so not sure how to handle without adding yet another special casing... It is however better to not have this in the sense that they do not serve any purpose.

@uweigand
Copy link
Member

It may be that even though there seem to be some slight advantage for the phys-reg arguments (spilling/copys), perhaps that doesn't matter. If benchmarks show no actual difference, maybe they could just be removed during isel regardless.

Do you have any numbers how benchmarks would be affected by just removing the instructions early?

Alternatively, as stated in the comment, maybe the MIScheduler could be improved here.

Do we know what specifically could be improved? (I guess we'd have to identify some actual regression first ...)

@JonPsson1
Copy link
Contributor Author

Do you have any numbers how benchmarks would be affected by just removing the instructions early?

See above: #77812 (comment)

Do we know what specifically could be improved? (I guess we'd have to identify some actual regression first ...)

I can give it a try and see...

@uweigand
Copy link
Member

Do you have any numbers how benchmarks would be affected by just removing the instructions early?

See above: #77812 (comment)

Right - I meant do we know whether these few extra instructions actually cause any visible difference in the results anywhere?

Do we know what specifically could be improved? (I guess we'd have to identify some actual regression first ...)

I can give it a try and see...

Thanks!

@JonPsson1
Copy link
Contributor Author

I rebuilt everything, and to my surprise I now saw different numbers than before:

main/nfc patch <-> patch but erasing MI

Spill|Reload   :               608814               608692     -122
Copies         :              1021002              1020968      -34

So at this moment it looks like that small variation can go either way depending on the shape of the compiler on the given day.

I also ran spec. With all benchmarks, ("mini"):

Improvements "eraseMI":
0.981: i541.leela_r 
0.990: f538.imagick_r 
0.996: i525.x264_r 

Regressions "eraseMI":
1.013: i505.mcf_r 
1.011: i523.xalancbmk_r 
1.010: i520.omnetpp_r 
1.009: i502.gcc_r 
1.007: f544.nab_r 
1.006: i557.xz_r 

That looked like a slight general regression. However, I also ran a few select benchmarks in parallel with the "full" run:

Improvements "eraseMI":
0.987: i523.xalancbmk_r 
0.991: f544.nab_r 
0.996: i541.leela_r 

Regressions "eraseMI":
1.013: i520.omnetpp_r 
1.006: i505.mcf_r 
1.004: i531.deepsjeng_r 
1.002: i502.gcc_r 
1.001: i557.xz_r 

With these runs it was a general tie on those select benchmarks.

Given that the spilling has varied a bit both ways, I don't see much point in looking at the MI scheduler. There is already a phys-reg heuristic there, so it should work well in general.

All the benchmarks have not been run fully, but it looks to me that it should be ok to simply remove the MIs and trust that the mi-scheduler will handle it without the scheduling barriers.

@uweigand
Copy link
Member

OK, this looks really like more in the noise to me. In that case I agree we should prefer to keep the code as simple as possible, and just remove the MIs early.

@JonPsson1
Copy link
Contributor Author

ok - patch updated to remove the MI pseudo early. Some test changes where one was a bit of an example of where the mi-scheduler actually does mess things up a bit:


0B      bb.0 (%ir-block.0):
          liveins: $r1d, $f0d, $r5d
16B       %2:addr64bit = COPY $r5d
32B       %1:fp64bit = COPY $f0d
48B       %0:gr64bit = COPY $r1d
64B       %3:gr64bit = LGDR %1:fp64bit
80B       %4:gr64bit = ADA_ENTRY_VALUE target-flags(<unknown>) @pass_vararg3, %2:addr64bit, 8, implicit-def dead $cc :: (dereferenceable invariant load (s64))
96B       %6:gr64bit = ADA_ENTRY_VALUE target-flags(<unknown>) @pass_vararg3, %2:addr64bit, 0, implicit-def dead $cc :: (dereferenceable invariant load (s64))
112B      $r6d = COPY %4:gr64bit
128B      $r1d = COPY %3:gr64bit
144B      $r2d = COPY %0:gr64bit
160B      $r5d = COPY %6:gr64bit
176B      CallBASR_XPLINK64 killed $r6d, $r1d, killed $r2d, $r5d, <regmask $f8d $f9d $f10d $f11d $f12d $f13d $f14d $f15d $f8q $f9q $f12q $f13q $f8s $f9s $f10s $f11s $f12s $f13s $f14s $f15s $r8d $r9d $r10d $r11d 
$r12d $r13d $r14d $r15d $r8h $r9h $r10h $r11h $r12h and 15 more...>, implicit-def dead $r7d, implicit-def dead $cc, implicit $fpc, implicit-def $r3d
192B      %8:gr64bit = COPY killed $r3d
208B      $r3d = COPY %8:gr64bit
224B      Return_XPLINK implicit killed $r3d

# End machine code for function call_vararg_both0.

# *** IR Dump After Machine Instruction Scheduler (machine-scheduler) ***:
# Machine code for function call_vararg_both0: NoPHIs, TracksLiveness, TiedOpsRewritten
Function Live Ins: $r1d in %0, $f0d in %1, $r5d in %2

0B      bb.0 (%ir-block.0):
          liveins: $r1d, $f0d, $r5d
16B       %2:addr64bit = COPY $r5d
80B       %4:gr64bit = ADA_ENTRY_VALUE target-flags(<unknown>) @pass_vararg3, %2:addr64bit, 8, implicit-def dead $cc :: (dereferenceable invariant load (s64))
96B       %6:gr64bit = ADA_ENTRY_VALUE target-flags(<unknown>) @pass_vararg3, %2:addr64bit, 0, implicit-def dead $cc :: (dereferenceable invariant load (s64))
104B      %1:fp64bit = COPY $f0d
112B      %3:gr64bit = LGDR %1:fp64bit
120B      %0:gr64bit = COPY $r1d
128B      $r6d = COPY %4:gr64bit
136B      $r1d = COPY %3:gr64bit
144B      $r2d = COPY %0:gr64bit
160B      $r5d = COPY %6:gr64bit
176B      CallBASR_XPLINK64 killed $r6d, $r1d, killed $r2d, $r5d, <regmask $f8d $f9d $f10d $f11d $f12d $f13d $f14d $f15d $f8q $f9q $f12q $f13q $f8s $f9s $f10s $f11s $f12s $f13s $f14s $f15s $r8d $r9d $r10d $r11d 

The LGDR was in the presence of ADJCALLSTACKDOWN below it, and is moved just above it by the mi-scheduler. For some reason, here without the ADJCALLSTACKDOWN, it is also moved above the COPY from $r1d, which is unfortunate as that causes an overlap of %3 which is to be COPY:ed to the same register.

I don't think there is any "tracking" of which vreg ranges go to/from which physreg and with that avoid unnecessary overlaps like this. Maybe this is rare so it's not worth the effort? Probably not a quick thing to fix, but this is a good example of this problem. Maybe mark the test with a TODO if this is acceptable for now?

@uweigand
Copy link
Member

Yes, I think this is really something that should be fixed in the register allocate (looks like a parallel-copy problem). But for now I think this patch is fine as is. Feel free to add a TODO to the test case, otherwise this LGTM.

@JonPsson1 JonPsson1 merged commit 16b7cc6 into llvm:main Mar 28, 2024
3 of 4 checks passed
@JonPsson1 JonPsson1 deleted the Callseq branch March 28, 2024 17:26
redstar added a commit to redstar/llvm-m88k that referenced this pull request Mar 29, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 1, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 5, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 5, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 6, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 7, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 8, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 12, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 12, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 15, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 19, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 21, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 26, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 29, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request Apr 30, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request May 3, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request May 7, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
redstar added a commit to redstar/llvm-m88k that referenced this pull request May 30, 2024
The ABI uses a reserved call frame, which is statically reserved in the prologue.
Following the discussion in llvm/llvm-project#77812, these pseudo instructions can be removed early.
My solution here is to not insert them in the first place.
The calculation of the reserved frame size is now done during call lowering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Select128 problem
5 participants