Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RISC-V Machine Outliner should reach parity with other platforms #89822

Open
ilovepi opened this issue Apr 23, 2024 · 9 comments
Open

RISC-V Machine Outliner should reach parity with other platforms #89822

ilovepi opened this issue Apr 23, 2024 · 9 comments
Labels
backend:RISC-V enhancement Improving things as opposed to bug fixing, e.g. new or missing feature llvm:codegen llvm:codesize Code size issues metabug

Comments

@ilovepi
Copy link
Contributor

ilovepi commented Apr 23, 2024

Machine Outliner can lead to significant size savings, even for 32-bit embedded targets, see https://www.linaro.org/blog/reducing-code-size-with-llvm-machine-outliner-on-32-bit-arm-targets/

Both Arm and AArch64 backends support variety of outlining strategies:

https://github.com/llvm/llvm-project/blob/89c95effe82c09b9a42408f4823409331f8fa266/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp#L5678-5778

/// Constants defining how certain sequences should be outlined.
/// This encompasses how an outlined function should be called, and what kind of
/// frame should be emitted for that outlined function.
///
/// \p MachineOutlinerDefault implies that the function should be called with
/// a save and restore of LR to the stack.
///
/// That is,
///
/// I1 Save LR OUTLINED_FUNCTION:
/// I2 --> BL OUTLINED_FUNCTION I1
/// I3 Restore LR I2
/// I3
/// RET
///
/// * Call construction overhead: 3 (save + BL + restore)
/// * Frame construction overhead: 1 (ret)
/// * Requires stack fixups? Yes
///
/// \p MachineOutlinerTailCall implies that the function is being created from
/// a sequence of instructions ending in a return.
///
/// That is,
///
/// I1 OUTLINED_FUNCTION:
/// I2 --> B OUTLINED_FUNCTION I1
/// RET I2
/// RET
///
/// * Call construction overhead: 1 (B)
/// * Frame construction overhead: 0 (Return included in sequence)
/// * Requires stack fixups? No
///
/// \p MachineOutlinerNoLRSave implies that the function should be called using
/// a BL instruction, but doesn't require LR to be saved and restored. This
/// happens when LR is known to be dead.
///
/// That is,
///
/// I1 OUTLINED_FUNCTION:
/// I2 --> BL OUTLINED_FUNCTION I1
/// I3 I2
/// I3
/// RET
///
/// * Call construction overhead: 1 (BL)
/// * Frame construction overhead: 1 (RET)
/// * Requires stack fixups? No
///
/// \p MachineOutlinerThunk implies that the function is being created from
/// a sequence of instructions ending in a call. The outlined function is
/// called with a BL instruction, and the outlined function tail-calls the
/// original call destination.
///
/// That is,
///
/// I1 OUTLINED_FUNCTION:
/// I2 --> BL OUTLINED_FUNCTION I1
/// BL f I2
/// B f
/// * Call construction overhead: 1 (BL)
/// * Frame construction overhead: 0
/// * Requires stack fixups? No
///
/// \p MachineOutlinerRegSave implies that the function should be called with a
/// save and restore of LR to an available register. This allows us to avoid
/// stack fixups. Note that this outlining variant is compatible with the
/// NoLRSave case.
///
/// That is,
///
/// I1 Save LR OUTLINED_FUNCTION:
/// I2 --> BL OUTLINED_FUNCTION I1
/// I3 Restore LR I2
/// I3
/// RET
///
/// * Call construction overhead: 3 (save + BL + restore)
/// * Frame construction overhead: 1 (ret)
/// * Requires stack fixups? No

While the RISC-V backend has some outlining support, it's fairly limited compared to Arm and AArch64 backends, see
std::optional<outliner::OutlinedFunction>
RISCVInstrInfo::getOutliningCandidateInfo(
std::vector<outliner::Candidate> &RepeatedSequenceLocs) const {
// First we need to filter out candidates where the X5 register (IE t0) can't
// be used to setup the function call.
auto CannotInsertCall = [](outliner::Candidate &C) {
const TargetRegisterInfo *TRI = C.getMF()->getSubtarget().getRegisterInfo();
return !C.isAvailableAcrossAndOutOfSeq(RISCV::X5, *TRI);
};
llvm::erase_if(RepeatedSequenceLocs, CannotInsertCall);
// If the sequence doesn't have enough candidates left, then we're done.
if (RepeatedSequenceLocs.size() < 2)
return std::nullopt;
unsigned SequenceSize = 0;
for (auto &MI : RepeatedSequenceLocs[0])
SequenceSize += getInstSizeInBytes(MI);
// call t0, function = 8 bytes.
unsigned CallOverhead = 8;
for (auto &C : RepeatedSequenceLocs)
C.setCallInfo(MachineOutlinerDefault, CallOverhead);
// jr t0 = 4 bytes, 2 bytes if compressed instructions are enabled.
unsigned FrameOverhead = 4;
if (RepeatedSequenceLocs[0]
.getMF()
->getSubtarget<RISCVSubtarget>()
.hasStdExtCOrZca())
FrameOverhead = 2;
return outliner::OutlinedFunction(RepeatedSequenceLocs, SequenceSize,
FrameOverhead, MachineOutlinerDefault);
}

We should improve the RISC-V Machine Outliner to be on par with Arm and AArch64.

Off the top of my head we should take the following steps:

  1. Perform Gap analysis between the ARM and RISC-V machine outliners
  2. Discuss the findings with RISC-V backend maintainers: @topperc @preames @MaskRay @asb @jrtc27
  3. Create a design implementation plan to improve support
  4. Implement improved support upstream

CC: @petrhosek

@ilovepi ilovepi added enhancement Improving things as opposed to bug fixing, e.g. new or missing feature backend:RISC-V llvm:codegen llvm:codesize Code size issues labels Apr 23, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 23, 2024

@llvm/issue-subscribers-backend-risc-v

Author: Paul Kirth (ilovepi)

Machine Outliner can lead to significant size savings, even for 32-bit embedded targets, see https://www.linaro.org/blog/reducing-code-size-with-llvm-machine-outliner-on-32-bit-arm-targets/

Both Arm and AArch64 backends support variety of outlining strategies:

https://github.com/llvm/llvm-project/blob/89c95effe82c09b9a42408f4823409331f8fa266/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp#L5678-5778

/// Constants defining how certain sequences should be outlined.
/// This encompasses how an outlined function should be called, and what kind of
/// frame should be emitted for that outlined function.
///
/// \p MachineOutlinerDefault implies that the function should be called with
/// a save and restore of LR to the stack.
///
/// That is,
///
/// I1 Save LR OUTLINED_FUNCTION:
/// I2 --> BL OUTLINED_FUNCTION I1
/// I3 Restore LR I2
/// I3
/// RET
///
/// * Call construction overhead: 3 (save + BL + restore)
/// * Frame construction overhead: 1 (ret)
/// * Requires stack fixups? Yes
///
/// \p MachineOutlinerTailCall implies that the function is being created from
/// a sequence of instructions ending in a return.
///
/// That is,
///
/// I1 OUTLINED_FUNCTION:
/// I2 --> B OUTLINED_FUNCTION I1
/// RET I2
/// RET
///
/// * Call construction overhead: 1 (B)
/// * Frame construction overhead: 0 (Return included in sequence)
/// * Requires stack fixups? No
///
/// \p MachineOutlinerNoLRSave implies that the function should be called using
/// a BL instruction, but doesn't require LR to be saved and restored. This
/// happens when LR is known to be dead.
///
/// That is,
///
/// I1 OUTLINED_FUNCTION:
/// I2 --> BL OUTLINED_FUNCTION I1
/// I3 I2
/// I3
/// RET
///
/// * Call construction overhead: 1 (BL)
/// * Frame construction overhead: 1 (RET)
/// * Requires stack fixups? No
///
/// \p MachineOutlinerThunk implies that the function is being created from
/// a sequence of instructions ending in a call. The outlined function is
/// called with a BL instruction, and the outlined function tail-calls the
/// original call destination.
///
/// That is,
///
/// I1 OUTLINED_FUNCTION:
/// I2 --> BL OUTLINED_FUNCTION I1
/// BL f I2
/// B f
/// * Call construction overhead: 1 (BL)
/// * Frame construction overhead: 0
/// * Requires stack fixups? No
///
/// \p MachineOutlinerRegSave implies that the function should be called with a
/// save and restore of LR to an available register. This allows us to avoid
/// stack fixups. Note that this outlining variant is compatible with the
/// NoLRSave case.
///
/// That is,
///
/// I1 Save LR OUTLINED_FUNCTION:
/// I2 --> BL OUTLINED_FUNCTION I1
/// I3 Restore LR I2
/// I3
/// RET
///
/// * Call construction overhead: 3 (save + BL + restore)
/// * Frame construction overhead: 1 (ret)
/// * Requires stack fixups? No

While the RISC-V backend has some outlining support, it's fairly limited compared to Arm and AArch64 backends, see
std::optional<outliner::OutlinedFunction>
RISCVInstrInfo::getOutliningCandidateInfo(
std::vector<outliner::Candidate> &RepeatedSequenceLocs) const {
// First we need to filter out candidates where the X5 register (IE t0) can't
// be used to setup the function call.
auto CannotInsertCall = [](outliner::Candidate &C) {
const TargetRegisterInfo *TRI = C.getMF()->getSubtarget().getRegisterInfo();
return !C.isAvailableAcrossAndOutOfSeq(RISCV::X5, *TRI);
};
llvm::erase_if(RepeatedSequenceLocs, CannotInsertCall);
// If the sequence doesn't have enough candidates left, then we're done.
if (RepeatedSequenceLocs.size() < 2)
return std::nullopt;
unsigned SequenceSize = 0;
for (auto &MI : RepeatedSequenceLocs[0])
SequenceSize += getInstSizeInBytes(MI);
// call t0, function = 8 bytes.
unsigned CallOverhead = 8;
for (auto &C : RepeatedSequenceLocs)
C.setCallInfo(MachineOutlinerDefault, CallOverhead);
// jr t0 = 4 bytes, 2 bytes if compressed instructions are enabled.
unsigned FrameOverhead = 4;
if (RepeatedSequenceLocs[0]
.getMF()
->getSubtarget<RISCVSubtarget>()
.hasStdExtCOrZca())
FrameOverhead = 2;
return outliner::OutlinedFunction(RepeatedSequenceLocs, SequenceSize,
FrameOverhead, MachineOutlinerDefault);
}

We should improve the RISC-V Machine Outliner to be on par with Arm and AArch64.

Off the top of my head we should take the following steps:

  1. Perform Gap analysis between the ARM and RISC-V machine outliners
  2. Discuss the findings with RISC-V backend maintainers: @topperc @preames @MaskRay @asb @jrtc27
  3. Create a design implementation plan to improve support
  4. Implement improved support upstream

CC: @petrhosek

@ilovepi
Copy link
Contributor Author

ilovepi commented May 2, 2024

As a first pass at a gap analysis, @hiraditya @petrhosek and I looked at the Aarch64 and ARM32 Machine outliner bits.

We noticed several things are handled in a more sophisticated manner than in RISC-V.

  1. There seems to be more effort spent on identifying equivalent instructions/instruction sequences that the outliner can take advantage of. This seems to include handling for various addressing modes.

  2. There are more cost modeling adjustments in ARM32. This is probably worth duplicating.

  3. Stack frame handling, seems to be much more involved in the Arm outliners, with many routines and heuristics trying to optimize and maximize outlining opportunities. This is likely worth addressing for RISC-V.

  4. Handling for various extensions may be challenging. Likely we should first focus on profiles and then on supporting other common extensions. Arm outliniers handle PAC-ret and BTI sequences, and RISC-V will probably need similar mechanism for its various extensions.

  5. The Aarch64 implementation in particular seems to use a more advanced method for ranking/scoring outlining candidates.

A related observation is that I think RISC-V is different form other platforms in that most of its relocation sequences allow for interleaving instructions between operations in a fixed sequence. TLSDESC access is a good example of this. We wonder if there should be some cannonicalization of particular instruction sequences to maximize outlining opportunities. I think this is worth considering, but shouldn't be prioritized over basic improvements to identifying instruction sequences, cost modeling, and stack frame handling.

Lastly, we know that RISC-V heavily uses TableGen, and wonder if we can similarly generate some of these rules, or augment the information available to the outliner from the existing instruction information.

@ilovepi
Copy link
Contributor Author

ilovepi commented May 6, 2024

I watched the 2023 LLVM Dev Mtg - Profiling Based Global Machine Outlining presentation, and I'm wondering if we can leverage that to do some further analysis and find code sequences that the existing heuristics miss, or identify cases where one backend can do significantly better in a systematic way.

@appujee
Copy link
Contributor

appujee commented May 6, 2024

I watched the 2023 LLVM Dev Mtg - Profiling Based Global Machine Outlining presentation, and I'm wondering if we can leverage that to do some further analysis and find code sequences that the existing heuristics miss, or identify cases where one backend can do significantly better in a systematic way.

That's a good idea for sure! In the short team i believe focusing on items 1,2, and 4 (from our analysis) will give maximum ROI.

@ilovepi
Copy link
Contributor Author

ilovepi commented May 6, 2024

I think the next steps are to take those items and file issues w/ enough context that anyone will have sufficient information to make progress. Likely we'll be the ones to do the work, but if we can make it easier for other contributors to grab work items, we should. I'll try to have that done by May 10, and then we can start improving things w/o stepping all over one another.

@ilovepi
Copy link
Contributor Author

ilovepi commented May 13, 2024

Looking at the RISC-V side, why do we only support outlining when we can make a call w/ t0? https://github.com/llvm/llvm-project/blob/69937982dbdd73172ec06580f6f93616edca8e9e/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp#L2819C1-L2824C5

@topperc
Copy link
Collaborator

topperc commented May 13, 2024

Looking at the RISC-V side, why do we only support outlining when we can make a call w/ t0? https://github.com/llvm/llvm-project/blob/69937982dbdd73172ec06580f6f93616edca8e9e/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp#L2819C1-L2824C5

t0 or ra are the only registers we should use to make a call. Those are the only registers defined to work with the return address prediction stack.

So I think what we're missing is code to save t0(or ra) if they aren't available to use. AArch64 has findRegisterToSaveLRTo

@ilovepi
Copy link
Contributor Author

ilovepi commented May 13, 2024

Looking at the RISC-V side, why do we only support outlining when we can make a call w/ t0? https://github.com/llvm/llvm-project/blob/69937982dbdd73172ec06580f6f93616edca8e9e/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp#L2819C1-L2824C5

t0 or ra are the only registers we should use to make a call. Those are the only registers defined to work with the return address prediction stack.

So I think what we're missing is code to save t0(or ra) if they aren't available to use. AArch64 has findRegisterToSaveLRTo

That's what I was getting at. It seems like we could be doing more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:RISC-V enhancement Improving things as opposed to bug fixing, e.g. new or missing feature llvm:codegen llvm:codesize Code size issues metabug
Projects
None yet
Development

No branches or pull requests

4 participants