Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CodeGen][MachinePipeliner] Limit register pressure when scheduling #74807

Merged
merged 1 commit into from
Jan 22, 2024

Conversation

kasuga-fj
Copy link
Contributor

@kasuga-fj kasuga-fj commented Dec 8, 2023

In software pipelining, when searching for the Initiation Interval (II), MachinePipeliner tries to reduce register pressure, but doesn't check how many variables can actually be alive at the same time. As a result, a lot of register spills/fills can be generated after register allocation, which might cause performance degradation. To prevent such cases, this patch adds a check phase that calculates the maximum register pressure of the scheduled loop and reject it if the pressure is too high. This can be enabled this by specifying
pipeliner-register-pressure. Additionally, an II search range is currently fixed at 10, which is too small to find a schedule when the above algorithm is applied. Therefore this patch also adds a new option pipeliner-ii-search-range to specify the length of the range to search. There is one more new option
pipeliner-register-pressure-margin, which can be used to estimate a register pressure limit less than actual for conservative analysis.

Discourse thread: https://discourse.llvm.org/t/considering-register-pressure-when-deciding-initiation-interval-in-machinepipeliner/74725

Note that this patch provides only minimal functionality and is disabled by default. We'd like to do the following in the future:

  • We are developing another patch to support MachinePipeliner for AArch64. When applying that patch, we'd like to enable pipeliner-register-pressure in AArch64.
    • This doesn't mean enabling MachinePipeliner by default on AArch64. That is, when MachinePipeliner is enabled in AArch64 pipeliner-register-pressuer will be enabled by default.
    • pipeliner-register-pressure-margin will remain available for fine tuning.
    • There are other architectures that support MachinePipeliner, for example, PowerPC. We are not going to determine whether or not to enable the functionality of this patch on those architectures.
  • We will then add other features to MachinePipeliner, for example, improved II search method.
    • Currently the range of searching II is fixed at 10, and increasing it might make compile time longer. We believe that II linear search is not very efficient and would like to implement a variant of binary search. It will allow increasing the range (by pipeliner-ii-search-range) without compile time degradation.

EDIT: Here is a brief summary of the performance improvement we observed when applying this patch. It's measured by using the loop based on https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584 with some modification. Please see #65609 (comment) for more details.

II cycles cycles (MVE)
without this patch 11 29.3 19.6
with this patch 20 16.5 15.7

@kasuga-fj
Copy link
Contributor Author

@ceseo Could you please check?

@kasuga-fj
Copy link
Contributor Author

ping

@ceseo
Copy link
Contributor

ceseo commented Dec 15, 2023

@kasuga-fj could you please check the failing check? It seems to be crashing the MLIR tests on Windows.

@ceseo
Copy link
Contributor

ceseo commented Dec 15, 2023

@luporl Could you please help with this review?

@kasuga-fj
Copy link
Contributor Author

Thank you for your reply! The failure is now resolved (just rebased)

@luporl
Copy link
Contributor

luporl commented Dec 18, 2023

@luporl Could you please help with this review?

Yes, I can take a look at the changes, but please note that I'm not familiar with this part of llvm.

@kasuga-fj
Copy link
Contributor Author

ping

@ceseo
Copy link
Contributor

ceseo commented Jan 2, 2024

ping

Hi. I plan to take a look this later this week / early next week.

@ceseo
Copy link
Contributor

ceseo commented Jan 2, 2024

In software pipelining, when searching for the Initiation Interval (II), MachinePipeliner tries to reduce register pressure, but doesn't check how many variables can actually alive at the same time. This can result a lot of register spills/fills can be generated after register allocation, which might cause performance degradation.

Do you have any benchmark numbers showing this performance degradation and how this patch improves it?

@luporl
Copy link
Contributor

luporl commented Jan 3, 2024

A few suggestions to improve the description text:

how many variables can actually alive at the same time.

how many variables can actually be alive at the same time.

This can result a lot of register spills/fills can be generated after register allocation, ...

As a result, a lot of register spills/fills can be generated after register allocation, ...

* We are developing another patch to support `MachinePipeliner` for AArch64. When applying the patch, we'd like to enable this patch in AArch64.
  • We are developing another patch to support MachinePipeliner for AArch64. When applying that patch, we'd like to enable pipeliner-register-pressure in AArch64. (if I understood it correctly)
  * This doesn't mean enabling `MachinePipeliner` by default on AArch64.

So, does it mean that, when MachinePipeliner is enabled on AArch64 pipeliner-register-pressure will be enabled by
default?

What about pipeliner-ii-search-range and pipeliner-register-pressure-margin?
Will the improved II search method reduce the need to use pipeliner-ii-search-range?
And will pipeliner-register-pressure-margin remain available for fine tuning?

@luporl
Copy link
Contributor

luporl commented Jan 3, 2024

In software pipelining, when searching for the Initiation Interval (II), MachinePipeliner tries to reduce register pressure, but doesn't check how many variables can actually alive at the same time. This can result a lot of register spills/fills can be generated after register allocation, which might cause performance degradation.

Do you have any benchmark numbers showing this performance degradation and how this patch improves it?

IIUIC, by following the discourse thread and checking the results at #65609 (comment), this patch reduces the number of cycles needed to execute the loop of https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584, with some modifications. In this test code, with this patch applied, II would be changed from 11 to 20, to avoid spills/fills, which results in the number of cycles per iteration going down from 29.3 to 16.5, without MVE (#65609), and from 19.6 to 15.7 with MVE. Is that correct?

It would be nice to add a short version of the results to the description of this patch, to give an idea of the performance improvement, without the need to go through discourse and the MVE patch.

Also, it would be nice if the modifications made to the test code could be made public, so that others can try to reproduce the results. Are there any improvements with the unmodified version too?

Finally, it would help if you could try this patch with other benchmarks, like SPEC CPU 2017, if it's not too much work, to check how it impacts the performance of other workloads. This will be important when considering enabling pipeliner-register-pressure by default.

llvm/lib/CodeGen/MachinePipeliner.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachinePipeliner.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachinePipeliner.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachinePipeliner.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachinePipeliner.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachinePipeliner.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachinePipeliner.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachinePipeliner.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachinePipeliner.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/MachinePipeliner.cpp Show resolved Hide resolved
@kasuga-fj
Copy link
Contributor Author

Thank you for your reply!. Here are the answers to your questions (the description is also updated).

So, does it mean that, when MachinePipeliner is enabled on AArch64 pipeliner-register-pressure will be enabled by default?

Yes. We'd like to.

What about pipeliner-ii-search-range and pipeliner-register-pressure-margin? Will the improved II search method reduce the need to use pipeliner-ii-search-range? And will pipeliner-register-pressure-margin remain available for fine tuning?

As you said, pipeliner-register-pressure-margin will remain available for fine tuning. I'm not sure if our II search method improvement reduces the need to use pipeliner-ii-search-range. But it may allow us to use the obvious upper limit of II (sum of the latencies of all instructions in the loop) rather than user specified one, without large compile time degradation.

IIUIC, by following the discourse thread and checking the results at #65609 (comment), this patch reduces the number of cycles needed to execute the loop of https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584, with some modifications. In this test code, with this patch applied, II would be changed from 11 to 20, to avoid spills/fills, which results in the number of cycles per iteration going down from 29.3 to 16.5, without MVE (#65609), and from 19.6 to 15.7 with MVE. Is that correct?

You are right. Sorry for going out of your way to find it.

Also, it would be nice if the modifications made to the test code could be made public, so that others can try to reproduce the results.

My colleague @ytmukai are working to publish it. I believe that it will be published soon.

Are there any improvements with the unmodified version too?

The improvement without modification has not been confirmed because the analysis required for the pipeliner doesn't work well and MachinePipeliner cannot be applied. We recognize that this is an issue and would like to resolve.

Finally, it would help if you could try this patch with other benchmarks, like SPEC CPU 2017, if it's not too much work, to check how it impacts the performance of other workloads.

For the same reason as above, we've not been able to check the performance with other benchmarks. Please let this be a future work.

Copy link

github-actions bot commented Jan 4, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@kasuga-fj kasuga-fj requested a review from luporl January 4, 2024 11:56
@ceseo
Copy link
Contributor

ceseo commented Jan 4, 2024

IIUIC, by following the discourse thread and checking the results at #65609 (comment), this patch reduces the number of cycles needed to execute the loop of https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584, with some modifications. In this test code, with this patch applied, II would be changed from 11 to 20, to avoid spills/fills, which results in the number of cycles per iteration going down from 29.3 to 16.5, without MVE (#65609), and from 19.6 to 15.7 with MVE. Is that correct?

Thanks, I missed that.

It would be nice to add a short version of the results to the description of this patch, to give an idea of the performance improvement, without the need to go through discourse and the MVE patch.

Also, it would be nice if the modifications made to the test code could be made public, so that others can try to reproduce the results. Are there any improvements with the unmodified version too?

I'd add a benchmark to the LLVM test-suite. It would make it easier to catch any future performance regressions.

@luporl
Copy link
Contributor

luporl commented Jan 5, 2024

Thanks for all the answers and improvements!

Are there any improvements with the unmodified version too?

The improvement without modification has not been confirmed because the analysis required for the pipeliner doesn't work well and MachinePipeliner cannot be applied. We recognize that this is an issue and would like to resolve.

Finally, it would help if you could try this patch with other benchmarks, like SPEC CPU 2017, if it's not too much work, to check how it impacts the performance of other workloads.

For the same reason as above, we've not been able to check the performance with other benchmarks. Please let this be a future work.

So is the issue caused by some loops failing to match MachinePipeliner's expectations and then being skipped by it?
If this is the case I don't see any issue in letting this be a future work.

Copy link
Contributor

@luporl luporl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but it would be better if someone who knows this part better could take a look.

Comment on lines 11 to 17
# CHECK: Rejecte the schedule because of too high register pressure
# CHECK: Try to schedule with 24
# CHECK: Rejecte the schedule because of too high register pressure
# CHECK: Try to schedule with 25
# CHECK: Rejecte the schedule because of too high register pressure
# CHECK: Try to schedule with 26
# CHECK: Rejecte the schedule because of too high register pressure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These messages need to be updated, to fix the CI error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I forgot. Thanks!

@kasuga-fj
Copy link
Contributor Author

Thank you for your review!

So is the issue caused by some loops failing to match MachinePipeliner's expectations and then being skipped by it?

Yes. However, we are working on resolving this issue and have recently been able to increase the number of programs to which MachinePipeliner can be applied. We've not yet confirmed the details of the results, but we may be able to present improvements in other benchmarks in the near future.

Once we have the results, I will ask someone familiar with this part to review the patch.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I tried to give this a try on some Arm tests with it enabled, and seemed to it look OK. A few things got a little better and nothing seemed to break, which is a good sign.

I read through the code and it LGTM.

@kasuga-fj
Copy link
Contributor Author

Thanks for checking.

In software pipelining, when searching for the Initiation Interval (II),
`MachinePipeliner` tries to reduce register pressure, but doesn't check
how many variables can actually alive at the same time. This can result
a lot of register spills/fills can be generated after register
allocation, which might cause performance degradation. To prevent such
cases, this patch adds a check phase that calculates the maximum
register pressure of the scheduled loop and reject it if the pressure is
too high. This can be enabled this by specifying
`pipeliner-register-pressure`. Additionally, an II search range is
currently fixed at 10, which is too small to find a schedule when the
above algorithm is applied. Threfore this patch also adds a new option
`pipeliner-ii-search-range` to specify the length of the range to
search. There is one more new option
`pipeliner-register-pressure-margin`, which can be used to estimate a
register pressure limit less than actual for conservative analysis.

Discourse thread: https://discourse.llvm.org/t/considering-register-pressure-when-deciding-initiation-interval-in-machinepipeliner/74725
@kasuga-fj kasuga-fj merged commit 7556626 into llvm:main Jan 22, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants