[CodeGen][MachinePipeliner] Limit register pressure when scheduling #74807

kasuga-fj · 2023-12-08T05:05:13Z

In software pipelining, when searching for the Initiation Interval (II), MachinePipeliner tries to reduce register pressure, but doesn't check how many variables can actually be alive at the same time. As a result, a lot of register spills/fills can be generated after register allocation, which might cause performance degradation. To prevent such cases, this patch adds a check phase that calculates the maximum register pressure of the scheduled loop and reject it if the pressure is too high. This can be enabled this by specifying
pipeliner-register-pressure. Additionally, an II search range is currently fixed at 10, which is too small to find a schedule when the above algorithm is applied. Therefore this patch also adds a new option pipeliner-ii-search-range to specify the length of the range to search. There is one more new option
pipeliner-register-pressure-margin, which can be used to estimate a register pressure limit less than actual for conservative analysis.

Discourse thread: https://discourse.llvm.org/t/considering-register-pressure-when-deciding-initiation-interval-in-machinepipeliner/74725

Note that this patch provides only minimal functionality and is disabled by default. We'd like to do the following in the future:

We are developing another patch to support MachinePipeliner for AArch64. When applying that patch, we'd like to enable pipeliner-register-pressure in AArch64.
- This doesn't mean enabling MachinePipeliner by default on AArch64. That is, when MachinePipeliner is enabled in AArch64 pipeliner-register-pressuer will be enabled by default.
- pipeliner-register-pressure-margin will remain available for fine tuning.
- There are other architectures that support MachinePipeliner, for example, PowerPC. We are not going to determine whether or not to enable the functionality of this patch on those architectures.
We will then add other features to MachinePipeliner, for example, improved II search method.
- Currently the range of searching II is fixed at 10, and increasing it might make compile time longer. We believe that II linear search is not very efficient and would like to implement a variant of binary search. It will allow increasing the range (by pipeliner-ii-search-range) without compile time degradation.

EDIT: Here is a brief summary of the performance improvement we observed when applying this patch. It's measured by using the loop based on https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584 with some modification. Please see #65609 (comment) for more details.

	II	cycles	cycles (MVE)
without this patch	11	29.3	19.6
with this patch	20	16.5	15.7

kasuga-fj · 2023-12-08T05:46:24Z

@ceseo Could you please check?

kasuga-fj · 2023-12-15T07:56:10Z

ping

ceseo · 2023-12-15T14:01:43Z

@kasuga-fj could you please check the failing check? It seems to be crashing the MLIR tests on Windows.

ceseo · 2023-12-15T14:03:34Z

@luporl Could you please help with this review?

kasuga-fj · 2023-12-18T02:55:56Z

Thank you for your reply! The failure is now resolved (just rebased)

luporl · 2023-12-18T12:41:38Z

@luporl Could you please help with this review?

Yes, I can take a look at the changes, but please note that I'm not familiar with this part of llvm.

kasuga-fj · 2023-12-28T02:51:14Z

ping

ceseo · 2024-01-02T00:33:18Z

ping

Hi. I plan to take a look this later this week / early next week.

ceseo · 2024-01-02T14:19:02Z

In software pipelining, when searching for the Initiation Interval (II), MachinePipeliner tries to reduce register pressure, but doesn't check how many variables can actually alive at the same time. This can result a lot of register spills/fills can be generated after register allocation, which might cause performance degradation.

Do you have any benchmark numbers showing this performance degradation and how this patch improves it?

luporl · 2024-01-03T12:50:48Z

A few suggestions to improve the description text:

how many variables can actually alive at the same time.

how many variables can actually be alive at the same time.

This can result a lot of register spills/fills can be generated after register allocation, ...

As a result, a lot of register spills/fills can be generated after register allocation, ...

* We are developing another patch to support `MachinePipeliner` for AArch64. When applying the patch, we'd like to enable this patch in AArch64.

We are developing another patch to support MachinePipeliner for AArch64. When applying that patch, we'd like to enable pipeliner-register-pressure in AArch64. (if I understood it correctly)

  * This doesn't mean enabling `MachinePipeliner` by default on AArch64.

So, does it mean that, when MachinePipeliner is enabled on AArch64 pipeliner-register-pressure will be enabled by
default?

What about pipeliner-ii-search-range and pipeliner-register-pressure-margin?
Will the improved II search method reduce the need to use pipeliner-ii-search-range?
And will pipeliner-register-pressure-margin remain available for fine tuning?

luporl · 2024-01-03T13:42:50Z

In software pipelining, when searching for the Initiation Interval (II), MachinePipeliner tries to reduce register pressure, but doesn't check how many variables can actually alive at the same time. This can result a lot of register spills/fills can be generated after register allocation, which might cause performance degradation.

Do you have any benchmark numbers showing this performance degradation and how this patch improves it?

IIUIC, by following the discourse thread and checking the results at #65609 (comment), this patch reduces the number of cycles needed to execute the loop of https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584, with some modifications. In this test code, with this patch applied, II would be changed from 11 to 20, to avoid spills/fills, which results in the number of cycles per iteration going down from 29.3 to 16.5, without MVE (#65609), and from 19.6 to 15.7 with MVE. Is that correct?

It would be nice to add a short version of the results to the description of this patch, to give an idea of the performance improvement, without the need to go through discourse and the MVE patch.

Also, it would be nice if the modifications made to the test code could be made public, so that others can try to reproduce the results. Are there any improvements with the unmodified version too?

Finally, it would help if you could try this patch with other benchmarks, like SPEC CPU 2017, if it's not too much work, to check how it impacts the performance of other workloads. This will be important when considering enabling pipeliner-register-pressure by default.

llvm/lib/CodeGen/MachinePipeliner.cpp

kasuga-fj · 2024-01-04T10:30:04Z

Thank you for your reply!. Here are the answers to your questions (the description is also updated).

So, does it mean that, when MachinePipeliner is enabled on AArch64 pipeliner-register-pressure will be enabled by default?

Yes. We'd like to.

What about pipeliner-ii-search-range and pipeliner-register-pressure-margin? Will the improved II search method reduce the need to use pipeliner-ii-search-range? And will pipeliner-register-pressure-margin remain available for fine tuning?

As you said, pipeliner-register-pressure-margin will remain available for fine tuning. I'm not sure if our II search method improvement reduces the need to use pipeliner-ii-search-range. But it may allow us to use the obvious upper limit of II (sum of the latencies of all instructions in the loop) rather than user specified one, without large compile time degradation.

IIUIC, by following the discourse thread and checking the results at #65609 (comment), this patch reduces the number of cycles needed to execute the loop of https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584, with some modifications. In this test code, with this patch applied, II would be changed from 11 to 20, to avoid spills/fills, which results in the number of cycles per iteration going down from 29.3 to 16.5, without MVE (#65609), and from 19.6 to 15.7 with MVE. Is that correct?

You are right. Sorry for going out of your way to find it.

Also, it would be nice if the modifications made to the test code could be made public, so that others can try to reproduce the results.

My colleague @ytmukai are working to publish it. I believe that it will be published soon.

Are there any improvements with the unmodified version too?

The improvement without modification has not been confirmed because the analysis required for the pipeliner doesn't work well and MachinePipeliner cannot be applied. We recognize that this is an issue and would like to resolve.

Finally, it would help if you could try this patch with other benchmarks, like SPEC CPU 2017, if it's not too much work, to check how it impacts the performance of other workloads.

For the same reason as above, we've not been able to check the performance with other benchmarks. Please let this be a future work.

github-actions · 2024-01-04T10:35:14Z

✅ With the latest revision this PR passed the C/C++ code formatter.

ceseo · 2024-01-04T18:29:51Z

IIUIC, by following the discourse thread and checking the results at #65609 (comment), this patch reduces the number of cycles needed to execute the loop of https://github.com/AMReX-Codes/amrex/blob/9e35dc19489dc5d312e92781cb0471d282cf8370/Src/LinearSolvers/MLMG/AMReX_MLNodeLap_2D_K.H#L584, with some modifications. In this test code, with this patch applied, II would be changed from 11 to 20, to avoid spills/fills, which results in the number of cycles per iteration going down from 29.3 to 16.5, without MVE (#65609), and from 19.6 to 15.7 with MVE. Is that correct?

Thanks, I missed that.

It would be nice to add a short version of the results to the description of this patch, to give an idea of the performance improvement, without the need to go through discourse and the MVE patch.

Also, it would be nice if the modifications made to the test code could be made public, so that others can try to reproduce the results. Are there any improvements with the unmodified version too?

I'd add a benchmark to the LLVM test-suite. It would make it easier to catch any future performance regressions.

luporl · 2024-01-05T13:55:29Z

Thanks for all the answers and improvements!

Are there any improvements with the unmodified version too?

The improvement without modification has not been confirmed because the analysis required for the pipeliner doesn't work well and MachinePipeliner cannot be applied. We recognize that this is an issue and would like to resolve.

Finally, it would help if you could try this patch with other benchmarks, like SPEC CPU 2017, if it's not too much work, to check how it impacts the performance of other workloads.

For the same reason as above, we've not been able to check the performance with other benchmarks. Please let this be a future work.

So is the issue caused by some loops failing to match MachinePipeliner's expectations and then being skipped by it?
If this is the case I don't see any issue in letting this be a future work.

luporl

LGTM, but it would be better if someone who knows this part better could take a look.

luporl · 2024-01-05T14:14:16Z

llvm/test/CodeGen/PowerPC/sms-regpress.mir

+# CHECK: Rejecte the schedule because of too high register pressure
+# CHECK: Try to schedule with 24
+# CHECK: Rejecte the schedule because of too high register pressure
+# CHECK: Try to schedule with 25
+# CHECK: Rejecte the schedule because of too high register pressure
+# CHECK: Try to schedule with 26
+# CHECK: Rejecte the schedule because of too high register pressure


These messages need to be updated, to fix the CI error.

Oh, sorry, I forgot. Thanks!

kasuga-fj · 2024-01-09T08:37:38Z

Thank you for your review!

So is the issue caused by some loops failing to match MachinePipeliner's expectations and then being skipped by it?

Yes. However, we are working on resolving this issue and have recently been able to increase the number of programs to which MachinePipeliner can be applied. We've not yet confirmed the details of the results, but we may be able to present improvements in other benchmarks in the near future.

Once we have the results, I will ask someone familiar with this part to review the patch.

davemgreen

Hi. I tried to give this a try on some Arm tests with it enabled, and seemed to it look OK. A few things got a little better and nothing seemed to break, which is a good sign.

I read through the code and it LGTM.

kasuga-fj · 2024-01-22T01:32:21Z

Thanks for checking.

In software pipelining, when searching for the Initiation Interval (II), `MachinePipeliner` tries to reduce register pressure, but doesn't check how many variables can actually alive at the same time. This can result a lot of register spills/fills can be generated after register allocation, which might cause performance degradation. To prevent such cases, this patch adds a check phase that calculates the maximum register pressure of the scheduled loop and reject it if the pressure is too high. This can be enabled this by specifying `pipeliner-register-pressure`. Additionally, an II search range is currently fixed at 10, which is too small to find a schedule when the above algorithm is applied. Threfore this patch also adds a new option `pipeliner-ii-search-range` to specify the length of the range to search. There is one more new option `pipeliner-register-pressure-margin`, which can be used to estimate a register pressure limit less than actual for conservative analysis. Discourse thread: https://discourse.llvm.org/t/considering-register-pressure-when-deciding-initiation-interval-in-machinepipeliner/74725

topperc · 2024-09-18T18:13:56Z

llvm/lib/CodeGen/MachinePipeliner.cpp

+    LLVM_DEBUG({
+      for (auto Reg : FixedRegs) {
+        dbgs() << printReg(Reg, TRI, 0, &MRI) << ": [";
+        const int *Sets = TRI->getRegUnitPressureSets(Reg);


Sorry to dig up this old review. This is passing a physical register to an interface that expects a register unit which is a different encoding space. MCRegUnitIterator needs to be used to translate from physical to register unit.

Oh, I see, I didn't know that. Thanks for finding it.

kasuga-fj added llvm:codegen llvm:optimizations labels Dec 8, 2023

kasuga-fj force-pushed the swpl-limit-register-pressure branch from c9f0fbe to 3a22d00 Compare December 8, 2023 05:40

kasuga-fj requested review from DragonDisciple, bcahoon and dpenry December 8, 2023 05:44

kasuga-fj force-pushed the swpl-limit-register-pressure branch from 3a22d00 to 7933686 Compare December 18, 2023 01:40

luporl reviewed Jan 3, 2024

View reviewed changes

kasuga-fj requested a review from luporl January 4, 2024 11:56

luporl reviewed Jan 5, 2024

View reviewed changes

davemgreen approved these changes Jan 21, 2024

View reviewed changes

kasuga-fj force-pushed the swpl-limit-register-pressure branch from 31b4b50 to e1a964a Compare January 22, 2024 01:32

kasuga-fj force-pushed the swpl-limit-register-pressure branch from e1a964a to 012e158 Compare January 22, 2024 04:17

kasuga-fj merged commit 7556626 into llvm:main Jan 22, 2024
4 checks passed

topperc reviewed Sep 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CodeGen][MachinePipeliner] Limit register pressure when scheduling #74807

[CodeGen][MachinePipeliner] Limit register pressure when scheduling #74807

kasuga-fj commented Dec 8, 2023 •

edited

Loading

kasuga-fj commented Dec 8, 2023

kasuga-fj commented Dec 15, 2023

ceseo commented Dec 15, 2023

ceseo commented Dec 15, 2023

kasuga-fj commented Dec 18, 2023

luporl commented Dec 18, 2023

kasuga-fj commented Dec 28, 2023

ceseo commented Jan 2, 2024

ceseo commented Jan 2, 2024

luporl commented Jan 3, 2024

luporl commented Jan 3, 2024 •

edited

Loading

kasuga-fj commented Jan 4, 2024

github-actions bot commented Jan 4, 2024 •

edited

Loading

ceseo commented Jan 4, 2024

luporl commented Jan 5, 2024

luporl left a comment

luporl Jan 5, 2024

kasuga-fj Jan 9, 2024

kasuga-fj commented Jan 9, 2024

davemgreen left a comment

kasuga-fj commented Jan 22, 2024

topperc Sep 18, 2024

kasuga-fj Sep 19, 2024

[CodeGen][MachinePipeliner] Limit register pressure when scheduling #74807

[CodeGen][MachinePipeliner] Limit register pressure when scheduling #74807

Conversation

kasuga-fj commented Dec 8, 2023 • edited Loading

kasuga-fj commented Dec 8, 2023

kasuga-fj commented Dec 15, 2023

ceseo commented Dec 15, 2023

ceseo commented Dec 15, 2023

kasuga-fj commented Dec 18, 2023

luporl commented Dec 18, 2023

kasuga-fj commented Dec 28, 2023

ceseo commented Jan 2, 2024

ceseo commented Jan 2, 2024

luporl commented Jan 3, 2024

luporl commented Jan 3, 2024 • edited Loading

kasuga-fj commented Jan 4, 2024

github-actions bot commented Jan 4, 2024 • edited Loading

ceseo commented Jan 4, 2024

luporl commented Jan 5, 2024

luporl left a comment

Choose a reason for hiding this comment

luporl Jan 5, 2024

Choose a reason for hiding this comment

kasuga-fj Jan 9, 2024

Choose a reason for hiding this comment

kasuga-fj commented Jan 9, 2024

davemgreen left a comment

Choose a reason for hiding this comment

kasuga-fj commented Jan 22, 2024

topperc Sep 18, 2024

Choose a reason for hiding this comment

kasuga-fj Sep 19, 2024

Choose a reason for hiding this comment

kasuga-fj commented Dec 8, 2023 •

edited

Loading

luporl commented Jan 3, 2024 •

edited

Loading

github-actions bot commented Jan 4, 2024 •

edited

Loading