-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8320379: C2: Sort spilling/unspilling sequence for better ld/st merging into ldp/stp on AArch64 #16754
Conversation
…ng into ldp/stp on AArch64 Macro-assembler on aarch64 can merge adjacent loads or stores into ldp/stp[1]. For example, it can merge: ``` str w20, [sp, openjdk#16] str w10, [sp, openjdk#20] ``` into ``` stp w20, w10, [sp, openjdk#16] ``` But C2 may generate a sequence like: ``` str x21, [sp, openjdk#8] str w20, [sp, openjdk#16] str x19, [sp, openjdk#24] <--- str w10, [sp, openjdk#20] <--- Before sorting str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` We can't do any merging for non-adjacent loads or stores. The patch is to sort the spilling or unspilling sequence in the order of offset during instruction scheduling and bundling phase. After that, we can get a new sequence: ``` str x21, [sp, openjdk#8] str w20, [sp, openjdk#16] str w10, [sp, openjdk#20] <--- str x19, [sp, openjdk#24] <--- After sorting str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` Then macro-assembler can do ld/st merging: ``` str x21, [sp, openjdk#8] stp w20, w10, [sp, openjdk#16] <--- Merged str x19, [sp, openjdk#24] str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` To justify the patch, we run `HelloWorld.java` ``` public class HelloWorld { public static void main(String [] args) { System.out.println("Hello World!"); } } ``` with `java -Xcomp -XX:-TieredCompilation HelloWorld`. Before the patch, macro-assembler can do ld/st merging for 3688 times. After the patch, the number of ld/st merging increases to 3871 times, by ~5 %. Tested tier1~3 on x86 and AArch64. [1] https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079
👋 Welcome back fgao! A progress list of the required criteria for merging this PR into |
Webrevs
|
src/hotspot/share/opto/output.cpp
Outdated
// Comparison between reg -> stack and reg -> stack | ||
if (OptoReg::is_stack(first_dst_lo) && OptoReg::is_stack(second_dst_lo) && | ||
OptoReg::is_reg(first_src_lo) && OptoReg::is_reg(second_src_lo)) { | ||
return _regalloc->reg2offset(first_dst_lo) > _regalloc->reg2offset(second_dst_lo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return _regalloc->reg2offset(first_dst_lo) > _regalloc->reg2offset(second_dst_lo); | |
return _regalloc->reg2offset(first_dst_lo) - _regalloc->reg2offset(second_dst_lo); |
src/hotspot/share/opto/output.cpp
Outdated
@@ -2271,6 +2277,29 @@ Node * Scheduling::ChooseNodeToBundle() { | |||
return _available[0]; | |||
} | |||
|
|||
bool Scheduling::compare_two_spill_nodes(Node* first, Node* second) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool Scheduling::compare_two_spill_nodes(Node* first, Node* second) { | |
int Scheduling::compare_two_spill_nodes(Node* first, Node* second) { |
src/hotspot/share/opto/output.cpp
Outdated
// Return true only when the stack offset of the first spill node is | ||
// greater than the stack offset of the second one. Otherwise, return false. | ||
// When compare_two_spill_nodes(first, second) returns true, we think that | ||
// "second" should be scheduled before "first" in the final basic block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Return true only when the stack offset of the first spill node is | |
// greater than the stack offset of the second one. Otherwise, return false. | |
// When compare_two_spill_nodes(first, second) returns true, we think that | |
// "second" should be scheduled before "first" in the final basic block. | |
// Return an integer less than, equal to, or greater than zero if the stack offset of the | |
// first argument is respectively less than, equal to, or greater than the second. | |
src/hotspot/share/opto/output.cpp
Outdated
break; | ||
} else if (_current_latency[_available[i]->_idx] == latency && | ||
n->is_MachSpillCopy() && _available[i]->is_MachSpillCopy() && | ||
compare_two_spill_nodes(n, _available[i])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare_two_spill_nodes(n, _available[i])) { | |
compare_two_spill_nodes(n, _available[i]) > 0) { |
src/hotspot/share/opto/output.cpp
Outdated
// Insert in latency order (insertion sort) | ||
// Insert in latency order (insertion sort). If two MachSpillCopyNodes | ||
// for stack spilling or unspilling have the same latency, we sort | ||
// them in the order of stack offset. Some backends (aarch64) may also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// them in the order of stack offset. Some backends (aarch64) may also | |
// them in the order of stack offset. Some ports (e.g. aarch64) may also |
src/hotspot/share/opto/output.cpp
Outdated
// Comparison between stack -> reg and stack -> reg | ||
if (OptoReg::is_stack(first_src_lo) && OptoReg::is_stack(second_src_lo) && | ||
OptoReg::is_reg(first_dst_lo) && OptoReg::is_reg(second_dst_lo)) { | ||
return _regalloc->reg2offset(first_src_lo) > _regalloc->reg2offset(second_src_lo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return _regalloc->reg2offset(first_src_lo) > _regalloc->reg2offset(second_src_lo); | |
return _regalloc->reg2offset(first_src_lo) - _regalloc->reg2offset(second_src_lo); |
src/hotspot/share/opto/output.cpp
Outdated
return _regalloc->reg2offset(first_dst_lo) > _regalloc->reg2offset(second_dst_lo); | ||
} | ||
|
||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false; | |
return 0; // Not comparable. |
This is a good idea, although the real-world gains are small. I'd wonder if this was worth doing for non-AArch64 ports, although even on others sorting the accesses into order might help. |
Hi @theRealAph , thanks a lot for your review! All comments have been resolved in the new commit.
Yeah, that's also bothering me. I'm not sure if it benefits other ports. Do you think if we need convert the change to aarch64-only? Thanks. |
#ifdefs are probably wrose, so I'd leave it as it is. We need another reviewer. |
@fg1417 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 60 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
@theRealAph thanks for your review! May I have another review? Maybe @vnkozlov? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. It may help other platform's pre-fetchers because you are ordering memory access.
I will run testing before approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My tier1-4,xcomp,stress testing passed.
Thanks for all your reviewing and test work! @vnkozlov @theRealAph I'll integrate it if there is no other comment. |
/integrate |
Going to push as commit 3ccd02f.
Your commit was automatically rebased without conflicts. |
Macro-assembler on aarch64 can merge adjacent loads or stores into ldp/stp.[1]
For example, it can merge:
into
But C2 may generate a sequence like:
We can't do any merging for non-adjacent loads or stores.
The patch is to sort the spilling or unspilling sequence in the order of offset during instruction scheduling and bundling phase. After that, we can get a new sequence:
Then macro-assembler can do ld/st merging:
To justify the patch, we run
HelloWorld.java
with
java -Xcomp -XX:-TieredCompilation HelloWorld
.Before the patch, macro-assembler can do ld/st merging for 3688 times. After the patch, the number of ld/st merging increases to 3871 times, by ~5 %.
Tested tier1~3 on x86 and AArch64.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16754/head:pull/16754
$ git checkout pull/16754
Update a local copy of the PR:
$ git checkout pull/16754
$ git pull https://git.openjdk.org/jdk.git pull/16754/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16754
View PR using the GUI difftool:
$ git pr show -t 16754
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16754.diff
Webrev
Link to Webrev Comment