Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8320379: C2: Sort spilling/unspilling sequence for better ld/st merging into ldp/stp on AArch64 #16754

Closed
wants to merge 3 commits into from

Conversation

fg1417
Copy link

@fg1417 fg1417 commented Nov 21, 2023

Macro-assembler on aarch64 can merge adjacent loads or stores into ldp/stp.[1]

For example, it can merge:

str     w20, [sp, #16]
str     w10, [sp, #20]

into

stp     w20, w10, [sp, #16]

But C2 may generate a sequence like:

str     x21, [sp, #8]
str     w20, [sp, #16]
str     x19, [sp, #24] <---
str     w10, [sp, #20] <--- Before sorting
str     x11, [sp, #40]
str     w13, [sp, #48]
str     x16, [sp, #56]

We can't do any merging for non-adjacent loads or stores.

The patch is to sort the spilling or unspilling sequence in the order of offset during instruction scheduling and bundling phase. After that, we can get a new sequence:

str     x21, [sp, #8]
str     w20, [sp, #16]
str     w10, [sp, #20] <---
str     x19, [sp, #24] <--- After sorting
str     x11, [sp, #40]
str     w13, [sp, #48]
str     x16, [sp, #56]

Then macro-assembler can do ld/st merging:

str     x21, [sp, #8]
stp     w20, w10, [sp, #16] <--- Merged
str     x19, [sp, #24]
str     x11, [sp, #40]
str     w13, [sp, #48]
str     x16, [sp, #56]

To justify the patch, we run HelloWorld.java

public class HelloWorld {
    public static void main(String [] args) {
        System.out.println("Hello World!");
    }
}

with java -Xcomp -XX:-TieredCompilation HelloWorld.

Before the patch, macro-assembler can do ld/st merging for 3688 times. After the patch, the number of ld/st merging increases to 3871 times, by ~5 %.

Tested tier1~3 on x86 and AArch64.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8320379: C2: Sort spilling/unspilling sequence for better ld/st merging into ldp/stp on AArch64 (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16754/head:pull/16754
$ git checkout pull/16754

Update a local copy of the PR:
$ git checkout pull/16754
$ git pull https://git.openjdk.org/jdk.git pull/16754/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16754

View PR using the GUI difftool:
$ git pr show -t 16754

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16754.diff

Webrev

Link to Webrev Comment

…ng into ldp/stp on AArch64

Macro-assembler on aarch64 can merge adjacent loads or stores
into ldp/stp[1]. For example, it can merge:
```
str     w20, [sp, openjdk#16]
str     w10, [sp, openjdk#20]
```
into
```
stp     w20, w10, [sp, openjdk#16]
```

But C2 may generate a sequence like:
```
str     x21, [sp, openjdk#8]
str     w20, [sp, openjdk#16]
str     x19, [sp, openjdk#24] <---
str     w10, [sp, openjdk#20] <--- Before sorting
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```
We can't do any merging for non-adjacent loads or stores.

The patch is to sort the spilling or unspilling sequence in
the order of offset during instruction scheduling and bundling
phase. After that, we can get a new sequence:
```
str     x21, [sp, openjdk#8]
str     w20, [sp, openjdk#16]
str     w10, [sp, openjdk#20] <---
str     x19, [sp, openjdk#24] <--- After sorting
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```

Then macro-assembler can do ld/st merging:
```
str     x21, [sp, openjdk#8]
stp     w20, w10, [sp, openjdk#16] <--- Merged
str     x19, [sp, openjdk#24]
str     x11, [sp, openjdk#40]
str     w13, [sp, openjdk#48]
str     x16, [sp, openjdk#56]
```

To justify the patch, we run `HelloWorld.java`
```
public class HelloWorld {
    public static void main(String [] args) {
        System.out.println("Hello World!");
    }
}
```
with `java -Xcomp -XX:-TieredCompilation HelloWorld`.

Before the patch, macro-assembler can do ld/st merging for
3688 times. After the patch, the number of ld/st merging
increases to 3871 times, by ~5 %.

Tested tier1~3 on x86 and AArch64.

[1] https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079
@bridgekeeper
Copy link

bridgekeeper bot commented Nov 21, 2023

👋 Welcome back fgao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 21, 2023
@openjdk
Copy link

openjdk bot commented Nov 21, 2023

@fg1417 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Nov 21, 2023
@mlbridge
Copy link

mlbridge bot commented Nov 21, 2023

Webrevs

// Comparison between reg -> stack and reg -> stack
if (OptoReg::is_stack(first_dst_lo) && OptoReg::is_stack(second_dst_lo) &&
OptoReg::is_reg(first_src_lo) && OptoReg::is_reg(second_src_lo)) {
return _regalloc->reg2offset(first_dst_lo) > _regalloc->reg2offset(second_dst_lo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return _regalloc->reg2offset(first_dst_lo) > _regalloc->reg2offset(second_dst_lo);
return _regalloc->reg2offset(first_dst_lo) - _regalloc->reg2offset(second_dst_lo);

@@ -2271,6 +2277,29 @@ Node * Scheduling::ChooseNodeToBundle() {
return _available[0];
}

bool Scheduling::compare_two_spill_nodes(Node* first, Node* second) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool Scheduling::compare_two_spill_nodes(Node* first, Node* second) {
int Scheduling::compare_two_spill_nodes(Node* first, Node* second) {

Comment on lines 172 to 175
// Return true only when the stack offset of the first spill node is
// greater than the stack offset of the second one. Otherwise, return false.
// When compare_two_spill_nodes(first, second) returns true, we think that
// "second" should be scheduled before "first" in the final basic block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Return true only when the stack offset of the first spill node is
// greater than the stack offset of the second one. Otherwise, return false.
// When compare_two_spill_nodes(first, second) returns true, we think that
// "second" should be scheduled before "first" in the final basic block.
// Return an integer less than, equal to, or greater than zero if the stack offset of the
// first argument is respectively less than, equal to, or greater than the second.

break;
} else if (_current_latency[_available[i]->_idx] == latency &&
n->is_MachSpillCopy() && _available[i]->is_MachSpillCopy() &&
compare_two_spill_nodes(n, _available[i])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
compare_two_spill_nodes(n, _available[i])) {
compare_two_spill_nodes(n, _available[i]) > 0) {

// Insert in latency order (insertion sort)
// Insert in latency order (insertion sort). If two MachSpillCopyNodes
// for stack spilling or unspilling have the same latency, we sort
// them in the order of stack offset. Some backends (aarch64) may also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// them in the order of stack offset. Some backends (aarch64) may also
// them in the order of stack offset. Some ports (e.g. aarch64) may also

// Comparison between stack -> reg and stack -> reg
if (OptoReg::is_stack(first_src_lo) && OptoReg::is_stack(second_src_lo) &&
OptoReg::is_reg(first_dst_lo) && OptoReg::is_reg(second_dst_lo)) {
return _regalloc->reg2offset(first_src_lo) > _regalloc->reg2offset(second_src_lo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return _regalloc->reg2offset(first_src_lo) > _regalloc->reg2offset(second_src_lo);
return _regalloc->reg2offset(first_src_lo) - _regalloc->reg2offset(second_src_lo);

return _regalloc->reg2offset(first_dst_lo) > _regalloc->reg2offset(second_dst_lo);
}

return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false;
return 0; // Not comparable.

@theRealAph
Copy link
Contributor

This is a good idea, although the real-world gains are small. I'd wonder if this was worth doing for non-AArch64 ports, although even on others sorting the accesses into order might help.

@fg1417
Copy link
Author

fg1417 commented Nov 23, 2023

Hi @theRealAph , thanks a lot for your review! All comments have been resolved in the new commit.

This is a good idea, although the real-world gains are small. I'd wonder if this was worth doing for non-AArch64 ports, although even on others sorting the accesses into order might help.

Yeah, that's also bothering me. I'm not sure if it benefits other ports. Do you think if we need convert the change to aarch64-only? Thanks.

@theRealAph
Copy link
Contributor

Hi @theRealAph , thanks a lot for your review! All comments have been resolved in the new commit.

This is a good idea, although the real-world gains are small. I'd wonder if this was worth doing for non-AArch64 ports, although even on others sorting the accesses into order might help.

Yeah, that's also bothering me. I'm not sure if it benefits other ports. Do you think if we need convert the change to aarch64-only? Thanks.

#ifdefs are probably wrose, so I'd leave it as it is. We need another reviewer.

@openjdk
Copy link

openjdk bot commented Nov 23, 2023

@fg1417 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8320379: C2: Sort spilling/unspilling sequence for better ld/st merging into ldp/stp on AArch64

Reviewed-by: aph, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 60 new commits pushed to the master branch:

  • 1bb250c: 8261837: SIGSEGV in ciVirtualCallTypeData::translate_from
  • 5f7f2c4: 8320249: tools/jpackage/share/AddLauncherTest.java#id1 fails intermittently on Windows in verifyDescription
  • 6871a2f: 8320803: Update SourceVersion.RELEASE_22 description for language changes
  • 82967f4: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy
  • f0a12c5: 8320763: Fix spacing arround assignment in spec.gmk.in
  • 12e983a: 8194743: Compiler implementation for Statements before super()
  • 5e24aaf: 8320001: javac crashes while adding type annotations to the return type of a constructor
  • f9e9131: 8319703: Serial: Remove generationSpec
  • a006d7e: 8294549: configure script should detect unsupported path
  • 4977922: 8320330: Improve implementation of RShift Value
  • ... and 50 more: https://git.openjdk.org/jdk/compare/e47cf611c9490225e50a548787cbba66ab147058...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 23, 2023
@fg1417
Copy link
Author

fg1417 commented Nov 27, 2023

@theRealAph thanks for your review!

May I have another review? Maybe @vnkozlov? Thanks.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. It may help other platform's pre-fetchers because you are ordering memory access.
I will run testing before approval.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My tier1-4,xcomp,stress testing passed.

@fg1417
Copy link
Author

fg1417 commented Nov 28, 2023

Thanks for all your reviewing and test work! @vnkozlov @theRealAph

I'll integrate it if there is no other comment.

@fg1417
Copy link
Author

fg1417 commented Nov 29, 2023

/integrate

@openjdk
Copy link

openjdk bot commented Nov 29, 2023

Going to push as commit 3ccd02f.
Since your change was applied there have been 85 commits pushed to the master branch:

  • 2c4c6c9: 8320049: PKCS10 would not discard the cause when throw SignatureException on invalid key
  • f93b18f: 8320932: [BACKOUT] dsymutil command leaves around temporary directories
  • ce4e6e2: 8320915: Update copyright year in build files
  • 21d361e: 8320525: G1: G1UpdateRemSetTrackingBeforeRebuild::distribute_marked_bytes accesses partially unloaded klass
  • dc256fb: 8320061: [nmt] Multiple issues with peak accounting
  • adad132: 8320767: Use := wherever possible in spec.gmk.in
  • 69c0b24: 8320714: java/util/Locale/LocaleProvidersRun.java and java/util/ResourceBundle/modules/visibility/VisibilityTest.java timeout after passing
  • 66ae6d5: 8320899: Select the correct Makefile when running make in build directory
  • ebbef62: 8320769: Remove ill-adviced "make install" target
  • 86bb804: 8320863: dsymutil command leaves around temporary directories
  • ... and 75 more: https://git.openjdk.org/jdk/compare/e47cf611c9490225e50a548787cbba66ab147058...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 29, 2023
@openjdk openjdk bot closed this Nov 29, 2023
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Nov 29, 2023
@openjdk openjdk bot removed the rfr Pull request is ready for review label Nov 29, 2023
@openjdk
Copy link

openjdk bot commented Nov 29, 2023

@fg1417 Pushed as commit 3ccd02f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants