-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) #20098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back galder! A progress list of the required criteria for merging this PR into |
|
@galderz This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 82 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@jaskarth, @jddarcy, @rwestrel, @eme64, @chhagedorn) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C2 changes look nice! I just added one comment here about style. It would also be good to add some IR tests checking that the intrinsic is creating MaxL/MinL nodes before macro expansion, and a microbenchmark to compare results.
| //------------------------------inline_long_min_max------------------------------ | ||
| bool LibraryCallKit::inline_long_min_max(vmIntrinsics::ID id) { | ||
| assert(callee()->signature()->size() == 4, "minL/maxL has 2 parameters of size 2 each."); | ||
| Node *a = argument(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Node *a = argument(0); | |
| Node* a = argument(0); |
And the same for b and n as well.
|
Overall, looks fine. So, there will be Also, it's a bit confusing to see int variants names w/o basic type ( |
|
/reviewers 2 reviewer |
|
Core libs changes looks fine; bumping review count for the remainder of the PR. |
Thanks for the review. +1 to the IR tests, I'll work on those. Re: microbenchmark - what do you have exactly in mind? For vectorization performance there is I would not expect performance differences in linux/x64 darwin/aarch64 |
Yeah, I think a simple benchmark that tests for long min/max vectorization and reduction would be good. I worry that checking performance manually like in |
|
I've been working on some JMH benchmarks and I'm seeing some strange results that I need to investigate further. I will update the PR when I have found the reason(s). |
|
@galderz This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
Working on it |
|
@galderz in the benchmark did you collected the mispredicts/branches? |
|
@franz1981 No I hadn't done so until now, but I will be tracking those more closely. Context: I have been running some reduction JMH benchmarks and I could see a big drop in non AVX-512 performance compared to the unpatched code. E.g. @Benchmark
public long reductionSingleLongMax() {
long result = 0;
for (int i = 0; i < size; i++) {
final long v = 11 * aLong[i];
result = Math.max(result, v);
}
return result;
}This is caused by keeping the Max/Min nodes in the IR, which get translated into FYI: A similar situation can be replicated with reduction benchmarks that use max/min integer, but for the code to fallback into I also need to see what the performance looks on like on a system with AVX-512, and also look at how non-reduction JMH benchmarks behave on systems with/without AVX-512. Finally, I'm also looking at an experiment to see what would happen in cmovl was implemented with branch+mov instead. |
* movl only moves 4 bytes which is not enough here. movq is needed which moves 8 bytes, a java long.
Before noting the regressions, it's worth noting that PR also improves performance certain scenarios. I will summarise those tomorrow. Here's a summary of the regressions Regression 1Given a loop with a long min/max reduction pattern with one side of branch taken near 100% of time, when Supeword finds the pattern not profitable, then HotSpot will use scalar instructions (cmov) and performance will regress. Possible solutions: Regression 2Given a loop with a long min/max reduction pattern with one side of branch near 100% of time, when the platform does not support vector instructions to achieve this (e.g. AVX-512 quad word vpmax/vpmin), then HotSpot will use scalar instructions (cmov) and performance will regress. Possible solutions Regression 3Given a loop with a long min/max non-reduction pattern (e.g. Possible solutions: |
|
@galderz Thanks for the summary of regressions! Yes, there are plenty of speedups, I assume primarily because of All your Regressions 1-3 are cases with "extreme" probabilitiy (close to 100% / 0%), you listed none else. That matches with my intuition, that branching code is usually better than cmove in extreme probability cases. As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e.
Does that make sense, or am I missing something? |
+1 and the rest of suggestions. Shall I create a JDK bug for this?
Do we need JDK bug(s) for these? If so, how many? 1 or 2? |
|
Also, I've started a discussion on jmh-dev to see if there's a way to minimise pollution of |
|
@galderz about:
This should already be covered by these, and I will handle that eventually with the Cost-Model RFE JDK-8340093:
|
|
@galderz about:
I looked at So it seems that here lanewise min/max are supported for AVX2. But it seems that's different for reductions: So it seems maybe we could improve the AVX2 coverage for reductions. But honestly, I will probably find this issue again once I work on the other reductions above, and run the benchmarks. I think that will make it easier to investigate all of this. I will for example adjust the IR rules, and then it will be apparent where there are cases that are not covered. |
|
@galderz you said you would add some extra comments, then I will review again :) |
|
@eme64 I've added the comment that was pending from your last review. I've also merged latest master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for all the updates :)
|
I'm launching another round of testing on our side ;) |
|
@eme64 I've run tier[1-3] locally and looked good overall. I had to update jtreg and noticed this failure but I don't think it's related to this PR: |
I've created JDK-8351409 to address this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work and collection of all the data!
|
/integrate |
|
Thanks @eme64 @rwestrel @chhagedorn for your patience with this! |
|
/sponsor |
|
Going to push as commit 4e51a8c.
Your commit was automatically rebased without conflicts. |
This patch intrinsifies
Math.max(long, long)andMath.min(long, long)in order to help improve vectorization performance.Currently vectorization does not kick in for loops containing either of these calls because of the following error:
The control flow is due to the java implementation for these methods, e.g.
This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
E.g.
Applying the same changes to
ReductionPerfas in #13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):After the patch, on darwin/aarch64 (M1):
This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
Therefore, it still relies on the macro expansion to transform those into CMoveL.
I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:
The failure I got is CODETOOLS-7903745 so unrelated to these changes.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20098/head:pull/20098$ git checkout pull/20098Update a local copy of the PR:
$ git checkout pull/20098$ git pull https://git.openjdk.org/jdk.git pull/20098/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20098View PR using the GUI difftool:
$ git pr show -t 20098Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20098.diff
Using Webrev
Link to Webrev Comment