Skip to content

Conversation

@galderz
Copy link
Contributor

@galderz galderz commented Jul 9, 2024

This patch intrinsifies Math.max(long, long) and Math.min(long, long) in order to help improve vectorization performance.

Currently vectorization does not kick in for loops containing either of these calls because of the following error:

VLoop::check_preconditions: failed: control flow in loop not allowed

The control flow is due to the java implementation for these methods, e.g.

public static long max(long a, long b) {
    return (a >= b) ? a : b;
}

This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively.
By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization.
E.g.

SuperWord::transform_loop:
    Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt strip_mined
 518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21)

Applying the same changes to ReductionPerf as in #13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1):

==============================
Test summary
==============================
   TEST                                              TOTAL  PASS  FAIL ERROR
   jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
                                                         1     1     0     0
==============================
TEST SUCCESS

long min   1155
long max   1173

After the patch, on darwin/aarch64 (M1):

==============================
Test summary
==============================
   TEST                                              TOTAL  PASS  FAIL ERROR
   jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
                                                         1     1     0     0
==============================
TEST SUCCESS

long min   1042
long max   1042

This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes.
Therefore, it still relies on the macro expansion to transform those into CMoveL.

I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results:

==============================
Test summary
==============================
   TEST                                              TOTAL  PASS  FAIL ERROR
   jtreg:test/hotspot/jtreg:tier1                     2500  2500     0     0
>> jtreg:test/jdk:tier1                               2413  2412     1     0 <<
   jtreg:test/langtools:tier1                         4556  4556     0     0
   jtreg:test/jaxp:tier1                                 0     0     0     0
   jtreg:test/lib-test:tier1                            33    33     0     0
==============================

The failure I got is CODETOOLS-7903745 so unrelated to these changes.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)

Issue

  • JDK-8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20098/head:pull/20098
$ git checkout pull/20098

Update a local copy of the PR:
$ git checkout pull/20098
$ git pull https://git.openjdk.org/jdk.git pull/20098/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20098

View PR using the GUI difftool:
$ git pr show -t 20098

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20098.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 9, 2024

👋 Welcome back galder! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 9, 2024

@galderz This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long)

Reviewed-by: roland, epeter, chagedorn, darcy

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 82 new commits pushed to the master branch:

  • 7e3bc81: 8351216: ZGC: Store NUMA node count
  • 82eb780: 8351349: GSSUtil.createSubject has outdated access control context and policy related text
  • c3db667: 8351542: LIBMANAGEMENT_OPTIMIZATION remove special optimization settings
  • 375722f: 8351839: RISC-V: Fix base offset calculation introduced in JDK-8347489
  • 4c5956d: 8350866: [x86] Add C1 intrinsics for CRC32-C
  • 9c00331: 8330469: C2: Remove or change "PrintOpto && VerifyLoopOptimizations" as printing code condition
  • c18494d: 8351108: ImageIO.write(..) fails with exception when writing JPEG with IndexColorModel
  • 86860ca: 8346916: [REDO] align_up has potential overflow
  • a33b1f7: 8345298: RISC-V: Add riscv backend for Float16 operations - scalar
  • 6241d09: 8351861: RISC-V: add simple assert at arrays_equals_v
  • ... and 72 more: https://git.openjdk.org/jdk/compare/7314efc9483c5db6ecccd9215c04d78818e6a9a2...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@jaskarth, @jddarcy, @rwestrel, @eme64, @chhagedorn) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 9, 2024
@openjdk
Copy link

openjdk bot commented Jul 9, 2024

@galderz The following labels will be automatically applied to this pull request:

  • core-libs
  • graal
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Jul 9, 2024
@mlbridge
Copy link

mlbridge bot commented Jul 9, 2024

Copy link
Member

@jaskarth jaskarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C2 changes look nice! I just added one comment here about style. It would also be good to add some IR tests checking that the intrinsic is creating MaxL/MinL nodes before macro expansion, and a microbenchmark to compare results.

//------------------------------inline_long_min_max------------------------------
bool LibraryCallKit::inline_long_min_max(vmIntrinsics::ID id) {
assert(callee()->signature()->size() == 4, "minL/maxL has 2 parameters of size 2 each.");
Node *a = argument(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Node *a = argument(0);
Node* a = argument(0);

And the same for b and n as well.

@iwanowww
Copy link
Contributor

Overall, looks fine.

So, there will be inline_min_max, inline_fp_min_max, and inline_long_min_max which slightly vary. I'd prefer to see them unified. (Or, at least, enhance inline_min_max to cover minL/maxL` cases).

Also, it's a bit confusing to see int variants names w/o basic type (_min/_minL vs _minI/_minL). Please, clean it up along the way. (FTR I'm also fine handling the renaming as a separate change.)

@jddarcy
Copy link
Member

jddarcy commented Jul 12, 2024

/reviewers 2 reviewer

@jddarcy
Copy link
Member

jddarcy commented Jul 12, 2024

Core libs changes looks fine; bumping review count for the remainder of the PR.

@openjdk
Copy link

openjdk bot commented Jul 12, 2024

@jddarcy
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).

@galderz
Copy link
Contributor Author

galderz commented Jul 17, 2024

The C2 changes look nice! I just added one comment here about style. It would also be good to add some IR tests checking that the intrinsic is creating MaxL/MinL nodes before macro expansion, and a microbenchmark to compare results.

Thanks for the review. +1 to the IR tests, I'll work on those.

Re: microbenchmark - what do you have exactly in mind? For vectorization performance there is ReductionPerf though it's not a microbenchmark per se. Do you want a microbenchmark for the performance of vectorized max/min long? For non-vectorization performance there is MathBench.

I would not expect performance differences in MathBench because the backend is still the same and this change really benefits vectorization. I've run the min/max long tests on darwin/aarch64 and linux/x64 and indeed I see no difference:

linux/x64

Benchmark          (seed)   Mode  Cnt        Score       Error   Units
MathBench.maxLong       0  thrpt    8  1464197.164 ± 27044.205  ops/ms # base
MathBench.minLong       0  thrpt    8  1469917.328 ± 25397.401  ops/ms # base
MathBench.maxLong       0  thrpt    8  1469615.250 ± 17950.429  ops/ms # patched
MathBench.minLong       0  thrpt    8  1456290.514 ± 44455.727  ops/ms # patched

darwin/aarch64

Benchmark          (seed)   Mode  Cnt        Score        Error   Units
MathBench.maxLong       0  thrpt    8  1739341.447 ? 210983.444  ops/ms # base
MathBench.minLong       0  thrpt    8  1659547.649 ? 260554.159  ops/ms # base
MathBench.maxLong       0  thrpt    8  1660449.074 ? 254534.725  ops/ms # patched
MathBench.minLong       0  thrpt    8  1729728.021 ?  16327.575  ops/ms # patched

@jaskarth
Copy link
Member

Do you want a microbenchmark for the performance of vectorized max/min long?

Yeah, I think a simple benchmark that tests for long min/max vectorization and reduction would be good. I worry that checking performance manually like in ReductionPerf can lead to harder to interpret results than with a microbenchmark, especially with vm warmup 😅 Thanks for looking into this!

@galderz
Copy link
Contributor Author

galderz commented Jul 24, 2024

I've been working on some JMH benchmarks and I'm seeing some strange results that I need to investigate further. I will update the PR when I have found the reason(s).

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 21, 2024

@galderz This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@galderz
Copy link
Contributor Author

galderz commented Aug 27, 2024

Working on it

@franz1981
Copy link

@galderz in the benchmark did you collected the mispredicts/branches?

@galderz
Copy link
Contributor Author

galderz commented Sep 9, 2024

@franz1981 No I hadn't done so until now, but I will be tracking those more closely.

Context:

I have been running some reduction JMH benchmarks and I could see a big drop in non AVX-512 performance compared to the unpatched code. E.g.

    @Benchmark
    public long reductionSingleLongMax() {
        long result = 0;
        for (int i = 0; i < size; i++) {
            final long v = 11 * aLong[i];
            result = Math.max(result, v);
        }
        return result;
    }

This is caused by keeping the Max/Min nodes in the IR, which get translated into cmpq+cmovlq instructions (via the macro expansion). The code gets unrolled but a dependency chain on the current max value. In the unpatched code the intrinsic does not kick in and uses a standard ternary operation, which gets translated into a normal control flow. The system is able to handle this better due to branch prediction. @franz1981's comment is precisely about this. I need to enhance the benchmark to control the branchiness of the test (e.g. how often it goes one side or the other of a max/min call) and measure the mispredictions and branches...etc.

FYI: A similar situation can be replicated with reduction benchmarks that use max/min integer, but for the code to fallback into cmov, both AVX and SSE have be turned off.

I also need to see what the performance looks on like on a system with AVX-512, and also look at how non-reduction JMH benchmarks behave on systems with/without AVX-512.

Finally, I'm also looking at an experiment to see what would happen in cmovl was implemented with branch+mov instead.

* movl only moves 4 bytes which is not enough here.
movq is needed which moves 8 bytes, a java long.
@galderz
Copy link
Contributor Author

galderz commented Feb 26, 2025

That said: if we know that it is only in the high-probability cases, then we can address those separately. I would not consider it a blocking issue, as long as we file the follow-up RFE for int/max scalar case with high branch probability.
What would be really helpful: a list of all regressions / issues, and how we intend to deal with them. If we later find a regression that someone cares about, then we can come back to that list, and justify the decision we made here.

I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create.

Before noting the regressions, it's worth noting that PR also improves performance certain scenarios. I will summarise those tomorrow.

Here's a summary of the regressions

Regression 1

Given a loop with a long min/max reduction pattern with one side of branch taken near 100% of time, when Supeword finds the pattern not profitable, then HotSpot will use scalar instructions (cmov) and performance will regress.

Possible solutions:
a) make Superword recognise these scenarios as profitable.

Regression 2

Given a loop with a long min/max reduction pattern with one side of branch near 100% of time, when the platform does not support vector instructions to achieve this (e.g. AVX-512 quad word vpmax/vpmin), then HotSpot will use scalar instructions (cmov) and performance will regress.

Possible solutions
a) find a way to use other vector instructions (vpcmp+vpblend+vmov?)
b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided

Regression 3

Given a loop with a long min/max non-reduction pattern (e.g. longLoopMax) with one side of branch taken near 100% of time, when the platform does not vectorize it (either lack of CPU instruction support, or Superword finding not profitable), then HotSpot will use scalar instructions (cmov) and performance will regress.

Possible solutions:
a) find a way to use other vector instructions (e.g. longLoopMax vectorizes with AVX2 and might also do with earlier instruction sets)
b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided,

@eme64
Copy link
Contributor

eme64 commented Feb 27, 2025

@galderz Thanks for the summary of regressions! Yes, there are plenty of speedups, I assume primarily because of Long.min/max vectorization, but possibly also because the operation can now "float" out of a loop for example.

All your Regressions 1-3 are cases with "extreme" probabilitiy (close to 100% / 0%), you listed none else. That matches with my intuition, that branching code is usually better than cmove in extreme probability cases.

As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. cmp+mov). So to me, these are the follow-up RFE's:

  • Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the Integer.min/max cases, which have the same issue.
  • Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional.
  • Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional.

Does that make sense, or am I missing something?

@galderz
Copy link
Contributor Author

galderz commented Feb 27, 2025

Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the Integer.min/max cases, which have the same issue.

+1 and the rest of suggestions. Shall I create a JDK bug for this?

Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional.
Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional.

Do we need JDK bug(s) for these? If so, how many? 1 or 2?

@galderz
Copy link
Contributor Author

galderz commented Feb 27, 2025

Also, I've started a discussion on jmh-dev to see if there's a way to minimise pollution of Math.min(II) compilation. As a follow to #20098 (comment) I looked at where the other Math.min(II) calls are coming from, and a big chunk seem related to the JMH infrastructure.

@eme64
Copy link
Contributor

eme64 commented Mar 6, 2025

@galderz about:

Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional.

This should already be covered by these, and I will handle that eventually with the Cost-Model RFE JDK-8340093:

  • JDK-8345044 Sum of array elements not vectorized
    • (min/max of array)
  • JDK-8336000 C2 SuperWord: report that 2-element reductions do not vectorize
    • You would for example see that on aarch64 machines with only neon/asimd support you can have at most 2 longs per vector, because the max vector length is 128 bits.

@eme64
Copy link
Contributor

eme64 commented Mar 6, 2025

@galderz about:

Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional.

I looked at src/hotspot/cpu/x86/x86.ad
bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType bt) {

   1774     case Op_MaxV:                                                                                                                                                                                                             
   1775     case Op_MinV:
   1776       if (UseSSE < 4 && is_integral_type(bt)) {
   1777         return false;
   1778       }
...

So it seems that here lanewise min/max are supported for AVX2. But it seems that's different for reductions:

   1818     case Op_MinReductionV:
   1819     case Op_MaxReductionV:                                                                                                                                                                                                    
   1820       if ((bt == T_INT || is_subword_type(bt)) && UseSSE < 4) {
   1821         return false;
   1822       } else if (bt == T_LONG && (UseAVX < 3 || !VM_Version::supports_avx512vlbwdq())) {
   1823         return false;
   1824       }
...

So it seems maybe we could improve the AVX2 coverage for reductions. But honestly, I will probably find this issue again once I work on the other reductions above, and run the benchmarks. I think that will make it easier to investigate all of this. I will for example adjust the IR rules, and then it will be apparent where there are cases that are not covered.

@eme64
Copy link
Contributor

eme64 commented Mar 6, 2025

@galderz you said you would add some extra comments, then I will review again :)

@galderz
Copy link
Contributor Author

galderz commented Mar 7, 2025

@eme64 I've added the comment that was pending from your last review. I've also merged latest master.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for all the updates :)

@eme64
Copy link
Contributor

eme64 commented Mar 7, 2025

I'm launching another round of testing on our side ;)

@galderz
Copy link
Contributor Author

galderz commented Mar 7, 2025

@eme64 I've run tier[1-3] locally and looked good overall. I had to update jtreg and noticed this failure but I don't think it's related to this PR:

java.lang.AssertionError: gtest execution failed; exit code = 2. the failed tests: [codestrings::validate_vm]
	at GTestWrapper.main(GTestWrapper.java:98)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
	at java.base/java.lang.Thread.run(Thread.java:1447)

@galderz
Copy link
Contributor Author

galderz commented Mar 7, 2025

As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. cmp+mov). So to me, these are the follow-up RFE's:

  • Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the Integer.min/max cases, which have the same issue.

I've created JDK-8351409 to address this.

@eme64
Copy link
Contributor

eme64 commented Mar 7, 2025

@galderz Excellent. Testing looks all good on our side. Yes I think what you saw was unrelated.
@rwestrel Could give this a last quick scan and then I think you can integrate :)

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 10, 2025
Copy link
Member

@chhagedorn chhagedorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work and collection of all the data!

@galderz
Copy link
Contributor Author

galderz commented Mar 13, 2025

/integrate

@galderz
Copy link
Contributor Author

galderz commented Mar 13, 2025

Thanks @eme64 @rwestrel @chhagedorn for your patience with this!

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Mar 13, 2025
@openjdk
Copy link

openjdk bot commented Mar 13, 2025

@galderz
Your change (at version 1aa690d) is now ready to be sponsored by a Committer.

@rwestrel
Copy link
Contributor

/sponsor

@openjdk
Copy link

openjdk bot commented Mar 13, 2025

Going to push as commit 4e51a8c.
Since your change was applied there have been 82 commits pushed to the master branch:

  • 7e3bc81: 8351216: ZGC: Store NUMA node count
  • 82eb780: 8351349: GSSUtil.createSubject has outdated access control context and policy related text
  • c3db667: 8351542: LIBMANAGEMENT_OPTIMIZATION remove special optimization settings
  • 375722f: 8351839: RISC-V: Fix base offset calculation introduced in JDK-8347489
  • 4c5956d: 8350866: [x86] Add C1 intrinsics for CRC32-C
  • 9c00331: 8330469: C2: Remove or change "PrintOpto && VerifyLoopOptimizations" as printing code condition
  • c18494d: 8351108: ImageIO.write(..) fails with exception when writing JPEG with IndexColorModel
  • 86860ca: 8346916: [REDO] align_up has potential overflow
  • a33b1f7: 8345298: RISC-V: Add riscv backend for Float16 operations - scalar
  • 6241d09: 8351861: RISC-V: add simple assert at arrays_equals_v
  • ... and 72 more: https://git.openjdk.org/jdk/compare/7314efc9483c5db6ecccd9215c04d78818e6a9a2...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 13, 2025
@openjdk openjdk bot closed this Mar 13, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Mar 13, 2025
@openjdk
Copy link

openjdk bot commented Mar 13, 2025

@rwestrel @galderz Pushed as commit 4e51a8c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.