Skip to content

8309583: AArch64: Optimize firstTrue() when amount of elements < 8 #14373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

changpeng1997
Copy link
Contributor

@changpeng1997 changpeng1997 commented Jun 8, 2023

This patch optimizes VectorMask.firstTrue() on Neon when there are 2 or 4 elements in vector registers.

VectorMask.firstTrue() should return VLEGNTH when vector mask is all false 1. Current implementation uses rbit and then clz 2 to count leading zeros, then uses csel 3 (conditional select) to get the smaller value between VLENGTH and the number of unset lanes to ensure correctness.

This patch sets the 16th or 32nd bit as 1, when there are only 2 or 4 elements in boolean masks, before rbit and clz. With this trick, maximum value calculated in such case will be VLENGTH (2 or 4).

Test:
All vector and vectorapi test passed.

Performance:
The benchmark functions are in MaskQueryOperationsBenchmark.java [4]. This patch also modifies above benchmark to measure mask operations' performance more effectively.

Following data is collected on a 128-bit Neon machine.

Benchmark (inputs) Mode Before After Units
MaskQueryOperationsBenchmark.testFirstTrueInt 1 thrpt 5952.670 7298.491 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueInt 2 thrpt 5951.513 7297.620 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueInt 3 thrpt 5953.048 7298.072 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueLong 1 thrpt 3496.990 4003.188 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueLong 2 thrpt 3497.755 4002.577 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueLong 3 thrpt 3500.085 4002.471 ops/ms


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8309583: AArch64: Optimize firstTrue() when amount of elements < 8 (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14373/head:pull/14373
$ git checkout pull/14373

Update a local copy of the PR:
$ git checkout pull/14373
$ git pull https://git.openjdk.org/jdk.git pull/14373/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14373

View PR using the GUI difftool:
$ git pr show -t 14373

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14373.diff

Webrev

Link to Webrev Comment

This patch optimizes VectorMask.firstTrue() on Neon when there are 2
or 4 elements in vector registers.

VectorMask.firstTrue() should return VLEGNTH when vector mask is all
false [1]. Current implementation uses rbit and then clz [2] to count
leading zeros, then uses csel [3] (conditional select) to get the
smaller value between VLENGTH and the number of unset lanes to ensure
correctness.

This patch sets the 16th or 32nd bit as 1, when there are only 2 or 4
elements in boolean masks, before rbit and clz. With this trick, maximum
value calculated in such case will be VLENGTH (2 or 4).

Test:
All vector and vectorapi test passed.

Performance:
The benchmark function is like:

```
@benchmark
public static int testInt() {
    int res = 0;
    for (int i = 0; i < LENGTH; i += INT_SPECIES.length()) {
        VectorMask<Integer> m = VectorMask.fromArray(INT_SPECIES, ia, i);
        res += m.firstTrue();
    }

    return res;
}
```

Following data is collected on a 128-bit Neon machine.

Benchmark     Before     After     Unit
testInt       22214.740  25627.833 ops/ms
testLong      11649.898  13698.535 ops/ms

[1]: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorMask.html#firstTrue()
[2]: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L5540
[3]: https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/CSEL--Conditional-Select-

Change-Id: I4a2de805ffa4469f88d510c96617eae165f0e025
@bridgekeeper
Copy link

bridgekeeper bot commented Jun 8, 2023

👋 Welcome back changpeng1997! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 8, 2023
@openjdk
Copy link

openjdk bot commented Jun 8, 2023

@changpeng1997 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jun 8, 2023
@mlbridge
Copy link

mlbridge bot commented Jun 8, 2023

Webrevs

@theRealAph
Copy link
Contributor

Where is the benchmark? You don't seem to have included it in this PR.

@changpeng1997
Copy link
Contributor Author

changpeng1997 commented Jun 19, 2023

Where is the benchmark? You don't seem to have included it in this PR.

@theRealAph

Sorry for the delay. Original performance was measured by a simple benchmark only measuring firstTrue()'s performance written by myself. When I wanted to add it to JDK I found an existing benchmark used to measure different mask operations' performance (jdk/test/micro/org/openjdk/bench/jdk/incubator/vector/MaskQueryOperationsBenchmark.java at master · openjdk/jdk · GitHub). I tried to measure firstTrue()'s performance by this benchmark, but I found Blackhole‘s proportion of hottest region is too high, like following:

[Hottest Methods (after inlining)].............................................................. 
57.15% c2, level 4 org.openjdk.jmh.infra.Blackhole::consumeFull, version 867 
41.99% c2, level 4 org.openjdk.bench.jdk.incubator.vector.jmh_generated.MaskQueryOperationsBenchmark_testFirstTrueLong_jmhTest::testFirstTrueLong_thrpt_jmhStub, version 883.

So I spent some time on fixing this benchmark to measure mask operations' performance effectively. After this update, the proportion of blackhole is below 10% for each benchmark function. And I also updated the performance of firstTrue() measured by this benchmark when there are only 2 or 4 elements in boolean masks.

@theRealAph
Copy link
Contributor

Sorry for the delay. Original performance was measured by a simple benchmark only measuring firstTrue()'s performance written by myself. When I wanted to add it to JDK I found an existing benchmark used to measure different mask operations' performance (jdk/test/micro/org/openjdk/bench/jdk/incubator/vector/MaskQueryOperationsBenchmark.java at master · openjdk/jdk · GitHub). I tried to measure firstTrue()'s performance by this benchmark, but I found Blackhole‘s proportion of hottest region is too high, like following:

Can you please send the entire output of JMH? Blackhole should not appear at all in the output because it's been intrinsified. I'd like to know why the intrinsic isn't working for you.

@changpeng1997
Copy link
Contributor Author

Sorry for the delay. Original performance was measured by a simple benchmark only measuring firstTrue()'s performance written by myself. When I wanted to add it to JDK I found an existing benchmark used to measure different mask operations' performance (jdk/test/micro/org/openjdk/bench/jdk/incubator/vector/MaskQueryOperationsBenchmark.java at master · openjdk/jdk · GitHub). I tried to measure firstTrue()'s performance by this benchmark, but I found Blackhole‘s proportion of hottest region is too high, like following:

Can you please send the entire output of JMH? Blackhole should not appear at all in the output because it's been intrinsified. I'd like to know why the intrinsic isn't working for you.

Output before this patch: https://gist.github.com/changpeng1997/734aa176577bfff56f5a87db9c8db69a
Output after this patch: https://gist.github.com/changpeng1997/73098069b8f814310d6606dfd7dc56c5

@theRealAph
Copy link
Contributor

Blackhole mode: full + dont-inline hint (default, use -Djmh.blackhole.autoDetect=true to auto-detect)

Could you ty this?

@theRealAph
Copy link
Contributor

theRealAph commented Jun 21, 2023

Something is wrong with your setup.
You should be seeing this:
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)

not this:
# Blackhole mode: full + dont-inline hint (default, use -Djmh.blackhole.autoDetect=true to auto-detect)

@shipilev
Copy link
Member

Output before this patch: https://gist.github.com/changpeng1997/734aa176577bfff56f5a87db9c8db69a
Output after this patch: https://gist.github.com/changpeng1997/73098069b8f814310d6606dfd7dc56c5

Blackhole mode autodetection was added in JMH 1.33, and enabled in JMH 1.34. The logs above say they run with JMH 1.33. Current version is 1.36, you need to upgrade, @changpeng1997.

Also, I notice that your before/after logs use different JVM modes, one uses release, and another uses fastdebug. These are not comparable.

@theRealAph
Copy link
Contributor

So I'm looking at the results of the patch and I see:

Before

Benchmark                                           (inputs)  Mode  Cnt   Score   Error  Units
MaskQueryOperationsBenchmark.testFirstTrueInt              1  avgt    3  69.547 ± 2.837  ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt              2  avgt    3  69.549 ± 0.497  ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt              3  avgt    3  69.506 ± 1.360  ns/op

After:

Benchmark                                           (inputs)  Mode  Cnt   Score   Error  Units
MaskQueryOperationsBenchmark.testFirstTrueInt              1  avgt    3  58.955 ± 0.838  ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt              2  avgt    3  58.690 ± 2.940  ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt              3  avgt    3  58.923 ± 1.088  ns/op

which corresponds with a change from

            0x00000001158ef748:   fmov	x11, d16
            0x00000001158ef74c:   rbit	x11, x11
            0x00000001158ef750:   clz	x11, x11
            0x00000001158ef754:   lsr	w11, w11, #3
           ;; 0x4
            0x00000001158ef758:   orr	w8, wzr, #0x4
            0x00000001158ef75c:   cmp	w11, w8
            0x00000001158ef760:   csel	w11, w8, w11, ge  // ge = tcont

to

            0x0000000115f3f8e8:   fmov	x14, d16
            0x0000000115f3f8ec:   orr	x14, x14, #0x100000000
            0x0000000115f3f8f0:   rbit	x14, x14
            0x0000000115f3f8f4:   clz	x14, x14
            0x0000000115f3f8f8:   lsr	w14, w14, #3

That's a pretty decent speedup when you consider that the benchmark is dominated by memory operations and vector->core register moves.

@theRealAph
Copy link
Contributor

If we care about memory ops, note that we can get a useful speedup with match(Set dst (VectorMaskFirstTrue (LoadVector mem))) but perhaps that's not worth doing.

Benchmark                                           (inputs)  Mode  Cnt   Score   Error  Units
MaskQueryOperationsBenchmark.testFirstTrueInt              1  avgt    3  49.591 ± 0.477  ns/op

I will say that in general if you have to work in the core integer processor on an in-memory vector , it might be worth loading straight into core registers rather than going via the SIMD regs. Maybe we should write a general-purpose function that bypasses the SIMD unit in all such cases.

Copy link
Contributor

@theRealAph theRealAph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@openjdk
Copy link

openjdk bot commented Jun 22, 2023

⚠️ @changpeng1997 the full name on your profile does not match the author name in this pull requests' HEAD commit. If this pull request gets integrated then the author name from this pull requests' HEAD commit will be used for the resulting commit. If you wish to push a new commit with a different author name, then please run the following commands in a local repository of your personal fork:

$ git checkout optimize_firsttrue2e4e_neon
$ git commit --author='Preferred Full Name <you@example.com>' --allow-empty -m 'Update full name'
$ git push

@openjdk
Copy link

openjdk bot commented Jun 22, 2023

@changpeng1997 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8309583: AArch64: Optimize firstTrue() when amount of elements < 8

Reviewed-by: aph, eliu

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 82 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @e1iu) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 22, 2023
@changpeng1997
Copy link
Contributor Author

Something is wrong with your setup. You should be seeing this: # Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)

not this: # Blackhole mode: full + dont-inline hint (default, use -Djmh.blackhole.autoDetect=true to auto-detect)

@theRealAph Sorry for the delay, I was on holiday last week.

I found that we can avoid the effects of blackhole by using -Djmh.blackhole.autoDetect=true, so I have reset this benchmark.

Following is the performance of testFirstTrueInt and testFirstTrueLong before and after this patch:

Benchmark                                       bits    (inputs)  Mode  Before      After      Units
MaskQueryOperationsBenchmark.testFirstTrueInt   128           1  thrpt  520650.354  580091.081 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueInt   128           2  thrpt  520677.937  580391.661 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueInt   128           3  thrpt  519967.269  580705.642 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueLong  128           1  thrpt  518563.126  575941.490 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueLong  128           2  thrpt  517329.190  578848.383 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueLong  128           3  thrpt  517987.339  577601.752 ops/ms

And following are the corresponding JMH output:
before my patch: https://gist.github.com/changpeng1997/3ebe4b7beea93716d9f29d4ef71641af
after my patch: https://gist.github.com/changpeng1997/d306957370eb0bdbb8e71b601440cdaa

We can see the C2 code of firstTrue().

@changpeng1997
Copy link
Contributor Author

Output before this patch: https://gist.github.com/changpeng1997/734aa176577bfff56f5a87db9c8db69a
Output after this patch: https://gist.github.com/changpeng1997/73098069b8f814310d6606dfd7dc56c5

Blackhole mode autodetection was added in JMH 1.33, and enabled in JMH 1.34. The logs above say they run with JMH 1.33. Current version is 1.36, you need to upgrade, @changpeng1997.

Also, I notice that your before/after logs use different JVM modes, one uses release, and another uses fastdebug. These are not comparable.

@shipilev Thanks! Sorry for this mistake.

@changpeng1997
Copy link
Contributor Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Jun 26, 2023
@openjdk
Copy link

openjdk bot commented Jun 26, 2023

@changpeng1997
Your change (at version 8ce4ba8) is now ready to be sponsored by a Committer.

@e1iu
Copy link
Member

e1iu commented Jun 27, 2023

/sponsor

@openjdk
Copy link

openjdk bot commented Jun 27, 2023

Going to push as commit 45b581b.
Since your change was applied there have been 107 commits pushed to the master branch:

  • 87e6fab: 8310873: Re-enable locked_create_entry symbol check in runtime/NMT/CheckForProperDetailStackTrace.java for RISC-V
  • 39fa4e6: 8310489: New test runtime/ClassInitErrors/TestStackOverflowDuringInit.java failed
  • 46add3f: 8310909: java.io.InvalidObjectException has redundant @since tag
  • 8f5b677: 8310908: Non-standard @since tag in com.sun.java.accessibility.util.package-info
  • a197ee7: 8310838: Correct range notations in MethodTypeDesc specification
  • 7c6a28f: 8310922: java/lang/Class/forName/ForNameNames.java fails after being added by JDK-8310242
  • 7db2f08: 8310242: Clarify the name parameter to Class::forName
  • 297c799: 8301492: Modernize equals() method of ResourceBundle.CacheKey and Bundles.CacheKey
  • a08352f: 8305671: javac rejects semicolons in compilation units with no imports
  • ff9a754: 8310459: [BACKOUT] 8304450: [vectorapi] Refactor VectorShuffle implementation
  • ... and 97 more: https://git.openjdk.org/jdk/compare/959a61fdd483c9523764b9ba0972f59ca06db0ee...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 27, 2023
@openjdk openjdk bot closed this Jun 27, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Jun 27, 2023
@openjdk
Copy link

openjdk bot commented Jun 27, 2023

@e1iu @changpeng1997 Pushed as commit 45b581b.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants