-
Notifications
You must be signed in to change notification settings - Fork 5.8k
8309583: AArch64: Optimize firstTrue() when amount of elements < 8 #14373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8309583: AArch64: Optimize firstTrue() when amount of elements < 8 #14373
Conversation
This patch optimizes VectorMask.firstTrue() on Neon when there are 2 or 4 elements in vector registers. VectorMask.firstTrue() should return VLEGNTH when vector mask is all false [1]. Current implementation uses rbit and then clz [2] to count leading zeros, then uses csel [3] (conditional select) to get the smaller value between VLENGTH and the number of unset lanes to ensure correctness. This patch sets the 16th or 32nd bit as 1, when there are only 2 or 4 elements in boolean masks, before rbit and clz. With this trick, maximum value calculated in such case will be VLENGTH (2 or 4). Test: All vector and vectorapi test passed. Performance: The benchmark function is like: ``` @benchmark public static int testInt() { int res = 0; for (int i = 0; i < LENGTH; i += INT_SPECIES.length()) { VectorMask<Integer> m = VectorMask.fromArray(INT_SPECIES, ia, i); res += m.firstTrue(); } return res; } ``` Following data is collected on a 128-bit Neon machine. Benchmark Before After Unit testInt 22214.740 25627.833 ops/ms testLong 11649.898 13698.535 ops/ms [1]: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorMask.html#firstTrue() [2]: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L5540 [3]: https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/CSEL--Conditional-Select- Change-Id: I4a2de805ffa4469f88d510c96617eae165f0e025
👋 Welcome back changpeng1997! A progress list of the required criteria for merging this PR into |
@changpeng1997 The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
Where is the benchmark? You don't seem to have included it in this PR. |
Sorry for the delay. Original performance was measured by a simple benchmark only measuring firstTrue()'s performance written by myself. When I wanted to add it to JDK I found an existing benchmark used to measure different mask operations' performance (jdk/test/micro/org/openjdk/bench/jdk/incubator/vector/MaskQueryOperationsBenchmark.java at master · openjdk/jdk · GitHub). I tried to measure firstTrue()'s performance by this benchmark, but I found Blackhole‘s proportion of hottest region is too high, like following:
So I spent some time on fixing this benchmark to measure mask operations' performance effectively. After this update, the proportion of blackhole is below 10% for each benchmark function. And I also updated the performance of firstTrue() measured by this benchmark when there are only 2 or 4 elements in boolean masks. |
Can you please send the entire output of JMH? Blackhole should not appear at all in the output because it's been intrinsified. I'd like to know why the intrinsic isn't working for you. |
Output before this patch: https://gist.github.com/changpeng1997/734aa176577bfff56f5a87db9c8db69a |
Could you ty this? |
Something is wrong with your setup. not this: |
Blackhole mode autodetection was added in JMH 1.33, and enabled in JMH 1.34. The logs above say they run with JMH 1.33. Current version is 1.36, you need to upgrade, @changpeng1997. Also, I notice that your before/after logs use different JVM modes, one uses |
So I'm looking at the results of the patch and I see: Before
After:
which corresponds with a change from
to
That's a pretty decent speedup when you consider that the benchmark is dominated by memory operations and vector->core register moves. |
If we care about memory ops, note that we can get a useful speedup with
I will say that in general if you have to work in the core integer processor on an in-memory vector , it might be worth loading straight into core registers rather than going via the SIMD regs. Maybe we should write a general-purpose function that bypasses the SIMD unit in all such cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
|
@changpeng1997 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 82 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @e1iu) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
@theRealAph Sorry for the delay, I was on holiday last week. I found that we can avoid the effects of blackhole by using Following is the performance of
And following are the corresponding JMH output: We can see the C2 code of firstTrue(). |
@shipilev Thanks! Sorry for this mistake. |
/integrate |
@changpeng1997 |
/sponsor |
Going to push as commit 45b581b.
Your commit was automatically rebased without conflicts. |
@e1iu @changpeng1997 Pushed as commit 45b581b. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
This patch optimizes VectorMask.firstTrue() on Neon when there are 2 or 4 elements in vector registers.
VectorMask.firstTrue() should return VLEGNTH when vector mask is all false 1. Current implementation uses rbit and then clz 2 to count leading zeros, then uses csel 3 (conditional select) to get the smaller value between VLENGTH and the number of unset lanes to ensure correctness.
This patch sets the 16th or 32nd bit as 1, when there are only 2 or 4 elements in boolean masks, before rbit and clz. With this trick, maximum value calculated in such case will be VLENGTH (2 or 4).
Test:
All vector and vectorapi test passed.
Performance:
The benchmark functions are in MaskQueryOperationsBenchmark.java [4]. This patch also modifies above benchmark to measure mask operations' performance more effectively.
Following data is collected on a 128-bit Neon machine.
Benchmark (inputs) Mode Before After Units
MaskQueryOperationsBenchmark.testFirstTrueInt 1 thrpt 5952.670 7298.491 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueInt 2 thrpt 5951.513 7297.620 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueInt 3 thrpt 5953.048 7298.072 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueLong 1 thrpt 3496.990 4003.188 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueLong 2 thrpt 3497.755 4002.577 ops/ms
MaskQueryOperationsBenchmark.testFirstTrueLong 3 thrpt 3500.085 4002.471 ops/ms
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14373/head:pull/14373
$ git checkout pull/14373
Update a local copy of the PR:
$ git checkout pull/14373
$ git pull https://git.openjdk.org/jdk.git pull/14373/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 14373
View PR using the GUI difftool:
$ git pr show -t 14373
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14373.diff
Webrev
Link to Webrev Comment