-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8366333: AArch64: Enhance SVE subword type implementation of vector compress #27188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ompress The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions. This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations—such as looping over the active elements, extraction, and insertion—I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects: 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction. 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid. This pull request introduces the following changes: 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput. 2. Eliminates unnecessary compress operations for partial subword type cases. 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure. Benchmark results demonstrate that these changes significantly improve performance. Benchmarks on Nvidia Grace machine with 128-bit SVE: ``` Benchmark Unit Before Error After Error Uplift Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 ``` This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed.
|
👋 Welcome back erifan! A progress list of the required criteria for merging this PR into |
|
@erifan This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 203 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@eme64, @iwanowww, @jatin-bhateja, @XiaohongGong) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
|
@erifan this pull request can not be integrated into git checkout JDK-8366333-compress
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
Would it make sense to additionally run the relevant benchmarks on other popular aarch64 platforms such as Graviton, to make sure the improvements are seen there as well? |
|
@galderz Yeah, absolutely. This is the test results on an AWS graviton3 V1 machine, we can see similar performance gain.
|
eme64
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by comments, going on vacation soon so don't depend on me fully reviewing this any time soon ;)
|
@erifan I'm going to be out of the office for 3 weeks, so feel free to ask others for reviews :) |
erifan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review @eme64 . Have a nice trip!
| // Example input: src = q p n m l k j i h g f e d c b a, one character is 8 bits. | ||
| // mask = 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1, one character is 1 bit. | ||
| // Expected result: dst = 0 0 0 0 0 0 0 0 0 0 0 p i g c a | ||
| sve_dup(vtmp3, B, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, you could declare a local FloatRegister vzr = vtmp3 and refer to it at all use sites. That would make things clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following reads slightly better, but it's up to you how to shape it.
FloatRegister vzr = vtmp3;
sve_dup(vzr, B, 0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
| void C2_MacroAssembler::sve_compress_short(FloatRegister dst, FloatRegister src, PRegister mask, | ||
| FloatRegister vtmp1, FloatRegister vtmp2, | ||
| PRegister pgtmp) { | ||
| FloatRegister vtmp, FloatRegister vtmp_zr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On code style: it's confusing to see a temp register used in non-destructive way to pass a constant. If you want to save on materializing an all 0 vector constant, I suggest to name it differently (e.g., zr) and put the argument before vtmp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
erifan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iwanowww I have addressed all of your suggestions, thanks for your review.
| // Example input: src = q p n m l k j i h g f e d c b a, one character is 8 bits. | ||
| // mask = 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1, one character is 1 bit. | ||
| // Expected result: dst = 0 0 0 0 0 0 0 0 0 0 0 p i g c a | ||
| sve_dup(vtmp3, B, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
| void C2_MacroAssembler::sve_compress_short(FloatRegister dst, FloatRegister src, PRegister mask, | ||
| FloatRegister vtmp1, FloatRegister vtmp2, | ||
| PRegister pgtmp) { | ||
| FloatRegister vtmp, FloatRegister vtmp_zr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
iwanowww
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
| // Example input: src = q p n m l k j i h g f e d c b a, one character is 8 bits. | ||
| // mask = 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1, one character is 1 bit. | ||
| // Expected result: dst = 0 0 0 0 0 0 0 0 0 0 0 p i g c a | ||
| sve_dup(vtmp3, B, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following reads slightly better, but it's up to you how to shape it.
FloatRegister vzr = vtmp3;
sve_dup(vzr, B, 0);
|
|
||
| @Test | ||
| @IR(counts = { IRNode.COMPRESS_VB, "= 1" }, | ||
| applyIfCPUFeature = { "sve", "true" }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @erifan,
Nice work!,
Can you please also enable these tests for x86? Following are the relevant features.
CompressVB -> avx512_vbmi2, avx512_vl
CompressVS -> avx512_vbmi2. avx512_vl
CompressVI/VF -> avx512f, avx512vl
ComprssVL/VD -> avx512f, avx512vl
PS: avx512_vbmi2 is missing from test/IREncodingPrinter.java
FYI , currently, we don't support sub-word compression intrinsics on AVX2/E-core targets. I created a vectorized algorithm without any x86 backend change just using vector APIs, and it showed 12x improvement.
PROMPT>java -cp . --add-modules=jdk.incubator.vector short_vector_compress 0
WARNING: Using incubator modules: jdk.incubator.vector
[ baseline time] 976 ms [res] 429507073
PROMPT>java -cp . --add-modules=jdk.incubator.vector short_vector_compress 1
WARNING: Using incubator modules: jdk.incubator.vector
[ withopt time] 80 ms [res] 429507073
PROMPT>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, please help me check if it is correct, thank you! I have tested it locally.
|
Hi @iwanowww @jatin-bhateja I have addressed your comments, thanks for your review! |
jatin-bhateja
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @erifan ,
Verified IR test changes.
|
/contributor add @jatin-bhateja |
|
@erifan |
|
Hi, can I integrate this patch now? Could any Oracle friends help me with internal testing of this patch? Thanks~ |
XiaohongGong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks! Reviewed internally.
|
I have tested a lot of different configurations on both aarch64 and x64, including 128/256/512 bits SVE2/SVE/NEON, AVX3/2/1, SSE4/3/2/1. All tests passed, so I'll integrate the PR, thanks for all! |
|
/sponsor |
|
Going to push as commit 2de8d58.
Your commit was automatically rebased without conflicts. |
|
@XiaohongGong @erifan Pushed as commit 2de8d58. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type
compressoperations. Therefore, the current implementation uses the 32-bit SVEcompactinstruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using theindex + tblinstructions.This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations—such as looping over the active elements, extraction, and insertion—I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects:
index + tblis suboptimal due to the high latency of theindexinstruction.This pull request introduces the following changes:
index + tblwith thewhilelt + spliceinstructions, which offer lower latency and higher throughput.sve_compress_byte, one less temporary register is used to alleviate potential register pressure.Benchmark results demonstrate that these changes significantly improve performance.
Benchmarks on Nvidia Grace machine with 128-bit SVE:
This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed.
Progress
Issue
Reviewers
Contributors
<jbhateja@openjdk.org>Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27188/head:pull/27188$ git checkout pull/27188Update a local copy of the PR:
$ git checkout pull/27188$ git pull https://git.openjdk.org/jdk.git pull/27188/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 27188View PR using the GUI difftool:
$ git pr show -t 27188Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27188.diff
Using Webrev
Link to Webrev Comment