-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8351623: VectorAPI: Refactor subword gather load and add SVE implementation #24679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back xgong! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@XiaohongGong The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
|
/label hotspot-compiler |
|
/label remove graal |
|
@XiaohongGong |
|
@XiaohongGong |
|
Hi @jatin-bhateja , could you please help take a look at this PR especially the X86 part? Thanks a lot! |
| * @library /test/lib / | ||
| * @modules jdk.incubator.vector | ||
| * | ||
| * @run driver compiler.vectorapi.VectorGatherSubwordTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use @run main instead of @run driver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look at this PR! I think it's fine using @run main instead.
|
@XiaohongGong I had a quick look at your changes and PR description. I wonder if you could split some of the refactoring into a separate PR? That would make it easier to review. Currently, you basically have x64 changes, aarch64 changes, Java library changes, and C2 changes. That's a lot at once. And it would basically require the review from a lot of different people at once. Splitting would make it easier to review, less work for the reviewer. It would ensure everybody can look at a smaller change set, and that would also increase the quality of the code after review, I think. What do you think? |
Thanks for looking at this PR @eme64 ! It's a good idea splitting this PR as smaller ones. I will consider about this. Maybe I can do a refactoring first, and then implement the compiler support for AArch64 as a followed-up PR. WDYT? |
That sounds excellent :) |
|
I‘d like to close this PR and split the change with two new PRs. Thanks for the review! |
Summary:
JDK-8318650 added the hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. This patch aims at implementing the equivalent functionality for AArch64 SVE platform. In addition to the AArch64 backend support, this patch also refactors the API implementation in Java side and the compiler mid-end part to make the operations more efficient and maintainable across different architectures.
Background:
Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices stored in an int array. SVE provides native vector gather load instructions for byte/short types using an int vector saving indices (see [2][3]).
The number of loaded elements must match the index vector's element count. Since int elements are 4/2 times larger than byte/short elements, and given
MaxVectorSizeconstraints, the operation may need to be splitted into multiple parts.Using a 128-bit byte vector gather load as an example, there are four scenarios with different
MaxVectorSize:MaxVectorSize = 16, byte_vector_size = 16:Example:
MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2:Example:
MaxVectorSize = 64, byte_vector_size = MaxVectorSize / 4:Example:
MaxVectorSize > 64, byte_vector_size < MaxVectorSize / 4:memory access.
Example:
Main changes:
LoadVectorGather/LoadVectorGatherMaskedIR for subword types. This patch removes the memory offset input and add it to the memory baseaddrin IR level for architectures that need the index array like X86. This not only simplifies the backend implementation, but also saves some add operations. Additionally, it unifies the IR for all types.Testing:
UseAVXflags on X86 andUseSVEflags on AArch64Performance:
The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance data:
And here is the performance data on a X86 avx512 system, which shows the performance can improve at most 39%.
[1] https://bugs.openjdk.org/browse/JDK-8318650
[2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector---Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en
Progress
Integration blocker
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24679/head:pull/24679$ git checkout pull/24679Update a local copy of the PR:
$ git checkout pull/24679$ git pull https://git.openjdk.org/jdk.git pull/24679/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24679View PR using the GUI difftool:
$ git pr show -t 24679Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24679.diff
Using Webrev
Link to Webrev Comment