-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8351623: VectorAPI: Add SVE implementation of subword gather load operation #26236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
👋 Welcome back xgong! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@XiaohongGong The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
|
Hi @Bhavana-Kilambi, @fg1417, could you please help take a look at this PR? BTW, since the vector register size of my SVE machine is 128-bit, could you please help test the correctness on a SVE machine with larger vector size (e.g. 512-bit vector size)? Thanks a lot in advance! |
|
Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon. |
Testing on 256-bit SVE machines are fine to me. Thanks so much for your help! |
|
@XiaohongGong thanks for your work! Tier1 - tier3 passed on |
|
@XiaohongGong Please correct me if I’m missing something or got anything wrong. Taking Actually, we can get the target result directly by If so, the current design of |
Good! Thanks so much for your help! |
Yes, you are right! This can work for truncating and merging two gather load results. But we have to consider other scenarios together: 1) No merging 2) Need 4 times of gather-loads and merging. Additionally, we have to make To make the IR itself simple and unify the inputs for all types on kinds of architectures, I choose to pass one For cases that need more than 1 time of gather, I choose to generate multiple Regarding to the refinement based on your suggestion,
As a summary, |
@XiaohongGong thanks for your reply. This idea generally looks good to me. For case-2, we have Can we improve Then we can also consistently define the semantics of |
Thanks! Regarding to the definitation of |
Maybe I can define the vector type of The implementation would like:
Or more commonly:
From the IR level, which one do you think is better? |
That makes sense to me. Thanks for your explanation!
I like this idea! The first one looks better, in which |
Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion! |
Thanks! I’d suggest also highlighting |
Thanks for your point~ |
|
Hi @fg1417 , the latest commit refactored the whole IR patterns and Main changes
IR implementation
Performance changeIt can observe about 4% ~ 9% uplifts on some micro benchmarks. No significant regressions are observed. |
fg1417
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating it. Looks good on my end.
It might be helpful to have Reviewers take a look.
Thanks a lot for your review and test! |
|
Hi, could anyone please help take a look at this PR? Thanks so much! Hi @RealFYang , not sure whether there is any plan to support the subword gather-load for RVV, it will be much appreciated if we can get any feedback from other architecture side. Would you mind taking a look at this PR? Thanks a lot in advance! |
|
ping~ |
|
Hi, could anyone please help take a look at this PR? Thanks a lot in advance! |
|
Hi @eme64 , could you please help take a look at this PR? Thanks a lot in advance! |
shqking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
erifan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
eme64
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very interesting. I have a first series of questions / comments :)
There is definitively a tradeoff between complexity in the backend and in the C2 IR. So I'm yet trying to wrap my head around that decision. I'm just afraid that adding more very specific C2 IR nodes makes things more complicated to do optimizations in the C2 IR.
| // Unpack elements from the lower or upper half of the source | ||
| // predicate and place in elements of twice their size within | ||
| // the destination predicate. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This empty line is auto-generated by the m4 file. I tried some methods to clean it, but all fails. So I have to keep it as it is.
| // Concatenate elements from two source vectors by narrowing the elements to half size. Put | ||
| // the narrowed elements from the first source vector to the lower half of the destination | ||
| // vector, and the narrowed elements from the second source vector to the upper half. | ||
| // | ||
| // e.g. vec1 = [0d 0c 0b 0a], vec2 = [0h 0g 0f 0e] | ||
| // dst = [h g f e d c b a] | ||
| // | ||
| class VectorConcatenateNode : public VectorNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That semantic is not quite what I would expect from Concatenate. Maybe we can call it something else?
VectorConcatenateAndNarrowNode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using 2x Cast + Concatenate instead, and just matching that in the backend? I don't remember how to do the mere Concat, but it should be possible via the unslice or some other operation that concatenates two vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That semantic is not quite what I would expect from
Concatenate. Maybe we can call it something else?VectorConcatenateAndNarrowNode?
Yeah, VectorConcatenateAndNarrowNode would be much match. I just thought the name would be too long. I will change it in next commit. Thanks for your suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using
2x Cast + Concatenateinstead, and just matching that in the backend? I don't remember how to do the mere Concat, but it should be possible via theunsliceor some other operation that concatenates two vectors.
Would using 2x Cast + Concatenate make the IRs and match rule more complex? Mere concatenate would be something like vector slice in Vector API. It concatenates two vectors into one with an index denoting the merging position. And it requires the vector types are the same for two input vectors and the dst vector. Hence, if we want to separate this operation with cast and concatenate, the IRs would be (assume original type of v1/v2 is 4-int, the result type should be 8-short):
- Narrow two input vectors:
v1 = VectorCast(v1) (4-short); v2 = VectorCast(v2) (4-short).
The vector length are not changed while the element size is half size. Hence the vector length in bytes is half size as well. - Resize
v1andv2to double vector length. The higher bits are cleared:
v1 = VectorReinterpret(v1) (8-short); v2 = VectorReinterpret(v2) (8-short). - Concatenate
v1andv2like slice. The position is the middle of the vector length.
v = VectorSlice(v1, v2, 4) (8-short).
If we want to merging these IRs in backend, would the match rule be more complex? I will take a considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying I know that this alternative would be better. I'm just worried about having extra IR nodes, and then optimizations are more complex / just don't work because we don't handle all nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started looking at the PR and it looks appealing to simplify VM intrinsics and lift more code into Java. In other words, subword gather operation can be coded as a composition of operations on int vectors. Have you considered that?
It doesn't solve the problem how to reliably match complex graph into a single instruction through. Matcher favors tree representation, but there are multiple ways to workaround it. Personally, I'd prefer to address it separately.
For now, a dedicated node to concatenate vectors look appropriate (please, note there's existing PackNode et al).
It can be either exposed through VM intrinsic or substituted for a well-known complex IR shape during IGVN (like the one you depicted). The nice thing is it'll uniformly cover all usages irrespective of whether they come from Vector API implementation itself or from user code.
In the context of Vector API, the plan was to expose generic element rearranges/shuffles through API, but then enable various strength-reductions to optimize well-known/popular shapes. Packing multiple vectors perfectly fits that effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started looking at the PR and it looks appealing to simplify VM intrinsics and lift more code into Java. In other words, subword gather operation can be coded as a composition of operations on int vectors. Have you considered that?
Thanks so much for looking at this PR! Yes, personally I think we can move these op generation to Java-level for subword gather operation. And I also considered this when I started working at this task. However, this may break current backend implementation for other architectures like X86. I'm not sure whether moving to Java will be also friendly for non-SVE arches. Per my understanding, subword gather depends much more on the backend solution.
For now, a dedicated node to concatenate vectors look appropriate (please, note there's existing PackNode et al).
It can be either exposed through VM intrinsic or substituted for a well-known complex IR shape during IGVN (like the one you depicted). The nice thing is it'll uniformly cover all usages irrespective of whether they come from Vector API implementation itself or from user code.In the context of Vector API, the plan was to expose generic element rearranges/shuffles through API, but then enable various strength-reductions to optimize well-known/popular shapes. Packing multiple vectors perfectly fits that effort.
Thanks for your inputs on the IR choice. I agree with you about adding such a vector concatenate node in C2. And if we decide to move the complex implementation to Java-level, we'd better also add such an API for vector concatenate, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we decide to move the complex implementation to Java-level, we'd better also add such an API for vector concatenate, right?
There's already generic shuffle operation present (rearrange). But there're precedents when more specific operations became part of the API for convenience reasons (e.g., slice/unslice). So, a dedicated operation for vector concatenation may be well-justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, this may break current backend implementation for other architectures like X86. I'm not sure whether moving to Java will be also friendly for non-SVE arches. Per my understanding, subword gather depends much more on the backend solution.
IMO that's a clear sign that current abstraction is way too ad-hoc and platform-specific. x86 ISA lacks native support, so the operation is emulated with hand-written assembly. If there's a less performant implementation, but which relies on a uniform cross-platform VM interface, it'll be a clear winner.
The PR, as it is now, introduces a new IR representation which complicates things even more. Instead, I'd encourage you to work on a uniform API even if x86 won't be immediately migrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see and it makes sense to me. Thanks for your suggestion. I will have a try with moving the complex operations to API level next.
| // Unpack the elements to twice size. | ||
| class VectorMaskWidenNode : public VectorNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a visual example like above for VectorConcatenateNode, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider the alternative of Extract + Cast? Not sure if that would be better, you know more about the code complexity. It would just allow us to have one fewer nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just has the Extract node to extract an element from vector in C2, right? Extracting the lowest part can be implemented with VectorReinterpret easily. But how about the higher parts? Maybe this can also be implemented with operations like slice ? But, seems this will also make the IR more complex? For Cast, we have VectorCastMask now, but it assumes the vector length should be the same for input and output. So the VectorReinterpret or an VectorExtract is sill needed.
I can have a try with separating the IR. But I guess an additional new node is still necessary.
It would just allow us to have one fewer nodes.
This is also what I expect really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would just be nice to build on "simple" building blocks and not have too many complex nodes, that have very special semantics (widen + split into two). It just means that the IR optimizations have to take care of more special cases, rather than following simple rules/optimizations because every IR node does a relatively simple thing.
Maybe you find out that we really need a complex node, and can provide good arguments. Looking forward to what you find :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @iwanowww , regarding to the operation of extending the higher half element size for a vector mask, do you have any better idea? To split the gather operation for a subword type, we usually need to split the input mask as well. Especially for SVE, which the vector mask needs the same data type for an element. I need to extract the part of the original vector mask, and extend it to the int type. For Vector API, I think we can either use similar vector slice for a mask, or a vector extract API. WDYT?
Note that on SVE, it has the native PUNPKHI [1] instruction supported.
|
Hi @eme64 , I just push a commit which added more comments and assertion in the code. This is just a simple fixing to part of your comments. Regarding to the IR refinement, I need more time taking a look. So could you please take another look at the changes relative to method_rename/comment/assertion? Thanks a lot in advance! |
| // | ||
| // SVE requires vector indices for gather-load/scatter-store operations on all | ||
| // data types. | ||
| static bool gather_scatter_requires_index_in_address(BasicType bt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I agreed to this naming, but I looked at the signature of Gather again:
LoadVectorGatherNode(Node* c, Node* mem, Node* adr, const TypePtr* at, const TypeVect* vt, Node* indices)
I'm a little confused now what is the address that your name references. Is it the adr? I think not, because that is the base address, right? Can you clarify a little more? Maybe add to the documentation of the gather and scatter node as well, if you think that helps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you already did add documentation to the gather / scatter nodes now. And based on your explanation there, I suggest you rename the method here to:
gather_scatter_requires_indices_from_array
This would say that the indices come from an array, rather than a vector register.
Your current name we had agreed on confuses me because it suggests that the index maybe already in the address adr, but that does not make much sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, gather_scatter_requires_indices_from_array sounds better to me. I will change it soon.
I'm a little confused now what is the address that your name references. Is it the adr? I think not, because that is the base address, right? Can you clarify a little more? Maybe add to the documentation of the gather and scatter node as well, if you think that helps?
It means the input indices is an address that saves the indexes if this method return true, otherwise, indices is a vector register. You are right that it has no relationship with adr input which is the memory base address.
| // Load Vector from memory via index map. The index map is usually a vector of indices | ||
| // that has the same vector type as the node's bottom type. For non-subword types, it must | ||
| // be. However, for subword types, the basic type of index is int. Hence, the index map | ||
| // can be either a vector with int elements or an address which saves the int indices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, that helps!
eme64
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XiaohongGong I'm going to be away on vacation for about 3 weeks now. So I won't be able to continue with the review until I'm back.
Maybe @vnkozlov or @iwanowww can review instead. Maybe @PaulSandoz or @jatin-bhateja would like to look at it too. If they do, I would want them to consider if the approach with the special vector nodes VectorConcatenateAndNarrow and VectorMaskWiden are really desirable. The complexity needs to go somewhere, but I'm not sure if it is better in the C2 IR or in the backend.
Intuitively this seems like the right way to think about it, although I don't have a proposed solution, i am really just agreeing with the above sentiment - a compositional solution, if possible, with the right primitive building blocks will likely be superior. |
Thanks for your input @PaulSandoz ! And I agree with making the IR simple enough. I'm now working on finding a better way for these two complex operations. Hope I can fix it soon. Thanks! |
|
Hi @iwanowww , @PaulSandoz , @eme64 , Hope you’re doing well! I’ve created a prototype that moves the implementation to the Java API level, as suggested (see: XiaohongGong#8). This refactoring has resulted in significantly cleaner and more maintainable code. Thanks for your insightful feedback @iwanowww ! However, it also introduces new issues that we have to consider. The codegen might not be optimal. If we want to generate the optimal instruction sequence, we need more effort. Following is the details:
As a summary, the implementation itself of this API is clean. But it introduces more overhead especially for SVE. It's not so easy for me to make a conclusion whether the Java change wins or not. Any suggestion on this? Thanks, |
|
@XiaohongGong would it help if |
Thanks for you input @PaulSandoz ! Yes, I think passing a |
|
I suspect it's likely more complex overall adding a slice operation to mask, that is really only needed for a specific case. (A more general operation would be compress/expand of the mask bits, but i don't believe there are hardware instructions for such operations on mask registers.) In my view adding a part parameter is a compromise and seems less complex that requiring N index vectors, and it fits with a general pattern we have around parts of the vector. It moves the specialized operation requirements on the mask into the area where it is needed rather than trying to generalize in a manner that i don't think is appropriate in the mask API. |
Yes, I agree with you. Personally, I’d prefer not to introduce such APIs for a vector mask.
Yeah, it can sound reasonable that an API can finish a simple task and then choose to move the results to different part of a vector based on an offset. Consider |
|
Hi @iwanowww , @PaulSandoz , and @eme64 : I’ve recently completed a prototype that moves the implementation into the Java API level: Do you think it would be a good time to open a draft PR for easier review? Below is a brief summary of the changes compared with the previous version. Main idea
Advantages
Limitations
I plan to rebase and update the compiler-change PR using the same node and match rules as well, so we can clearly compare both approaches. Any thoughts or feedback would be much appreciated. Thanks so much! Best Regards, |
|
Nice work, @XiaohongGong! I haven't closely looked at the patch yet, but I very much like the general direction. I don't consider performance regression in default Java implementation a big deal. In the future, we can rethink how default implementations are handled for operations which lack hardware/VM intrinsic support. |
Thank you very much for your input so far—it’s been extremely helpful. I have an additional concern regarding the
I am unsure of the best approach for implementing the |
Yes, you have to slice the mask, whether it be represented as a mask/predicate register or as a vector. There's no way around that and we have to deal with the current limitations in hardware. As a further compromise we can in Java convert the mask to a vector and rearrange it, then pass the vector representation of the mask to the scatter/gather intrinsic. Then the intrinsic can if it chooses convert it back to a mask/predicate register if that is the best form. IIUC we have agreed for non-masked subword scatter/gather to compose by parts using the intrinsic. That seems good, and it looks like we can do the same for masked subword scatter/gather, as above, but it may not be the most efficient for the platform. Do you have any use cases for mask subword scatter/gather? Given the lack of underlying hardware support it seems focusing on getting the non-masked version working well, and the masked version working ok is a pragmatic way forward. |
Yes, converting mask to vector will be the way to resolve. Do you think it's better that defining a private VectorMask function for the slice operation? The function could be implemented with corresponding vector slice APIs. Although this function is not friendly to SVE performance, it wins on unifying the implementation.
Currently, I do not have specific use cases for masked subword gather or scatter operations. However, I would like to ensure support for these APIs on SVE in case they become relevant for future Java workloads. However, compared to having no intrinsic support at all, using intrinsified APIs—even if not fully optimized—can still significantly improve performance, right? |
If it helps just add a utility method that does the slice/rearrange mask<->vector conversion, but given your use case i expect it only to be used in one location, so perhaps keep it close to there. It maybe you don't need full slice functionality, since you only care about a part of the mask elements that was rearranged to the start of the vector and therefore don't need to zero out the remaining parts that are not relevant. (The same happens for conversion by parts.) Since we don't yet have any slice intrinsic i think that would be OK and we could revisit later. Ideally we should able to optimize rearrange of vectors using constant shuffles with recognizable patterns. |
Make sense to me. Thanks for all your inputs! I will create a PR for the java-level refactor and X86 modifications first. We can have more discussion then. |
This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform.
Background
Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for
byte/shorttypes usingintvectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e.intelements). Hence, the total size is32 * elem_numbits, whereelem_numis the number of loaded elements in the vector register.Implementation
Challenges
Due to size differences between
intindices (32-bit) andbyte/shortdata (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints.For a 512-bit SVE machine, loading a
bytevector with different vector species require different approaches:Use
ByteVector.SPECIES_512as an example:64 * 32bits, which is 4 times of the SVE vector register size.Solution
The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end.
Here is the main changes:
gather_scatter_needs_vector_index()matcher.VectorSliceNodefor result merging.VectorMaskWidenNodefor mask spliting and type conversion for masked gather-load.Testing:
Performance:
The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance data:
[1] https://bugs.openjdk.org/browse/JDK-8355563
[2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector-Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en
Progress
Issue
Reviewers
Reviewers without OpenJDK IDs
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26236/head:pull/26236$ git checkout pull/26236Update a local copy of the PR:
$ git checkout pull/26236$ git pull https://git.openjdk.org/jdk.git pull/26236/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26236View PR using the GUI difftool:
$ git pr show -t 26236Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26236.diff
Using Webrev
Link to Webrev Comment