-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8303762: Optimize vector slice operation with constant index using VPALIGNR instruction #24104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ALIGNR instruction
|
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@jatin-bhateja The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
|
/label add hotspot-compiler-dev |
|
@jatin-bhateja |
|
@jatin-bhateja This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a |
|
/keepalive |
|
@jatin-bhateja The pull request is being re-evaluated and the inactivity timeout has been reset. |
|
@jatin-bhateja this pull request can not be integrated into git checkout JDK-8303762
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
@jatin-bhateja This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a |
3d09134 to
c0b9eea
Compare
c0b9eea to
607a8fc
Compare
|
Performance after AVX2 backend modifications |
Webrevs
|
|
Performance on AVX512 machine |
|
Adding additional notes on implementation: A) Slice:-
Thus, current support implicitly covers all three 3 variants of slice APIs. B) Similar extensions to optimize Unslice with constant index:-
To ease the review process, I plan to optimize the unslice API with a constant index by extending the newly added expander in a follow-up patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work @jatin-bhateja! This PR also provides help on AArch64 that we also have plan to do the same intrinsifaction in our side.
| return false; // operand unboxing failed | ||
| } | ||
|
|
||
| Node* origin_node = gvn().intcon(origin->get_con() * type2aelembytes(elem_bt)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q1: Is it possible that just passing origin->get_con() to VectorSliceNode in case there are architectures that need it directly? Or, maybe we'd better add comment telling that the origin passed to VectorSliceNode is adjust to bytes.
Q2: If origin is not a constant, and there is an architecture that support the index as a variable, will the code crash here? Can we just limit the origin to a constant for this intrinsifaction in this PR? We can consider to extend it to variable in case any architecture has such requirement. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q1: Is it possible that just passing
origin->get_con()toVectorSliceNodein case there are architectures that need it directly? Or, maybe we'd better add comment telling that the origin passed toVectorSliceNodeis adjust to bytes.
Added comments.
Q2: If
originis not a constant, and there is an architecture that support the index as a variable, will the code crash here? Can we just limit theoriginto a constant for this intrinsifaction in this PR? We can consider to extend it to variable in case any architecture has such a requirement. WDYT?
Currently, inline expander only supports constant origin. I have added a check to fail intrinsification and inline fallback using the hybrid call generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your updating! So maybe the matcher function supports_vector_slice_with_non_constant_index() could also be removed totally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, idea here is just to intrinsify a perticular scenario where slice index is a constant value and not burden the inline expander with full-blown intrinsification of all possible control paths without impacting the performance.
|
|
||
| class VectorSliceNode : public VectorNode { | ||
| public: | ||
| VectorSliceNode(Node* vec1, Node* vec2, Node* origin, const TypeVect* vt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have specific value for origin like zero or vlen? If so, maybe simply Identity is better to be added as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have specific value for
originlike zero or vlen? If so, maybe simply Identity is better to be added as well.
Done, Thanks!, also added a new IR test to complement the code changes.
| @OutputTimeUnit(TimeUnit.MILLISECONDS) | ||
| @State(Scope.Thread) | ||
| @Fork(jvmArgs = {"--add-modules=jdk.incubator.vector"}) | ||
| public class VectorSliceBenchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that it has the micro benchmarks for slice/unslice under test/micro/org/openjdk/bench/jdk/incubator/vector/operation on panama-vector. Can we reuse those JMHs to check the benchmark improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that it has the micro benchmarks for slice/unslice under
test/micro/org/openjdk/bench/jdk/incubator/vector/operationon panama-vector. Can we reuse those JMHs to check the benchmark improvement?
All those are the ones with variable slice index , slice kernel performance of those benchmarks on AVX2 and AVX512 targets are at par with baseline, and deviations are statistically insignificant due to error margins.
New benchmark complements the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Make sense to me. Thanks!
d55fa59 to
70c2293
Compare
|
@jatin-bhateja Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information. |
|
@jatin-bhateja This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a |
|
/keepalive |
|
@jatin-bhateja The pull request is being re-evaluated and the inactivity timeout has been reset. |
| "Ljdk/internal/vm/vector/VectorSupport$Vector;" \ | ||
| "Ljdk/internal/vm/vector/VectorSupport$Vector;" \ | ||
| "Ljdk/internal/vm/vector/VectorSupport$VectorSliceOp;)" \ | ||
| "Ljdk/internal/vm/vector/VectorSupport$Vector;") \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this \ is not aligned ?
| "Ljdk/internal/vm/vector/VectorSupport$Vector;" \ | ||
| "Ljdk/internal/vm/vector/VectorSupport$VectorSliceOp;)" \ | ||
| "Ljdk/internal/vm/vector/VectorSupport$Vector;") \ | ||
| do_name(vector_slice_name, "sliceOp") \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| public static final VectorSpecies<Byte> BSP = ByteVector.SPECIES_PREFERRED; | ||
| public static final VectorSpecies<Short> SSP = ShortVector.SPECIES_PREFERRED; | ||
| public static final VectorSpecies<Integer> ISP = IntVector.SPECIES_PREFERRED; | ||
| public static final VectorSpecies<Long> LSP = LongVector.SPECIES_PREFERRED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation supports floating point types, but why doesn't the test include fp types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to consider partial cases. I looked at the aarch64 situation and found that different implementations are needed for partial and non-partial cases. The test indices in test/jdk/jdk/incubator/vector/ are randomly generated, so it might be better to test different vector species here.
| static final VectorSpecies<Byte> bspecies = ByteVector.SPECIES_PREFERRED; | ||
| static final VectorSpecies<Short> sspecies = ShortVector.SPECIES_PREFERRED; | ||
| static final VectorSpecies<Integer> ispecies = IntVector.SPECIES_PREFERRED; | ||
| static final VectorSpecies<Long> lspecies = LongVector.SPECIES_PREFERRED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, no fp types ?
| .slice(1, ByteVector.fromArray(BSP, bsrc2, i)) | ||
| .intoArray(bdst, i); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this optimization also benefits the slice variant with mask, could you add some tests for it as well?
| .slice(i & (bspecies.length() - 1), ByteVector.fromArray(bspecies, bsrc2, i)) | ||
| .intoArray(bdst, i); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, add a benchmark for the slice variant with mask ?
|
|
||
| @Benchmark | ||
| public void shortVectorSliceWithConstantIndex1() { | ||
| for (int i = 0; i < sspecies.loopBound(sdst.length); i += bspecies.length()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo ? bspecies -> sspecies and the following cases.
| ByteVector.fromArray(BSP, bsrc1, i) | ||
| .slice(0, ByteVector.fromArray(BSP, bsrc2, i)) | ||
| .intoArray(bdst, i); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding a correctness check for these tests, for byte type, like:
@DontInline
static void verifyVectorSliceByte(int origin) {
for (int i = 0; i < BSP.loopBound(SIZE); i += BSP.length()) {
int index = i;
for (int j = i + origin; j < i + BSP.length(); j++) {
Asserts.assertEquals(bsrc1[j], bdst[index++]);
}
for (int j = i; j < i + origin; j++) {
Asserts.assertEquals(bsrc2[j], bdst[index++]);
}
}
}
| public void test16BSliceIndexByte() { | ||
| for (int i = 0; i < BSP.loopBound(SIZE); i += BSP.length()) { | ||
| ByteVector.fromArray(BSP, bsrc1, i) | ||
| .slice(16, ByteVector.fromArray(BSP, bsrc2, i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16 may out of bounds when this test is run with option -XX:MaxVectorSize=8
|
Hi @erifan , Thanks for your comments. I will address them soon, please keep reviewing in the meantime :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jatin-bhateja I have no further comments, great work. After this PR is merged, I will complete the backend optimization of the aarch64 part based on it. Thanks!
| ins_pipe( pipe_slow ); | ||
| %} | ||
|
|
||
| instruct vector_slice_const_origin_LT16B_reg(vec dst, vec src1, vec src2, immI origin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| instruct vector_slice_const_origin_LT16B_reg(vec dst, vec src1, vec src2, immI origin) | |
| instruct vector_slice_const_origin_EQ16B_reg(vec dst, vec src1, vec src2, immI origin) |
Or
| instruct vector_slice_const_origin_LT16B_reg(vec dst, vec src1, vec src2, immI origin) | |
| instruct vector_slice_const_origin_16B_reg(vec dst, vec src1, vec src2, immI origin) |
| public static final VectorSpecies<Byte> BSP = ByteVector.SPECIES_PREFERRED; | ||
| public static final VectorSpecies<Short> SSP = ShortVector.SPECIES_PREFERRED; | ||
| public static final VectorSpecies<Integer> ISP = IntVector.SPECIES_PREFERRED; | ||
| public static final VectorSpecies<Long> LSP = LongVector.SPECIES_PREFERRED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to consider partial cases. I looked at the aarch64 situation and found that different implementations are needed for partial and non-partial cases. The test indices in test/jdk/jdk/incubator/vector/ are randomly generated, so it might be better to test different vector species here.
Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction.
It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails.
Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java).
Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
Performance numbers:
Please share your feedback.
Best Regards,
Jatin
Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24104/head:pull/24104$ git checkout pull/24104Update a local copy of the PR:
$ git checkout pull/24104$ git pull https://git.openjdk.org/jdk.git pull/24104/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24104View PR using the GUI difftool:
$ git pr show -t 24104Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24104.diff
Using Webrev
Link to Webrev Comment