Skip to content

Conversation

@jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Mar 18, 2025

Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction.
It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails.

Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java).

Vector API jtreg tests pass at AVX level 2, remaining validation in progress.

Performance numbers:


System : 13th Gen Intel(R) Core(TM) i3-1315U

Baseline:
Benchmark                                                (size)   Mode  Cnt      Score   Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2   9444.444          ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  10009.319          ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2   9081.926          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2   6085.825          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2   6505.378          ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2   6204.489          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2   1651.334          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2   1642.784          ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2   1474.808          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  10399.394          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  10502.894          ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt    2   9756.573          ops/ms

With opt:
Benchmark                                                (size)   Mode  Cnt      Score   Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  34122.435          ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  33281.868          ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2   9345.154          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2   8283.247          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2   8510.695          ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2   5626.367          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2    960.958          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2   4155.801          ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2   1465.953          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  32748.061          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  33674.408          ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt    2   9346.148          ops/ms

Please share your feedback.

Best Regards,
Jatin


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8303762: Optimize vector slice operation with constant index using VPALIGNR instruction (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24104/head:pull/24104
$ git checkout pull/24104

Update a local copy of the PR:
$ git checkout pull/24104
$ git pull https://git.openjdk.org/jdk.git pull/24104/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24104

View PR using the GUI difftool:
$ git pr show -t 24104

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24104.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 18, 2025

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 18, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Mar 18, 2025

@jatin-bhateja The following labels will be automatically applied to this pull request:

  • core-libs
  • graal
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Mar 18, 2025
@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Mar 18, 2025

/label add hotspot-compiler-dev

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Mar 18, 2025
@openjdk
Copy link

openjdk bot commented Mar 18, 2025

@jatin-bhateja
The hotspot-compiler label was successfully added.

@bridgekeeper
Copy link

bridgekeeper bot commented May 13, 2025

@jatin-bhateja This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented May 18, 2025

/keepalive

@openjdk
Copy link

openjdk bot commented May 18, 2025

@jatin-bhateja The pull request is being re-evaluated and the inactivity timeout has been reset.

@openjdk
Copy link

openjdk bot commented May 18, 2025

@jatin-bhateja this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8303762
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label May 18, 2025
@bridgekeeper
Copy link

bridgekeeper bot commented Jul 13, 2025

@jatin-bhateja This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper bridgekeeper bot added oca Needs verification of OCA signatory status and removed oca Needs verification of OCA signatory status labels Jul 15, 2025
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Jul 24, 2025
@jatin-bhateja
Copy link
Member Author

Performance after AVX2 backend modifications

Benchmark                                                (size)   Mode  Cnt      Score   Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  51644.530          ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  48171.079          ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2   9662.306          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2  14358.347          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2  14619.920          ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2   6675.824          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2    818.911          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2   4778.321          ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2   1612.264          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  35961.146          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  39072.170          ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt    2  11209.685          ops/ms

@jatin-bhateja jatin-bhateja marked this pull request as ready for review July 25, 2025 13:40
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 25, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 25, 2025

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Jul 28, 2025

Performance on AVX512 machine

Baseline:
Benchmark                                                (size)   Mode  Cnt      Score       Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    4  35741.780 ±  1561.065  ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    4  35011.929 ±  5886.902  ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    4  32366.844 ±  1489.449  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    4  10636.281 ±   608.705  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    4  10750.833 ±   328.997  ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    4  10257.338 ±  2027.422  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    4   5362.330 ±  4199.651  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    4   4992.399 ±  6053.641  ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    4   4941.258 ±   478.193  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    4  40432.828 ± 26672.673  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    4  41300.811 ± 34342.482  ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt    4  36958.309 ±  1899.676  ops/ms

Withopt:
Benchmark                                                (size)   Mode  Cnt      Score       Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt   10  67936.711 ±   389.783  ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt   10  70086.731 ±  5972.968  ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt   10  31879.187 ±   148.213  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt   10  17676.883 ±   217.238  ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt   10  16983.007 ±  3988.548  ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt   10   9851.266 ±    31.773  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt   10   9194.216 ±    42.772  ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt   10   8411.738 ±    33.209  ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt   10   5244.850 ±    12.214  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt   10  61233.526 ± 20472.895  ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt   10  61545.276 ± 20722.066  ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt   10  41208.718 ±  5374.829  ops/ms


@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Aug 7, 2025

Adding additional notes on implementation:

A) Slice:-

  1. New inline expander and C2 IR node VectorSlice for leaf level intrinsic corresponding to Vector.slice(int)
  2. Other interfaces of slice APIs.
    • Vector.slice(int, Vector)
      • The second vector argument is the background vector, which replaces the zero broadcasted vector of the base version of API.
      • API internally calls the same intrinsic entry point as the base version.
    • Vector.slice(int, Vector, VectorMask)
      • This version of the API internally calls the above slice API with index and vector arguments, followed by an explicit blend with a broadcasted zero vector.

Thus, current support implicitly covers all three 3 variants of slice APIs.

B) Similar extensions to optimize Unslice with constant index:-

  1. Similar to slice, unslice also has three interfaces.
  2. Leaf-level interface only accepts an index argument.
  3. Other variants of unslice accept unslice index, background vector, and part number.
  4. We can assume the receiver vector to be sliding over two contiguously placed background vectors.
  5. It's possible to implement all three variants of unslice using slice operations as follows.

jshell> // Input
jshell> vec
vec ==> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
jshell> vec2
vec2 ==> [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160]
jshell> bzvec
bzvec ==> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

jshell> // Case 1:
jshell> vec.unslice(4)
$79 ==> [0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
jshell> bzvec.slice(vec.length() - 4, vec)
$80 ==> [0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

jshell> // Case 2:
jshell> vec.unslice(4, vec2, 0)
$81 ==> [10, 20, 30, 40, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
jshell> vec2.blend(vec2.slice(vec2.length() - 4, vec), VectorMask.fromLong(IntVector.SPECIES_512, ((1L << (vec.length() - 4)) - 1) << 4))
$82 ==> [10, 20, 30, 40, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

jshell> // Case 3:
jshell> vec.unslice(4, vec2, 1)
$83 ==> [13, 14, 15, 16, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160]
jshell> vec2.blend(vec.slice(vec.length() - 4, vec2), VectorMask.fromLong(IntVector.SPECIES_512, ((1L << 4) - 1)))
$84 ==> [13, 14, 15, 16, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160]

jshell> // Case 4:
jshell> vec.unslice(4, vec2, 0, VectorMask.fromLong(IntVector.SPECIES_512, 0xFF))
$85 ==> [10, 20, 30, 40, 1, 2, 3, 4, 5, 6, 7, 8, 130, 140, 150, 160]
jshell> // Current Java fallback implementation for this version is based on slice and unslice operations.

To ease the review process, I plan to optimize the unslice API with a constant index by extending the newly added expander in a follow-up patch.

Copy link

@XiaohongGong XiaohongGong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work @jatin-bhateja! This PR also provides help on AArch64 that we also have plan to do the same intrinsifaction in our side.

return false; // operand unboxing failed
}

Node* origin_node = gvn().intcon(origin->get_con() * type2aelembytes(elem_bt));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q1: Is it possible that just passing origin->get_con() to VectorSliceNode in case there are architectures that need it directly? Or, maybe we'd better add comment telling that the origin passed to VectorSliceNode is adjust to bytes.

Q2: If origin is not a constant, and there is an architecture that support the index as a variable, will the code crash here? Can we just limit the origin to a constant for this intrinsifaction in this PR? We can consider to extend it to variable in case any architecture has such requirement. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q1: Is it possible that just passing origin->get_con() to VectorSliceNode in case there are architectures that need it directly? Or, maybe we'd better add comment telling that the origin passed to VectorSliceNode is adjust to bytes.

Added comments.

Q2: If origin is not a constant, and there is an architecture that support the index as a variable, will the code crash here? Can we just limit the origin to a constant for this intrinsifaction in this PR? We can consider to extend it to variable in case any architecture has such a requirement. WDYT?

Currently, inline expander only supports constant origin. I have added a check to fail intrinsification and inline fallback using the hybrid call generator.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your updating! So maybe the matcher function supports_vector_slice_with_non_constant_index() could also be removed totally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, idea here is just to intrinsify a perticular scenario where slice index is a constant value and not burden the inline expander with full-blown intrinsification of all possible control paths without impacting the performance.


class VectorSliceNode : public VectorNode {
public:
VectorSliceNode(Node* vec1, Node* vec2, Node* origin, const TypeVect* vt)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have specific value for origin like zero or vlen? If so, maybe simply Identity is better to be added as well.

Copy link
Member Author

@jatin-bhateja jatin-bhateja Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have specific value for origin like zero or vlen? If so, maybe simply Identity is better to be added as well.

Done, Thanks!, also added a new IR test to complement the code changes.

@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@Fork(jvmArgs = {"--add-modules=jdk.incubator.vector"})
public class VectorSliceBenchmark {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that it has the micro benchmarks for slice/unslice under test/micro/org/openjdk/bench/jdk/incubator/vector/operation on panama-vector. Can we reuse those JMHs to check the benchmark improvement?

Copy link
Member Author

@jatin-bhateja jatin-bhateja Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that it has the micro benchmarks for slice/unslice under test/micro/org/openjdk/bench/jdk/incubator/vector/operation on panama-vector. Can we reuse those JMHs to check the benchmark improvement?

All those are the ones with variable slice index , slice kernel performance of those benchmarks on AVX2 and AVX512 targets are at par with baseline, and deviations are statistically insignificant due to error margins.

New benchmark complements the code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Make sense to me. Thanks!

@openjdk
Copy link

openjdk bot commented Aug 13, 2025

@jatin-bhateja Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 17, 2025

@jatin-bhateja This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@jatin-bhateja
Copy link
Member Author

/keepalive

@openjdk
Copy link

openjdk bot commented Sep 23, 2025

@jatin-bhateja The pull request is being re-evaluated and the inactivity timeout has been reset.

"Ljdk/internal/vm/vector/VectorSupport$Vector;" \
"Ljdk/internal/vm/vector/VectorSupport$Vector;" \
"Ljdk/internal/vm/vector/VectorSupport$VectorSliceOp;)" \
"Ljdk/internal/vm/vector/VectorSupport$Vector;") \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this \ is not aligned ?

"Ljdk/internal/vm/vector/VectorSupport$Vector;" \
"Ljdk/internal/vm/vector/VectorSupport$VectorSliceOp;)" \
"Ljdk/internal/vm/vector/VectorSupport$Vector;") \
do_name(vector_slice_name, "sliceOp") \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +42 to +45
public static final VectorSpecies<Byte> BSP = ByteVector.SPECIES_PREFERRED;
public static final VectorSpecies<Short> SSP = ShortVector.SPECIES_PREFERRED;
public static final VectorSpecies<Integer> ISP = IntVector.SPECIES_PREFERRED;
public static final VectorSpecies<Long> LSP = LongVector.SPECIES_PREFERRED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation supports floating point types, but why doesn't the test include fp types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to consider partial cases. I looked at the aarch64 situation and found that different implementations are needed for partial and non-partial cases. The test indices in test/jdk/jdk/incubator/vector/ are randomly generated, so it might be better to test different vector species here.

Comment on lines +56 to +59
static final VectorSpecies<Byte> bspecies = ByteVector.SPECIES_PREFERRED;
static final VectorSpecies<Short> sspecies = ShortVector.SPECIES_PREFERRED;
static final VectorSpecies<Integer> ispecies = IntVector.SPECIES_PREFERRED;
static final VectorSpecies<Long> lspecies = LongVector.SPECIES_PREFERRED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, no fp types ?

.slice(1, ByteVector.fromArray(BSP, bsrc2, i))
.intoArray(bdst, i);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this optimization also benefits the slice variant with mask, could you add some tests for it as well?

.slice(i & (bspecies.length() - 1), ByteVector.fromArray(bspecies, bsrc2, i))
.intoArray(bdst, i);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, add a benchmark for the slice variant with mask ?


@Benchmark
public void shortVectorSliceWithConstantIndex1() {
for (int i = 0; i < sspecies.loopBound(sdst.length); i += bspecies.length()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo ? bspecies -> sspecies and the following cases.

ByteVector.fromArray(BSP, bsrc1, i)
.slice(0, ByteVector.fromArray(BSP, bsrc2, i))
.intoArray(bdst, i);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a correctness check for these tests, for byte type, like:

    @DontInline
    static void verifyVectorSliceByte(int origin) {
        for (int i = 0; i < BSP.loopBound(SIZE); i += BSP.length()) {
            int index = i;
            for (int j = i + origin; j < i + BSP.length(); j++) {
                Asserts.assertEquals(bsrc1[j], bdst[index++]);
            }
            for (int j = i; j < i + origin; j++) {
                Asserts.assertEquals(bsrc2[j], bdst[index++]);
            }
        }
    }

public void test16BSliceIndexByte() {
for (int i = 0; i < BSP.loopBound(SIZE); i += BSP.length()) {
ByteVector.fromArray(BSP, bsrc1, i)
.slice(16, ByteVector.fromArray(BSP, bsrc2, i))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 may out of bounds when this test is run with option -XX:MaxVectorSize=8

@jatin-bhateja
Copy link
Member Author

Hi @erifan , Thanks for your comments. I will address them soon, please keep reviewing in the meantime :-)

Copy link
Contributor

@erifan erifan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jatin-bhateja I have no further comments, great work. After this PR is merged, I will complete the backend optimization of the aarch64 part based on it. Thanks!

ins_pipe( pipe_slow );
%}

instruct vector_slice_const_origin_LT16B_reg(vec dst, vec src1, vec src2, immI origin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
instruct vector_slice_const_origin_LT16B_reg(vec dst, vec src1, vec src2, immI origin)
instruct vector_slice_const_origin_EQ16B_reg(vec dst, vec src1, vec src2, immI origin)

Or

Suggested change
instruct vector_slice_const_origin_LT16B_reg(vec dst, vec src1, vec src2, immI origin)
instruct vector_slice_const_origin_16B_reg(vec dst, vec src1, vec src2, immI origin)

Comment on lines +42 to +45
public static final VectorSpecies<Byte> BSP = ByteVector.SPECIES_PREFERRED;
public static final VectorSpecies<Short> SSP = ShortVector.SPECIES_PREFERRED;
public static final VectorSpecies<Integer> ISP = IntVector.SPECIES_PREFERRED;
public static final VectorSpecies<Long> LSP = LongVector.SPECIES_PREFERRED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to consider partial cases. I looked at the aarch64 situation and found that different implementations are needed for partial and non-partial cases. The test indices in test/jdk/jdk/incubator/vector/ are randomly generated, so it might be better to test different vector species here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

5 participants