Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors #2520

Closed
wants to merge 4 commits into from

Conversation

@sviswa7
Copy link

@sviswa7 sviswa7 commented Feb 11, 2021

The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.

JBS: https://bugs.openjdk.java.net/browse/JDK-8261542

The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.

Before:
Benchmark (size) Mode Cnt Score Error Units
PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 18.887 ± 1.128 ops/ms
PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 9.374 ± 0.370 ops/ms

After:
Benchmark (size) Mode Cnt Score Error Units
PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 13861.420 ± 19.071 ops/ms
PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 7895.199 ± 142.580 ops/ms


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/2520/head:pull/2520
$ git checkout pull/2520

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Feb 11, 2021

👋 Welcome back sviswanathan! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Feb 11, 2021

@sviswa7 The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Loading

@openjdk openjdk bot added the hotspot label Feb 11, 2021
@sviswa7
Copy link
Author

@sviswa7 sviswa7 commented Feb 11, 2021

/label hotspot-compiler

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Feb 11, 2021

@sviswa7
The hotspot-compiler label was successfully added.

Loading

@sviswa7 sviswa7 marked this pull request as ready for review Feb 11, 2021
@openjdk openjdk bot added the rfr label Feb 11, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 11, 2021

Loading

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Please, add a test which verifies correctness of results when this code is used. If we don't have it already.

Loading

@@ -7500,13 +7501,24 @@ instruct rearrangeB(vec dst, vec shuffle) %{
ins_pipe( pipe_slow );
%}

instruct rearrangeB_avx(vec dst, vec src, vec shuffle) %{
instruct rearrangeB_avx(legVec dst, legVec src, vec shuffle, legVec vtmp1, legVec vtmp2, rRegP scratch) %{
predicate(vector_element_basic_type(n) == T_BYTE &&
vector_length(n) == 32 && !VM_Version::supports_avx512_vbmi());
Copy link
Contributor

@vnkozlov vnkozlov Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Predicate matches bail-out condition in match_rule_supported_vector(). Does it mean this code never used before?
So you are implementing it now. Right?

Loading

Copy link
Author

@sviswa7 sviswa7 Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this rule was not used before and I am implementing it now.

Loading

match(Set dst (VectorLoadShuffle src));
effect(TEMP dst, TEMP vtmp, TEMP scratch);
format %{ "vector_load_shuffle $dst, $src\t! using $vtmp and $scratch as TEMP" %}
ins_encode %{
// Create a byte shuffle mask from short shuffle mask
// only byte shuffle instruction available on these platforms
int vlen_in_bytes = vector_length_in_bytes(this);
if (UseAVX == 0) {
Copy link
Contributor

@vnkozlov vnkozlov Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will not be executed with vector length 16 because match_rule_supported_vector() bailout with (size_in_bits == 256 && UseAVX < 2).

Loading

Copy link
Author

@sviswa7 sviswa7 Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, but this code will execute for vector length 16 when UseAVX ==2.
It will also execure for vector length 16 when UseAVX == 3 &&
!VM_Version::supports_avx512bw.

Loading

Copy link
Contributor

@vnkozlov vnkozlov Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next assert checks <= 16 when code is guarded by (UseAVX == 0). It is not (UseAVX ==2).
Also } else { case is for UseAVX > 0 which includes AVX=1 but vpaddb() (avx3) is used there.
Seems UseAVX checks wrong here.

Loading

Copy link
Author

@sviswa7 sviswa7 Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert checks for vlen_in_bytes <= 16 (128 bits) and so is a correct check for UseAVX=0.
vpaddb is supported on AVX1/AVX2 as well.
vpaddb is supported on AVX1 for up to 128 bit and
on AVX2 for upto 256 bit and
on AVX3 (512) for upto 512 bit vectors.
I have tested this for UseAVX=0, UseAVX=1, UseAVX=2, UseAVX=3 platform.

The check is for UseAVX as with any flavor of AVX, we can use less number of instructions to do this operation.
This is because AVX allows destination to be separate from both the sources.

Please let me know if I am missing something.

Loading

Copy link
Contributor

@vnkozlov vnkozlov Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad - I missed that size is in bytes in assert. The assert is correct, as you said.
And } else { part works for AVX1 because of match_rule_supported_vector() bailout 256-bit case.
May be add assert(UseAVX > 1 || vlen_in_bytes <= 16, ).

I only have one question left - about check >= 256 in match_rule_supported_vector()

Loading

Copy link
Author

@sviswa7 sviswa7 Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the following assert on else path:

  •  assert(UseAVX > 1 || vlen_in_bytes <= 16, "required");
    

Loading

Copy link
Contributor

@vnkozlov vnkozlov Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

Loading

@sviswa7
Copy link
Author

@sviswa7 sviswa7 commented Feb 18, 2021

@vnkozlov thanks a lot for the review.
The test for slice and unslice are already part of test/jdk/jdk/incubator/vector/Byte256VectorTests.java and Short256VectorTests.java.

Loading

@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Feb 18, 2021

@vnkozlov thanks a lot for the review.
The test for slice and unslice are already part of test/jdk/jdk/incubator/vector/Byte256VectorTests.java and Short256VectorTests.java.

Good.

Loading

@@ -1693,9 +1694,9 @@ const bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType
return false; // Implementation limitation due to how shuffle is loaded
} else if (size_in_bits == 256 && UseAVX < 2) {
Copy link
Contributor

@vnkozlov vnkozlov Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be >= 256?

Loading

Copy link
Author

@sviswa7 sviswa7 Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general >= 256 part is taken care of early on in match_rule_supported_vector as below:
if (!vector_size_supported(bt, vlen)) {
return false;
}
The only additional check that is being done here is for float and double 256 bit vectors that are supported on AVX=1 and will pass the vector_size_supported check.
This is because the VectorLoadShuffle cannot be performed for 256 bit vectors on AVX1 platform as it needs "integer" 256 bit instructions which are only available on AVX2.

Loading

Copy link
Contributor

@vnkozlov vnkozlov Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

Loading

@@ -1693,9 +1694,9 @@ const bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType
return false; // Implementation limitation due to how shuffle is loaded
} else if (size_in_bits == 256 && UseAVX < 2) {
Copy link
Contributor

@vnkozlov vnkozlov Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

Loading

match(Set dst (VectorLoadShuffle src));
effect(TEMP dst, TEMP vtmp, TEMP scratch);
format %{ "vector_load_shuffle $dst, $src\t! using $vtmp and $scratch as TEMP" %}
ins_encode %{
// Create a byte shuffle mask from short shuffle mask
// only byte shuffle instruction available on these platforms
int vlen_in_bytes = vector_length_in_bytes(this);
if (UseAVX == 0) {
Copy link
Contributor

@vnkozlov vnkozlov Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Feb 19, 2021

@sviswa7 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

Reviewed-by: kvn, neliasso

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 113 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

Loading

@openjdk openjdk bot added the ready label Feb 19, 2021
Copy link
Contributor

@neliasso neliasso left a comment

Looks good.

Loading

@sviswa7
Copy link
Author

@sviswa7 sviswa7 commented Feb 19, 2021

/integrate

Loading

@openjdk openjdk bot closed this Feb 19, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Feb 19, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Feb 19, 2021

@sviswa7 Since your change was applied there have been 123 commits pushed to the master branch:

  • 8b4fd77: 8262042: ProblemList javax/xml/jaxp/unittest/common/prettyprint/PrettyPrintTest.java on Windows
  • 7ffa148: 8247918: Clarify Reader.skip behavior for end of stream
  • 8a1c712: 8261728: SimpleDateFormat should link to DateTimeFormatter
  • 851b2e3: 6245663: (spec) File.renameTo(File) changes the file-system object, not the File instance
  • c4f17a3: 8257925: enable more support for nested inline tags
  • 433096a: 8261744: Implement CharsetDecoder ASCII and latin-1 fast-paths
  • efbaede: 8262018: Wrong format in SAP copyright header of OsVersionTest
  • 55463b0: 8261984: Shenandoah: Remove unused ShenandoahPushWorkerQueuesScope class
  • a180a38: 8260694: (fc) Clarify FileChannel.transferFrom to better describe "no bytes available" case
  • 1b0c36b: 8261649: AArch64: Optimize LSE atomics in C++ code
  • ... and 113 more: https://git.openjdk.java.net/jdk/compare/40754f12f768ed1b32cbda187b0f43e9ee25c5c9...master

Your commit was automatically rebased without conflicts.

Pushed as commit c53acc2.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants