Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8318217: RISC-V: C2 VectorizedHashCode #16629

Closed
wants to merge 20 commits into from

Conversation

ygaevsky
Copy link
Contributor

@ygaevsky ygaevsky commented Nov 13, 2023

Hello All,

Please review these changes to support _vectorizedHashCode intrinsic on
RISC-V platform. The patch adds the "scalar" code for the intrinsic without
usage of any RVV instruction but provides manual unrolling of the appropriate
loop. The code with usage of RVV instruction could be added as follow-up of
the patch or independently.

Thanks,
-Yuri Gaevsky

P.S. My OCA has been accepted recently (ygaevsky).

Correctness checks

Testing: tier1 tests successfully passed on a RISC-V StarFive JH7110 board with Linux.

Performance results (the numbers for non-ints are similar)

StarFive JH7110 board:


ArraysHashCode:              without intrinsic      with intrinsic
-------------------------------------------------------------------------------
Benchmark  (size)  Mode  Cnt       Score     Error       Score     Error  Units
-------------------------------------------------------------------------------
multiints       0  avgt   30       2.658 ?   0.001       2.661 ?   0.004  ns/op
multiints       1  avgt   30       4.881 ?   0.011       4.892 ?   0.015  ns/op
multiints       2  avgt   30      16.109 ?   0.041      10.451 ?   0.075  ns/op
multiints       3  avgt   30      14.873 ?   0.068      11.753 ?   0.024  ns/op
multiints       4  avgt   30      17.283 ?   0.078      13.176 ?   0.044  ns/op
multiints       5  avgt   30      19.691 ?   0.136      14.723 ?   0.046  ns/op
multiints       6  avgt   30      21.727 ?   0.166      15.463 ?   0.124  ns/op
multiints       7  avgt   30      23.790 ?   0.126      18.298 ?   0.059  ns/op
multiints       8  avgt   30      23.527 ?   0.116      18.267 ?   0.046  ns/op
multiints       9  avgt   30      27.981 ?   0.303      20.453 ?   0.069  ns/op
multiints      10  avgt   30      26.947 ?   0.215      20.541 ?   0.051  ns/op
multiints      50  avgt   30      95.373 ?   0.588      69.238 ?   0.208  ns/op
multiints     100  avgt   30     177.109 ?   0.525     137.852 ?   0.417  ns/op
multiints     200  avgt   30     341.074 ?   1.363     296.832 ?   0.725  ns/op
multiints     500  avgt   30     847.993 ?   1.713     752.415 ?   1.918  ns/op
multiints    1000  avgt   30    1610.199 ?   5.424    1426.112 ?   3.407  ns/op
multiints   10000  avgt   30   16234.260 ?  26.789   14447.936 ?  26.345  ns/op
multiints  100000  avgt   30  170726.025 ? 184.003  152587.649 ? 381.964  ns/op
-------------------------------------------------------------------------------

T-Head RVB-ICE board:

ArraysHashCode:              without intrinsic      with intrinsic
------------------------------------------------------------------------------
Benchmark  (size)  Mode  Cnt       Score     Error      Score     Error  Units
------------------------------------------------------------------------------
multiints       0  avgt   30       2.780 ?   0.022      2.816 ?   0.038  ns/op
multiints       1  avgt   30       5.073 ?   0.032      5.101 ?   0.064  ns/op
multiints       2  avgt   30      14.656 ?   0.234     10.974 ?   0.118  ns/op
multiints       3  avgt   30      12.890 ?   0.064     14.168 ?   0.096  ns/op
multiints       4  avgt   30      13.715 ?   0.092     12.552 ?   0.188  ns/op
multiints       5  avgt   30      19.068 ?   0.172     13.557 ?   0.164  ns/op
multiints       6  avgt   30      18.863 ?   0.122     14.848 ?   0.086  ns/op
multiints       7  avgt   30      23.155 ?   0.123     17.300 ?   0.360  ns/op
multiints       8  avgt   30      21.656 ?   0.130     19.214 ?   1.689  ns/op
multiints       9  avgt   30      27.223 ?   0.116     20.450 ?   0.088  ns/op
multiints      10  avgt   30      26.194 ?   0.116     19.463 ?   0.411  ns/op
multiints      50  avgt   30      97.904 ?   1.396     69.662 ?   1.729  ns/op
multiints     100  avgt   30     179.607 ?   0.766    137.127 ?   3.156  ns/op
multiints     200  avgt   30     303.845 ?   2.743    235.476 ?   3.446  ns/op
multiints     500  avgt   30     752.295 ?   2.131    571.157 ?   4.258  ns/op
multiints    1000  avgt   30    1404.489 ?   5.839   1048.263 ?   3.473  ns/op
multiints   10000  avgt   30   13797.829 ?  44.821  10344.752 ?  24.183  ns/op
multiints  100000  avgt   30  135067.307 ? 254.361  98806.823 ? 131.562  ns/op
------------------------------------------------------------------------------

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8318217: RISC-V: C2 VectorizedHashCode (Sub-task - P5)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16629/head:pull/16629
$ git checkout pull/16629

Update a local copy of the PR:
$ git checkout pull/16629
$ git pull https://git.openjdk.org/jdk.git pull/16629/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16629

View PR using the GUI difftool:
$ git pr show -t 16629

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16629.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 13, 2023

👋 Welcome back ygaevsky! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 13, 2023
@openjdk
Copy link

openjdk bot commented Nov 13, 2023

@ygaevsky The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Nov 13, 2023
@mlbridge
Copy link

mlbridge bot commented Nov 13, 2023

@ygaevsky ygaevsky changed the title 8318217: RISC-V: C2 VectorizedHashCode RFR: 8318217: RISC-V: C2 VectorizedHashCode Nov 13, 2023
@openjdk openjdk bot removed the rfr Pull request is ready for review label Nov 13, 2023
@ygaevsky ygaevsky changed the title RFR: 8318217: RISC-V: C2 VectorizedHashCode 8318217: RISC-V: C2 VectorizedHashCode Nov 13, 2023
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 13, 2023
src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/riscv.ad Show resolved Hide resolved
src/hotspot/cpu/riscv/riscv.ad Show resolved Hide resolved
src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/stubRoutines_riscv.cpp Outdated Show resolved Hide resolved
@Hamlin-Li
Copy link

Hamlin-Li commented Nov 14, 2023

The code with usage of RVV instruction could be added as follow-up of
the patch or independently.

Hey @ygaevsky, I can work on this real vectorized intrinsic implementation, please let me know how you think about it.
If you already had a solution or started working on it, please ignore my message.

Thanks.

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/stubRoutines_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/riscv.ad Show resolved Hide resolved
src/hotspot/cpu/riscv/riscv.ad Show resolved Hide resolved
@ygaevsky
Copy link
Contributor Author

macos-x64/windows-x64 failures look unrelated.

Copy link

@Hamlin-Li Hamlin-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating. Looks good to me.
Just one minor comment.

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
@Hamlin-Li
Copy link

It's all in the pr #16453, I don't have much other information.
The major perf opt method of that pr is do the loads from the buffer more incrementally instead of all in one go, which I think is the opposite you're doing here. Seems the major difference between this and that pr is, in that pr it has lots of more work to do between the incremental load.

@ygaevsky
Copy link
Contributor Author

ygaevsky commented Dec 7, 2023

Thanks @Hamlin-Li, I see now what you mean, that's interesting.

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, glad to see the performance numbers are back to normal. Would you mind two more tweaks? Thanks.

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp Outdated Show resolved Hide resolved
@ygaevsky
Copy link
Contributor Author

ygaevsky commented Dec 8, 2023

Hi, glad to see the performance numbers are back to normal. Would you mind two more tweaks? Thanks.

Thank you for the suggestions, fixed, please take a look.

@ygaevsky
Copy link
Contributor Author

ygaevsky commented Dec 8, 2023

The failures above seem unrelated.

@RealFYang
Copy link
Member

RealFYang commented Dec 11, 2023

Thanks for the update. So I gave it a second try and some tunning. I see up to 7%+ extra improvement on licheepi-4a board (T-Head C910) with following small add-on change (no obvious change on unmatched board). This materializes the powers of 31 with direct mv instructions and avoids loading elements from _arrays_hashcode_powers_of_31 array which would involve calculation of the array address. We could further remove the _arrays_hashcode_powers_of_31 array then.

diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp
index 11cbcaa48a1..fe82b7a4e74 100644
--- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp
+++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp
@@ -1493,16 +1493,16 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res

   beqz(cnt, DONE);

-  addiw(pow31_2, zr, 961);       // [31^^2]
   andi(chunks, cnt, ~(stride-1));
   beqz(chunks, TAIL);

+  mv(pow31_4, 923521); // [31^^4]
+  mv(pow31_3, 29791);  // [31^^3]
+  mv(pow31_2, 961);    // [31^^2]
+
   slli(chunks_end, chunks, chunks_end_shift);
   add(chunks_end, ary, chunks_end);
   andi(cnt, cnt, stride-1);      // don't forget about tail!
-  ld(pow31_4, ExternalAddress(StubRoutines::riscv::arrays_hashcode_powers_of_31()
-                              + 0 * sizeof(jint))); // [31^^3:31^^4]
-  srli(pow31_3, pow31_4, 32);

   bind(WIDE_LOOP);
   mulw(result, result, pow31_4); // 31^^4 * h
  1. licheepi-4a / without addon fix:
Benchmark                   (size)  Mode  Cnt      Score    Error  Units
ArraysHashCode.bytes             1  avgt   15     21.327 ?  0.035  ns/op
ArraysHashCode.bytes            10  avgt   15     33.195 ?  0.166  ns/op
ArraysHashCode.bytes           100  avgt   15    154.175 ?  3.433  ns/op
ArraysHashCode.bytes         10000  avgt   15  12318.680 ? 25.131  ns/op
ArraysHashCode.chars             1  avgt   15     20.965 ?  0.598  ns/op
ArraysHashCode.chars            10  avgt   15     33.097 ?  0.117  ns/op
ArraysHashCode.chars           100  avgt   15    153.510 ?  0.280  ns/op
ArraysHashCode.chars         10000  avgt   15  11881.690 ? 44.507  ns/op
ArraysHashCode.ints              1  avgt   15     21.330 ?  0.070  ns/op
ArraysHashCode.ints             10  avgt   15     33.409 ?  0.225  ns/op
ArraysHashCode.ints            100  avgt   15    154.254 ?  0.650  ns/op
ArraysHashCode.ints          10000  avgt   15  11833.894 ? 73.945  ns/op
ArraysHashCode.multibytes        1  avgt   15      3.468 ?  0.046  ns/op
ArraysHashCode.multibytes       10  avgt   15     12.412 ?  0.126  ns/op
ArraysHashCode.multibytes      100  avgt   15     75.963 ?  0.267  ns/op
ArraysHashCode.multibytes    10000  avgt   15   6587.068 ? 53.064  ns/op
ArraysHashCode.multichars        1  avgt   15      3.437 ?  0.042  ns/op
ArraysHashCode.multichars       10  avgt   15     13.019 ?  0.118  ns/op
ArraysHashCode.multichars      100  avgt   15     82.657 ?  0.244  ns/op
ArraysHashCode.multichars    10000  avgt   15   6743.844 ? 80.474  ns/op
ArraysHashCode.multiints         1  avgt   15      3.409 ?  0.036  ns/op
ArraysHashCode.multiints        10  avgt   15     13.102 ?  0.140  ns/op
ArraysHashCode.multiints       100  avgt   15     82.864 ?  1.002  ns/op
ArraysHashCode.multiints     10000  avgt   15   7107.843 ? 69.506  ns/op
ArraysHashCode.multishorts       1  avgt   15      3.475 ?  0.033  ns/op
ArraysHashCode.multishorts      10  avgt   15     12.923 ?  0.108  ns/op
ArraysHashCode.multishorts     100  avgt   15     82.498 ?  0.450  ns/op
ArraysHashCode.multishorts   10000  avgt   15   6744.477 ? 22.576  ns/op
ArraysHashCode.shorts            1  avgt   15     21.337 ?  0.077  ns/op
ArraysHashCode.shorts           10  avgt   15     33.236 ?  0.114  ns/op
ArraysHashCode.shorts          100  avgt   15    154.099 ?  0.421  ns/op
ArraysHashCode.shorts        10000  avgt   15  11876.918 ? 41.767  ns/op
  1. licheepi-4a / with add-on change:
Benchmark                   (size)  Mode  Cnt      Score    Error  Units
ArraysHashCode.bytes             1  avgt   15     21.311 ?  0.036  ns/op
ArraysHashCode.bytes            10  avgt   15     32.113 ?  0.124  ns/op
ArraysHashCode.bytes           100  avgt   15    150.476 ?  0.635  ns/op
ArraysHashCode.bytes         10000  avgt   15  11639.521 ? 16.383  ns/op
ArraysHashCode.chars             1  avgt   15     21.329 ?  0.041  ns/op
ArraysHashCode.chars            10  avgt   15     32.315 ?  0.466  ns/op
ArraysHashCode.chars           100  avgt   15    151.996 ?  1.008  ns/op
ArraysHashCode.chars         10000  avgt   15  10957.449 ? 23.898  ns/op
ArraysHashCode.ints              1  avgt   15     21.323 ?  0.035  ns/op
ArraysHashCode.ints             10  avgt   15     32.416 ?  0.170  ns/op
ArraysHashCode.ints            100  avgt   15    152.277 ?  0.555  ns/op
ArraysHashCode.ints          10000  avgt   15  11019.286 ? 53.589  ns/op
ArraysHashCode.multibytes        1  avgt   15      3.450 ?  0.026  ns/op
ArraysHashCode.multibytes       10  avgt   15     12.204 ?  0.171  ns/op
ArraysHashCode.multibytes      100  avgt   15     78.433 ?  0.357  ns/op
ArraysHashCode.multibytes    10000  avgt   15   6654.488 ? 19.664  ns/op
ArraysHashCode.multichars        1  avgt   15      3.443 ?  0.043  ns/op
ArraysHashCode.multichars       10  avgt   15     12.364 ?  0.087  ns/op
ArraysHashCode.multichars      100  avgt   15     78.246 ?  0.540  ns/op
ArraysHashCode.multichars    10000  avgt   15   6455.363 ? 30.115  ns/op
ArraysHashCode.multiints         1  avgt   15      3.441 ?  0.019  ns/op
ArraysHashCode.multiints        10  avgt   15     12.493 ?  0.063  ns/op
ArraysHashCode.multiints       100  avgt   15     78.485 ?  0.587  ns/op
ArraysHashCode.multiints     10000  avgt   15   6843.608 ? 82.197  ns/op
ArraysHashCode.multishorts       1  avgt   15      3.466 ?  0.029  ns/op
ArraysHashCode.multishorts      10  avgt   15     12.369 ?  0.144  ns/op
ArraysHashCode.multishorts     100  avgt   15     78.172 ?  0.580  ns/op
ArraysHashCode.multishorts   10000  avgt   15   6446.791 ? 13.104  ns/op
ArraysHashCode.shorts            1  avgt   15     20.971 ?  0.574  ns/op
ArraysHashCode.shorts           10  avgt   15     32.002 ?  0.642  ns/op
ArraysHashCode.shorts          100  avgt   15    152.359 ?  0.692  ns/op
ArraysHashCode.shorts        10000  avgt   15  10968.816 ? 31.404  ns/op

@ygaevsky
Copy link
Contributor Author

Avoiding memory accesses is always good idea, so your suggestion makes perfect sense, thanks, fixed. I've re-checked that the performance is at least not worse on THead/SiFive/StarFive.

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated change looks good to me. Thanks for your patience.

@openjdk
Copy link

openjdk bot commented Dec 11, 2023

@ygaevsky This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8318217: RISC-V: C2 VectorizedHashCode

Reviewed-by: mli, fyang

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 410 new commits pushed to the master branch:

  • ce4b257: 8320886: Unsafe_SetMemory0 is not guarded
  • b270f30: 8318629: G1: Refine code a bit in G1RemSetTrackingPolicy::update_at_allocate
  • 486594d: 8316657: Support whitebox testing in microbenchmarks
  • ce8399f: 8321582: yield .class not parsed correctly.
  • 3c6459e: 8321641: ClassFile ModuleAttribute.ModuleAttributeBuilder::moduleVersion spec contains a copy-paste error
  • 92fd490: 8321176: [Screencast] make a second attempt on screencast failure
  • d13302f: 8321387: SegmentAllocator:allocateFrom(AddressLayout, MemorySegment) does not throw stated UnsupportedOperationException
  • ce10844: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays)
  • 5c12a18: 8320790: Update --release 22 symbol information for JDK 22 build 27
  • 7180088: 8321429: (fc) FileChannel.lock creates a FileKey containing two long index values, they could be stored as int values
  • ... and 400 more: https://git.openjdk.org/jdk/compare/4d650fe85fe780cf69070184d049a423cbc7d20e...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@Hamlin-Li, @RealFYang) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 11, 2023
@ygaevsky
Copy link
Contributor Author

Updated change looks good to me. Thanks for your patience.

Thank you for the thorough review and suggestions, @RealFYang.

Copy link

@Hamlin-Li Hamlin-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ygaevsky
Copy link
Contributor Author

Thanks for your review/suggestions, @Hamlin-Li!

@ygaevsky
Copy link
Contributor Author

/sponsor

@openjdk
Copy link

openjdk bot commented Dec 12, 2023

@ygaevsky Only Committers are allowed to sponsor changes.

@ygaevsky
Copy link
Contributor Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Dec 12, 2023
@openjdk
Copy link

openjdk bot commented Dec 12, 2023

@ygaevsky
Your change (at version cf988f0) is now ready to be sponsored by a Committer.

@VladimirKempik
Copy link

/sponsor

@openjdk
Copy link

openjdk bot commented Dec 12, 2023

Going to push as commit 6359b4e.
Since your change was applied there have been 410 commits pushed to the master branch:

  • ce4b257: 8320886: Unsafe_SetMemory0 is not guarded
  • b270f30: 8318629: G1: Refine code a bit in G1RemSetTrackingPolicy::update_at_allocate
  • 486594d: 8316657: Support whitebox testing in microbenchmarks
  • ce8399f: 8321582: yield .class not parsed correctly.
  • 3c6459e: 8321641: ClassFile ModuleAttribute.ModuleAttributeBuilder::moduleVersion spec contains a copy-paste error
  • 92fd490: 8321176: [Screencast] make a second attempt on screencast failure
  • d13302f: 8321387: SegmentAllocator:allocateFrom(AddressLayout, MemorySegment) does not throw stated UnsupportedOperationException
  • ce10844: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays)
  • 5c12a18: 8320790: Update --release 22 symbol information for JDK 22 build 27
  • 7180088: 8321429: (fc) FileChannel.lock creates a FileKey containing two long index values, they could be stored as int values
  • ... and 400 more: https://git.openjdk.org/jdk/compare/4d650fe85fe780cf69070184d049a423cbc7d20e...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Dec 12, 2023
@openjdk openjdk bot closed this Dec 12, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Dec 12, 2023
@openjdk
Copy link

openjdk bot commented Dec 12, 2023

@VladimirKempik @ygaevsky Pushed as commit 6359b4e.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants