Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8310268: RISC-V: misaligned memory access in String.Compare intrinsic #14534

Closed
wants to merge 9 commits into from

Conversation

VladimirKempik
Copy link

@VladimirKempik VladimirKempik commented Jun 19, 2023

Please review this fix. it eliminates misaligned loads in String.compare on risc-v

for small compares ( <= 72 bytes), the instrinsic in c2_MacroAssembler.cpp is used,
it reads ( in case of UU/LL) 8 bytes per loop, and at then end, it reads tail - misaligned load of last 8 bytes from the string.

so if string length is not 8x bytes long then last load is misaligned, also it performs read/compare of some data which already was processed.

I have changed that to compare only last length%8 bytes using SHORT_STRING part of intrinsic for UL/LU. But for UU/LL I have made an optimised version.

Thanks to optimisations for conditional branching at line 947 I’ve got no perf drop on thead ( with +AvoidUnalignedAccesses) which supports misaligned access.

Improvements to inflate_XX() methods gives 3-5% improvements for UL/LU cases on thead, almost no perf change on hifive.

for large strings, the instrinsics from stubGenerator.cpp is used
for UU/LL - generate_compare_long_string_same_encoding, I have just replaced misaligned ld with load_long_misaligned. Since this tail reading is not on hot path, this give some small penalty for thead when -XX:+AvoidUnalignedAccesses.

large LU/UL comparision is done in compare_long_string_different_encoding in sutbGenerator.cpp:
These changes are partially based on feilongjiang's patch, but I have changed tail reading to prevent reading past the end of string. I have observed no perf difference between feilongjiang's and my version.

This also enables regression test for string.Compare which previously was aarch64-only

Testing: tier1 and tier2 clean on hifive.

JMH testing, hifive:
before:

Benchmark                                   (delta)  (size)  Mode  Cnt     Score    Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9     6.474 ±  1.475  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9   125.823 ±  1.947  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9    10.512 ±  0.236  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9    13.032 ±  0.821  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9    18.983 ±  0.318  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9    29.925 ±  0.458  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9    30.607 ±  0.963  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   150.772 ±  1.048  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9    13.979 ±  0.868  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9    18.344 ±  0.640  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   567.667 ±  5.222  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9  1060.860 ±  3.348  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9  2054.307 ±  6.988  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  4025.110 ± 16.923  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  4030.565 ± 18.757  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  4447.476 ± 21.069  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9    16.458 ±  0.640  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9    20.713 ±  0.705  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   566.969 ±  3.499  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9  1061.740 ±  3.034  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9  2050.731 ±  6.786  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  4028.398 ± 16.138  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  4034.976 ± 19.330  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  4445.266 ± 13.599  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9     7.673 ±  0.933  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9     9.495 ±  1.561  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9    14.430 ±  0.370  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9    19.752 ±  0.927  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9    30.463 ±  0.770  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9    52.556 ±  0.588  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9    53.675 ±  0.544  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   173.248 ±  0.830  ms/op

after:

Benchmark                                   (delta)  (size)  Mode  Cnt    Score    Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9    6.403 ±  1.252  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9    8.473 ±  1.084  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9   11.301 ±  1.067  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9   15.615 ±  0.825  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9   21.007 ±  0.683  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9   32.836 ±  0.639  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9   30.565 ±  1.188  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   33.284 ±  0.862  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9   14.089 ±  0.563  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9   17.754 ±  0.622  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   32.248 ±  0.679  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9   52.433 ±  4.008  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9   91.470 ±  0.927  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  174.563 ±  1.376  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  178.459 ±  1.381  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  202.416 ± 13.131  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9   15.420 ±  0.192  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9   19.283 ±  0.226  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   34.827 ±  1.365  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9   52.482 ±  1.852  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9   94.261 ±  1.677  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  179.576 ±  3.538  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  179.983 ±  1.184  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  180.987 ±  2.206  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9   10.841 ±  0.905  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9   11.709 ±  0.517  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9   15.979 ±  0.691  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9   20.225 ±  0.623  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9   31.435 ±  1.017  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9   53.657 ±  1.057  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9   53.921 ±  1.079  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   54.215 ±  0.655  ms/op

thead:

before:

Benchmark                                   (delta)  (size)  Mode  Cnt    Score   Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9    2.592 ? 0.008  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9    3.727 ? 0.115  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9    7.277 ? 0.076  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9    8.733 ? 0.178  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9   13.334 ? 0.448  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9   19.843 ? 0.234  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9   19.971 ? 0.093  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   21.052 ? 0.318  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9    7.847 ? 0.188  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9   10.890 ? 0.077  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   24.353 ? 0.277  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9   40.850 ? 0.591  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9   72.205 ? 0.352  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  134.445 ? 0.523  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  134.866 ? 1.081  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  138.925 ? 1.241  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9    9.532 ? 0.080  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9   12.873 ? 0.304  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   26.725 ? 0.582  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9   42.855 ? 0.451  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9   74.674 ? 0.817  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  136.531 ? 0.530  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  136.446 ? 0.946  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  141.521 ? 1.371  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9    4.106 ? 0.012  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9    5.458 ? 0.237  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9    9.530 ? 0.155  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9   13.908 ? 0.390  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9   20.331 ? 0.221  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9   33.734 ? 0.213  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9   33.884 ? 0.201  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   35.376 ? 0.210  ms/op

after:

Benchmark                                   (delta)  (size)  Mode  Cnt    Score   Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9    2.701 ? 0.128  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9    3.552 ? 0.024  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9    7.533 ? 0.049  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9   10.330 ? 0.090  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9   15.484 ? 0.474  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9   21.892 ? 0.366  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9   20.239 ? 0.102  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   22.113 ? 0.302  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9    7.150 ? 0.065  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9   10.103 ? 0.227  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   22.354 ? 0.218  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9   36.192 ? 0.374  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9   65.506 ? 0.301  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  124.193 ? 0.337  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  126.693 ? 0.560  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  127.654 ? 0.598  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9    9.611 ? 0.558  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9   12.892 ? 0.672  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   23.973 ? 0.570  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9   38.567 ? 1.064  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9   67.992 ? 0.986  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  126.910 ? 0.738  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  128.019 ? 0.508  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  127.973 ? 0.326  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9    3.903 ? 0.020  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9    4.911 ? 0.044  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9   10.019 ? 0.210  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9   14.479 ? 0.217  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9   20.668 ? 0.208  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9   34.244 ? 0.240  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9   34.582 ? 0.220  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   34.804 ? 0.199  ms/op

after, with -XX:-AvoidUnalignedAccesses:

Benchmark                                   (delta)  (size)  Mode  Cnt    Score   Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9    2.641 ? 0.221  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9    3.486 ? 0.047  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9    7.528 ? 0.037  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9    9.002 ? 0.037  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9   13.772 ? 0.185  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9   20.139 ? 0.138  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9   20.238 ? 0.322  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   20.928 ? 0.171  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9    7.100 ? 0.113  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9    9.976 ? 0.213  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   21.896 ? 0.205  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9   37.569 ? 0.113  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9   66.993 ? 0.730  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  126.249 ? 2.044  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  126.366 ? 0.585  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  129.772 ? 0.538  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9   10.031 ? 0.305  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9   12.185 ? 0.629  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   24.069 ? 0.539  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9   40.073 ? 0.573  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9   69.349 ? 0.609  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  128.168 ? 0.843  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  128.314 ? 0.584  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  132.530 ? 0.373  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9    3.658 ? 0.160  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9    4.496 ? 0.026  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9    9.397 ? 0.116  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9   13.752 ? 0.409  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9   20.053 ? 0.317  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9   33.385 ? 0.242  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9   33.808 ? 0.301  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   35.481 ? 0.480  ms/op


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8310268: RISC-V: misaligned memory access in String.Compare intrinsic (Enhancement - P4)

Reviewers

Contributors

  • Feilong Jiang <fjiang@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14534/head:pull/14534
$ git checkout pull/14534

Update a local copy of the PR:
$ git checkout pull/14534
$ git pull https://git.openjdk.org/jdk.git pull/14534/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14534

View PR using the GUI difftool:
$ git pr show -t 14534

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14534.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 19, 2023

👋 Welcome back vkempik! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 19, 2023
@openjdk
Copy link

openjdk bot commented Jun 19, 2023

@VladimirKempik The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Jun 19, 2023
@mlbridge
Copy link

mlbridge bot commented Jun 19, 2023

@VladimirKempik
Copy link
Author

/contributor add fjiang

@openjdk
Copy link

openjdk bot commented Jun 19, 2023

@VladimirKempik
Contributor Feilong Jiang <fjiang@openjdk.org> successfully added.

@RealFYang
Copy link
Member

RealFYang commented Jun 26, 2023

Hi, I am looking at this but it really takes quite some time for me understand this. I have to say that the added logic for handling unaligned accesses makes the already complex assembly code even harder to read. And I am also worried about code maintainability as previously mentioned by @theRealAph about increasing checkings for AvoidUnalignedAccesses.

After a cursory look at this change, it seems that we could make use of the newly-added load_X_misaligned assemblers for the possibly unaligned memory accesses instead of adding the extra logic under AvoidUnalignedAccesses. This might not offer us the optimal performance but it does reduce code complexity.

PS: At the same time, I think it's time to distinguish support for unaligned access for different platforms when measuring performance. We might want to manually set AvoidUnalignedAccesses to false for platforms like T-HEAD running older kernels. Thanks to: https://bugs.openjdk.org/browse/JDK-8309258, we have the proper detection for new kernels. We would add more vender-specific settings in the future.

@VladimirKempik
Copy link
Author

Hi, I am looking at this but it really takes quite some time for me understand this. I have to say that the added logic for handling unaligned accesses makes the already complex assembly code even harder to read. And I am also worried about code maintainability as previously mentioned by @theRealAph about increasing checkings for AvoidUnalignedAccesses.

After a cursory look at this change, it seems that we could make use of the newly-added load_X_misaligned assemblers for the possibly unaligned memory accesses instead of adding the extra logic under AvoidUnalignedAccesses. This might not offer us the optimal performance but it does reduce code complexity.

This is really complex change, and it's last for string intrinsics we have.

I have used load_X_misaligned where it made sense ( in compare_long UU/LL)
in compare_long UL/LU the main comparision loop was doing misaligned loads ( 4 bytes aligned but not 8 bytes aligned) for latin string. hence it required complicated fix ( thanks Feilong Jiang).

The compare_short ( in c2_macroAssember) was doing too much conditional branches in one place,it was possible to slightly reduce it.
Thanks for looking at it.

@RealFYang
Copy link
Member

RealFYang commented Jun 26, 2023

The compare_short ( in c2_macroAssember) was doing too much conditional branches in one place,it was possible to slightly reduce it. Thanks for looking at it.

I haven't checked other changes, but for the possible unaligned accesses in file src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp:
https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#LL958-L971

It looks that simple calls to load_X_misaligned assemblers with proper granularity will suffice, which would be much simpler.

@VladimirKempik
Copy link
Author

VladimirKempik commented Jun 26, 2023

The compare_short ( in c2_macroAssember) was doing too much conditional branches in one place,it was possible to slightly reduce it. Thanks for looking at it.

I haven't checked other changes, but for the possible unaligned accesses in file src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#LL958-L971

It looks that simple calls to load_X_misaligned assemblers with proper granularity will suffice, which would be much simpler.

I have made a simlpified version of c2_MacroAssembler_riscv.cpp patch - VladimirKempik@656af81

which is basically - if the tail is 8 bytes - use ld/lwu, otherwise use load_X_misaligned.
The results on hifive ( I have made a version of jmh test with small string lengths)

hifive, current pr:

Benchmark                                   (delta)  (size)  Mode  Cnt   Score   Error  Units
StringCompareToDifferentLength.compareToLL        2       7  avgt    9   7.826 ± 0.803  ms/op
StringCompareToDifferentLength.compareToLL        2       8  avgt    9   8.510 ± 0.884  ms/op
StringCompareToDifferentLength.compareToLL        2      15  avgt    9   7.171 ± 0.957  ms/op
StringCompareToDifferentLength.compareToLL        2      24  avgt    9   6.469 ± 0.701  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9   7.970 ± 0.578  ms/op
StringCompareToDifferentLength.compareToLU        2       7  avgt    9   8.700 ± 0.583  ms/op
StringCompareToDifferentLength.compareToLU        2       8  avgt    9   8.079 ± 0.910  ms/op
StringCompareToDifferentLength.compareToLU        2      15  avgt    9  11.577 ± 0.650  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9  13.612 ± 0.436  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9  17.866 ± 0.922  ms/op
StringCompareToDifferentLength.compareToUL        2       7  avgt    9   8.755 ± 0.561  ms/op
StringCompareToDifferentLength.compareToUL        2       8  avgt    9  10.201 ± 0.633  ms/op
StringCompareToDifferentLength.compareToUL        2      15  avgt    9  11.568 ± 0.459  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9  15.762 ± 0.630  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9  19.614 ± 0.677  ms/op
StringCompareToDifferentLength.compareToUU        2       7  avgt    9   7.463 ± 0.306  ms/op
StringCompareToDifferentLength.compareToUU        2       8  avgt    9   6.102 ± 0.978  ms/op
StringCompareToDifferentLength.compareToUU        2      15  avgt    9   8.144 ± 1.073  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9   9.413 ± 0.959  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9  11.012 ± 0.345  ms/op

hifive, from compare_lam_2 branch:

Benchmark                                   (delta)  (size)  Mode  Cnt   Score   Error  Units
StringCompareToDifferentLength.compareToLL        2       7  avgt    9   7.899 ± 0.761  ms/op
StringCompareToDifferentLength.compareToLL        2       8  avgt    9   8.635 ± 0.626  ms/op
StringCompareToDifferentLength.compareToLL        2      15  avgt    9   8.663 ± 0.647  ms/op
StringCompareToDifferentLength.compareToLL        2      24  avgt    9   7.015 ± 0.889  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9  10.199 ± 0.671  ms/op
StringCompareToDifferentLength.compareToLU        2       7  avgt    9   9.685 ± 0.991  ms/op
StringCompareToDifferentLength.compareToLU        2       8  avgt    9   8.402 ± 0.650  ms/op
StringCompareToDifferentLength.compareToLU        2      15  avgt    9  12.259 ± 0.753  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9  13.637 ± 0.828  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9  18.201 ± 0.994  ms/op
StringCompareToDifferentLength.compareToUL        2       7  avgt    9  11.866 ± 0.791  ms/op
StringCompareToDifferentLength.compareToUL        2       8  avgt    9  10.466 ± 0.568  ms/op
StringCompareToDifferentLength.compareToUL        2      15  avgt    9  14.092 ± 0.331  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9  15.518 ± 0.314  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9  19.325 ± 0.452  ms/op
StringCompareToDifferentLength.compareToUU        2       7  avgt    9   7.422 ± 0.565  ms/op
StringCompareToDifferentLength.compareToUU        2       8  avgt    9   6.409 ± 0.748  ms/op
StringCompareToDifferentLength.compareToUU        2      15  avgt    9   8.357 ± 0.983  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9   9.453 ± 0.911  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9  11.561 ± 0.745  ms/op

Clear performance degradation when we have to go into TAIL ( cases 15, 36 for LL. 7 and 15 for LU/UL)

@RealFYang
Copy link
Member

Clear performance degradation when we have to go into TAIL ( cases 15, 36 for LL. 7 and 15 for LU/UL)

It's a tradeoff which is acceptable to me provided that we already got most of the gain as compared to JDK mainline.
In fact, I meant something more simpler :

    if (str1_isL == str2_isL) { // LL or UU
      load_long_misaligned(tmp1, Address(str1), tmp3, isLL ? 1 : 2);
      load_long_misaligned(tmp2, Address(str2), tmp3, isLL ? 1 : 2);
    } else if (isLU) { // LU case
      load_int_misaligned(tmp1, Address(str1), tmp3, false);
      load_long_misaligned(tmp2, Address(str2), tmp3, 2);
      inflate_lo32(tmp3, tmp1);
      mv(tmp1, tmp3);
    } else { // UL case
      load_int_misaligned(tmp2, Address(str2), tmp3, false);
      load_long_misaligned(tmp1, Address(str1), tmp3, 2);
      inflate_lo32(tmp3, tmp2);
      mv(tmp2, tmp3);
    }

The preceding beqz added by VladimirKempik@656af81 will have a negative impact for platforms like T-head which have fast unaligned accesses.

@VladimirKempik
Copy link
Author

Ok, updated the PR, new results for hifive:

Benchmark                                   (delta)  (size)  Mode  Cnt   Score   Error  Units
StringCompareToDifferentLength.compareToLL        2       7  avgt    9   7.472 ± 0.351  ms/op
StringCompareToDifferentLength.compareToLL        2       8  avgt    9   8.228 ± 0.801  ms/op
StringCompareToDifferentLength.compareToLL        2      15  avgt    9   8.417 ± 0.735  ms/op
StringCompareToDifferentLength.compareToLL        2      24  avgt    9   9.052 ± 0.684  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9   9.587 ± 0.688  ms/op
StringCompareToDifferentLength.compareToLU        2       7  avgt    9   9.315 ± 0.552  ms/op
StringCompareToDifferentLength.compareToLU        2       8  avgt    9   9.360 ± 0.467  ms/op
StringCompareToDifferentLength.compareToLU        2      15  avgt    9  11.735 ± 0.279  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9  14.576 ± 0.617  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9  19.208 ± 0.901  ms/op
StringCompareToDifferentLength.compareToUL        2       7  avgt    9  12.049 ± 0.559  ms/op
StringCompareToDifferentLength.compareToUL        2       8  avgt    9  11.422 ± 0.410  ms/op
StringCompareToDifferentLength.compareToUL        2      15  avgt    9  13.699 ± 0.215  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9  16.343 ± 0.652  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9  19.905 ± 0.494  ms/op
StringCompareToDifferentLength.compareToUU        2       7  avgt    9   7.008 ± 0.869  ms/op
StringCompareToDifferentLength.compareToUU        2       8  avgt    9   6.996 ± 0.835  ms/op
StringCompareToDifferentLength.compareToUU        2      15  avgt    9   8.033 ± 0.800  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9  10.105 ± 0.591  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9  11.739 ± 0.758  ms/op

@VladimirKempik
Copy link
Author

Crossbuild is failing to fetch pkgs, built risc-v locally, failures aren't related to the patch

@VladimirKempik
Copy link
Author

VladimirKempik commented Jul 11, 2023

Now that 17u backport is done, can we please get back to this PR?
we have got compare case for small strings reviewed, only left the case for large strings, where the case for UU/LL is pretty simple ( just replacing ld with misaligned_load)
and the case for large UL/LU strings is the only complex thing

@RealFYang
Copy link
Member

RealFYang commented Jul 11, 2023

Now that 17u backport is done, can we please get back to this PR? we have got compare case for small strings reviewed, only left the case for large strings, where the case for UU/LL is pretty simple ( just replacing ld with misaligned_load) and the case for large UL/LU strings is the only complex thing

Yeah, I think I will be able to take another look tomorrow. Sorry for the delay.
I think the case for large UL/LU strings is getting more complex too. Is it possible to simplify it similarly with misaligned_load?

@VladimirKempik
Copy link
Author

the case for large LU/UL was made very unfriendly to the hifive ( and misaligned case in general) from the ground up.
Doing 4 (but not 8) bytes aligned LD on every iteration.

there is not big potential for simplifying

@RealFYang
Copy link
Member

RealFYang commented Jul 13, 2023

OK. I have went through the changes in function generate_compare_long_string_different_encoding and here is what I am thinking. I see there are two purposes for these changes:

  1. Make sure that the memory accesses happened in SMALL_LOOP 8-bytes aligned on strL;
  2. Avoid unaligned accessed in the loop tail (the if-else structure with AvoidUnalignedAccesses check);

I am fine with the changes for the first purpose. In fact, my previous comment was about the second one. Since it's the loop tail, making using of misaligned_load (and thus eliminating the if block) shouldn't affect much. Please consider that.

@VladimirKempik
Copy link
Author

OK. I have went through the changes in function generate_compare_long_string_different_encoding and here is what I am thinking. I see there are two purposes for these changes:

  1. Make sure that the memory accesses happened in SMALL_LOOP 8-bytes aligned on strL;
  2. Avoid unaligned accessed in the loop tail (the if-else structure with AvoidUnalignedAccesses check);

I am fine with the changes for the first purpose. In fact, my previous comment was about the second one. Since it's the loop tail, making using of misaligned_load (and thus eliminating the if block) shouldn't affect much. Please consider that.

THank you for looking at this, I'll try simplifying the tail and will test performance as soon as I find some time.

@VladimirKempik
Copy link
Author

VladimirKempik commented Jul 20, 2023

Results for latest update, from thead
what's now happening for tail - we move pointers to strU and strL back for some amount of bytes, to have tail of size 8 and 16 respectively. then process it the usual way but in misaligned manner
+AvoidUnaligned

Benchmark                                   (delta)  (size)  Mode  Cnt    Score   Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9    4.000 ± 0.106  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9    4.562 ± 0.089  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9    7.536 ± 0.085  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9   10.341 ± 0.287  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9   15.275 ± 0.249  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9   21.731 ± 0.413  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9   20.255 ± 0.287  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   22.114 ± 0.641  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9    7.615 ± 0.032  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9   10.566 ± 0.096  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   21.975 ± 0.288  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9   36.078 ± 0.419  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9   65.567 ± 0.715  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  124.196 ± 0.636  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  126.580 ± 1.431  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  129.830 ± 1.857  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9   10.386 ± 0.368  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9   12.981 ± 0.271  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   23.726 ± 0.532  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9   37.997 ± 0.482  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9   67.834 ± 0.915  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  126.500 ± 0.771  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  128.853 ± 2.059  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  132.825 ± 3.318  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9    4.013 ± 0.012  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9    4.845 ± 0.148  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9   10.276 ± 0.313  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9   14.338 ± 0.201  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9   20.912 ± 0.550  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9   34.264 ± 0.660  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9   34.557 ± 0.252  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   34.841 ± 0.380  ms/op

-AvoidUnaligned

Benchmark                                   (delta)  (size)  Mode  Cnt    Score   Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9    2.557 ± 0.034  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9    3.507 ± 0.035  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9    7.513 ± 0.033  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9    9.095 ± 0.210  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9   13.666 ± 0.134  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9   20.131 ± 0.234  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9   20.115 ± 0.065  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   20.865 ± 0.224  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9    7.091 ± 0.067  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9    9.883 ± 0.109  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   22.037 ± 0.327  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9   35.914 ± 0.307  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9   65.673 ± 1.075  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  124.257 ± 0.722  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  126.128 ± 0.453  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  129.413 ± 1.567  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9    9.661 ± 0.440  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9   12.106 ± 0.290  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   23.903 ± 0.441  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9   38.722 ± 1.049  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9   67.640 ± 0.957  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  126.744 ± 1.904  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  129.400 ± 2.463  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  130.664 ± 1.380  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9    3.662 ± 0.166  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9    4.552 ± 0.217  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9    9.399 ± 0.270  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9   13.688 ± 0.294  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9   20.033 ± 0.290  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9   33.512 ± 0.433  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9   33.796 ± 0.435  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   33.983 ± 0.152  ms/op

hifive results to follow soon

@VladimirKempik
Copy link
Author

VladimirKempik commented Jul 21, 2023

Updated results on hifive:

Benchmark                                   (delta)  (size)  Mode  Cnt    Score   Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9    8.694 ± 0.920  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9    9.583 ± 0.840  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9   11.671 ± 0.710  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9   15.860 ± 0.980  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9   21.407 ± 0.509  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9   32.435 ± 0.622  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9   30.888 ± 0.402  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   33.265 ± 0.489  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9   14.335 ± 0.182  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9   18.220 ± 0.360  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   32.331 ± 1.166  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9   50.462 ± 1.495  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9   91.972 ± 1.635  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  174.093 ± 0.384  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  178.736 ± 2.627  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  183.112 ± 2.090  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9   16.344 ± 0.486  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9   19.724 ± 0.628  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   34.062 ± 0.513  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9   52.019 ± 1.212  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9   93.111 ± 1.267  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  176.734 ± 0.695  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  179.531 ± 1.215  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  184.424 ± 1.069  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9    9.966 ± 0.648  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9   11.915 ± 0.662  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9   15.422 ± 0.925  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9   19.726 ± 0.581  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9   31.103 ± 0.738  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9   51.847 ± 0.861  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9   53.083 ± 0.326  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   53.209 ± 0.477  ms/op

no significant difference in performance after simplification ( and unification with -AvoidUnaligned)

ping @RealFYang

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Thanks for the update. Taking another look.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp Outdated Show resolved Hide resolved
@VladimirKempik
Copy link
Author

VladimirKempik commented Jul 24, 2023

updated results from hifive:
good perf improvements on compareToLU and compareToUL with sizes > 72

Benchmark                                   (delta)  (size)  Mode  Cnt    Score   Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9    8.610 ± 0.524  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9    9.623 ± 0.980  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9   11.483 ± 0.607  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9   15.931 ± 0.306  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9   21.179 ± 0.179  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9   32.687 ± 0.713  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9   31.122 ± 0.580  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   33.225 ± 0.478  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9   15.019 ± 0.631  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9   18.538 ± 1.178  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   30.966 ± 1.096  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9   48.397 ± 1.622  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9   87.368 ± 1.432  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  164.575 ± 0.816  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  167.250 ± 1.221  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  172.279 ± 1.525  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9   16.391 ± 0.456  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9   19.760 ± 0.283  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   31.841 ± 0.888  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9   49.545 ± 1.115  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9   88.728 ± 0.877  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  166.147 ± 1.468  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  168.843 ± 1.251  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  173.655 ± 1.518  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9    9.462 ± 0.572  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9   11.976 ± 0.696  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9   15.301 ± 0.673  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9   19.836 ± 0.841  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9   31.328 ± 0.619  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9   52.381 ± 1.249  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9   53.119 ± 1.195  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   53.588 ± 1.803  ms/op

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, would you mind one more tweak? Thanks.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp Show resolved Hide resolved
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp Outdated Show resolved Hide resolved
@@ -913,9 +914,8 @@ void C2_MacroAssembler::string_compare(Register str1, Register str2,
addi(cnt1, cnt1, 8);
}
addi(cnt2, cnt2, isUL ? 4 : 8);
bne(tmp1, tmp2, DIFFERENCE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to place this bne before the preceding addi(cnt2, cnt2, isUL ? 4 : 8);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubt so. Current code allows cnt2 to be calculated by the time the "bgez(cnt2, TAIL);" executed, if the cpu can execute two opcodes at once. So it benefits inorder dual-issue cpus (like hifive u74/unmatched)

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp Outdated Show resolved Hide resolved
@openjdk
Copy link

openjdk bot commented Jul 26, 2023

@VladimirKempik This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8310268: RISC-V: misaligned memory access in String.Compare intrinsic

Co-authored-by: Feilong Jiang <fjiang@openjdk.org>
Reviewed-by: fyang

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 454 new commits pushed to the master branch:

  • 1f81e5b: 8312229: Crash involving yield, switch and anonymous classes
  • e9daf4a: 8312916: Remove remaining usages of -Xdebug from test/hotspot/jtreg
  • 117f42d: 8312625: Test serviceability/dcmd/vm/TrimLibcHeapTest.java failed: RSS use increased
  • 2d05d35: 8312979: Fix assembler_aarch64.hpp after JDK-8311847
  • 78a8a99: 8312488: tools/jpackage/share/AppLauncherEnvTest.java fails with dynamically linked libstdc++
  • cb82c95: 8312415: Expand -Xlint:serial checks to enum constants with specialized class bodies
  • c6396dc: 8039165: [Doc] MessageFormat null locale generates NullPointerException
  • 36f3bae: 8312401: SymbolTable::do_add_if_needed hangs when called in InstanceKlass::add_initialization_error path with requesting length exceeds max_symbol_length
  • e554fde: 8311592: ECKeySizeParameterSpec causes too many exceptions on third party providers
  • 9606cbc: 8312524: [JVMCI] serviceability/dcmd/compiler/CompilerQueueTest.java fails
  • ... and 444 more: https://git.openjdk.org/jdk/compare/83d92672d4c2637fc37ddd873533c85a9b083904...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 26, 2023
@VladimirKempik
Copy link
Author

tier1/tier2 are fine
/integrate

@openjdk
Copy link

openjdk bot commented Jul 28, 2023

Going to push as commit d6245b6.
Since your change was applied there have been 491 commits pushed to the master branch:

  • 402cb6a: 8312201: Clean up common behavior in "page writers" and "member writers"
  • 23755f9: 8312411: MessageFormat.formatToCharacterIterator() can be improved
  • e2cb0bc: 8313204: Inconsistent order of sections in generated class documentation
  • 4ae75ca: 8313253: Rename methods in javadoc Comparators class
  • e897041: 8312262: Klass::array_klass() should return ArrayKlass pointer
  • a9a3463: 8312416: Tests in Locale should have more descriptive names
  • d9559f9: 8312612: handle WideCharToMultiByte return values
  • 34173ff: 8312574: jdk/jdk/jfr/jvm/TestChunkIntegrity.java fails with timeout
  • 47c4b99: 8312121: Fix -Wconversion warnings in tribool.hpp
  • a3d6723: 8311033: [macos] PrinterJob does not take into account Sides attribute
  • ... and 481 more: https://git.openjdk.org/jdk/compare/83d92672d4c2637fc37ddd873533c85a9b083904...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 28, 2023
@openjdk openjdk bot closed this Jul 28, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 28, 2023
@openjdk
Copy link

openjdk bot commented Jul 28, 2023

@VladimirKempik Pushed as commit d6245b6.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@VladimirKempik VladimirKempik deleted the compare_lam branch July 28, 2023 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
2 participants