Skip to content

8317721: RISC-V: Implement CRC32 intrinsic #17046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from

Conversation

ArsenyBochkarev
Copy link
Contributor

@ArsenyBochkarev ArsenyBochkarev commented Dec 11, 2023

Hi everyone! Please review this port of AArch64 _updateBytesCRC32, _updateByteBufferCRC32 and _updateCRC32 intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version.

Correctness checks

Tier 1/2 tests are ok.

Performance results on T-Head board

Results for enabled intrinsic:

Used test is test/micro/org/openjdk/bench/java/util/TestCRC32.java

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 24 3730.929 37.773 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 24 2126.673 2.032 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 24 1134.330 6.714 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 24 584.017 2.267 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 24 151.173 0.346 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 24 19.113 0.008 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 24 4.647 0.022 ops/ms

Results for disabled intrinsic:

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 15 798.365 35.486 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 15 677.756 46.619 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 15 552.781 27.143 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 15 429.304 12.518 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 15 166.738 0.935 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 15 25.060 0.034 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 15 6.196 0.030 ops/ms

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8317721: RISC-V: Implement CRC32 intrinsic (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17046/head:pull/17046
$ git checkout pull/17046

Update a local copy of the PR:
$ git checkout pull/17046
$ git pull https://git.openjdk.org/jdk.git pull/17046/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 17046

View PR using the GUI difftool:
$ git pr show -t 17046

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17046.diff

Webrev

Link to Webrev Comment

@ArsenyBochkarev
Copy link
Contributor Author

/cc hotspot-compiler

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 11, 2023

👋 Welcome back ArsenyBochkarev! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added rfr Pull request is ready for review hotspot-compiler hotspot-compiler-dev@openjdk.org labels Dec 11, 2023
@openjdk
Copy link

openjdk bot commented Dec 11, 2023

@ArsenyBochkarev
The hotspot-compiler label was successfully added.

@mlbridge
Copy link

mlbridge bot commented Dec 11, 2023

@ArsenyBochkarev
Copy link
Contributor Author

ArsenyBochkarev commented Dec 11, 2023

Performance comparison for disabling/enabling Zba on StarFive VisionFive 2 board:

-XX:-UseZba:

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 12 3563.320 3.326 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 12 1928.837 2.234 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 12 1005.273 1.953 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 512.550 1.718 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 130.396 0.341 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 16.319 0.073 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 3.913 0.011 ops/ms

-XX:+UseZba:

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 12 4206.654 0.547 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 12 2308.843 3.565 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 12 1214.727 0.305 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 623.173 0.651 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 158.965 0.376 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 19.934 0.055 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 4.730 0.007 ops/ms

-XX:-UseCRC32Intrinsics:

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 12 1390.530 42.217 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 12 1109.742 24.201 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 12 805.345 12.155 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 520.965 5.651 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 169.591 0.747 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 22.624 0.139 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 5.430 0.016 ops/ms

@robehn
Copy link
Contributor

robehn commented Dec 12, 2023

Thanks! (not a review, just an ack)

There are two other version we probably need also, using carry-less-multiplication.
There is a scalar clmul in Zbc and there is a vclmul in vector.

Copy link

@Hamlin-Li Hamlin-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments

@ArsenyBochkarev
Copy link
Contributor Author

ArsenyBochkarev commented Dec 21, 2023

Performance measurements on the same benchmark, on T-Head board. I used the -XX:TieredStopAtLevel=1 flag:

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 12 3617.860 17.463 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 12 2253.626 4.739 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 12 1242.516 83.245 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 687.339 1.712 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 183.016 0.258 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 23.368 0.133 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 5.640 0.023 ops/ms

Current results with no such flag:

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 12 4270.963 47.669 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 12 2448.523 72.738 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 12 1360.692 30.246 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 709.729 7.413 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 183.903 1.051 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 23.320 0.038 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 5.651 0.017 ops/ms

@Hamlin-Li
Copy link

Performance comparison for disabling/enabling Zba on StarFive VisionFive 2 board:

-XX:-UseZba:

Benchmark	(count)	Mode	Cnt	Score	Error	Units

CRC32.TestCRC32.testCRC32Update	512	thrpt	12	512.550	1.718	ops/ms
CRC32.TestCRC32.testCRC32Update	2048	thrpt	12	130.396	0.341	ops/ms
CRC32.TestCRC32.testCRC32Update	16384	thrpt	12	16.319	0.073	ops/ms
CRC32.TestCRC32.testCRC32Update	65536	thrpt	12	3.913	0.011	ops/ms

-XX:+UseZba:

Benchmark	(count)	Mode	Cnt	Score	Error	Units

CRC32.TestCRC32.testCRC32Update	512	thrpt	12	623.173	0.651	ops/ms
CRC32.TestCRC32.testCRC32Update	2048	thrpt	12	158.965	0.376	ops/ms
CRC32.TestCRC32.testCRC32Update	16384	thrpt	12	19.934	0.055	ops/ms
CRC32.TestCRC32.testCRC32Update	65536	thrpt	12	4.730	0.007	ops/ms

-XX:-UseCRC32Intrinsics:

Benchmark	(count)	Mode	Cnt	Score	Error	Units

CRC32.TestCRC32.testCRC32Update	512	thrpt	12	520.965	5.651	ops/ms
CRC32.TestCRC32.testCRC32Update	2048	thrpt	12	169.591	0.747	ops/ms
CRC32.TestCRC32.testCRC32Update	16384	thrpt	12	22.624	0.139	ops/ms
CRC32.TestCRC32.testCRC32Update	65536	thrpt	12	5.430	0.016	ops/ms

Seems there is regression when count >= 512, especially when count >= 2048. And I suppose that big message is common case for CRC32 usage?

@ArsenyBochkarev
Copy link
Contributor Author

ArsenyBochkarev commented Dec 22, 2023

Performance comparison for disabling/enabling Zba on StarFive VisionFive 2 board:

-XX:-UseZba:

Benchmark	(count)	Mode	Cnt	Score	Error	Units

CRC32.TestCRC32.testCRC32Update	512	thrpt	12	512.550	1.718	ops/ms
CRC32.TestCRC32.testCRC32Update	2048	thrpt	12	130.396	0.341	ops/ms
CRC32.TestCRC32.testCRC32Update	16384	thrpt	12	16.319	0.073	ops/ms
CRC32.TestCRC32.testCRC32Update	65536	thrpt	12	3.913	0.011	ops/ms

-XX:+UseZba:

Benchmark	(count)	Mode	Cnt	Score	Error	Units

CRC32.TestCRC32.testCRC32Update	512	thrpt	12	623.173	0.651	ops/ms
CRC32.TestCRC32.testCRC32Update	2048	thrpt	12	158.965	0.376	ops/ms
CRC32.TestCRC32.testCRC32Update	16384	thrpt	12	19.934	0.055	ops/ms
CRC32.TestCRC32.testCRC32Update	65536	thrpt	12	4.730	0.007	ops/ms

-XX:-UseCRC32Intrinsics:

Benchmark	(count)	Mode	Cnt	Score	Error	Units

CRC32.TestCRC32.testCRC32Update	512	thrpt	12	520.965	5.651	ops/ms
CRC32.TestCRC32.testCRC32Update	2048	thrpt	12	169.591	0.747	ops/ms
CRC32.TestCRC32.testCRC32Update	16384	thrpt	12	22.624	0.139	ops/ms
CRC32.TestCRC32.testCRC32Update	65536	thrpt	12	5.430	0.016	ops/ms

Seems there is regression when count >= 512, especially when count >= 2048. And I suppose that big message is common case for CRC32 usage?

Hmm, I don't know about common CRC32 count parameter sizes, maybe others know? 😓 Or maybe anyone knows if there are other ways to optimize such plain version of intrinsic more?

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having another look.

@Hamlin-Li
Copy link

Thanks for updating and continuous refinement.

image

Seems the performance gain (last column in the picture) introduced by intrinsic is getting less and less when the data size increasing.
So IMHO, when data size is big enough, it brings performance regression rather than performance gain.

@ArsenyBochkarev
Copy link
Contributor Author

ArsenyBochkarev commented Apr 2, 2024

@Hamlin-Li I modified test/micro/org/openjdk/bench/java/util/TestCRC32.java a bit to see if the regression happens on increased data, and run it on VisionFive2 with Zba enabed:

Enabled intrinsic

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 131072 thrpt 40 2.841 0.001 ops/ms
CRC32.TestCRC32.testCRC32Update 262144 thrpt 40 1.420 0.001 ops/ms
CRC32.TestCRC32.testCRC32Update 524288 thrpt 40 0.709 0.001 ops/ms
CRC32.TestCRC32.testCRC32Update 2097152 thrpt 40 0.176 0.001 ops/ms

Disabled intrinsic

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 131072 thrpt 40 2.729 0.003 ops/ms
CRC32.TestCRC32.testCRC32Update 262144 thrpt 40 1.367 0.001 ops/ms
CRC32.TestCRC32.testCRC32Update 524288 thrpt 40 0.684 0.001 ops/ms
CRC32.TestCRC32.testCRC32Update 2097152 thrpt 40 0.170 0.001 ops/ms
(count) enabled/disabled
131072 1,041040674
262144 1,038771031
524288 1.036549708
2097152 1,035294118

So since there are no regressions compared to C2-generated code with -XX:+UseZba, how about making the intrinsic Zba-exclusive?

@Hamlin-Li
Copy link

Thanks for updating.
Seems fine, but I'm not sure. Maybe see how others think about it.

Just FYI, as the trend of performance gain in this implementation is less and less as the data size grow larger, so I wonder if the CRC algorithm used in this implementation is optimal enough. Seems there're other more advanced algorithms which are supposed to bring more optimistic performance gains, and some of these algorithms are already implemented on other platforms in jdk.

@bridgekeeper
Copy link

bridgekeeper bot commented May 17, 2024

@ArsenyBochkarev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@ArsenyBochkarev
Copy link
Contributor Author

Hello again everyone! I made two changes to this PR:

  1. Now the intrinsic is Zba-exclusive.
  2. I used the slli + srliw combination instead of srli + andi and rescheduled them a bit, making use of t6 register.

Current results on StarFive VisionFive2 (with Zba) are:

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 12 4644.129 9.566 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 12 2911.927 12.866 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 12 1538.630 5.463 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 799.100 3.216 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 205.947 0.236 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 25.880 0.069 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 6.022 0.020 ops/ms

Results for disabled intrinsic on VisionFive2 (taken from here)

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 12 1390.530 42.217 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 12 1109.742 24.201 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 12 805.345 12.155 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 520.965 5.651 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 169.591 0.747 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 22.624 0.139 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 5.430 0.016 ops/ms

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 19, 2024

@ArsenyBochkarev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@ArsenyBochkarev
Copy link
Contributor Author

Hi all! Can anyone take another look, please?

Current numbers on VisionFive2:
with intrinsic enabled:

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 12 4644.129 9.566 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 12 2911.927 12.866 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 12 1538.630 5.463 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 799.100 3.216 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 205.947 0.236 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 25.880 0.069 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 6.022 0.020 ops/ms
CRC32.TestCRC32.testCRC32Update 131072 thrpt 40 3.030 0.001 ops/ms
CRC32.TestCRC32.testCRC32Update 262144 thrpt 40 1.514 0.001 ops/ms
CRC32.TestCRC32.testCRC32Update 524288 thrpt 40 0.757 0.001 ops/ms
CRC32.TestCRC32.testCRC32Update 2097152 thrpt 40 0.188 0.001 ops/ms

and without intrinsic enabled:

Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 64 thrpt 12 1390.530 42.217 ops/ms
CRC32.TestCRC32.testCRC32Update 128 thrpt 12 1109.742 24.201 ops/ms
CRC32.TestCRC32.testCRC32Update 256 thrpt 12 805.345 12.155 ops/ms
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 520.965 5.651 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 169.591 0.747 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 22.624 0.139 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 5.430 0.016 ops/ms
CRC32.TestCRC32.testCRC32Update 131072 thrpt 40 2.729 0.003 ops/ms
CRC32.TestCRC32.testCRC32Update 262144 thrpt 40 1.367 0.001 ops/ms
CRC32.TestCRC32.testCRC32Update 524288 thrpt 40 0.684 0.001 ops/ms
CRC32.TestCRC32.testCRC32Update 2097152 thrpt 40 0.170 0.001 ops/ms

Copy link
Contributor

@robehn robehn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good!

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 1, 2024
@ArsenyBochkarev
Copy link
Contributor Author

Thanks everyone for all the comments! Sanity checks for ./test/hotspot/jtreg/compiler/codegen/CRCTest.java and test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java are ok.

@ArsenyBochkarev
Copy link
Contributor Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Jul 1, 2024
@openjdk
Copy link

openjdk bot commented Jul 1, 2024

@ArsenyBochkarev
Your change (at version 204e6ca) is now ready to be sponsored by a Committer.

@luhenry
Copy link
Member

luhenry commented Jul 1, 2024

/sponsor

@openjdk
Copy link

openjdk bot commented Jul 1, 2024

Going to push as commit 2f4f6cc.
Since your change was applied there have been 1388 commits pushed to the master branch:

  • 3ca2bcd: 8335060: ClassCastException after JDK-8294960
  • 747e1e4: 8334295: CTW: update modules
  • 0a6ffa5: 8261242: [Linux] OSContainer::is_containerized() returns true when run outside a container
  • 71e3798: 8335308: compiler/uncommontrap/DeoptReallocFailure.java times out with SerialGC on Windows
  • c7e9ebb: 8331732: [PPC64] Unify and optimize code which converts != 0 to 1
  • 53242cd: 8335283: Build failure due to 'no_sanitize' attribute directive ignored
  • d9bcf06: 8335217: Fix memory ordering in ClassLoaderData::ChunkedHandleList
  • bb18498: 8335349: jcmd VM.classloaders "fold" option should be optional
  • 8350b1d: 8335294: Fix simple -Wzero-as-null-pointer-constant warnings in gc code
  • 5d866bf: 8335252: Reduce size of j.u.Formatter.Conversion#isValid
  • ... and 1378 more: https://git.openjdk.org/jdk/compare/fb390d202c8bbbbb87ba48fd01387feb35a1b768...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 1, 2024
@openjdk openjdk bot closed this Jul 1, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Jul 1, 2024
@openjdk
Copy link

openjdk bot commented Jul 1, 2024

@luhenry @ArsenyBochkarev Pushed as commit 2f4f6cc.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

9 participants