8275914: SHA3: changing java implementation to help C2 create high-performance code #6847
8275914: SHA3: changing java implementation to help C2 create high-performance code #6847bulasevich wants to merge 2 commits intoopenjdk:masterfrom
Conversation
|
👋 Welcome back bulasevich! A progress list of the required criteria for merging this PR into |
|
@bulasevich The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
|
@kuksenko You used to improve this code before, could you share your opinion? |
I think benchmark results table tells more. :) |
|
@bulasevich This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
Hi, Boris, I'd go with option 2 for this patch and option 1 in a separate JBS issue. Safer and more straightforward that way. The only change I'd make to the option 2 patch is to try replace raw constants with derivations of DM. E.g. in smChi() replace 0, 5, 10, 15, 20, and 25 with 0DM, 1DM, 2DM, 3DM, 4DM, and 5DM. In smTheta, replace a[0] with a[0DM], a[1] with a[0DM+1], etc. But, if doing so causes C2 to fail to constant fold, revert to raw constants. |
|
Also, going with option 2 first leaves a progression record that might be useful to future compiler optimizer implementors. The option 1 patch looks good too. |
|
Sorry this feel off my stack with the holidays and everything else going on. Thanks |
|
Hi, Thanks for looking at this!
In my original post I included the test suite and KAT - I have not found ready-to use one, so I did it from scratch using the test vectors provided for KECCAK Final Algorithm Package.
There are before/after MessageDigests benchmark results for option1/options2 on ARM32/PPC/ARM64/AMD: http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html - is it OK? |
|
Ah yes. Sorry, I saw the replied email that left that out |
|
Please update the copyright date from 2020 to 2022. |
… to help C2 with SHA3 impl
dd88946 to
4d15fea
Compare
|
@bulasevich This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 8 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
|
/integrate |
|
Going to push as commit c74b8f4.
Your commit was automatically rebased without conflicts. |
|
@bulasevich Pushed as commit c74b8f4. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Background
The goal is to improve SHA3 implementation performance as it runs up to two times slower than native (OpenSSL, measured on AMD64 and AArch6464) implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and most AArch64) do not.
For AArch64 hardware that does support SHA3 accelerators, an intrinsic was implemented for ARMv8.2 with SHA3 instructions support:
SHA3 java implementation have already been reworked to eliminate memory consumption and make it work faster:
The latter issue addressed this previously, but there is still some room to improve SHA3 performance with current C2 implementation, which is proposed in this PR.
Option 1 (this PR)
With this change I unroll the internal cycles manually, inline it manually, and use local variables (not array) for data processing. Such an approach gives the best performance (see benchmark results). Without this change (with current array data processing) we observe a lot of load/store operations in comparison to processing in local variables, both on AMD64 and on AArch64.
Native implementations shows that on AArch64 (32 GPU registers) SHA-3 algorithm can hold all 25 data and all temporary variables in registers. C2 can't optimize it as well because many regisers are allocated for internal usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local variables the number of excessive load/stores is much smaller and performance result is much better.
Option 2 (alternative)
LINK: the change
This is a more conservative change which minimizes code changes, but it has lower performance improvement. Please let me know if you think this change is better then main one: in this case I will replace it within this Pull Request.
With this change I unroll the internal cycles manually and use @Forceinline annotation. Manual unrolling is necessary because C2 does not recognize there is a counted cycle that can be completely unrolled. Instead of replacing the loop with five loop bodies C2 splits the loop to pre- main- and post- loop which is not good for this case. C2 works better when the array is created locally, but in our case the array is created on object instantiation, so C2 can't prove the array length is constant. The second issue affecting performance is that methods with unrolled loops get larger and C2 does not inline them. It's addressed here by using @Forceinline annotation.
Benchmark results
We measured the change on four platforms: ARM32, PPC, AArch64 (with no advanced SHA-3 instructions) and AMD on
MessageDigests benchmark. Here is the result: http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
Testing
Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 implementation over the same vectors before and after the change, and checking the results are the same.
The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
The test vectors compiled from the Final Algorithm Package:
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6847/head:pull/6847$ git checkout pull/6847Update a local copy of the PR:
$ git checkout pull/6847$ git pull https://git.openjdk.java.net/jdk pull/6847/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 6847View PR using the GUI difftool:
$ git pr show -t 6847Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6847.diff