-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8337666: AArch64: SHA3 GPR intrinsic #24260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back dchuyko! A progress list of the required criteria for merging this PR into |
|
@dchuyko This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the ➡️ To integrate this PR with the above commit message to the |
|
@dchuyko This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
Thanks. I think we need a bit more information.
RPi3 is Cortex A53. It is ten years old. Which Surface Pro model are you referring to? How old is it? The comparison I'd like to see is the difference between this code and the fastest existing version of SHA3, for any given hardware. It would also be nice for the automation to choose the fastest, for any given hardware. |
It's Surface Pro X, SQ1, Cortex-A76 / A55 (with the performance profile and power supplied and according to the numbers it should be A76), 5 years old.
Do you mean OpenJDK implementations only, or OpenSSL as well (it uses same approaches but exact numbers are different)?
As for now, in OpenJDK JDK-8252204 variant is the fastest on Apple Silicon, and this variant is second on Apple Silicon and the fastest elsewhere tested. |
|
So, this is for two now-discontinued computers? Does this patch improve performance on any recently-available hardware, or is it purely for retrocomputers? |
I'd not call Graviton 4,3,2 retro. |
I'm trying to understand what you wrote, which is very confusing. Reading it again, on Graviton 3 it is 8-14% faster than the existing fastest implementation. I don't think we should maintain multiple implementations of SHA-3 unless there is a convincing advantage one way or the other. I certainly don't want to see a precedent where we have special versions for crypto algorithms for each microarchitecture. Is 8252204 much faster that this one on Apple silicon? It would be great if we could ditch that one. |
Correct. And for newest Graviton 4 there was hope to see either no difference between this version and C2 generated code or to see 8252204 being faster than C2 generated code. However, on Graviton 4 this version is still 7-12% faster than C2 generated code, which is still faster than 8252204.
Even on M1 8252204 is 28-32% faster than this one. They seem to have 4 execution blocks per core for the accelerator instructions (unlike servers that may provide just 1 unit). It would be great if C2 could allocate scratch registers in such methods but that would complicate the entire port. |
To begin with, please isolate Is it possible to define GPR macro-instructions for instructions like |
| #ifndef R18_RESERVED | ||
| can_use_r18 = true; | ||
| #endif | ||
| bool can_use_fp = !PreserveFramePointer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can's we always use fp in a leaf function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure about external tools compatibility, like perf which was the cause of the entire +PreserveFP story.
theRealAph
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm convinced. We really do need both of these intrinsics.
|
@dchuyko this pull request can not be integrated into git checkout JDK-8337666
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
Thanks, Andrew. I'm working on your previous comments (bg mode but still). Just a short update on that: macros per few instructions may make it more confusing (in terms of in-mind masm->"C" mapping) and also may wipe out a couple of tiny code micro-arrangements; macro for the main loop and other corrections make sense. |
|
GPR rol, rax and rax1 pseudo instructions were added in MacroAssembler. Main loop and "bcax"/Chi parts were extracted as functions. Main loop counter was put in fp register with fp decrement and fcmp (this variant does have a positive impact). Updated results from Graviton machines (Linux, intrinsic vs C2): |
Co-authored-by: Andrew Haley <aph-open@littlepinkcloud.com>
Co-authored-by: Andrew Haley <aph-open@littlepinkcloud.com>
|
@dchuyko Thanks for working on this! I have quickly scanned the code, and it looks reasonable, though I am not an intrinsics specialist. I'll not run some internal testing, feel free to ping me again in 24h. |
|
A nit: can you please fix the alignment issue in the PR description's benchmark results? |
theRealAph
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks.
|
/integrate |
This is an implementation of SHA3 intrinsics for AArch64 that operates GPRs. It follows the Java implementation algorithm but eagerly uses available registers. For example, FP+R18 are used when it's allowed. On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version; on Graviton 3 it is 8-14% faster than C2 compiled version (which is faster than the current intrinsic); on Apple Silicon it is faster than C2 compiled version but slower than the ARMv8.2-SHA intrinsic. Improvements on a particular CPU depend on the input length. For instance, for Graviton 2:
(results for intermediate input lengths look like steps)
On Graviton 4 there is still a noticeable difference between the proposed implementation and C2 generated code:
and the version that uses the extension is ~1.8x slower than C2
Existing intrinsic implementation is put under a flag
UseSIMDForSHA3Intrinsicwhich is on by default where the intrinsic is enabled currently.Sanity tests were modified to cover new intrinsic variants (
-XX:-UseSIMDForSHA3Intrinsic -XX:+-PreserveFramePointer) on aarch64 hw. Existing test cases where intrinsic is enabled are executed with-XX:+IgnoreUnrecognizedVMOptions -XX:+UseSIMDForSHA3Intrinsic, on platforms where the sha3 extension is missing they still are cut off by isSHA3IntrinsicAvailable() predicate.The original PR #20422 has been auto-closed and the branch has been re-created on top of the new master.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24260/head:pull/24260$ git checkout pull/24260Update a local copy of the PR:
$ git checkout pull/24260$ git pull https://git.openjdk.org/jdk.git pull/24260/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24260View PR using the GUI difftool:
$ git pr show -t 24260Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24260.diff
Using Webrev
Link to Webrev Comment