Skip to content

Conversation

@dchuyko
Copy link
Member

@dchuyko dchuyko commented Mar 26, 2025

This is an implementation of SHA3 intrinsics for AArch64 that operates GPRs. It follows the Java implementation algorithm but eagerly uses available registers. For example, FP+R18 are used when it's allowed. On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version; on Graviton 3 it is 8-14% faster than C2 compiled version (which is faster than the current intrinsic); on Apple Silicon it is faster than C2 compiled version but slower than the ARMv8.2-SHA intrinsic. Improvements on a particular CPU depend on the input length. For instance, for Graviton 2:

G2
Benchmark                    (digesterName)  (length)	Pct
MessageDigests.digest              SHA3-256        64     28.28%
MessageDigests.digest              SHA3-256     16384     53.58%
MessageDigests.digest              SHA3-512        64     27.97%
MessageDigests.digest              SHA3-512     16384     43.90%
MessageDigests.getAndDigest        SHA3-256        64     26.18%
MessageDigests.getAndDigest        SHA3-256     16384     52.82%
MessageDigests.getAndDigest        SHA3-512        64     24.73%
MessageDigests.getAndDigest        SHA3-512     16384     44.31%

(results for intermediate input lengths look like steps)

On Graviton 4 there is still a noticeable difference between the proposed implementation and C2 generated code:

G4
Benchmark                    (digesterName)  (length)	Pct
MessageDigests.digest              SHA3-256        64     8.3%
MessageDigests.digest              SHA3-256     16384     11%
MessageDigests.digest              SHA3-512        64     8.4%
MessageDigests.digest              SHA3-512     16384     11.5%
MessageDigests.getAndDigest        SHA3-256        64     7.2%
MessageDigests.getAndDigest        SHA3-256     16384     11%
MessageDigests.getAndDigest        SHA3-512        64     7.3%
MessageDigests.getAndDigest        SHA3-512     16384     11.6%

and the version that uses the extension is ~1.8x slower than C2

Existing intrinsic implementation is put under a flag UseSIMDForSHA3Intrinsic which is on by default where the intrinsic is enabled currently.

Sanity tests were modified to cover new intrinsic variants (-XX:-UseSIMDForSHA3Intrinsic -XX:+-PreserveFramePointer) on aarch64 hw. Existing test cases where intrinsic is enabled are executed with -XX:+IgnoreUnrecognizedVMOptions -XX:+UseSIMDForSHA3Intrinsic, on platforms where the sha3 extension is missing they still are cut off by isSHA3IntrinsicAvailable() predicate.

The original PR #20422 has been auto-closed and the branch has been re-created on top of the new master.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8337666: AArch64: SHA3 GPR intrinsic (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24260/head:pull/24260
$ git checkout pull/24260

Update a local copy of the PR:
$ git checkout pull/24260
$ git pull https://git.openjdk.org/jdk.git pull/24260/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24260

View PR using the GUI difftool:
$ git pr show -t 24260

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24260.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 26, 2025

👋 Welcome back dchuyko! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 26, 2025

@dchuyko This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8337666: AArch64: SHA3 GPR intrinsic

Reviewed-by: aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title JDK-8337666 8337666: AArch64: SHA3 GPR intrinsic Mar 26, 2025
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 26, 2025
@openjdk
Copy link

openjdk bot commented Mar 26, 2025

@dchuyko The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Mar 26, 2025
@mlbridge
Copy link

mlbridge bot commented Mar 26, 2025

Webrevs

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 23, 2025

@dchuyko This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@theRealAph
Copy link
Contributor

Thanks. I think we need a bit more information.

On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version

RPi3 is Cortex A53. It is ten years old. Which Surface Pro model are you referring to? How old is it?

The comparison I'd like to see is the difference between this code and the fastest existing version of SHA3, for any given hardware. It would also be nice for the automation to choose the fastest, for any given hardware.

@dchuyko
Copy link
Member Author

dchuyko commented Apr 24, 2025

RPi3 is Cortex A53. It is ten years old. Which Surface Pro model are you referring to? How old is it?

It's Surface Pro X, SQ1, Cortex-A76 / A55 (with the performance profile and power supplied and according to the numbers it should be A76), 5 years old.

The comparison I'd like to see is the difference between this code and the fastest existing version of SHA3, for any given hardware.

Do you mean OpenJDK implementations only, or OpenSSL as well (it uses same approaches but exact numbers are different)?

It would also be nice for the automation to choose the fastest, for any given hardware.

As for now, in OpenJDK JDK-8252204 variant is the fastest on Apple Silicon, and this variant is second on Apple Silicon and the fastest elsewhere tested.
There is also notable double Keccak improvement in JDK-8348561 but that's not for SHA3.

@theRealAph
Copy link
Contributor

So, this is for two now-discontinued computers? Does this patch improve performance on any recently-available hardware, or is it purely for retrocomputers?

@dchuyko
Copy link
Member Author

dchuyko commented Apr 24, 2025

So, this is for two now-discontinued computers? Does this patch improve performance on any recently-available hardware, or is it purely for retrocomputers?

I'd not call Graviton 4,3,2 retro.

@theRealAph
Copy link
Contributor

So, this is for two now-discontinued computers? Does this patch improve performance on any recently-available hardware, or is it purely for retrocomputers?

I'd not call Graviton 4,3,2 retro.

I'm trying to understand what you wrote, which is very confusing. Reading it again, on Graviton 3 it is 8-14% faster than the existing fastest implementation.

I don't think we should maintain multiple implementations of SHA-3 unless there is a convincing advantage one way or the other. I certainly don't want to see a precedent where we have special versions for crypto algorithms for each microarchitecture. Is 8252204 much faster that this one on Apple silicon? It would be great if we could ditch that one.

@dchuyko
Copy link
Member Author

dchuyko commented Apr 24, 2025

So, this is for two now-discontinued computers? Does this patch improve performance on any recently-available hardware, or is it purely for retrocomputers?

I'd not call Graviton 4,3,2 retro.

I'm trying to understand what you wrote, which is very confusing. Reading it again, on Graviton 3 it is 8-14% faster than the existing fastest implementation.

Correct. And for newest Graviton 4 there was hope to see either no difference between this version and C2 generated code or to see 8252204 being faster than C2 generated code. However, on Graviton 4 this version is still 7-12% faster than C2 generated code, which is still faster than 8252204.

I don't think we should maintain multiple implementations of SHA-3 unless there is a convincing advantage one way or the other. I certainly don't want to see a precedent where we have special versions for crypto algorithms for each microarchitecture. Is 8252204 much faster that this one on Apple silicon? It would be great if we could ditch that one.

Even on M1 8252204 is 28-32% faster than this one. They seem to have 4 execution blocks per core for the accelerator instructions (unlike servers that may provide just 1 unit).

It would be great if C2 could allocate scratch registers in such methods but that would complicate the entire port.

@theRealAph
Copy link
Contributor

Even on M1 8252204 is 28-32% faster than this one. They seem to have 4 execution blocks per core for the accelerator instructions (unlike servers that may provide just 1 unit).

It would be great if C2 could allocate scratch registers in such methods but that would complicate the entire port.

To begin with, please isolate keccak_round() in its own function, to make it more similar to the other implementation.

Is it possible to define GPR macro-instructions for instructions like eor3and raxl ? This would make it a lot easier to understand what is going on, thereby making maintenance easier.

#ifndef R18_RESERVED
can_use_r18 = true;
#endif
bool can_use_fp = !PreserveFramePointer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can's we always use fp in a leaf function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure about external tools compatibility, like perf which was the cause of the entire +PreserveFP story.

Copy link
Contributor

@theRealAph theRealAph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm convinced. We really do need both of these intrinsics.

@openjdk
Copy link

openjdk bot commented May 14, 2025

@dchuyko this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8337666
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label May 14, 2025
@dchuyko
Copy link
Member Author

dchuyko commented May 15, 2025

OK, I'm convinced. We really do need both of these intrinsics.

Thanks, Andrew. I'm working on your previous comments (bg mode but still). Just a short update on that: macros per few instructions may make it more confusing (in terms of in-mind masm->"C" mapping) and also may wipe out a couple of tiny code micro-arrangements; macro for the main loop and other corrections make sense.

@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label May 30, 2025
@dchuyko
Copy link
Member Author

dchuyko commented May 30, 2025

GPR rol, rax and rax1 pseudo instructions were added in MacroAssembler.

Main loop and "bcax"/Chi parts were extracted as functions.

Main loop counter was put in fp register with fp decrement and fcmp (this variant does have a positive impact).

Updated results from Graviton machines (Linux, intrinsic vs C2):

Benchmark              (digesterName)  (length)  Pct
G2
MessageDigests.digest        SHA3-256        64     +20.8%
MessageDigests.digest        SHA3-256     16384     +27.2%
G3
MessageDigests.digest        SHA3-256        64     +12.8%
MessageDigests.digest        SHA3-256     16384     +15.7%
G4
MessageDigests.digest        SHA3-256        64     +9.7%
MessageDigests.digest        SHA3-256     16384     +13.2%

dchuyko and others added 4 commits June 1, 2025 00:37
@eme64
Copy link
Contributor

eme64 commented Jun 4, 2025

@dchuyko Thanks for working on this! I have quickly scanned the code, and it looks reasonable, though I am not an intrinsics specialist. I'll not run some internal testing, feel free to ping me again in 24h.

@eme64
Copy link
Contributor

eme64 commented Jun 4, 2025

A nit: can you please fix the alignment issue in the PR description's benchmark results?

@dchuyko
Copy link
Member Author

dchuyko commented Jun 5, 2025

@dchuyko Thanks for working on this! I have quickly scanned the code, and it looks reasonable, though I am not an intrinsics specialist. I'll not run some internal testing, feel free to ping me again in 24h.

@eme64 thanks, are the results ok?

@dchuyko dchuyko requested a review from theRealAph June 5, 2025 09:44
Copy link
Contributor

@theRealAph theRealAph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 5, 2025
@dchuyko
Copy link
Member Author

dchuyko commented Jun 5, 2025

/integrate

@openjdk
Copy link

openjdk bot commented Jun 5, 2025

Going to push as commit 23f1d4f.
Since your change was applied there has been 1 commit pushed to the master branch:

  • 33ed7c1: 8358689: test/micro/org/openjdk/bench/java/net/SocketEventOverhead.java does not build after JDK-8351594

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 5, 2025
@openjdk openjdk bot closed this Jun 5, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 5, 2025
@openjdk
Copy link

openjdk bot commented Jun 5, 2025

@dchuyko Pushed as commit 23f1d4f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

3 participants