8337666: AArch64: SHA3 GPR intrinsic #24260

dchuyko · 2025-03-26T15:55:59Z

This is an implementation of SHA3 intrinsics for AArch64 that operates GPRs. It follows the Java implementation algorithm but eagerly uses available registers. For example, FP+R18 are used when it's allowed. On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version; on Graviton 3 it is 8-14% faster than C2 compiled version (which is faster than the current intrinsic); on Apple Silicon it is faster than C2 compiled version but slower than the ARMv8.2-SHA intrinsic. Improvements on a particular CPU depend on the input length. For instance, for Graviton 2:

G2
Benchmark                    (digesterName)  (length)	Pct
MessageDigests.digest              SHA3-256        64     28.28%
MessageDigests.digest              SHA3-256     16384     53.58%
MessageDigests.digest              SHA3-512        64     27.97%
MessageDigests.digest              SHA3-512     16384     43.90%
MessageDigests.getAndDigest        SHA3-256        64     26.18%
MessageDigests.getAndDigest        SHA3-256     16384     52.82%
MessageDigests.getAndDigest        SHA3-512        64     24.73%
MessageDigests.getAndDigest        SHA3-512     16384     44.31%

(results for intermediate input lengths look like steps)

On Graviton 4 there is still a noticeable difference between the proposed implementation and C2 generated code:

G4
Benchmark                    (digesterName)  (length)	Pct
MessageDigests.digest              SHA3-256        64     8.3%
MessageDigests.digest              SHA3-256     16384     11%
MessageDigests.digest              SHA3-512        64     8.4%
MessageDigests.digest              SHA3-512     16384     11.5%
MessageDigests.getAndDigest        SHA3-256        64     7.2%
MessageDigests.getAndDigest        SHA3-256     16384     11%
MessageDigests.getAndDigest        SHA3-512        64     7.3%
MessageDigests.getAndDigest        SHA3-512     16384     11.6%

and the version that uses the extension is ~1.8x slower than C2

Existing intrinsic implementation is put under a flag UseSIMDForSHA3Intrinsic which is on by default where the intrinsic is enabled currently.

Sanity tests were modified to cover new intrinsic variants (-XX:-UseSIMDForSHA3Intrinsic -XX:+-PreserveFramePointer) on aarch64 hw. Existing test cases where intrinsic is enabled are executed with -XX:+IgnoreUnrecognizedVMOptions -XX:+UseSIMDForSHA3Intrinsic, on platforms where the sha3 extension is missing they still are cut off by isSHA3IntrinsicAvailable() predicate.

The original PR #20422 has been auto-closed and the branch has been re-created on top of the new master.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8337666: AArch64: SHA3 GPR intrinsic (Enhancement - P4)

Reviewers

Andrew Haley (@theRealAph - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24260/head:pull/24260
$ git checkout pull/24260

Update a local copy of the PR:
$ git checkout pull/24260
$ git pull https://git.openjdk.org/jdk.git pull/24260/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24260

View PR using the GUI difftool:
$ git pr show -t 24260

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24260.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-03-26T15:57:03Z

👋 Welcome back dchuyko! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-03-26T15:58:27Z

@dchuyko This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8337666: AArch64: SHA3 GPR intrinsic

Reviewed-by: aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2025-03-26T15:59:21Z

@dchuyko The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-03-26T16:02:31Z

Webrevs

bridgekeeper · 2025-04-23T21:13:02Z

@dchuyko This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

theRealAph · 2025-04-24T08:19:53Z

Thanks. I think we need a bit more information.

On simpler cores like RPi3 or Surface Pro it is 23-53% faster than C2 compiled version

RPi3 is Cortex A53. It is ten years old. Which Surface Pro model are you referring to? How old is it?

The comparison I'd like to see is the difference between this code and the fastest existing version of SHA3, for any given hardware. It would also be nice for the automation to choose the fastest, for any given hardware.

dchuyko · 2025-04-24T13:20:01Z

RPi3 is Cortex A53. It is ten years old. Which Surface Pro model are you referring to? How old is it?

It's Surface Pro X, SQ1, Cortex-A76 / A55 (with the performance profile and power supplied and according to the numbers it should be A76), 5 years old.

The comparison I'd like to see is the difference between this code and the fastest existing version of SHA3, for any given hardware.

Do you mean OpenJDK implementations only, or OpenSSL as well (it uses same approaches but exact numbers are different)?

It would also be nice for the automation to choose the fastest, for any given hardware.

As for now, in OpenJDK JDK-8252204 variant is the fastest on Apple Silicon, and this variant is second on Apple Silicon and the fastest elsewhere tested.
There is also notable double Keccak improvement in JDK-8348561 but that's not for SHA3.

theRealAph · 2025-04-24T14:28:39Z

So, this is for two now-discontinued computers? Does this patch improve performance on any recently-available hardware, or is it purely for retrocomputers?

dchuyko · 2025-04-24T14:33:12Z

So, this is for two now-discontinued computers? Does this patch improve performance on any recently-available hardware, or is it purely for retrocomputers?

I'd not call Graviton 4,3,2 retro.

theRealAph · 2025-04-24T17:05:59Z

So, this is for two now-discontinued computers? Does this patch improve performance on any recently-available hardware, or is it purely for retrocomputers?

I'd not call Graviton 4,3,2 retro.

I'm trying to understand what you wrote, which is very confusing. Reading it again, on Graviton 3 it is 8-14% faster than the existing fastest implementation.

I don't think we should maintain multiple implementations of SHA-3 unless there is a convincing advantage one way or the other. I certainly don't want to see a precedent where we have special versions for crypto algorithms for each microarchitecture. Is 8252204 much faster that this one on Apple silicon? It would be great if we could ditch that one.

dchuyko · 2025-04-24T20:23:54Z

So, this is for two now-discontinued computers? Does this patch improve performance on any recently-available hardware, or is it purely for retrocomputers?

I'd not call Graviton 4,3,2 retro.

I'm trying to understand what you wrote, which is very confusing. Reading it again, on Graviton 3 it is 8-14% faster than the existing fastest implementation.

Correct. And for newest Graviton 4 there was hope to see either no difference between this version and C2 generated code or to see 8252204 being faster than C2 generated code. However, on Graviton 4 this version is still 7-12% faster than C2 generated code, which is still faster than 8252204.

I don't think we should maintain multiple implementations of SHA-3 unless there is a convincing advantage one way or the other. I certainly don't want to see a precedent where we have special versions for crypto algorithms for each microarchitecture. Is 8252204 much faster that this one on Apple silicon? It would be great if we could ditch that one.

Even on M1 8252204 is 28-32% faster than this one. They seem to have 4 execution blocks per core for the accelerator instructions (unlike servers that may provide just 1 unit).

It would be great if C2 could allocate scratch registers in such methods but that would complicate the entire port.

theRealAph · 2025-04-25T08:24:35Z

Even on M1 8252204 is 28-32% faster than this one. They seem to have 4 execution blocks per core for the accelerator instructions (unlike servers that may provide just 1 unit).

It would be great if C2 could allocate scratch registers in such methods but that would complicate the entire port.

To begin with, please isolate keccak_round() in its own function, to make it more similar to the other implementation.

Is it possible to define GPR macro-instructions for instructions like eor3and raxl ? This would make it a lot easier to understand what is going on, thereby making maintenance easier.

theRealAph · 2025-04-25T09:42:19Z

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp

+#ifndef R18_RESERVED
+    can_use_r18 = true;
+#endif
+    bool can_use_fp = !PreserveFramePointer;


Can's we always use fp in a leaf function?

I'm still not sure about external tools compatibility, like perf which was the cause of the entire +PreserveFP story.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp

theRealAph

OK, I'm convinced. We really do need both of these intrinsics.

openjdk · 2025-05-14T13:54:33Z

@dchuyko this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8337666
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

dchuyko · 2025-05-15T09:49:54Z

OK, I'm convinced. We really do need both of these intrinsics.

Thanks, Andrew. I'm working on your previous comments (bg mode but still). Just a short update on that: macros per few instructions may make it more confusing (in terms of in-mind masm->"C" mapping) and also may wipe out a couple of tiny code micro-arrangements; macro for the main loop and other corrections make sense.

dchuyko · 2025-05-30T18:24:22Z

GPR rol, rax and rax1 pseudo instructions were added in MacroAssembler.

Main loop and "bcax"/Chi parts were extracted as functions.

Main loop counter was put in fp register with fp decrement and fcmp (this variant does have a positive impact).

Updated results from Graviton machines (Linux, intrinsic vs C2):

Benchmark              (digesterName)  (length)  Pct
G2
MessageDigests.digest        SHA3-256        64     +20.8%
MessageDigests.digest        SHA3-256     16384     +27.2%
G3
MessageDigests.digest        SHA3-256        64     +12.8%
MessageDigests.digest        SHA3-256     16384     +15.7%
G4
MessageDigests.digest        SHA3-256        64     +9.7%
MessageDigests.digest        SHA3-256     16384     +13.2%

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp

Co-authored-by: Andrew Haley <aph-open@littlepinkcloud.com>

eme64 · 2025-06-04T08:20:16Z

@dchuyko Thanks for working on this! I have quickly scanned the code, and it looks reasonable, though I am not an intrinsics specialist. I'll not run some internal testing, feel free to ping me again in 24h.

eme64 · 2025-06-04T08:21:36Z

A nit: can you please fix the alignment issue in the PR description's benchmark results?

dchuyko · 2025-06-05T09:44:12Z

@dchuyko Thanks for working on this! I have quickly scanned the code, and it looks reasonable, though I am not an intrinsics specialist. I'll not run some internal testing, feel free to ping me again in 24h.

@eme64 thanks, are the results ok?

theRealAph

OK, thanks.

dchuyko · 2025-06-05T14:27:36Z

/integrate

openjdk · 2025-06-05T14:28:30Z

Going to push as commit 23f1d4f.
Since your change was applied there has been 1 commit pushed to the master branch:

33ed7c1: 8358689: test/micro/org/openjdk/bench/java/net/SocketEventOverhead.java does not build after JDK-8351594

Your commit was automatically rebased without conflicts.

openjdk · 2025-06-05T14:28:41Z

@dchuyko Pushed as commit 23f1d4f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

dchuyko added 2 commits March 26, 2025 18:45

SHA3 GPR intrinsic & tests

a944a54

Delete empty line

43f4594

openjdk bot changed the title ~~JDK-8337666~~ 8337666: AArch64: SHA3 GPR intrinsic Mar 26, 2025

openjdk bot added the rfr Pull request is ready for review label Mar 26, 2025

openjdk bot added the hotspot hotspot-dev@openjdk.org label Mar 26, 2025

theRealAph reviewed Apr 25, 2025

View reviewed changes

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp Outdated Show resolved Hide resolved

theRealAph reviewed Apr 25, 2025

View reviewed changes

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp Outdated Show resolved Hide resolved

theRealAph approved these changes May 14, 2025

View reviewed changes

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label May 14, 2025

dchuyko added 3 commits May 28, 2025 19:45

Merge master

0c54d74

Review suggestions

91845cb

Copyright year

a46cebe

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label May 30, 2025

Assert message

fc5bd6a

Merge branch 'openjdk:master' into JDK-8337666

e61812a

theRealAph reviewed May 31, 2025

View reviewed changes

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp Outdated Show resolved Hide resolved

theRealAph reviewed May 31, 2025

View reviewed changes

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp Outdated Show resolved Hide resolved

dchuyko and others added 4 commits June 1, 2025 00:37

Update src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp

194f77c

Co-authored-by: Andrew Haley <aph-open@littlepinkcloud.com>

Update src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp

87e22c8

Co-authored-by: Andrew Haley <aph-open@littlepinkcloud.com>

Merge branch 'openjdk:master' into JDK-8337666

cd24df6

No imm masking in rolw

d9cf513

dchuyko requested a review from theRealAph June 5, 2025 09:44

Merge branch 'openjdk:master' into JDK-8337666

37bda3c

theRealAph approved these changes Jun 5, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Jun 5, 2025

openjdk bot added the integrated Pull request has been integrated label Jun 5, 2025

openjdk bot closed this Jun 5, 2025

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 5, 2025

8337666: AArch64: SHA3 GPR intrinsic #24260

8337666: AArch64: SHA3 GPR intrinsic #24260

Uh oh!

Conversation

dchuyko commented Mar 26, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Mar 26, 2025

Uh oh!

openjdk bot commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Mar 26, 2025

Uh oh!

mlbridge bot commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

bridgekeeper bot commented Apr 23, 2025

Uh oh!

theRealAph commented Apr 24, 2025

Uh oh!

dchuyko commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theRealAph commented Apr 24, 2025

Uh oh!

dchuyko commented Apr 24, 2025

Uh oh!

theRealAph commented Apr 24, 2025

Uh oh!

dchuyko commented Apr 24, 2025

Uh oh!

theRealAph commented Apr 25, 2025

Uh oh!

theRealAph Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

dchuyko May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

theRealAph left a comment

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented May 14, 2025

Uh oh!

dchuyko commented May 15, 2025

Uh oh!

dchuyko commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eme64 commented Jun 4, 2025

Uh oh!

eme64 commented Jun 4, 2025

Uh oh!

dchuyko commented Jun 5, 2025

Uh oh!

theRealAph left a comment

Choose a reason for hiding this comment

Uh oh!

dchuyko commented Jun 5, 2025

Uh oh!

openjdk bot commented Jun 5, 2025

Uh oh!

openjdk bot commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

dchuyko commented Mar 26, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Mar 26, 2025 •

edited

Loading

mlbridge bot commented Mar 26, 2025 •

edited

Loading

dchuyko commented Apr 24, 2025 •

edited

Loading

dchuyko commented May 30, 2025 •

edited

Loading