Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics #10939

Closed
wants to merge 3 commits into from

Conversation

dgbo
Copy link
Member

@dgbo dgbo commented Nov 2, 2022

In JDK-8252204, when implemented SHA3 intrinsics, we use digest_length to differentiate SHA3-224, SHA3-256, SHA3-384, SHA3-512 and calculate block_size with block_size = 200 - 2 * digest_length.
However, there are two extra SHA3 instances, SHAKE256 and SHAKE128, allowing an arbitrary digest_length:

	digest_length	block_size
SHA3-224	28	144
SHA3-256	32	136
SHA3-384	48	104
SHA3-512	64	72
SHAKE128	variable	168
SHAKE256	variable	136

This causes SIGSEGV crash or hash code mismatch with test/jdk/sun/security/ec/ed/EdDSATest.java. The test calls SHAKE256 in Ed448.

The main idea of the patch is to pass the block_size to differentiate SHA3 instances.
Tests test/jdk/sun/security/ec/ed/EdDSATest.java and ./test/jdk/sun/security/provider/MessageDigest/SHA3.java both passed.
And tier1~3 passed on SHA3 supported hardware.

The SHA3 intrinsics still deliver 20%~40% performance improvement on our pre-silicon simulated platform.
The latency and throughput of crypto SHA3 ops are designed to be 1 cpu cycle and 2 execution pipes respectively.

Compared with the main stream code, the performance change with this patch are negligible on real hardware and simulation platform.
Based on the JMH results of SHA3 intirinsics, performance can be improved by ~50% on some hardware, while some hardware have ~30% regression.
These performance details are available in the comments of the issue page.
I guess the performance benefit of SHA3 intrinsics is dependent on the micro architecture, it should be switched on/off based on the running platform.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10939/head:pull/10939
$ git checkout pull/10939

Update a local copy of the PR:
$ git checkout pull/10939
$ git pull https://git.openjdk.org/jdk pull/10939/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10939

View PR using the GUI difftool:
$ git pr show -t 10939

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10939.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 2, 2022

👋 Welcome back dongbo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 2, 2022
@openjdk
Copy link

openjdk bot commented Nov 2, 2022

@dgbo The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Nov 2, 2022
@mlbridge
Copy link

mlbridge bot commented Nov 2, 2022

Webrevs

Copy link
Contributor

@shqking shqking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good to me. (I'm not a Reviewer)

I agree with your fix to pass block_size rather than digest_length, but I didn't fully understand your update insides stubGenerator_aarch64.cpp, i.e. those operations before rounds24_loop, since I'm not a crypto expert.
We may need some crypo expert to review that part.

Note that I rerun tier1~3 with the latest commit in this PR on sha512 feature supported hardware, and the test passed.

Thanks.

@dgbo
Copy link
Member Author

dgbo commented Nov 3, 2022

@shqking Thanks for the review.

Before the NR (==24) iterations in keccak, a0~a24 need to be ready:

// SHA3.java
    private void keccak() {
        // convert the 200-byte state into 25 lanes
        bytes2Lanes(state, lanes);
        ...
        a0 = lanes[0]; a1 = lanes[1]; ...; a24 = lanes[24];

        // process the lanes through step mappings
        for (int ir = 0; ir < NR; ir++) {
            long c0 = a0^a5^a10^a15^a20;
            long c1 = a1^a6^a11^a16^a21;
            ...

While a_i is computed as: state[i] ^= b[ofs++] in implCompress0(byte[] b, int ofs), byte2Lanes(state, lanes), a_i = lanes[i].

// SHA3.java
    @IntrinsicCandidate
    private void implCompress0(byte[] b, int ofs) {
       for (int i = 0; i < buffer.length; i++) {
           state[i] ^= b[ofs++];
       }
       keccak();
    }

Those operations before rounds24_loop can be taken as vectorial version of the Java code shown above:

  1. Load all elements of state array into vector registers (v0~v24).
  2. Load the input data into v25~v31. The number of data to load is dependent on block_size (== buffer.length).
  3. Perform state[i] ^ b[ofs].
    We don't need instructions for byte2Lanes, cause it is already done with the ld1_8B instructions.

Hope this can address your confusion.

@shqking
Copy link
Contributor

shqking commented Nov 4, 2022

Thanks for your kind explanation. Understood.
This patch is good to me. (I'm not a Reviewer)

@dgbo
Copy link
Member Author

dgbo commented Nov 14, 2022

Hi, @nick-arm @theRealAph, could you help to review this? Thanks.

@theRealAph
Copy link
Contributor

This looks right, but I don't think I can test it, which I usually would do with a patch this complicated. When we have a processor without FEAT_SHA3) we should define BCAX, EOR3, RAX1, and XAR as macros. Could you do that, please?

@dgbo
Copy link
Member Author

dgbo commented Nov 14, 2022

This looks right, but I don't think I can test it, which I usually would do with a patch this complicated. When we have a processor without FEAT_SHA3) we should define BCAX, EOR3, RAX1, and XAR as macros. Could you do that, please?

Thanks for the comments.

Do you mean that we need a patch (not to be merged) to support testing on processor without FEAT_SHA3?
In which, the SHA3 instructions are substituted by multiple instructions, something like: eor3 v1, v2, v3, v4 => eor v1, v2, v3; eor v1, v1, v4?

BTW, FEAT_SHA3 is supported on M1. If you happen to have one, the test can be done on it. :)
To test this on M1/MacOS, modification below is needed to enable SHA3Intriniscs by default.
Since other features, i.e. UseSHA, can not be automatically detected neither, I think it is irrelevant with this patch.

--- a/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
@@ -334,15 +334,15 @@ void VM_Version::initialize() {
     FLAG_SET_DEFAULT(UseSHA256Intrinsics, false);
   }

- if (UseSHA && VM_Version::supports_sha3()) {
+ // if (UseSHA && VM_Version::supports_sha3()) {
     // Do not auto-enable UseSHA3Intrinsics until it has been fully tested on hardware
- // if (FLAG_IS_DEFAULT(UseSHA3Intrinsics)) {
- // FLAG_SET_DEFAULT(UseSHA3Intrinsics, true);
- // }
- } else if (UseSHA3Intrinsics) {
- warning("Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.");
- FLAG_SET_DEFAULT(UseSHA3Intrinsics, false);
- }
+ if (FLAG_IS_DEFAULT(UseSHA3Intrinsics)) {
+ FLAG_SET_DEFAULT(UseSHA3Intrinsics, true);
+ }
+ //} else if (UseSHA3Intrinsics) {
+ // warning("Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.");
+ // FLAG_SET_DEFAULT(UseSHA3Intrinsics, false);
+ //}

@theRealAph
Copy link
Contributor

This looks right, but I don't think I can test it, which I usually would do with a patch this complicated. When we have a processor without FEAT_SHA3) we should define BCAX, EOR3, RAX1, and XAR as macros. Could you do that, please?

Thanks for the comments.

Do you mean that we need a patch (not to be merged) to support testing on processor without FEAT_SHA3? In which, the SHA3 instructions are substituted by multiple instructions, something like: eor3 v1, v2, v3, v4 => eor v1, v2, v3; eor v1, v1, v4?

Yes, exactly. So the intrinsic will work everywhere. Why not merge it? It seems obvious to me.

BTW, FEAT_SHA3 is supported on M1. If you happen to have one, the test can be done on it. :) To test this on M1/MacOS, modification below is needed to enable SHA3Intriniscs by default. Since other features, i.e. UseSHA, can not be automatically detected neither, I think it is irrelevant with this patch.

Do you know why SHA3 isn't enabled on M1?

@dgbo
Copy link
Member Author

dgbo commented Nov 15, 2022

This looks right, but I don't think I can test it, which I usually would do with a patch this complicated. When we have a processor without FEAT_SHA3) we should define BCAX, EOR3, RAX1, and XAR as macros. Could you do that, please?

Thanks for the comments.
Do you mean that we need a patch (not to be merged) to support testing on processor without FEAT_SHA3? In which, the SHA3 instructions are substituted by multiple instructions, something like: eor3 v1, v2, v3, v4 => eor v1, v2, v3; eor v1, v1, v4?

Yes, exactly. So the intrinsic will work everywhere. Why not merge it? It seems obvious to me.

I see, great idea. But I'm afraid this cannot easily be done due to the high register pressure of the keccak() algorithm loop.

Three registers are passed to xar Vd, Vn, Vm, #imm, which operation equals to tmp = Vn ROR Vm; Vd = ROR(tmp[127:64], imm):ROR(tmp[63:0], imm).
Registers Vm/Vn will be read by operations below, we need another tmp register, or we will get clobbered Vm/Vn when translating xar to NEON instructions.
And we cannot find a free tmp register, almost every register has useful data for future use.

The other three SHA3 instructions have similar register issue.

BTW, FEAT_SHA3 is supported on M1. If you happen to have one, the test can be done on it. :) To test this on M1/MacOS, modification below is needed to enable SHA3Intriniscs by default. Since other features, i.e. UseSHA, can not be automatically detected neither, I think it is irrelevant with this patch.

Do you know why SHA3 isn't enabled on M1?

FEAT_SHA3 is not detected on MacOS, this code snippet below would be enough for FEAT_SHA3.
I think there are some other hardware features, e.g. FEAT_SHA1, FEAT_SHA256, FEAT_SHA512, are not detected on MacOS too.
We may need another PR to fix it.

diff --git a/src/hotspot/os_cpu/bsd_aarch64/vm_version_bsd_aarch64.cpp b/src/hotspot/os_cpu/bsd_aarch64/vm_version_bsd_aarch64.cpp
index 45cd77b3ba5..0ac6b4a56d1 100644
--- a/src/hotspot/os_cpu/bsd_aarch64/vm_version_bsd_aarch64.cpp
+++ b/src/hotspot/os_cpu/bsd_aarch64/vm_version_bsd_aarch64.cpp
@@ -68,6 +68,8 @@ void VM_Version::get_os_cpu_info() {
   if (cpu_has("hw.optional.armv8_crc32"))     _features |= CPU_CRC32;
   if (cpu_has("hw.optional.armv8_1_atomics")) _features |= CPU_LSE;
 
+  if (cpu_has("hw.optional.armv8_2_sha3")) _features |= CPU_SHA3;
+
   int cache_line_size;
   int hw_conf_cache_line[] = { CTL_HW, HW_CACHELINE };
   sysctllen = sizeof(cache_line_size);

@theRealAph
Copy link
Contributor

Hmm, okay. Looks like there's work to do on this. I'll approve this patch, but we really must get MacOS fixed for JDK 20.

@openjdk
Copy link

openjdk bot commented Nov 15, 2022

@dgbo This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8295698: AArch64: test/jdk/sun/security/ec/ed/EdDSATest.java failed with -XX:+UseSHA3Intrinsics

Reviewed-by: haosun, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 183 new commits pushed to the master branch:

  • 6ead2b0: 8296548: Improve MD5 intrinsic for x86_64
  • bd3acbe: 8297089: [BACKOUT] JDK-8297088 Update LCMS to 2.14
  • 6a60d31: 8295369: Update LCMS to 2.14
  • 0cbf084: 8296969: C1: PrintC1Statistics is broken after JDK-8292878
  • f662a06: 8296970: Remove sysThreadAvailableStackWithSlack from hotspot-symbols
  • 7357a1a: 8296889: Race condition when cancelling a request
  • 87530e6: 8296913: Correct enable preview idiom in JdbLastErrorTest.java
  • fafe682: 8295861: get rid of list argument in debug agent's removeNode() API
  • 216c6f6: 8294881: test/hotspot/jtreg/vmTestbase/nsk/jdi/VirtualMachine/dispose/dispose003/TestDescription.java fails
  • 6aef3a4: 8262435: Clarify the behavior of a few inherited ZipInputStream methods
  • ... and 173 more: https://git.openjdk.org/jdk/compare/37107fc1574a4191987420d88f7182e63c7da60c...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 15, 2022
@dgbo
Copy link
Member Author

dgbo commented Nov 16, 2022

Hmm, okay. Looks like there's work to do on this. I'll approve this patch, but we really must get MacOS fixed for JDK 20.

Thanks for the review. For MacOS issue, I have filed an issue, see https://bugs.openjdk.org/browse/JDK-8297092.
Because we only have a limited testing environment for MacOS, I've asked @shqking for help, he is willing to fix it.

We have two reviews for this SHA3 PR, and it has been fully tested on several real processors.
I am going to integrate this, @theRealAph, would you like to sponsor this? Thanks.

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Nov 16, 2022
@openjdk
Copy link

openjdk bot commented Nov 16, 2022

@dgbo
Your change (at version e22c50f) is now ready to be sponsored by a Committer.

@TobiHartmann
Copy link
Member

/sponsor

@openjdk
Copy link

openjdk bot commented Nov 17, 2022

Going to push as commit 2f728d0.
Since your change was applied there have been 224 commits pushed to the master branch:

  • b9db16a: 8288717: Add a means to close idle connections in HTTP/2 connection pool
  • 9f8b6d2: 8296437: NMT incurs costs if disabled
  • e81359f: 8296170: Refactor stack-locking path in C2_MacroAssembler::fast_unlock()
  • 502fa3e: 8296912: C2: CreateExNode::Identity fails with assert(i < _max) failed: oob: i=1, _max=1
  • 5795c76: 8296222: SwingEventMonitor - installListeners(Component , int ) - CELLEDITOR - bug
  • b9d6e83: 8296906: VMError::controlled_crash crashes with wrong code and address
  • cd9c688: 8276064: CheckCastPP with raw oop input floats below a safepoint
  • d61720a: 8218885: Restore pop_frame and force_early_return functionality for Graal
  • dd9aa72: 8296083: javax/swing/JTree/6263446/bug6263446.java fails intermittently on a VM
  • cc44419: 8295407: C2 crash: Error: ShouldNotReachHere() in multiple vector tests with -XX:-MonomorphicArrayCheck -XX:-UncommonNullCast
  • ... and 214 more: https://git.openjdk.org/jdk/compare/37107fc1574a4191987420d88f7182e63c7da60c...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 17, 2022
@openjdk openjdk bot closed this Nov 17, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Nov 17, 2022
@openjdk
Copy link

openjdk bot commented Nov 17, 2022

@TobiHartmann @dgbo Pushed as commit 2f728d0.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@dgbo dgbo deleted the 8295698-EdDSATest-crash branch November 19, 2022 03:22
__ BIND(sha3_384_or_224);
__ tbz(digest_length, 2, rounds24_loop); // bit 2 cleared? SHA-384
// block_size == 136, bit4 == 0 and bit5 == 0, SHA3-256 or SHAKE256
__ andw(c_rarg5, block_size, 48);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does c_rarg5 serve as a temporary register here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a save-on-call register. If it was active, it should been saved.
Similar usages can be found at other intrinsics, such as

.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
5 participants