Skip to content

8275914: SHA3: changing java implementation to help C2 create high-performance code #6847

Closed
bulasevich wants to merge 2 commits intoopenjdk:masterfrom
bulasevich:sha3_inline_unroll_locals_8275914
Closed

8275914: SHA3: changing java implementation to help C2 create high-performance code #6847
bulasevich wants to merge 2 commits intoopenjdk:masterfrom
bulasevich:sha3_inline_unroll_locals_8275914

Conversation

@bulasevich
Copy link
Contributor

@bulasevich bulasevich commented Dec 15, 2021

Background

The goal is to improve SHA3 implementation performance as it runs up to two times slower than native (OpenSSL, measured on AMD64 and AArch6464) implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and most AArch64) do not.

For AArch64 hardware that does support SHA3 accelerators, an intrinsic was implemented for ARMv8.2 with SHA3 instructions support:

  • JDK-8252204: AArch64: Implement SHA3 accelerator/intrinsic

SHA3 java implementation have already been reworked to eliminate memory consumption and make it work faster:

  • JDK-8157495: SHA-3 Hash algorithm performance improvements (~12x speedup)
    The latter issue addressed this previously, but there is still some room to improve SHA3 performance with current C2 implementation, which is proposed in this PR.

Option 1 (this PR)

With this change I unroll the internal cycles manually, inline it manually, and use local variables (not array) for data processing. Such an approach gives the best performance (see benchmark results). Without this change (with current array data processing) we observe a lot of load/store operations in comparison to processing in local variables, both on AMD64 and on AArch64.
Native implementations shows that on AArch64 (32 GPU registers) SHA-3 algorithm can hold all 25 data and all temporary variables in registers. C2 can't optimize it as well because many regisers are allocated for internal usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local variables the number of excessive load/stores is much smaller and performance result is much better.

Option 2 (alternative)

LINK: the change

This is a more conservative change which minimizes code changes, but it has lower performance improvement. Please let me know if you think this change is better then main one: in this case I will replace it within this Pull Request.

With this change I unroll the internal cycles manually and use @Forceinline annotation. Manual unrolling is necessary because C2 does not recognize there is a counted cycle that can be completely unrolled. Instead of replacing the loop with five loop bodies C2 splits the loop to pre- main- and post- loop which is not good for this case. C2 works better when the array is created locally, but in our case the array is created on object instantiation, so C2 can't prove the array length is constant. The second issue affecting performance is that methods with unrolled loops get larger and C2 does not inline them. It's addressed here by using @Forceinline annotation.

Benchmark results

We measured the change on four platforms: ARM32, PPC, AArch64 (with no advanced SHA-3 instructions) and AMD on
MessageDigests benchmark. Here is the result: http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html

  • fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on ARM32/PPC/AArch64/AMD64
  • fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on ARM32/PPC/AArch64/AMD64

Testing

Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 implementation over the same vectors before and after the change, and checking the results are the same.
The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
The test vectors compiled from the Final Algorithm Package:
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8275914: SHA3: changing java implementation to help C2 create high-performance code

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6847/head:pull/6847
$ git checkout pull/6847

Update a local copy of the PR:
$ git checkout pull/6847
$ git pull https://git.openjdk.java.net/jdk pull/6847/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6847

View PR using the GUI difftool:
$ git pr show -t 6847

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6847.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 15, 2021

👋 Welcome back bulasevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 15, 2021
@openjdk
Copy link

openjdk bot commented Dec 15, 2021

@bulasevich The following label will be automatically applied to this pull request:

  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the security security-dev@openjdk.org label Dec 15, 2021
@mlbridge
Copy link

mlbridge bot commented Dec 15, 2021

Webrevs

@dchuyko
Copy link
Member

dchuyko commented Dec 15, 2021

@kuksenko You used to improve this code before, could you share your opinion?

@kuksenko
Copy link
Contributor

@kuksenko You used to improve this code before, could you share your opinion?

I think benchmark results table tells more. :)

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 12, 2022

@bulasevich This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@phohensee
Copy link
Member

phohensee commented Jan 27, 2022

Hi, Boris,

I'd go with option 2 for this patch and option 1 in a separate JBS issue. Safer and more straightforward that way. The only change I'd make to the option 2 patch is to try replace raw constants with derivations of DM. E.g. in smChi() replace 0, 5, 10, 15, 20, and 25 with 0DM, 1DM, 2DM, 3DM, 4DM, and 5DM. In smTheta, replace a[0] with a[0DM], a[1] with a[0DM+1], etc. But, if doing so causes C2 to fail to constant fold, revert to raw constants.

@phohensee
Copy link
Member

Also, going with option 2 first leaves a progression record that might be useful to future compiler optimizer implementors.

The option 1 patch looks good too.

@ascarpino
Copy link
Contributor

Sorry this feel off my stack with the holidays and everything else going on.
I am fine proceeding change in this PR, option 1 I believe. Have you run the tests to make sure it passes the Known Answer Tests?
Can you also include the before and after JMH benchmarks for SHA3 in the PR? they are under test/micro/org/openjdk/bench/java/security/MessageDigests

Thanks

@bulasevich
Copy link
Contributor Author

Hi,

Thanks for looking at this!

Have you run the tests to make sure it passes the Known Answer Tests?

In my original post I included the test suite and KAT - I have not found ready-to use one, so I did it from scratch using the test vectors provided for KECCAK Final Algorithm Package.

Can you also include the before and after JMH benchmarks for SHA3 in the PR? they are under test/micro/org/openjdk/bench/java/security/MessageDigests

There are before/after MessageDigests benchmark results for option1/options2 on ARM32/PPC/ARM64/AMD: http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html - is it OK?

@ascarpino
Copy link
Contributor

Ah yes. Sorry, I saw the replied email that left that out

@ascarpino
Copy link
Contributor

Please update the copyright date from 2020 to 2022.
I'll run regression tests to make sure things are ok.

@bulasevich bulasevich force-pushed the sha3_inline_unroll_locals_8275914 branch from dd88946 to 4d15fea Compare February 1, 2022 08:41
Copy link
Contributor

@ascarpino ascarpino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@openjdk
Copy link

openjdk bot commented Feb 1, 2022

@bulasevich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8275914: SHA3: changing java implementation to help C2 create high-performance code

Reviewed-by: ascarpino, phh

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 8 new commits pushed to the master branch:

  • 4532c3a: 8280554: resourcehogs/serviceability/sa/ClhsdbRegionDetailsScanOopsForG1.java can fail if GC is triggered
  • 5080e81: 8280770: serviceability/sa/ClhsdbThreadContext.java sometimes fails with 'Thread "SteadyStateThread"' missing from stdout/stderr
  • 1f6fcbe: 8278475: G1 dirty card refinement by Java threads may get unnecessarily paused
  • c5a8612: 8280458: G1: Remove G1BlockOffsetTablePart::_next_offset_threshold
  • 86debf4: 8280932: G1: Rename HeapRegionRemSet::_code_roots accessors
  • d37fb1d: 8280870: Parallel: Simplify CLD roots claim in Full GC cycle
  • 18a7dc8: 8279586: [macos] custom JCheckBox and JRadioBox with custom icon set: focus is still displayed after unchecking
  • 16ec47d: 8279856: Parallel: Use PreservedMarks to record promotion-failed objects

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 1, 2022
Copy link
Member

@phohensee phohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm as well.

@bulasevich
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Feb 1, 2022

Going to push as commit c74b8f4.
Since your change was applied there have been 13 commits pushed to the master branch:

  • a18beb4: 8280867: Cpuid1Ecx feature parsing is incorrect for AMD CPUs
  • fdd9ca7: 8280642: ObjectInputStream.readObject should throw InvalidClassException instead of IllegalAccessError
  • d95de5c: 8255495: Support CDS Archived Heap for uncompressed oops
  • bde2b37: 8279954: java/lang/StringBuffer(StringBuilder)/HugeCapacity.java intermittently fails
  • d1cc5fd: 8280941: os::print_memory_mappings() prints segment preceeding the inclusion range
  • 4532c3a: 8280554: resourcehogs/serviceability/sa/ClhsdbRegionDetailsScanOopsForG1.java can fail if GC is triggered
  • 5080e81: 8280770: serviceability/sa/ClhsdbThreadContext.java sometimes fails with 'Thread "SteadyStateThread"' missing from stdout/stderr
  • 1f6fcbe: 8278475: G1 dirty card refinement by Java threads may get unnecessarily paused
  • c5a8612: 8280458: G1: Remove G1BlockOffsetTablePart::_next_offset_threshold
  • 86debf4: 8280932: G1: Rename HeapRegionRemSet::_code_roots accessors
  • ... and 3 more: https://git.openjdk.java.net/jdk/compare/de3113b998550021bb502cd6f766036fb8351e7d...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Feb 1, 2022
@openjdk openjdk bot closed this Feb 1, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Feb 1, 2022
@openjdk
Copy link

openjdk bot commented Feb 1, 2022

@bulasevich Pushed as commit c74b8f4.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integrated Pull request has been integrated security security-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

5 participants