- 
                Notifications
    
You must be signed in to change notification settings  - Fork 6.1k
 
8341527: AVX-512 intrinsic for SHA3 #21352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
          
 👋 Welcome back ferakocz! A progress list of the required criteria for merging this PR into   | 
    
| 
           @ferakocz This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 110 new commits pushed to the  
 As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@sviswa7) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type   | 
    
          
Webrevs
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed performance on my dev machine. Looks good!
Instruction selection: no complaints. vperm* instructions tend to be slower on AVX2, but work great here. Clean, compact and easy-to-read implementation
I don't know enough about SHA3 to do a line-by-line asm review, but that leads me to 'experimentally confirm correctness': testing.
I am wondering how you verified your code. I did spot the existing SHA3 KAT tests from the NIST PDF. The problem with those is that unless you run tests with -Xcomp -XX:-TieredCompilation, the test will finish before the code is even compiled. I've done that before, running test twice with either options; its 'better then nothing' (unless I am not seeing some more tests?). I much prefer some sort of fuzzing; one great thing about working on JCE intrinsics is having a ready-made 'reference implementation' to verify things against.
Except I am not sure how one would implement fuzzing for SHA3, perhaps you have some thoughts. It seems impossible to have both intrinsic and java/interpreter running concurrently. For Poly1305IntrinsicFuzzTest, I used the fact that single-block digest is not intrinsified. For MontgomeryPolynomialFuzzTest, I used the fact that we have a residue-domain implementation to compare against.
For SHA3, all roads lead to the intrinsic (which is a good thing.. except for testing). No DirectByteBuffer, nor single-block bypass.. The only potential thought is the fact that single-block intrinsic appears unreachable. Looking at DigestBase.implCompressMultiBlock, it will always call the multi-block intrinsic (unless I am missing some fancy predicate-generation by the JIT).
If DigestBase.implCompressMultiBlock were 'fixed' to require at least 2 full blocks, before calling the multiblock intrinsic, then one could implement fuzzing by alternatively disabling one of the non-/multi-block intrinsics.
| if (UseSHA3Intrinsics) { | ||
| warning("Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU."); | ||
| FLAG_SET_DEFAULT(UseSHA3Intrinsics, false); | ||
| if (UseAVX > 2) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be #ifdef _LP64. (Similar format from above). Need to look up the cpu features required for the instructions in the intrinsic..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the #ifdef.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'rest' of the comment I owed you.. need AVX512F, AVX512DQ, AVX512BW.
So you will need supports_avx512bwdq() here
"Showing my (math) work.."
grep '__ ' src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp | sed -e 's| *__ *||' -e 's|(.*||' | sort -u
(only using the full 512 versions, no need for VL)
evmovdquq  AVX512F
evmovdquw  AVX512BW
evpermt2q  AVX512F
evprolq    AVX512F
evprolvq   AVX512F
evpxorq    AVX512F
vpternlogq AVX512F
kmovbl     AVX512DQ
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I will change the kmovbl()s to kmovwl() and then supports_avx512bw() will suffice.
| 
               | 
          ||
| void StubGenerator::generate_sha3_stubs() { | ||
| if (UseSHA3Intrinsics) { | ||
| if (VM_Version::supports_evex()) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really should be an assert. i.e. All cpu-flag checks should be done in vm_version_x86.cpp and by this point if UseSHA3Intrinsics is on, "we are good to go"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed as suggested.
| Assembler::evpsravw(dst, mask, nds, src, merge, vector_len); | ||
| } | ||
| } | ||
| void evpsrad(XMMRegister dst, KRegister mask, XMMRegister src, int shift, bool merge, int vector_len) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more compact way to 'unhide' function from Assembler.hpp is the using C++ feature : using Assembler::evpsrad;. (You can see it being used bit further below, line 1589)
Comment repeats in (several) changes in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed as suggested.
| __ kmovbl(k3, rax); | ||
| __ addl(rax, 8); | ||
| __ kmovbl(k4, rax); | ||
| __ addl(rax, 16); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you need k5 soonest, you could save a few cycles by removing the propagation dependency on rax and loading the immediate directly..
(If you really want to get clever,
  KRegister masks[] = {k1,k2,k3,k4,k5};
  for (long i=2; i<=32; i*=2) {
    __ mov64(rax, i-1);
    __ kmovbl(masks[i], rax);
  }
Highly debatable if its actually any more readable.. so up to you)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative that is closer to the structure of your code (And uses smaller instructions..).
- Start from the end, with 
k5, load0x1fconstant - Shift constant down by one and load into next KRegister
 - (still could be done with a loop. but you decide what you find more readable..)
 
This way k5 is available immediately for the evmovdquq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to start loading the mask registers from k5.
          
 Thanks for looking at it! 
 I was developing this as part of the ML-KEM and ML-DSA implementations, and there SHA3 is called quite frequently, so the test for those will test the SHA3 intrinsics, too. The algorithms for the hash (digest) functions are designed so that any programming error would lead to erroneous output on any input, so if your implementation produces the correct result on a few randomly chosen inputs of sizes varying from 0 bytes to several blocks then you can claim with high confidence that it is correct. 
 In a test, you can always just copy the pure Java implementation into the test and compare the results. During development of the intrinsics I like to use methods that return 0 from the intrinsic and 1 from the pure Java implementation and at the call sites, if the method returns 0 I also call the pure Java version (with a clone of the original inputs) and compare the results. 
  | 
    
          
 I suppose it works. When possible, I rather have a more granular unit test (and we don't have the code for those algorithms yet. erm, right?) 
 If you still have it and it can be 'made clean'.. I would love to see some of that 'scaffolding test code' kept for the final commit. (I like to imagine the 'final code cleanup' as 'removing scaffolding from a construction site' :) ) This will be especially useful if (when?) we revisit the intrinsic. (I can already see us also needing an AVX2 version.. someone will need to re-learn how to verify that intrinsic too)  | 
    
| 
           @ferakocz I think you need to enble SHA3 testing in jtreg tests we have by modifying: JDK-8252204 added several C2 tests for SHA3 intrinsics in   | 
    
| void Assembler::vpmuldq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { | ||
| assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : | ||
| (vector_len == AVX_256bit ? VM_Version::supports_avx2() : VM_Version::supports_evex()), ""); | ||
| // TODO check what legacy_mode needs to be set to | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by comment: There's a TODO left in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I was hoping that I would learn that from a reviewer @vpaprotsk or @vnkozlov , do you know? I was not able to figure it out from the manual what it should be. (with the current setting "false" at least my code works on the test machines that I tried, but I never tried with "true")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
legacy_mode should be false here. This instruction is promotable to evex encoding if higher bank registers (XMM16 and above) are used. It is not a legacy instruction so legacy_mode should be false. Examples of legacy instructions are vptest, vpblend*.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
| 
           This is not a review but I've run some testing with the current patch and found the following two failures on linux-x64-debug: Failure 1Tests: 
 Failure: Failure 2
 Output:  | 
    
          
 The scaffolding is really simple: instead of e.g. @IntrinsiCandidate you would have void foo(byte[] output, byte[] input) { @IntrinsicCandidate void fooImplJava(byte[] output, byte[] input) { Just a bit more complicated for non-void methods.  | 
    
          
 @chhagedorn could you send me the mach5 command line (or other means) to run these tests?  | 
    
          
 I think Christian missed additional important flag which limits avx512 features: -XX:+UseKNLSetting  | 
    
| 
           "Failure 2" is due to issue I pointed about compiler/testlibrary/sha/predicate/IntrinsicPredicates.java#L10  | 
    
          
 Right, you need -XX:+UseKNLSetting, thanks Vladimir for jumping in! Missed to mention that separately and was hard to spot in the posted command line.  | 
    
| 
           @ferakocz this pull request can not be integrated into  git checkout sha3-avx512-intrinsic
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push | 
    
          
 I did that, plus restored the error message, now all the tests suggested by @chhagedorn pass.  | 
    
| 
           @ferakocz Thanks for taking my inputs into consideration and the corresponding changes. Would it be also possible for you to add comments to the rounds24_loop code if you want us to review that in detail. Otherwise the PR looks good to me.  | 
    
| 
           Could someone approve my changes so that I can integrate?  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed the rounds_24 loop. Other than that the PR looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything from me was addressed too, thanks!
| 
           @sviswa7 and @vpaprotsk, I added comments to the algorithm implementation, could you take another look and approve again if you are satisfied with them? Thanks!  | 
    
| 
           Thanks for the comments. Still looks good to me (I haven't reviewed the core loop instruction-by-instruction either, I would need to spend a lot more time getting to know SHA3. But this is why I was asking about KAT/testing; This is 'condition-less' code, and no additions/carries-to-propagate. Testing should have 100% code-coverage with just a few tests., so no need to tests carries, input text value-independent. Do need to vary input length to test loop bounds).  | 
    
| 
           Thanks for the comments, very helpful. I have verified the theta mapping, chi step, and the xor step. They look good. Now making my way through the rho and sigma perms and rotates.  | 
    
          
 The rho and sigma perms/rotates also look good.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
| 
           /integrate  | 
    
| 
           /sponsor  | 
    
| 
          
 Going to push as commit 9cfb0f7. 
 Your commit was automatically rebased without conflicts.  | 
    
| 
           I think this broke the x86 assembler, I'm getting multiple failures that look like the following:  | 
    
| 0x8000000000008080L, 0x0000000080000001L, 0x8000000080008008L | ||
| }; | ||
| 
               | 
          ||
| ATTRIBUTE_ALIGNED(64) static const uint64_t permsAndRots[] = { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too late for this now, but this should've used alignas directly instead of ATTRIBUTE_ALIGNED. No worries though, the macro ultimately expands to alignas
There is already an intrinsic for SHA-3 for aarch64, which gives significant speed improvement on that architecture, so this pull request is bringing similar improvement for tha x64 family of systems that have the AVX-512 extension. Rudimentary measurements show that 30-40% speed improvement can be achieved.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21352/head:pull/21352$ git checkout pull/21352Update a local copy of the PR:
$ git checkout pull/21352$ git pull https://git.openjdk.org/jdk.git pull/21352/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 21352View PR using the GUI difftool:
$ git pr show -t 21352Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21352.diff
Webrev
Link to Webrev Comment