-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8329538: Accelerate P256 on x86_64 using Montgomery intrinsic #18583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- MontgomeryIntegerPolynomialP256 - Cleaned up montgomery in conversion; now automatically via Field.getElement() - removed toMontgomery() (and its interface) - Rearange and comment code, remove debug traces - (TODO: fix getElement(byte[])) - ECOperations - No more hashmap lookup on every multiply (ifs instead), removed trampoline - Regression; put back lazy loading of static constant table for better VM startup - Cleaned up tomontgomery conversion to use Field.getElement() - Flattened multilayer PointMultiplication (was ECOperations.PointMultiplier.LargeTable.P256 made even worse by new montgomery multipliers) - removed (pre-existing incorrect) use of member classes - Added comments and more comments - Made ECOperations.montgomeryOps private - Patched tests to use reflection to access ECOperations.montgomeryOps - stubGenerator_x86)64_poly_mont.cpp - Comments were very wrong, fixed/rewrote comments - Added ascii art for vector operations - Removed montgomery percolation in crypto ParameterSpec APIs - Enabled montgomery intrinsics by default - Fixed test cases (ECOperationsFussTest.java, ECOperationsKAT.java, MontgomeryPolynomialFuzzTest.java) - Typos from context switch - Fixes needed after refactoring - Use reflection to disable montgomery - getElement() refactored, unfix uses - Fixed PolynomialP256Bench microbenchmark - Need to remeasure - Removed changes in SignatureBench and KeyAggreementBench - Will use original with baseline SDK
|
👋 Welcome back vpaprotsk! A progress list of the required criteria for merging this PR into |
|
@vpaprotsk This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 8 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@magicus, @jatin-bhateja, @ascarpino, @sviswa7) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
|
@vpaprotsk The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
| jdk.jfr, | ||
| jdk.unsupported; | ||
| jdk.unsupported, | ||
| jdk.crypto.ec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jdk.crypto.ec has been hollowed out since JDK 22, the sun.security.ec are in java.base. So I don't think you need this qualified export.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed. (Started this when jdk.crypto.ec was still in use.. missed a few spots during rebase I guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build changes are trivially fine.
/reviewers 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few early comments.
Please update the copyright year of all the modified files.
You can even consider splitting this into two patches, Java side changes in one and x86 optimized intrinsic in next one.
| __ cmpl(length, 19); | ||
| __ jcc(Assembler::equal, L_Length19); | ||
|
|
||
| // Default copy loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add appropriate loop entry alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a 'switch statement default'. The default should never happen (See "Known Length comment on line 335"), but added because java code has that behavior. (i.e. in the unlikely case NIST adds a new elliptic curve to the existing standard?)
| ATTRIBUTE_ALIGNED(64) uint64_t MODULUS_P256[] = { | ||
| 0x000fffffffffffff, 0x00000fffffffffff, | ||
| 0x0000000000000000, 0x0000001000000000, | ||
| 0x0000ffffffff0000, 0x0000000000000000, | ||
| 0x0000000000000000, 0x0000000000000000 | ||
| }; | ||
| static address modulus_p256() { | ||
| return (address)MODULUS_P256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long constants should have UL suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Properly ULL, but good point, fixed
| __ bind(L_DefaultLoop); | ||
| __ cmpl(length, 0); | ||
| __ jcc(Assembler::less, L_Done); | ||
| assign_scalar(Address(aLimbs, 0), Address(bLimbs, 0), set, tmp, _masm); | ||
| __ subl(length, 16); | ||
| __ lea(aLimbs, Address(aLimbs,8)); | ||
| __ lea(bLimbs, Address(bLimbs,8)); | ||
| __ jmp(L_DefaultLoop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both sub and cmp are flag affecting instructions and are macro-fusible.
By doing a loop rotation i.e. moving the length <= 0 check outside the loop and pushing the loop exit check at bottom you can save additional compare checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-above, this is a switch statement (UNLIKELY) fallback. I can still add alignment and loop rotation, but being a fallback figured its more important to keep it small&readable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all part of intrinsic, no harm in polishing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (normalized loop/backedge). There was actually a problem in the loop counter.. (i-=1 instead of i-=16). Can't include a test since classes are sealed, but verified manually.
|
@ascarpino Hi Tony, this is the ECC P256 PR we talked about last year, would appreciate your feedback. |
|
In |
Hmm, thats a good point I haven't fully considered; i.e. (if I read correctly) "for One functional reason that might justify keeping it as-is, is fuzz-testing; with the fallback available, I am able to write the included Fuzz tests and have them check the values against the existing implementation. While I also included a few KAT tests using openssl-generated values, the fuzz tests check millions of values and it does add a lot more certainty about correctness of this code. Can it be removed? For the operations that do not involve multiplication (i.e. I tend to like Thanks @ascarpino PS: Perhaps there is some middle ground, remove the |
Thanks Jatin, will fix! |
I hadn't looked at your fuzz test until you mentioned it. I see you are using reflection to change the values. Is that what you mean by "fallback"? I'm assuming there is no to access the older implementation without reflection.
I wouldn't rip out the old implementation. I have been wondering if we should make the older implementation available, maybe by security property. I was looking at the static Maps at the top of
It would be nice to remove the nesting and it would be nice to be a singleton. Maybe some combination of what I mentioned above chance can help that. I haven't fully thought this out either. |
Fixed all copyright years Re splitting.. probably too late for that now.. (did consider it initially.. got hard to manage two changes while developing. And easier to justify the change when the entire patch is presented? but yes, far more code to review.. ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good and have passed testing
| } | ||
|
|
||
| #ifdef _LP64 | ||
| if (supports_avx512ifma() && supports_avx512vlbw() && MaxVectorSize >= 64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to tie the intrinsic to MaxVectorSize setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| stubName = "intpoly_assign"; | ||
|
|
||
| if (!stubAddr) return false; | ||
| if (stopped()) return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 7564 seems redundant here as there is no range check or anything like that before this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. That is what that is for... I thought it was some soft of 'VM quitting' short-circuit. Removed.
| // M = load(*modulus_p256) | ||
| __ evmovdquq(modulus, allLimbs, ExternalAddress(modulus_p256()), false, Assembler::AVX_512bit, rscratch); | ||
|
|
||
| // A = load(*aLimbs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit more description in comments on what the load step involves would be helpful. e.g. Load upper 4 limbs, shift left by 1 limb using perm, or in the lowest limb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| XMMRegister mask52 = xmm23; | ||
| XMMRegister broadcast5 = xmm24; | ||
| KRegister limb0 = k1; | ||
| KRegister limb5 = k2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limb5 and select are not being used anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed (and also broadcast5)
| // Save all 'SOE' registers | ||
| __ push(rbx); | ||
| #ifdef _WIN64 | ||
| __ push(rsi); | ||
| __ push(rdi); | ||
| #endif | ||
| __ push(r12); | ||
| __ push(r13); | ||
| __ push(r14); | ||
| __ push(r15); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to save/restore rbx, r12, r14, r15. Only r13 is used as temp in montgomeryMultiply(aLimbs, bLimbs, rLimbs). That too could be easily changed to r8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems I forgot to completely cleanup, thanks! (Originally copied from poly1305 stub)
| __ mov(aLimbs, c_rarg0); | ||
| __ mov(bLimbs, c_rarg1); | ||
| __ mov(rLimbs, c_rarg2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could directly call montgomeryMultiply(c_rarg0, c_rarg1, c_rarg2) then these moves are not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave them symbolic names and passed the gpr temp and parameter. vector register map still in the montgomeryMultiply function, but gprs explicitly passed in. 'close enough'?
|
The intrinsics and the C2 changes look good to me. |
|
Thanks Sandhya! Now that I have @ascarpino approval as well, I plan to integrate next Tuesday. |
|
I'll send this through our testing and will report back once it passed. |
|
I'm getting some conflicts when trying to apply this to master. Could you please merge the PR? |
|
Hi @TobiHartmann , merged with no issues for me. Could you please run the tests again? (I think Tony did run them, but can't hurt verifying again). Thanks! |
|
Thanks! I submitted testing and will report back once it passed. |
|
All tests passed. |
|
Thanks Tobi! /integrate |
|
@vpaprotsk |
|
/sponsor |
|
Going to push as commit afed7d0.
Your commit was automatically rebased without conflicts. |
|
@sviswa7 @vpaprotsk Pushed as commit afed7d0. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
|
Unfortunately, this caused a performance regression, see JDK-8333583. @vpaprotsk, please have a look. |
Performance. Before:
Performance, no intrinsic:
Performance, with intrinsics
Summary on design (see code for 'ASCII art', references and details on math):
IntegerPolynomialfield (MontgomeryIntegerPolynomialP256) with 52-bit limbsgetElement(*)/fromMontgomery()to convert numbers into/out of the fieldECOperationsis the primary use of the new fieldforParameters()/multiply()/setSum()generates numbers in the new fieldProjectivePoint/Montgomery{Imm|M}utable.asAffine()to convert out of the new fieldProgress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583$ git checkout pull/18583Update a local copy of the PR:
$ git checkout pull/18583$ git pull https://git.openjdk.org/jdk.git pull/18583/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18583View PR using the GUI difftool:
$ git pr show -t 18583Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18583.diff
Webrev
Link to Webrev Comment