Skip to content

Conversation

@ferakocz
Copy link
Contributor

@ferakocz ferakocz commented Apr 29, 2025

By using the AVX-512 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8351412: Add AVX-512 intrinsics for ML-KEM (Enhancement - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24953/head:pull/24953
$ git checkout pull/24953

Update a local copy of the PR:
$ git checkout pull/24953
$ git pull https://git.openjdk.org/jdk.git pull/24953/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24953

View PR using the GUI difftool:
$ git pr show -t 24953

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24953.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 29, 2025

👋 Welcome back ferakocz! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Apr 29, 2025

@ferakocz This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8351412: Add AVX-512 intrinsics for ML-KEM

Reviewed-by: sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 355 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@lmesnik, @sviswa7) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 29, 2025
@openjdk
Copy link

openjdk bot commented Apr 29, 2025

@ferakocz The following labels will be automatically applied to this pull request:

  • graal
  • hotspot
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added graal graal-dev@openjdk.org security security-dev@openjdk.org hotspot hotspot-dev@openjdk.org labels Apr 29, 2025
@mlbridge
Copy link

mlbridge bot commented Apr 29, 2025

Copy link
Member

@lmesnik lmesnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain in comments how this fix has been tested? I would like to understand which tests are relevant and which flags needs to be set to test this functionality.

@@ -1,5 +1,5 @@
/*
* Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the file contains only copyright changes.

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed three intrinsics so far, more review to do.

// a (short[256]) = c_rarg1
// b (short[256]) = c_rarg2
// c (short[256]) = c_rarg3
// kyberConsts (short[40]) = c_rarg4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kyberConsts is not one of the arguments passed in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// result (short[256]) = c_rarg0
// a (short[256]) = c_rarg1
// b (short[256]) = c_rarg2
// kyberConsts (short[40]) = c_rarg3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kyberConsts is not one of the arguments passed in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines +694 to +696
address generate_kyberAddPoly_2_avx512(StubGenerator *stubgen,
MacroAssembler *_masm) {

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java code for "implKyberAddPoly(short[] result, short[] a, short[] b)" does BarrettReduction but the intrinsic code here does not. Is that intentional and how is the reduction handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the Java version is the one that is too cautious. There is Barrett reduction after at most 4 consecutive uses of mlKemAddPoly(), so doing the reduction in implKyberAddPoly() is not necessary. Thanks for discovering this!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have another question, is there a reason that the Java versions of AddPoly (both for 2 and 3 input) return 1, whereas the corresponding intrinsics return 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use that for debugging. E.g. it is fairly easy to change the Java code to call both the intrinsic and Java version and compare the results. I don't see any harm in leaving that in the production version, since it is always ignored.

Comment on lines +798 to +799
address generate_kyber12To16_avx512(StubGenerator *stubgen,
MacroAssembler *_masm) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If AVX512_VBMI and AVX512_VBMI2 is available, it looks to me that the loop body of this algorithm can be implemented using more efficient instructions in simple 5 steps:

Step 1:
Load 0-47, 48-95, 96-143, 144-191 condensed bytes into xmm0, xmm1, xmm2, xmm3 respectively using masked load.

Step 2:
Use vpermb to arrange xmm0 such that bytes 1, 4, 7, ... are duplicated
xmm0 before b47, b46, ..., b0 where each b is a byte
xmm0 after b47 b46 b46 b45, ......., b5 b4 b4 b3 b2 b1 b1 b0
Repeat this for xmm1, xmm2, xmm3

Step 3:
Use vpshldvw to shift every word (16 bits) in the xmm0 appropriately with variable shift
Shift word 31 by 4, word 30 by 0, ... word 3 by 4, word 2 by 0, word 1 by 4, word 0 by 0
Repeat this for xmm1, xmm2, xmm3

Step 4:
Use vpand to "and" each word element in xmm0 by 0xfff.
Repeat this for xmm1, xmm2, xmm3

Step 5:
Store xmm0 into parsed
Store xmm1 into parsed + 64
Store xmm2 into parsed +128
Store xmm3 into parsed + 192

If you think there is not sufficient time, we could look into it after the merge of this PR as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that way we can speed this up a little (well, in itself it might be significant), but with the current intrinsics, the contribution of this function to the overall running time is about 1.5%, so it would not matter that much, while on the other hand not all AVX-512 capable processors have vbmi.
So I would rather not do it in this PR.

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor comment. Rest of the PR looks good to me.

// Kyber barrett reduce function.
//
// coeffs (short[256]) = c_rarg0
// kyberConsts (short[40]) = c_rarg1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kyberConsts is not an input parameter to implKyberBarrettReduce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@ferakocz
Copy link
Contributor Author

@sviswa7, thanks a lot for the review! If you agree with my changes to load the constants using broadcasting instructions instead of full AVX register loads, would you be so kind as to approve the PR and sponsor my integration?

Comment on lines +248 to +250
static void montmul(int outputRegs[], int inputRegs1[], int inputRegs2[],
int scratchRegs1[], int scratchRegs2[], MacroAssembler *_masm) {
for (int i = 0; i < 4; i++) {
Copy link

@sviswa7 sviswa7 May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the intrinsic for montMul we are treating as if MONT_R_BITS is 16 and MONT_Q_INV_MOD_R is 0xF301 whereas in the Java code MONT_R_BITS is 20 and MONT_Q_INT_MOD_R is 0x8F301. Are these equivalent?

Copy link
Contributor Author

@ferakocz ferakocz May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As used in this case, they are equivalent. For z = montmul(a,b), z will be between -q and q and congruent to a * b * R^-1 mod q, where R > 2 * q, R is a power of 2, -R/2 * q <= a * b < R/2 * q. For the Java code, we use R = 2^20 and for the intrinsic, R = 2^16. In our computations, b is always c * R mod q, so the montmul() really computes a * c mod q. In the Java code, we use 32-bit numbers for the computations, and we use R = 2^20 because that way the a * b numbers that occur during all computations stay in the required range (the inverse NTT computation is where they can grow the most), so we don't have to do Barrett reductions during that computation. For the intrinsics, we use R = 2^16, because this way we can do twice as much work in parallel, but we have to do Barrett reduction after levels 2 and 4 in the inverse NTT computation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the explanation. It would be good to add it as a comment in the stubGenerator_x86_64_kyber.cpp.

@sviswa7
Copy link

sviswa7 commented May 16, 2025

@sviswa7, thanks a lot for the review! If you agree with my changes to load the constants using broadcasting instructions instead of full AVX register loads, would you be so kind as to approve the PR and sponsor my integration?

The broadcast instructions look good. I only have one query on montMul above that I have wondering about.

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 20, 2025
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label May 20, 2025
Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 20, 2025
@ferakocz
Copy link
Contributor Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label May 20, 2025
@openjdk
Copy link

openjdk bot commented May 20, 2025

@ferakocz
Your change (at version ea2152d) is now ready to be sponsored by a Committer.

@sviswa7
Copy link

sviswa7 commented May 20, 2025

/sponsor

@openjdk
Copy link

openjdk bot commented May 20, 2025

Going to push as commit 972f2eb.
Since your change was applied there have been 355 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 20, 2025
@openjdk openjdk bot closed this May 20, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels May 20, 2025
@openjdk
Copy link

openjdk bot commented May 20, 2025

@sviswa7 @ferakocz Pushed as commit 972f2eb.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@seanjmullan
Copy link
Member

Please also write a release note as the performance improvement is significant. Thanks!

@lmesnik
Copy link
Member

lmesnik commented May 20, 2025

I haven't find answer an my question about testing. How this fix is tested?

@ferakocz
Copy link
Contributor Author

ferakocz commented May 21, 2025

I haven't find answer an my question about testing. How this fix is tested?

The change in the file test/jdk/sun/security/provider/acvp/Launcher.java in PR https://github.com/openjdk/jdk/pull/23860/files covers this as well.

@lmesnik
Copy link
Member

lmesnik commented May 21, 2025

Thanks for pointing to the test.

@ferakocz
Copy link
Contributor Author

Please also write a release note as the performance improvement is significant. Thanks!

Done. https://bugs.openjdk.org/browse/JDK-8357741 Release Note: ML-KEM Performance Improved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated security security-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

4 participants