-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8351034: Add AVX-512 intrinsics for ML-DSA #23860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
1ff5851
fe50e0d
331f1ec
3aaa106
64135f2
f65ef7c
aa2fdf2
2438fb5
1cfab77
e9db09e
9597b53
5665689
7a9f664
e4ab10b
0b0d096
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -289,6 +289,7 @@ address StubGenerator::generate_dilithiumAlmostNtt_avx512() { | |
|
|
||
| __ movl(iterations, 2); | ||
|
|
||
| __ align(OptoLoopAlignment); | ||
| __ BIND(L_loop); | ||
|
|
||
| __ subl(iterations, 1); | ||
|
|
@@ -611,6 +612,7 @@ address StubGenerator::generate_dilithiumAlmostInverseNtt_avx512() { | |
|
|
||
| __ movl(iterations, 2); | ||
|
|
||
| __ align(OptoLoopAlignment); | ||
| __ BIND(L_loop); | ||
|
|
||
| __ subl(iterations, 1); | ||
|
|
@@ -1009,6 +1011,7 @@ address StubGenerator::generate_dilithiumNttMult_avx512() { | |
|
|
||
| __ movl(len, 4); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Compile-time constant, why not 'unroll at compile time'? i.e. wrap this loop with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have found that unrolling these loops actually hurts performance (probably an I-cache effect. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting; I keep on having to re-train my intuition, thanks for the data |
||
|
|
||
| __ align(OptoLoopAlignment); | ||
| __ BIND(L_loop); | ||
|
|
||
| for (int i = 0; i < 4; i++) { | ||
|
|
@@ -1086,6 +1089,7 @@ address StubGenerator::generate_dilithiumMontMulByConstant_avx512() { | |
|
|
||
| __ movl(len, 2); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment here as the
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
|
|
||
| __ align(OptoLoopAlignment); | ||
| __ BIND(L_loop); | ||
|
|
||
| for (int i = 0; i < 8; i++) { | ||
|
|
@@ -1168,6 +1172,7 @@ address StubGenerator::generate_dilithiumDecomposePoly_avx512() { | |
|
|
||
| __ movl(len, 1024); | ||
|
|
||
| __ align(OptoLoopAlignment); | ||
| __ BIND(L_loop); | ||
|
|
||
| __ evmovdqul(xmm0, Address(input, 0), Assembler::AVX_512bit); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -178,6 +178,7 @@ address StubGenerator::generate_sha3_implCompress(StubGenStubId stub_id) { | |
| __ evmovdquq(xmm30, Address(permsAndRots, 832), Assembler::AVX_512bit); | ||
| __ evmovdquq(xmm31, Address(permsAndRots, 896), Assembler::AVX_512bit); | ||
|
|
||
| __ align(OptoLoopAlignment); | ||
| __ BIND(sha3_loop); | ||
|
|
||
| // there will be 24 keccak rounds | ||
|
|
@@ -232,6 +233,7 @@ address StubGenerator::generate_sha3_implCompress(StubGenStubId stub_id) { | |
| // The implementation closely follows the Java version, with the state | ||
| // array "rows" in the lowest 5 64-bit slots of zmm0 - zmm4, i.e. | ||
| // each row of the SHA3 specification is located in one zmm register. | ||
| __ align(OptoLoopAlignment); | ||
| __ BIND(rounds24_loop); | ||
| __ subl(roundsLeft, 1); | ||
|
|
||
|
|
@@ -357,6 +359,7 @@ address StubGenerator::generate_double_keccak() { | |
| const Register constant2use = r10; | ||
| const Register roundsLeft = r11; | ||
|
|
||
| __ align(OptoLoopAlignment); | ||
|
||
| Label rounds24_loop; | ||
|
|
||
| __ enter(); | ||
|
|
@@ -417,6 +420,7 @@ address StubGenerator::generate_double_keccak() { | |
| // load round_constants base | ||
| __ movptr(constant2use, round_consts); | ||
|
|
||
| __ align(OptoLoopAlignment); | ||
| __ BIND(rounds24_loop); | ||
| __ subl( roundsLeft, 1); | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ferakocz , Kindly align loop entry address using __align64() here and at all the places before __BIND(LOOP)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jatin-bhateja, thanks for the suggestion. I have added __ align(OptoLoopAlignment); before all loop entries.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ferakocz ,
Thanks!, for efficient utilization of Decode ICache (please refer to Intel SDM section 3.4.2.5), code blocks should be aligned to 32-byte boundaries; a 64-byte aligned code is a superset of both 16 and 32 byte aligned addresses and also matches with the cacheline size. However, I can noticed that we have been using OptoLoopAlignment at places in AES-GCM also.
I introduced some errors in generate_dilithiumAlmostInverseNtt_avx512 implementation in anticipation of catching it through existing ML_DSA_Tests under
test/jdk/sun/security/provider/acvp
But all the tests passed for me.
java -jar /home/jatinbha/sandboxes/jtreg/build/images/jtreg/lib/jtreg.jar -jdk:$JAVA_HOME -Djdk.test.lib.artifacts.ACVP-Server=/home/jatinbha/softwares/v1.1.0.38.zip -va -timeout:4 Launcher.javaCan you please point out a test I need to use for validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the easiest is to put a for (int i = 0; i < 1000; i++) loop around the switch statement in the run() method of the ML_DSA_Test class (test/jdk/sun/security/provider/acvp/ML_DSA_Test.java). (This is because the intrinsics kick in after a few thousand calls of the method.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ferakocz , Yes, we should modify the test or lower the compilation threshold with -Xbatch -XX:TieredCompileThreshold=0.1.
Alternatively, since the tests has a depedency on Automatic Cryptographic Validation Test server I have created a simplified test which cover all the security levels.
Kindly include test/hotspot/jtreg/compiler/intrinsics/signature/TestModuleLatticeDSA.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a new command to the test test/jdk/sun/security/provider/acvp/Launcher.java. The line with the -Xcomp will invoke the intrinsics on the first call, so they will be tested.