Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,7 @@ address StubGenerator::generate_dilithiumAlmostNtt_avx512() {

__ movl(iterations, 2);

__ align(OptoLoopAlignment);
__ BIND(L_loop);
Copy link
Member

@jatin-bhateja jatin-bhateja Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ferakocz , Kindly align loop entry address using __align64() here and at all the places before __BIND(LOOP)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jatin-bhateja, thanks for the suggestion. I have added __ align(OptoLoopAlignment); before all loop entries.

Copy link
Member

@jatin-bhateja jatin-bhateja Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ferakocz ,

Thanks!, for efficient utilization of Decode ICache (please refer to Intel SDM section 3.4.2.5), code blocks should be aligned to 32-byte boundaries; a 64-byte aligned code is a superset of both 16 and 32 byte aligned addresses and also matches with the cacheline size. However, I can noticed that we have been using OptoLoopAlignment at places in AES-GCM also.

I introduced some errors in generate_dilithiumAlmostInverseNtt_avx512 implementation in anticipation of catching it through existing ML_DSA_Tests under
test/jdk/sun/security/provider/acvp

But all the tests passed for me.
java -jar /home/jatinbha/sandboxes/jtreg/build/images/jtreg/lib/jtreg.jar -jdk:$JAVA_HOME -Djdk.test.lib.artifacts.ACVP-Server=/home/jatinbha/softwares/v1.1.0.38.zip -va -timeout:4 Launcher.java

Can you please point out a test I need to use for validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the easiest is to put a for (int i = 0; i < 1000; i++) loop around the switch statement in the run() method of the ML_DSA_Test class (test/jdk/sun/security/provider/acvp/ML_DSA_Test.java). (This is because the intrinsics kick in after a few thousand calls of the method.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ferakocz , Yes, we should modify the test or lower the compilation threshold with -Xbatch -XX:TieredCompileThreshold=0.1.

Alternatively, since the tests has a depedency on Automatic Cryptographic Validation Test server I have created a simplified test which cover all the security levels.

Kindly include test/hotspot/jtreg/compiler/intrinsics/signature/TestModuleLatticeDSA.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new command to the test test/jdk/sun/security/provider/acvp/Launcher.java. The line with the -Xcomp will invoke the intrinsics on the first call, so they will be tested.


__ subl(iterations, 1);
Expand Down Expand Up @@ -611,6 +612,7 @@ address StubGenerator::generate_dilithiumAlmostInverseNtt_avx512() {

__ movl(iterations, 2);

__ align(OptoLoopAlignment);
__ BIND(L_loop);

__ subl(iterations, 1);
Expand Down Expand Up @@ -1009,6 +1011,7 @@ address StubGenerator::generate_dilithiumNttMult_avx512() {

__ movl(len, 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile-time constant, why not 'unroll at compile time'? i.e. wrap this loop with for (int len=0; len<4; len++) instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found that unrolling these loops actually hurts performance (probably an I-cache effect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting; I keep on having to re-train my intuition, thanks for the data


__ align(OptoLoopAlignment);
__ BIND(L_loop);

for (int i = 0; i < 4; i++) {
Expand Down Expand Up @@ -1086,6 +1089,7 @@ address StubGenerator::generate_dilithiumMontMulByConstant_avx512() {

__ movl(len, 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as the generate_dilithiumNttMult_avx512

  • constants can be loaded directly into XMM
  • len can be removed by unrolling at compile time
  • symbolic names could be used for registers
  • comments could be added

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


__ align(OptoLoopAlignment);
__ BIND(L_loop);

for (int i = 0; i < 8; i++) {
Expand Down Expand Up @@ -1168,6 +1172,7 @@ address StubGenerator::generate_dilithiumDecomposePoly_avx512() {

__ movl(len, 1024);

__ align(OptoLoopAlignment);
__ BIND(L_loop);

__ evmovdqul(xmm0, Address(input, 0), Assembler::AVX_512bit);
Expand Down
4 changes: 4 additions & 0 deletions src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ address StubGenerator::generate_sha3_implCompress(StubGenStubId stub_id) {
__ evmovdquq(xmm30, Address(permsAndRots, 832), Assembler::AVX_512bit);
__ evmovdquq(xmm31, Address(permsAndRots, 896), Assembler::AVX_512bit);

__ align(OptoLoopAlignment);
__ BIND(sha3_loop);

// there will be 24 keccak rounds
Expand Down Expand Up @@ -232,6 +233,7 @@ address StubGenerator::generate_sha3_implCompress(StubGenStubId stub_id) {
// The implementation closely follows the Java version, with the state
// array "rows" in the lowest 5 64-bit slots of zmm0 - zmm4, i.e.
// each row of the SHA3 specification is located in one zmm register.
__ align(OptoLoopAlignment);
__ BIND(rounds24_loop);
__ subl(roundsLeft, 1);

Expand Down Expand Up @@ -357,6 +359,7 @@ address StubGenerator::generate_double_keccak() {
const Register constant2use = r10;
const Register roundsLeft = r11;

__ align(OptoLoopAlignment);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant alignment before label should be before it's bind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

Label rounds24_loop;

__ enter();
Expand Down Expand Up @@ -417,6 +420,7 @@ address StubGenerator::generate_double_keccak() {
// load round_constants base
__ movptr(constant2use, round_consts);

__ align(OptoLoopAlignment);
__ BIND(rounds24_loop);
__ subl( roundsLeft, 1);

Expand Down