Skip to content

8371864: GaloisCounterMode.implGCMCrypt0 AVX512/AVX2 intrinsics stubs cause AES-GCM encryption failure for certain payload sizes#28363

Closed
jianglizhou wants to merge 16 commits intoopenjdk:masterfrom
jianglizhou:JDK-8371864
Closed

8371864: GaloisCounterMode.implGCMCrypt0 AVX512/AVX2 intrinsics stubs cause AES-GCM encryption failure for certain payload sizes#28363
jianglizhou wants to merge 16 commits intoopenjdk:masterfrom
jianglizhou:JDK-8371864

Conversation

@jianglizhou
Copy link
Contributor

@jianglizhou jianglizhou commented Nov 17, 2025

Please review the fix in StubGenerator::aesgcm_avx512 and StubGenerator::aesgcm_avx2 to handle some edge cases with input sizes that are not multiple of the block size.

Thanks to Thomas Holenstein and Lukas Zobernig for analyzing the issue and providing the test case!


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8371864: GaloisCounterMode.implGCMCrypt0 AVX512/AVX2 intrinsics stubs cause AES-GCM encryption failure for certain payload sizes (Bug - P3)

Reviewers

Reviewers without OpenJDK IDs

Contributors

  • Thomas Holenstein <tholenst@google.com>
  • Lukas Zobernig <zlukas@google.com>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28363/head:pull/28363
$ git checkout pull/28363

Update a local copy of the PR:
$ git checkout pull/28363
$ git pull https://git.openjdk.org/jdk.git pull/28363/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28363

View PR using the GUI difftool:
$ git pr show -t 28363

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28363.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 17, 2025

👋 Welcome back jiangli! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@jianglizhou
Copy link
Contributor Author

/contributor tholenst@google.com
/contributor zlukas@google.com

@openjdk
Copy link

openjdk bot commented Nov 17, 2025

@jianglizhou This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8371864: GaloisCounterMode.implGCMCrypt0 AVX512/AVX2 intrinsics stubs cause AES-GCM encryption failure for certain payload sizes

Co-authored-by: Thomas Holenstein <tholenst@google.com>
Co-authored-by: Lukas Zobernig <zlukas@google.com>
Reviewed-by: shade, sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 254 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Nov 17, 2025

@jianglizhou Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

@openjdk
Copy link

openjdk bot commented Nov 17, 2025

@jianglizhou Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

@openjdk openjdk bot added security security-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org labels Nov 17, 2025
@openjdk
Copy link

openjdk bot commented Nov 17, 2025

@jianglizhou The following labels will be automatically applied to this pull request:

  • hotspot-compiler
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@jianglizhou
Copy link
Contributor Author

/contributor Thomas Holenstein tholenst@google.com
/contributor Lukas Zobernig zlukas@google.com

@openjdk
Copy link

openjdk bot commented Nov 17, 2025

@jianglizhou Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

@openjdk
Copy link

openjdk bot commented Nov 17, 2025

@jianglizhou Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 17, 2025
@jianglizhou
Copy link
Contributor Author

/contributor add Thomas Holenstein tholenst@google.com
/contributor add Lukas Zobernig zlukas@google.com

@openjdk
Copy link

openjdk bot commented Nov 17, 2025

@jianglizhou
Contributor Thomas Holenstein <tholenst@google.com> successfully added.

@openjdk
Copy link

openjdk bot commented Nov 17, 2025

@jianglizhou
Contributor Lukas Zobernig <zlukas@google.com> successfully added.

@mlbridge
Copy link

mlbridge bot commented Nov 17, 2025


public class TestAesGcmIntrinsic {

static final SecureRandom SECURE_RANDOM = newDefaultSecureRandom();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comment: Java code should use 4x whitespace indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TobiHartmann, thanks! Fixed.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Some stylistic comments for the product fix, and suggestions for the test.


__ bind(MESG_BELOW_32_BLKS);
__ subl(len, 16 * 16);
__ cmpl(len, 256);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the stylistic logic, this should be written as 16 * 16, to match the surrounding subl and addl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review, @shipilev! Fixed.

public static void main(String[] args) throws Exception {
TestAesGcmIntrinsic test = new TestAesGcmIntrinsic();
long startTime = System.currentTimeMillis();
while (System.currentTimeMillis() - startTime < 60 * 1000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that you want a stress test. But time-limiting puts the test into weird condition: it can have different number of iterations, depending on auxiliary load on the machine. These tests are running in parallel with lots of other tests, so it is not uncommon. Do you even need to repeat jitFunc() call multiple times? Looks like it traverses the interesting configurations in one go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing today. For 200 runs, removing the time-limited loop, there is 89 runs out of 200 fail. So I changed to use an iteration of three runs, all 200 runs fail without the fix.

for (int messageSize = SPLIT_LEN; messageSize < SPLIT_LEN + 300; messageSize++) {
byte[] message = randBytes(messageSize);
try {
byte[] ciphertext = gcmEncrypt(key, message, aad);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it makes sense to check that round-trip is successful, e.g. that decrypt(encrypt(message)) == message. Currently we implicitly rely on exceptions being thrown from the incorrectly executing code, which is IMO too weak -- in the boundary conditions like these, there might be bugs that do not manifest in visible exceptions, and just the encryption is subtly broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I added decrypt part and the check as suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes, there were more common parts in the test. I moved common code into helper methods.

import javax.crypto.spec.GCMParameterSpec;
import javax.crypto.spec.SecretKeySpec;

public class TestAesGcmIntrinsic {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like TestGCMSplitBound or some such; it is not a generic test for AES/GCM intrinsic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed to TestAesGcmIntrinsic name, when converting the original test into the jtreg test. TestGCMSplitBound SGTM. Changed.

throw new RuntimeException("ciphertext is null");
}
}
for (int messageSize = SPLIT_LEN; messageSize < SPLIT_LEN + 300; messageSize++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SPLIT_LEN - 300; SPLIT_LEN + 300] for completeness, perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


public class TestAesGcmIntrinsic {

static final SecureRandom SECURE_RANDOM = newDefaultSecureRandom();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need a SecureRandom here? Random RANDOM = Utils.getRandomInstance(); gets you the pre-seeded random instance, which can be used to repeatably reproduce failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the SecureRandom without changing. I think that could be more related to what the original reproducible.

- Rename test to TestGCMSplitBound.java
- Change test range to [SPLIT_LEN - 300; SPLIT_LEN + 300].
- Replace time-bound loop with an iteration of three runs.
- Add encrypt part and check to make sure the encrypted message is the same as the original.
@openjdk openjdk bot removed the rfr Pull request is ready for review label Nov 20, 2025
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 20, 2025
Comment on lines 86 to 89
byte[] nonce = randBytes(IV_SIZE_IN_BYTES);
System.arraycopy(ciphertext, 0, nonce, 0, IV_SIZE_IN_BYTES);
Cipher cipher = getCipher(key, aad, nonce);
return cipher.doFinal(ciphertext, IV_SIZE_IN_BYTES, ciphertext.length - IV_SIZE_IN_BYTES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indenting is still 2-space here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Fixed, thanks.

AlgorithmParameterSpec params =
new GCMParameterSpec(8 * TAG_SIZE_IN_BYTES, nonce, 0, nonce.length);
Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding");
cipher.init(Cipher.ENCRYPT_MODE, keySpec, params);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er. This is used from gcmDecrypt? How does it work without Cipher.DECRYPT_MODE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Interestingly the test passed for me on my local machine. Fixed to use Cipher.DECRYPT_MODE when doing gcmDecrypt.

Also an interesting new finding, with the decrypted message verification, I see there are 2 failures out of 200 runs with AVX512. I'm filing a new issue on the specifically, so it can be investigated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


/*
* @test
* @bug 8371864
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to just run the unit test on architectures with @requires vm.cpu.features ~= ".*avx512f.*" | vm.cpu.features ~= ".*avx2.*" annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing and testing!

Does it make sense to just run the unit test on architectures with @requires vm.cpu.features ~= ".avx512f." | vm.cpu.features ~= ".avx2." annotation?

Limiting the test execution on the relevant devices is a good idea. We can also check for os.simpleArch == "x64". We probably could check for ".avx512." instead ".avx512f." just to make sure we still get the proper test coverage in case there is any future/hidden bugs with populating cpu feature flags.

I just did a quick testing:
On my local machine, these related cpu feature flags are set: avx, avx2.

On a machine enabled with the aesgcm_avx512 intrinsic, these are the related cpu feature flags:
avx, avx2, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, avx512_vbmi2, avx512_vbmi, avx512_bitalg, avx512_ifma

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added @requires.

/*
* @test
* @bug 8371864
* @run main/othervm/timeout=600 TestGCMSplitBound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60 was sufficient for my test runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better to give larger timeout factor to prevent false failure when testing on slower machine.

- Add @requires
- Shorten long lines
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Nov 26, 2025
//process 8 16 byte blocks in initial_num_blocks.
//process 8 16 byte blocks at a time until all are done 'encrypt_by_8_new followed by ghash_last_8'
__ xorl(pos, pos);
__ cmpl(len, 128);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this part of the original problem? I was trying to trace where this is called with < 128 bytes and couldn't find the path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I documented in JDK-8371864 description, there was also a bug in AVX2 version of the intrinsic, StubGenerator::aesgcm_avx2. Hence the bug title mentioned both AVX512 and AVX2 intrinsics stubs.

The failure can be reproduced if you run TestGCMSplitBound.java on a machine supports AVX2 but not AVX512 features. You would need to find a x64 machine that supports AVX2 but not AVX512 features. See StubGenerator::generate_aes_stubs() for how it decides which version of the stub is used.

On my local machine with AVX2 support, TestGCMSplitBound.java fails without the fix:

test result: Failed. Execution failed: `main' threw exception: java.lang.RuntimeException: Failed for messageSize 100001

encryptAndDecrypt(key, aad, message, messageSize);
} catch (Exception e) {
throw new RuntimeException("Failed for messageSize " +
Integer.toHexString(messageSize), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: + operator should be first and line indented >= 8 white-spaces.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these nits something a tool should check and in the best case also fix automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: + operator should be first and line indented >= 8 white-spaces.

Aren't these nits something a tool should check and in the best case also fix automatically?

Changed to break before operators +.

AFAIK, we have mixed styles in existing JDK code with operator on the new line and operator at the end of previous line for breaking long lines. +1 on the suggestion to do auto-detection and auto-fix if we want to more strictly reinforce style.

Copy link
Contributor

@smemery smemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianglizhou thank you for the AVX2 related output from the unit test pre-fix. From this I was able to trace the point of failure and see that your proposed changes are good for approval. Thank you for your work on these issues!

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, the pos/len modifications in current code are confusing. I scratched my head for quite a while trying to comprehend why does __ bind(MESG_BELOW_32_BLKS) split the pos += 16 and len -= 16? On a surface, that just looks like a bug.

But looks that way because we do initial_blocks_16_avx512 twice, do pos += 16 twice, but only do the len += 32 after the second call. Which does not help if we shortcut after the first call. In fact, I am not at all sure that checking len < 32 without modifying len beforehand does not break the assumptions downstream:

  initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx,  ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK, stack_offset);
  __ addl(pos, 16 * 16);
  __ cmpl(len, 32 * 16);
  __ jcc(Assembler::below, MESG_BELOW_32_BLKS);

Really, in these kind of intrinsics, you want to make sure pos and len updates are tightly bound together, otherwise these kinds of mistakes would keep happening. You will lose on code density a bit, but would have more readable and robust code.

Shouldn't it be like this?

diff --git a/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp b/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
index 1e728ffa279..a16e25b075d 100644
--- a/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
+++ b/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
@@ -3475,12 +3475,14 @@ void StubGenerator::aesgcm_avx512(Register in, Register len, Register ct, Regist
 
   initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx,  ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK, stack_offset);
   __ addl(pos, 16 * 16);
+  __ subl(len, 16 * 16);
+
   __ cmpl(len, 32 * 16);
   __ jcc(Assembler::below, MESG_BELOW_32_BLKS);
 
   initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK, stack_offset + 16);
   __ addl(pos, 16 * 16);
-  __ subl(len, 32 * 16);
+  __ subl(len, 16 * 16);
 
   __ cmpl(len, 32 * 16);
   __ jcc(Assembler::below, NO_BIG_BLKS);
@@ -3491,24 +3493,27 @@ void StubGenerator::aesgcm_avx512(Register in, Register len, Register ct, Regist
   ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
                                     true, true, false, false, false, ghashin_offset, aesout_offset, HashKey_32);
   __ addl(pos, 16 * 16);
+  __ subl(len, 16 * 16);
 
   ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
                                     true, false, true, false, true, ghashin_offset + 16, aesout_offset + 16, HashKey_16);
   __ evmovdquq(AAD_HASHx, ZTMP4, Assembler::AVX_512bit);
   __ addl(pos, 16 * 16);
-  __ subl(len, 32 * 16);
+  __ subl(len, 16 * 16);
   __ jmp(ENCRYPT_BIG_BLKS_NO_HXOR);
 
   __ bind(ENCRYPT_BIG_NBLKS);
   ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
                                     false, true, false, false, false, ghashin_offset, aesout_offset, HashKey_32);
   __ addl(pos, 16 * 16);
+  __ subl(len, 16 * 16);
+
   ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
                                     false, false, true, true, true, ghashin_offset + 16, aesout_offset + 16, HashKey_16);
 
   __ movdqu(AAD_HASHx, ZTMP4);
   __ addl(pos, 16 * 16);
-  __ subl(len, 32 * 16);
+  __ subl(len, 16 * 16);
 
   __ bind(NO_BIG_BLKS);
   __ cmpl(len, 16 * 16);
@@ -3525,9 +3530,9 @@ void StubGenerator::aesgcm_avx512(Register in, Register len, Register ct, Regist
 
   ghash16_avx512(false, true, false, false, true, in, pos, avx512_subkeyHtbl, AAD_HASHx, SHUF_MASK, stack_offset, 16 * 16, 0, HashKey_16);
   __ addl(pos, 16 * 16);
+  __ subl(len, 16 * 16);
 
   __ bind(MESG_BELOW_32_BLKS);
-  __ subl(len, 16 * 16);
   gcm_enc_dec_last_avx512(len, in, pos, AAD_HASHx, SHUF_MASK, avx512_subkeyHtbl, ghashin_offset, HashKey_16, true, true);
 
   __ bind(GHASH_DONE);

@jianglizhou
Copy link
Contributor Author

Oh man, the pos/len modifications in current code are confusing. I scratched my head for quite a while trying to comprehend why does __ bind(MESG_BELOW_32_BLKS) split the pos += 16 and len -= 16? On a surface, that just looks like a bug.

The combination of handling for the fall through from ENCRYPT_16_BLKS and conditional entry to MESG_BELOW_32_BLKS cases are subtle.

I had also missed the fall through case in my initial proposed fix (with comp/jcc) until @sviswa7 pointed it out and suggested the current fix. The fix for StubGenerator::aesgcm_avx512 with moving __ addl(pos, 16 * 16) to be before __ bind(MESG_BELOW_32_BLKS) works correctly for both the fall through and conditional jump cases now.

But looks that way because we do initial_blocks_16_avx512 twice, do pos += 16 twice, but only do the len += 32 after the second call. Which does not help if we shortcut after the first call. In fact, I am not at all sure that checking len < 32 without modifying len beforehand does not break the assumptions downstream:

  initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx,  ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK, stack_offset);
  __ addl(pos, 16 * 16);
  __ cmpl(len, 32 * 16);
  __ jcc(Assembler::below, MESG_BELOW_32_BLKS);

Really, in these kind of intrinsics, you want to make sure pos and len updates are tightly bound together, otherwise these kinds of mistakes would keep happening. You will lose on code density a bit, but would have more readable and robust code.

Shouldn't it be like this?

diff --git a/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp b/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
index 1e728ffa279..a16e25b075d 100644
--- a/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
+++ b/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
@@ -3475,12 +3475,14 @@ void StubGenerator::aesgcm_avx512(Register in, Register len, Register ct, Regist
 
   initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx,  ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK, stack_offset);
   __ addl(pos, 16 * 16);
+  __ subl(len, 16 * 16);
+
   __ cmpl(len, 32 * 16);
   __ jcc(Assembler::below, MESG_BELOW_32_BLKS);
 
   initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK, stack_offset + 16);
   __ addl(pos, 16 * 16);
-  __ subl(len, 32 * 16);
+  __ subl(len, 16 * 16);
 
   __ cmpl(len, 32 * 16);
   __ jcc(Assembler::below, NO_BIG_BLKS);
@@ -3491,24 +3493,27 @@ void StubGenerator::aesgcm_avx512(Register in, Register len, Register ct, Regist
   ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
                                     true, true, false, false, false, ghashin_offset, aesout_offset, HashKey_32);
   __ addl(pos, 16 * 16);
+  __ subl(len, 16 * 16);
 
   ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
                                     true, false, true, false, true, ghashin_offset + 16, aesout_offset + 16, HashKey_16);
   __ evmovdquq(AAD_HASHx, ZTMP4, Assembler::AVX_512bit);
   __ addl(pos, 16 * 16);
-  __ subl(len, 32 * 16);
+  __ subl(len, 16 * 16);
   __ jmp(ENCRYPT_BIG_BLKS_NO_HXOR);
 
   __ bind(ENCRYPT_BIG_NBLKS);
   ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
                                     false, true, false, false, false, ghashin_offset, aesout_offset, HashKey_32);
   __ addl(pos, 16 * 16);
+  __ subl(len, 16 * 16);
+
   ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, SHUF_MASK,
                                     false, false, true, true, true, ghashin_offset + 16, aesout_offset + 16, HashKey_16);
 
   __ movdqu(AAD_HASHx, ZTMP4);
   __ addl(pos, 16 * 16);
-  __ subl(len, 32 * 16);
+  __ subl(len, 16 * 16);
 
   __ bind(NO_BIG_BLKS);
   __ cmpl(len, 16 * 16);
@@ -3525,9 +3530,9 @@ void StubGenerator::aesgcm_avx512(Register in, Register len, Register ct, Regist
 
   ghash16_avx512(false, true, false, false, true, in, pos, avx512_subkeyHtbl, AAD_HASHx, SHUF_MASK, stack_offset, 16 * 16, 0, HashKey_16);
   __ addl(pos, 16 * 16);
+  __ subl(len, 16 * 16);
 
   __ bind(MESG_BELOW_32_BLKS);
-  __ subl(len, 16 * 16);
   gcm_enc_dec_last_avx512(len, in, pos, AAD_HASHx, SHUF_MASK, avx512_subkeyHtbl, ghashin_offset, HashKey_16, true, true);
 
   __ bind(GHASH_DONE);

Improving readability is a good idea, hand-rolled assembly however is mostly motivated by performance. While sub with immediate value is fast and takes one cpu cycle, I would agree with the original author of aesgcm_avx512 on combining two sub instructions into one instruction whenever possible.

@jianglizhou
Copy link
Contributor Author

@jianglizhou thank you for the AVX2 related output from the unit test pre-fix. From this I was able to trace the point of failure and see that your proposed changes are good for approval. Thank you for your work on these issues!

@smemery Thanks for carefully testing the changes!

@jianglizhou
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Nov 30, 2025

@jianglizhou This pull request has not yet been marked as ready for integration.

@smemery
Copy link
Contributor

smemery commented Dec 1, 2025

@sviswa7 or @shipilev, if the updated changes look good to you then could you please reapprove/approve the PR as I don't have Reviewer privileges at this point.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 1, 2025
@jianglizhou
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Dec 1, 2025

Going to push as commit 6cb1c8f.
Since your change was applied there have been 255 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Dec 1, 2025
@openjdk openjdk bot closed this Dec 1, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Dec 1, 2025
@openjdk
Copy link

openjdk bot commented Dec 1, 2025

@jianglizhou Pushed as commit 6cb1c8f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated security security-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

7 participants