-
Notifications
You must be signed in to change notification settings - Fork 6.2k
JDK-8216437 : PPC64: Add intrinsic for GHASH algorithm #20235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JDK-8216437 : PPC64: Add intrinsic for GHASH algorithm #20235
Conversation
|
👋 Welcome back sroy! A progress list of the required criteria for merging this PR into |
|
@suchismith1993 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 128 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@TheRealMDoerr, @offamitkumar) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
|
@suchismith1993 The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
|
@suchismith1993 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
@suchismith1993 This pull request has been inactive for more than 16 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
|
/open |
|
@suchismith1993 This pull request is now open |
59f5c2a to
913be49
Compare
Webrevs
|
| if (UseGHASHIntrinsics) { | ||
| warning("GHASH intrinsics are not available on this CPU"); | ||
| FLAG_SET_DEFAULT(UseGHASHIntrinsics, false); | ||
| if (FLAG_IS_DEFAULT(UseGHASHIntrinsics)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a passing comment: I guess there should be a check about whether underlying architecture supports vector instruction or not. If it does then only enable intrinsic.
TheRealMDoerr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing it! Reviewing the algorithm will take more time. I already have some comments and suggestions.
| return start; | ||
| } | ||
|
|
||
| // Generate stub for ghash process blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple double-whitespaces in the new comments. Please clean them up!
| VectorRegister vMask = VR24; | ||
| VectorRegister vS = VR25; | ||
| VectorSRegister vXS = VSR33; | ||
| Label L_end, L_aligned; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to declare VectorRegisters only and using ->to_vsr() below. This should improve readability.
Non-volatile VectorRegisters need to be preserved. See
jdk/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
Line 3866 in bcb1bda
| // Save non-volatile vector registers (frameless). |
| VectorSRegister vXS = VSR33; | ||
| Label L_end, L_aligned; | ||
|
|
||
| static const unsigned char perm_pattern[16] __attribute__((aligned(16))) = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern can be produced by lvsl. Loading it from memory is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had tried something like
__ lvsl(loadOrder, 0);
This generated a pattern as below
{0xf, 0xe, 0xd, 0xc, 0xb, 0xa, 0x9, 0x8, 0x7, 0x6, 0x5, 0x4, 0x3,
0x2, 0x1, 0x0}}
This causes the the data to be loaded into vector in wrong order.
The desired pattern is {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
Since the data is stored in bytes and we don't have lxvb16x in power8, the pattern has to be enforced.
Is there a better way to do this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this code has been changed already, but I would like to point out that you should use alignas(alignment) for alignment purposes and not __attribute__((aligned(alignment))) like the HotSpot Style Guide recommends for any future changes
| // Arguments for generated stub: | ||
| // state: R3_ARG1 | ||
| // subkeyH: R4_ARG2 | ||
| // data: R5_ARG3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argument "blocks" missing.
| __ vsldoi(vHigherH, vSwappedH, vZero, 8); // H.H | ||
| __ vxor(vTmp1, vTmp1, vTmp1); | ||
| __ vxor(vZero, vZero, vZero); | ||
| __ mtctr(blocks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can blocks be 0?
blocks is an int. The higher half of the register may possibly contain garbage and should be cleared. (Can be combined with 0 check if needed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(
| while (blocks > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C2 compiler replaces this Java code by the intrinsic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len is passed to the stub without null check:
jdk/src/hotspot/share/opto/library_call.cpp
Line 7599 in f9b1133
| state_start, subkeyH_start, data_start, len); |
But I can see null checks in GHASH.java for all callers of
processBlocks. So, your assertion should be fine.
| __ addi(data, data, 16); | ||
| __ bdnz(loop); | ||
| __ stxvd2x(vZero->to_vsr(), state); | ||
| __ blr(); // Return from function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some empty lines would improve readability.
| __ vxor(vConstC2, vConstC2, vConstC2); | ||
| __ mtvrd(vConstC2, temp1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that vxor instruction really necessary? mtvrd will do overwrite any way. So why do we want to be sure that there is 0 in vConstC2 ?
| __ vsldoi(vSwappedH, vTmp2, vTmp2, 8); // swap Lower and Higher Halves of subkey H | ||
| __ vsldoi(vLowerH, vZero, vSwappedH, 8); // H.L | ||
| __ vsldoi(vHigherH, vSwappedH, vZero, 8); // H.H | ||
| __ vxor(vTmp1, vTmp1, vTmp1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this vTmp1 is being used in the loop and vpmsumd (line 741) should overwrite whatever it is containing. So was this xor necessary ?
| __ vxor(vTmp1, vTmp1, vTmp1); | ||
| __ vxor(vZero, vZero, vZero); | ||
| __ mtctr(blocks); | ||
| __ li(temp1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this load redundant as well. We are loading 0 again at line 722 in the loop.
| __ vsldoi(vLowerH, vZero, vSwappedH, 8); // H.L | ||
| __ vsldoi(vHigherH, vSwappedH, vZero, 8); // H.H | ||
| __ vxor(vTmp1, vTmp1, vTmp1); | ||
| __ vxor(vZero, vZero, vZero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are doing same xor operation on vZero at 721 in the loop, before that It is not being used. So can we get rid of this xor-instruction as well ?
|
@suchismith1993 this pull request can not be integrated into git checkout ghash_processblocks
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
The commenting here is poor. GHASH uses little-endian for the byte order, but big-endian for the bit order. For example, the polynomial 1 is represented as the 16-byte string 80 00 00 00 | 12 bytes of 00. So, we must either reverse the bytes in each word and do everything big-endian or reverse the bits in each byte and do it little-endian. Which do you do? Sure, I could figure it out by reading the code, but please say. |
Hi Andrew I would like to understand if I have fully understood your comment. Currently the load instruction takes care of the endianness ,for subkey and state. For loading the data, we enforce the endianness and reorder the bytes order using vec_perm. I am assuming the inputs for GHASH follows the endianness as per the algorithm, as you have mentioned. I have made sure they are in the appropriate intended representation for both LE and BE platforms(using vec_perm and appropriate load instructions) In the algorithm that I have used , 0xC2 is the polynomial for reduction. It is shifted by 56 bits to make It the most significant byte. I think this is little endian byte order ? I did not do any extra swapping for the subkey ,state vector and input. Is this what you are looking for ? |
Right, so in this implementation the low-order bits of the field polynomial (i.e. p = z^7+z^2+z+1) are represented as 0xC2, or 11000010. But you will note that there is a bit missing here. the low-order bits of the field polynomial should have four bits set. And in GHASH.java in the JDK, 0xe100000000000000 is used, which is a bit more obvious. I think you're using the trick described in Intel's Optimized Galois-Counter-Mode Implementation on Intel® Architecture Processors, which represents the polynomial in a shifted form as, in effect, The main problem is, though, that there is little commentary in the code which explains how things are encoded. If you're using a bit-reversed and shifted representation of a polynomial, you have to say that. If youre using the algorithm described in the Intel paper, you have to say that too. Have pity on the reader. |
|
Please run AESGCMByteBuffer.encrypt and provide some before and after figures. |
|
/integrate |
|
@suchismith1993 This pull request has not yet been marked as ready for integration. |
Hi @TheRealMDoerr Was this suite run from your end ? Was the TestAESMain that you had checked the run times on ? @theRealAph From my end, we had improvement of around 3 times after running TestAESMain. Is that not valid test suite ? If the improvement with this version is satisfactory , can we have this integrated and then pursue further improvements on it in separate PR ? Will open a JBS issue for the same |
I've run tier1-4 on linux ppc64le and AIX for stability testing and only used TestAESMain to check the performance. I agree with Andrew that some performance numbers should be published. I think you can also report the performance results of TestAESMain, here. |
|
Runtime without my changes ~/jdkHead/jdk/build/linux-ppc64le-server-fastdebug/jdk/bin/java -Xbatch -DcheckOutput=true -Dmode=GCM -DencInputOffset=1 -DencOutputOffset=1 -XX:DisableIntrinsic=_ghash_processBlocks -XX:+UnlockDiagnosticVMOptions -Xbootclasspath/a:. -cp . compiler.codegen.aes.TestAESMain 100000 100000 The output is as belows 100000 iterations algorithm=AES, mode=GCM, paddingStr=NoPadding, msgSize=646, keySize=128, noReinit=false, checkOutput=true, encInputOffset=1, encOutputOffset=1, decOutputOffset=0, lastChunkSize=32 algorithm=AES, mode=GCM, paddingStr=NoPadding, msgSize=646, keySize=128, noReinit=false, checkOutput=true, encInputOffset=1, encOutputOffset=1, decOutputOffset=0, lastChunkSize=32 |
|
Runtime with changes ~/jdkHead/jdk/build/linux-ppc64le-server-fastdebug/jdk/bin/java -Xbatch -DcheckOutput=true -Dmode=GCM -DencInputOffset=1 -DencOutputOffset=1 -XX:+UnlockDiagnosticVMOptions -Xbootclasspath/a:. -cp . compiler.codegen.aes.TestAESMain 100000 100000 The output is a below algorithm=AES, mode=GCM, paddingStr=NoPadding, msgSize=646, keySize=128, noReinit=false, checkOutput=true, encInputOffset=1, encOutputOffset=1, decOutputOffset=0, lastChunkSize=32 algorithm=AES, mode=GCM, paddingStr=NoPadding, msgSize=646, keySize=128, noReinit=false, checkOutput=true, encInputOffset=1, encOutputOffset=1, decOutputOffset=0, lastChunkSize=32 |
You should run the JMH test, like so: |
|
Without GHASH change B
|
|
With my changes
|
|
Hi @theRealAph Can you help understand this result ? since op/s is increasing for ghash code , does it suggest a speedup ? |
Yes, an increase in ops/s is what we want. |
|
Thank you everyone. |
|
/integrate |
|
@suchismith1993 |
|
/sponsor |
|
Going to push as commit cdad6d7.
Your commit was automatically rebased without conflicts. |
|
@TheRealMDoerr @suchismith1993 Pushed as commit cdad6d7. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
|
/backport jdk21u-dev |
|
/backport jdk24u-dev |
|
@suchismith1993 the backport was successfully created on the branch backport-suchismith1993-cdad6d78-master in my personal fork of openjdk/jdk21u-dev. To create a pull request with this backport targeting openjdk/jdk21u-dev:master, just click the following link: The title of the pull request is automatically filled in correctly and below you find a suggestion for the pull request body:
If you need to update the source branch of the pull then run the following commands in a local clone of your personal fork of openjdk/jdk21u-dev: |
|
@suchismith1993 The target repository |
JBS Issue : JDK-8216437
Currently acceleration code for GHASH is missing for PPC64.
The current implementation utlilises SIMD instructions on Power and uses Karatsuba multiplication for obtaining the final result.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20235/head:pull/20235$ git checkout pull/20235Update a local copy of the PR:
$ git checkout pull/20235$ git pull https://git.openjdk.org/jdk.git pull/20235/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20235View PR using the GUI difftool:
$ git pr show -t 20235Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20235.diff
Using Webrev
Link to Webrev Comment