-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions #10582
Conversation
Hi @vpaprotsk, welcome to this OpenJDK project and thanks for contributing! We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user vpaprotsk" as summary for the issue. If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing |
/covered |
I am part of Intel Java Team |
@vpaprotsk The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated! |
/label hotspot-compiler |
@vpaprotsk |
Webrevs
|
/** | ||
* Partition the authentication key into the R and S components, clamp | ||
* the R value, and instantiate IntegerModuloP objects to R and S's | ||
* numeric values. | ||
*/ | ||
private void setRSVals() { | ||
private void setRSVals() { //throws InvalidKeyException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The R and S check for invalid key (all bytes zero) could be submitted as a separate PR.
It is not related to the Poly1305 acceleration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, added a flag
@IntrinsicCandidate | ||
private static void processMultipleBlocks(byte[] input, int offset, int length, byte[] aBytes, byte[] rBytes) { | ||
MutableIntegerModuloP A = ipl1305.getElement(aBytes).mutable(); | ||
MutableIntegerModuloP R = ipl1305.getElement(rBytes).mutable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R doesn't need to be mutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public class Poly1305IntrinsicFuzzTest { | ||
public static void main(String[] args) throws Exception { | ||
//Note: it might be useful to increase this number during development of new Poly1305 intrinsics | ||
final int repeat = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we increase this repeat count for the c2 compiler to kick in for compiling engineUpdate() and have the call to stub in place from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did it with @run main/othervm -Xcomp -XX:-TieredCompilation com.sun.crypto.provider.Cipher.ChaCha20.Poly1305UnitTestDriver
for (TestData test : testList) { | ||
System.out.println("*** Test " + ++testNumber + ": " + | ||
test.testName); | ||
if (runSingleTest(test)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runSingleTest may need to be called enough number of times for the engineUpdate to be compiled by c2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a second copy with @run main/othervm -Xcomp -XX:-TieredCompilation com.sun.crypto.provider.Cipher.ChaCha20.Poly1305UnitTestDriver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial assembler level comments.
@@ -1955,6 +1955,90 @@ address StubGenerator::generate_base64_encodeBlock() | |||
return start; | |||
} | |||
|
|||
address StubGenerator::generate_poly1305_masksCP() { | |||
StubCodeMark mark(this, "StubRoutines", "generate_poly1305_masksCP"); | |||
address start = __ pc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may use align64 here, like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
void Assembler::evpunpckhqdq(XMMRegister dst, KRegister mask, XMMRegister src1, XMMRegister src2, bool merge, int vector_len) { | ||
assert(UseAVX > 2, "requires AVX512F"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace flag with feature EVEX check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -7747,6 +7827,16 @@ void Assembler::vpandq(XMMRegister dst, XMMRegister nds, XMMRegister src, int ve | |||
emit_int16((unsigned char)0xDB, (0xC0 | encode)); | |||
} | |||
|
|||
void Assembler::vpandq(XMMRegister dst, XMMRegister nds, Address src, int vector_len) { | |||
assert(VM_Version::supports_evex(), ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertion should check existence of AVX512VL for non 512 but vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -7864,6 +7954,15 @@ void Assembler::vporq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vec | |||
emit_int16((unsigned char)0xEB, (0xC0 | encode)); | |||
} | |||
|
|||
void Assembler::vporq(XMMRegister dst, XMMRegister nds, Address src, int vector_len) { | |||
assert(VM_Version::supports_evex(), ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I executed some quick testing and this fails with:
|
Hi @TobiHartmann , thanks for looking. Could you share CPU Model and flags from |
Test: jdk/incubator/vector/VectorMaxConversionTests.java#id1 I think the problem is this subtest runs with Call stack:
|
(Apologies, ignore the |
6a60c12
to
f048f93
Compare
@iwanowww Answered your review comments, please take a look again? Thanks again! |
__ vpxorq(xmm0, xmm0, xmm0, Assembler::AVX_512bit); | ||
__ vpxorq(xmm1, xmm1, xmm1, Assembler::AVX_512bit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use T0, T1 in place of xmm0, xmm1 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or simply switch to vzeroall
for xmm0
- xmm15
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah.. I remember thinking about doing that.. vzeroall
isnt encoded yet and I figured since I already have to do the xmm16-29, might as well do them all.. should I add that instruction too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please. And for the upper half of register file, just code it as a loop over register range:
for (int rxmm_num = 16; rxmm_num < 30; rxmm_num++) {
XMMRegister rxmm = as_XMMRegister(rxmm_num);
__ vpxorq(rxmm, rxmm, rxmm, Assembler::AVX_512bit);
}
or even
// Zeroes zmm16-zmm31.
for (XMMRegister rxmm = xmm16; rxmm->is_valid(); rxmm = rxmm->successor()) {
__ vpxorq(rxmm, rxmm, rxmm, Assembler::AVX_512bit);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.. ("loop" erm.. wow.. "duh, this isn't assembler!") Thanks!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
(Note: disassembler proof for vzeroall encoding
0x7fffed0022f8: vzeroall
0x7fffed0022fb: vpxorq zmm16,zmm16,zmm16
0x7fffed002301: vpxorq zmm17,zmm17,zmm17
0x7fffed002307: vpxorq zmm18,zmm18,zmm18
0x7fffed00230d: vpxorq zmm19,zmm19,zmm19
0x7fffed002313: vpxorq zmm20,zmm20,zmm20
0x7fffed002319: vpxorq zmm21,zmm21,zmm21
0x7fffed00231f: vpxorq zmm22,zmm22,zmm22
0x7fffed002325: vpxorq zmm23,zmm23,zmm23
0x7fffed00232b: vpxorq zmm24,zmm24,zmm24
0x7fffed002331: vpxorq zmm25,zmm25,zmm25
0x7fffed002337: vpxorq zmm26,zmm26,zmm26
0x7fffed00233d: vpxorq zmm27,zmm27,zmm27
0x7fffed002343: vpxorq zmm28,zmm28,zmm28
0x7fffed002349: vpxorq zmm29,zmm29,zmm29
0x7fffed00234f: vpxorq zmm30,zmm30,zmm30
0x7fffed002355: vpxorq zmm31,zmm31,zmm31
0x7fffed00235b: cmp ebx,0x10
0x7fffed00235e: jl 0x7fffed0023e6
)
__ vpsllq(R2P, R2P, 2, Assembler::AVX_512bit); | ||
|
||
// Store R^8-R for later use | ||
__ evmovdquq(Address(rsp, 64*0), B0, Assembler::AVX_512bit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these vector spills be eliminated? I counted 8 spare zmm registers available across the vector loop (xmm7-xmm12, xmm30, xmm31).
And here's what is explicitly used in process256Loop
:
D0 D1 = xmm2-xmm3
B0 B1 B2 B3 B4 B5 = xmm19-xmm24
TMP = xmm6
A0 A1 A2 A3 A4 A5 = xmm13-xmm18
R0 R1 R2 R1P R2P = xmm25-xmm29
T0 T1 T2 T3 T4 T5 = xmm0-xmm5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting!! Let me try that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
PS: This find really was great!
PPS: I also reordered the map alphabetically and counted in-order... it was just really bugging me!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iwanowww Another round ready your way :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good. Just one minor cleanup suggestion.
I've submitted the latest patch for testing (hs-tier1 - hs-tier4).
} | ||
__ shlq(t0, 40); | ||
__ addq(a1, t0); | ||
if (a2 == noreg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, get rid of early return and turn the check into if (a2 != noreg) { ... }
which guards the following code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done (some golang-ism slipped in.. rewiring habits again)
@iwanowww Hope the extra tests passed? (Or do you have to re-run them on the latest patch again?) |
The test results look good. (Had to wait until testing is complete.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JVM part looks good.
/integrate |
@vpaprotsk |
/sponsor |
Going to push as commit f12710e.
Your commit was automatically rebased without conflicts. |
@sviswa7 @vpaprotsk Pushed as commit f12710e. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Testing is broken: test/jdk/sun/security/util/math/BigIntegerModuloP.java:160: error: BigIntegerModuloP.ImmutableElement is not abstract and does not override abstract method getLimbs() in IntegerModuloP Did you forget to commit a test file? I will file a new bug for this. |
I fixed the test issue with JDK-8297382 but this also caused a regression with one of the crypto tests: JDK-8297417. @vpaprotsk, @sviswa7 could you please have a look at this? |
@TobiHartmann @dholmes-ora Sorry about that, looking |
@robcasloz Update to JDK-8297417 (since I don't have an account on the bugtracker yet to update there) Not able to reproduce it on Linux yet. The seed should make it deterministic.. but nothing. Resurrecting's my windows sandbox to see if I can reproduce on windows (only difference on windows is the intrinsic function register linkage. However problem there would make the problem very deterministic.. I think) |
/backport jdk17u-dev |
@asgibbons Could not automatically backport
Please fetch the appropriate branch/commit and manually resolve these conflicts by using the following commands in your personal fork of openjdk/jdk17u-dev. Note: these commands are just some suggestions and you can use other equivalent commands you know.
Once you have resolved the conflicts as explained above continue with creating a pull request towards the openjdk/jdk17u-dev with the title |
Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in
macroAssembler_x86_poly.cpp
.InvalidKeyException
inPoly1305.java
(see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please.MacBench.java
), since Poly1305 is not 'properly' registered with the provider.Perf before:
and after:
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582
$ git checkout pull/10582
Update a local copy of the PR:
$ git checkout pull/10582
$ git pull https://git.openjdk.org/jdk pull/10582/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 10582
View PR using the GUI difftool:
$ git pr show -t 10582
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10582.diff