8341527: AVX-512 intrinsic for SHA3 #21352

ferakocz · 2024-10-04T10:15:30Z

There is already an intrinsic for SHA-3 for aarch64, which gives significant speed improvement on that architecture, so this pull request is bringing similar improvement for tha x64 family of systems that have the AVX-512 extension. Rudimentary measurements show that 30-40% speed improvement can be achieved.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8341527: AVX-512 intrinsic for SHA3 (Enhancement - P3)

Reviewers

@vpaprotsk (no known openjdk.org user name / role) Review applies to e4979df2
Sandhya Viswanathan (@sviswa7 - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21352/head:pull/21352
$ git checkout pull/21352

Update a local copy of the PR:
$ git checkout pull/21352
$ git pull https://git.openjdk.org/jdk.git pull/21352/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21352

View PR using the GUI difftool:
$ git pr show -t 21352

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21352.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2024-10-04T10:16:08Z

👋 Welcome back ferakocz! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2024-10-04T10:17:34Z

@ferakocz This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8341527: AVX-512 intrinsic for SHA3

Reviewed-by: sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 110 new commits pushed to the master branch:

4ce19ca: 8343190: GHA: Try building JTReg several times
7c800e6: 8343026: JFR: Index into fields in the topFrame
d8b3685: 8342607: Enhance register printing on x86_64 platforms
d8430ef: 8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM
6332e25: 8343183: [s390x]: Problemlist runtime/Monitor/SyncOnValueBasedClassTest.java Failure
79a07ad: 8343149: Cleanup os::print_tos_pc on AIX
beff8bf: 8342823: Ubsan: ciEnv.cpp:1614:65: runtime error: member call on null pointer of type 'struct CompileTask'
e389f82: 8343137: C2: VerifyLoopOptimizations fails with "Was reachable in only one"
0abfa3b: 8304824: NMT should not use ThreadCritical
88dc655: 8342988: GHA: Build JTReg in single step
... and 100 more: https://git.openjdk.org/jdk/compare/18bcbf7941f7567449983b3f317401efb3e34d39...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@sviswa7) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

openjdk · 2024-10-04T10:18:18Z

@ferakocz The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2024-10-04T10:20:47Z

Webrevs

vpaprotsk

Confirmed performance on my dev machine. Looks good!

Instruction selection: no complaints. vperm* instructions tend to be slower on AVX2, but work great here. Clean, compact and easy-to-read implementation

I don't know enough about SHA3 to do a line-by-line asm review, but that leads me to 'experimentally confirm correctness': testing.

I am wondering how you verified your code. I did spot the existing SHA3 KAT tests from the NIST PDF. The problem with those is that unless you run tests with -Xcomp -XX:-TieredCompilation, the test will finish before the code is even compiled. I've done that before, running test twice with either options; its 'better then nothing' (unless I am not seeing some more tests?). I much prefer some sort of fuzzing; one great thing about working on JCE intrinsics is having a ready-made 'reference implementation' to verify things against.

Except I am not sure how one would implement fuzzing for SHA3, perhaps you have some thoughts. It seems impossible to have both intrinsic and java/interpreter running concurrently. For Poly1305IntrinsicFuzzTest, I used the fact that single-block digest is not intrinsified. For MontgomeryPolynomialFuzzTest, I used the fact that we have a residue-domain implementation to compare against.

For SHA3, all roads lead to the intrinsic (which is a good thing.. except for testing). No DirectByteBuffer, nor single-block bypass.. The only potential thought is the fact that single-block intrinsic appears unreachable. Looking at DigestBase.implCompressMultiBlock, it will always call the multi-block intrinsic (unless I am missing some fancy predicate-generation by the JIT).

If DigestBase.implCompressMultiBlock were 'fixed' to require at least 2 full blocks, before calling the multiblock intrinsic, then one could implement fuzzing by alternatively disabling one of the non-/multi-block intrinsics.

vpaprotsk · 2024-10-07T19:39:00Z

src/hotspot/cpu/x86/vm_version_x86.cpp

-  if (UseSHA3Intrinsics) {
-    warning("Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.");
-    FLAG_SET_DEFAULT(UseSHA3Intrinsics, false);
+  if (UseAVX > 2) {


Should be #ifdef _LP64. (Similar format from above). Need to look up the cpu features required for the instructions in the intrinsic..

Added the #ifdef.

The 'rest' of the comment I owed you.. need AVX512F, AVX512DQ, AVX512BW.
So you will need supports_avx512bwdq() here

"Showing my (math) work.."

grep '__ ' src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp | sed -e 's| *__ *||' -e 's|(.*||' | sort -u (only using the full 512 versions, no need for VL) evmovdquq AVX512F evmovdquw AVX512BW evpermt2q AVX512F evprolq AVX512F evprolvq AVX512F evpxorq AVX512F vpternlogq AVX512F kmovbl AVX512DQ ...

Thanks! I will change the kmovbl()s to kmovwl() and then supports_avx512bw() will suffice.

vpaprotsk · 2024-10-07T19:39:51Z

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp

+
+void StubGenerator::generate_sha3_stubs() {
+  if (UseSHA3Intrinsics) {
+    if (VM_Version::supports_evex()) {


This really should be an assert. i.e. All cpu-flag checks should be done in vm_version_x86.cpp and by this point if UseSHA3Intrinsics is on, "we are good to go"

Changed as suggested.

vpaprotsk · 2024-10-07T19:44:56Z

src/hotspot/cpu/x86/macroAssembler_x86.hpp

      Assembler::evpsravw(dst, mask, nds, src, merge, vector_len);
    }
  }
+  void evpsrad(XMMRegister dst, KRegister mask, XMMRegister src, int shift, bool merge, int vector_len) {


more compact way to 'unhide' function from Assembler.hpp is the using C++ feature : using Assembler::evpsrad;. (You can see it being used bit further below, line 1589)

Comment repeats in (several) changes in this file.

Changed as suggested.

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp

vpaprotsk · 2024-10-08T23:57:15Z

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp

+  __ kmovbl(k3, rax);
+  __ addl(rax, 8);
+  __ kmovbl(k4, rax);
+  __ addl(rax, 16);


Since you need k5 soonest, you could save a few cycles by removing the propagation dependency on rax and loading the immediate directly..

(If you really want to get clever,

KRegister masks[] = {k1,k2,k3,k4,k5}; for (long i=2; i<=32; i*=2) { __ mov64(rax, i-1); __ kmovbl(masks[i], rax); }

Highly debatable if its actually any more readable.. so up to you)

Another alternative that is closer to the structure of your code (And uses smaller instructions..).

Start from the end, with k5, load 0x1f constant

Shift constant down by one and load into next KRegister

(still could be done with a loop. but you decide what you find more readable..)

This way k5 is available immediately for the evmovdquq

Changed to start loading the mask registers from k5.

ferakocz · 2024-10-15T15:51:59Z

Confirmed performance on my dev machine. Looks good!

Thanks for looking at it!

Instruction selection: no complaints. vperm* instructions tend to be slower on AVX2, but work great here. Clean, compact and easy-to-read implementation

I don't know enough about SHA3 to do a line-by-line asm review, but that leads me to 'experimentally confirm correctness': testing.

I am wondering how you verified your code. I did spot the existing SHA3 KAT tests from the NIST PDF. The problem with those is that unless you run tests with -Xcomp -XX:-TieredCompilation, the test will finish before the code is even compiled. I've done that before, running test twice with either options; its 'better then nothing' (unless I am not seeing some more tests?). I much prefer some sort of fuzzing; one great thing about working on JCE intrinsics is having a ready-made 'reference implementation' to verify things against.

I was developing this as part of the ML-KEM and ML-DSA implementations, and there SHA3 is called quite frequently, so the test for those will test the SHA3 intrinsics, too.

The algorithms for the hash (digest) functions are designed so that any programming error would lead to erroneous output on any input, so if your implementation produces the correct result on a few randomly chosen inputs of sizes varying from 0 bytes to several blocks then you can claim with high confidence that it is correct.

Except I am not sure how one would implement fuzzing for SHA3, perhaps you have some thoughts. It seems impossible to have both intrinsic and java/interpreter running concurrently. For Poly1305IntrinsicFuzzTest, I used the fact that single-block digest is not intrinsified. For MontgomeryPolynomialFuzzTest, I used the fact that we have a residue-domain implementation to compare against.

For SHA3, all roads lead to the intrinsic (which is a good thing.. except for testing). No DirectByteBuffer, nor single-block bypass.. The only potential thought is the fact that single-block intrinsic appears unreachable. Looking at DigestBase.implCompressMultiBlock, it will always call the multi-block intrinsic (unless I am missing some fancy predicate-generation by the JIT).

In a test, you can always just copy the pure Java implementation into the test and compare the results. During development of the intrinsics I like to use methods that return 0 from the intrinsic and 1 from the pure Java implementation and at the call sites, if the method returns 0 I also call the pure Java version (with a clone of the original inputs) and compare the results.

If DigestBase.implCompressMultiBlock were 'fixed' to require at least 2 full blocks, before calling the multiblock intrinsic, then one could implement fuzzing by alternatively disabling one of the non-/multi-block intrinsics.

vpaprotsk · 2024-10-15T22:56:56Z

I was developing this as part of the ML-KEM and ML-DSA implementations, and there SHA3 is called quite frequently, so the test for those will test the SHA3 intrinsics, too.

I suppose it works. When possible, I rather have a more granular unit test (and we don't have the code for those algorithms yet. erm, right?)

In a test, you can always just copy the pure Java implementation into the test and compare the results. During development of the intrinsics I like to use methods that return 0 from the intrinsic and 1 from the pure Java implementation and at the call sites, if the method returns 0 I also call the pure Java version (with a clone of the original inputs) and compare the results.

If you still have it and it can be 'made clean'.. I would love to see some of that 'scaffolding test code' kept for the final commit. (I like to imagine the 'final code cleanup' as 'removing scaffolding from a construction site' :) ) This will be especially useful if (when?) we revisit the intrinsic. (I can already see us also needing an AVX2 version.. someone will need to re-learn how to verify that intrinsic too)

vnkozlov · 2024-10-15T23:01:51Z

@ferakocz I think you need to enble SHA3 testing in jtreg tests we have by modifying:
https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/testlibrary/sha/predicate/IntrinsicPredicates.java#L106

JDK-8252204 added several C2 tests for SHA3 intrinsics in test/hotspot/jtreg/compiler/intrinsics/sha. Please make sure your changes passed those tests.

TobiHartmann · 2024-10-16T07:27:02Z

src/hotspot/cpu/x86/assembler_x86.cpp

+void Assembler::vpmuldq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) {
+  assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
+        (vector_len == AVX_256bit ? VM_Version::supports_avx2() : VM_Version::supports_evex()), "");
+  // TODO check what legacy_mode needs to be set to


Drive-by comment: There's a TODO left in here.

Actually, I was hoping that I would learn that from a reviewer @vpaprotsk or @vnkozlov , do you know? I was not able to figure it out from the manual what it should be. (with the current setting "false" at least my code works on the test machines that I tried, but I never tried with "true")

legacy_mode should be false here. This instruction is promotable to evex encoding if higher bank registers (XMM16 and above) are used. It is not a legacy instruction so legacy_mode should be false. Examples of legacy instructions are vptest, vpblend*.

chhagedorn · 2024-10-16T15:15:07Z

This is not a review but I've run some testing with the current patch and found the following two failures on linux-x64-debug:

Failure 1

Tests:

testlibrary_tests/ir_framework/tests/TestCPUFeatureCheck.java
- Additional flags: -XX:+UseParallelGC -XX:+UseNUMA
compiler/loopopts/superword/TestDependencyOffsets.java
- Additional flags: -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation

CPU: total 12 (initial active 12) (6 cores per cpu, 2 threads per core) family 6 model 106 stepping 6 microcode 0x1, cx8, cmov, fxsr, ht, mmx, 3dnowpref, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, lzcnt, tsc, avx, avx2, aes, erms, clmul, bmi1, bmi2, adx, avx512f, avx512cd, sha, fma, clflush, hv, rdtscp, rdpid, fsrm, f16c, pku, ospke
CPU Model and flags from /proc/cpuinfo:
model name	: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves nt_good wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities

Failure:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/opt/mach5/mesos/work_dir/slaves/a4a7850a-7c35-410a-b879-d77fbb2f6087-S144302/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/d6cec7c3-7401-41e9-aaad-f45b38c7a9e7/runs/9e85fc0d-9d6b-426f-b5d8-e84e2daa4c8c/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:2979), pid=1550324, tid=1550336
#  Error: assert(VM_Version::supports_avx512dq()) failed
.....
Command Line: -Djava.library.path=/opt/mach5/mesos/work_dir/jib-master/install/2024-10-15-1659164.christian.hagedorn.jdk-test/linux-x64-debug.test/hotspot/jtreg/native -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -XX:MaxRAMPercentage=4.16667 -Dtest.boot.jdk=/opt/mach5/mesos/work_dir/jib-master/install/jdk/23/37/bundles/linux-x64/jdk-23_linux-x64_bin.tar.gz/jdk-23 -Djava.io.tmpdir=/opt/mach5/mesos/work_dir/slaves/a4a7850a-7c35-410a-b879-d77fbb2f6087-S151463/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/b87f227c-12f2-4145-8d72-0ba96c4ef814/runs/6cf6b0a6-7bb2-4e1d-97a8-2ad532639bfd/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_misc/tmp -XX:+UseParallelGC -XX:+UseNUMA -Dir.framework.server.port=42709 -XX:+UseKNLSetting -XX:-BackgroundCompilation -XX:CompileCommand=quiet compiler.lib.ir_framework.test.TestVM ir_framework.tests.TestCPUFeatureCheck
.....
Stack: [0x00007f4c047e9000,0x00007f4c048e9000],  sp=0x00007f4c048e57e0,  free space=1009k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x63bd61]  Assembler::kmovbl(KRegister, Register)+0x181  (assembler_x86.cpp:2979)
V  [libjvm.so+0x17615d4]  StubGenerator::generate_sha3_implCompress(bool, char const*)+0x274  (stubGenerator_x86_64_sha3.cpp:141)
V  [libjvm.so+0x176339a]  StubGenerator::generate_sha3_stubs()+0x3a  (stubGenerator_x86_64_sha3.cpp:86)
V  [libjvm.so+0x1711217]  StubGenerator::generate_compiler_stubs()+0x387  (stubGenerator_x86_64.cpp:4033)
V  [libjvm.so+0x1712ff8]  StubGenerator_generate(CodeBuffer*, StubCodeGenerator::StubsKind)+0xd8  (stubGenerator_x86_64.cpp:4242)
V  [libjvm.so+0x176d5af]  initialize_stubs(StubCodeGenerator::StubsKind, int, int, char const*, char const*, char const*)+0xef  (stubRoutines.cpp:245)
V  [libjvm.so+0x176f791]  compiler_stubs_init(bool)+0xa1  (stubRoutines.cpp:282)
V  [libjvm.so+0x87e5ae]  C2Compiler::init_c2_runtime()+0xbe  (c2compiler.cpp:92)
V  [libjvm.so+0x87e795]  C2Compiler::initialize()+0x35  (c2compiler.cpp:112)
V  [libjvm.so+0xa39716]  CompileBroker::init_compiler_runtime()+0xd6  (compileBroker.cpp:1771)
V  [libjvm.so+0xa3ff01]  CompileBroker::compiler_thread_loop()+0x121  (compileBroker.cpp:1913)
V  [libjvm.so+0xef158c]  JavaThread::thread_main_inner()+0xcc  (javaThread.cpp:759)
V  [libjvm.so+0x18199f6]  Thread::call_run()+0xb6  (thread.cpp:234)
V  [libjvm.so+0x14fc288]  thread_native_entry(Thread*)+0x128  (os_linux.cpp:858)

Failure 2

compiler/intrinsics/sha/cli/TestUseSHA3IntrinsicsOptionOnUnsupportedCPU.java
- Additional flags: -server -Xmixed

Output:

stderr: [Java HotSpot(TM) 64-Bit Server VM warning: SHA3 intrinsics require AVX512 instructions
java version "24-internal" 2025-03-18
Java(TM) SE Runtime Environment (fastdebug build 24-internal-2024-10-15-1659164.christian.hagedorn.jdk-test)
Java HotSpot(TM) 64-Bit Server VM (fastdebug build 24-internal-2024-10-15-1659164.christian.hagedorn.jdk-test, mixed mode, sharing)
]
 exitValue = 0

java.lang.AssertionError: Expected message not found: 'Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.'.
JVM should start with '-XX:-UseSHA3Intrinsics' flag without any warnings
	at jdk.test.lib.cli.CommandLineOptionTest.verifyOutput(CommandLineOptionTest.java:159)
	at jdk.test.lib.cli.CommandLineOptionTest.verifyJVMStartup(CommandLineOptionTest.java:130)
	at jdk.test.lib.cli.CommandLineOptionTest.verifySameJVMStartup(CommandLineOptionTest.java:209)
	at compiler.intrinsics.sha.cli.testcases.GenericTestCaseForUnsupportedX86CPU.verifyWarnings(GenericTestCaseForUnsupportedX86CPU.java:70)
	at compiler.intrinsics.sha.cli.DigestOptionsBase$TestCase.test(DigestOptionsBase.java:162)
	at compiler.intrinsics.sha.cli.DigestOptionsBase.runTestCases(DigestOptionsBase.java:139)
	at jdk.test.lib.cli.CommandLineOptionTest.test(CommandLineOptionTest.java:537)
	at compiler.intrinsics.sha.cli.TestUseSHA3IntrinsicsOptionOnUnsupportedCPU.main(TestUseSHA3IntrinsicsOptionOnUnsupportedCPU.java:56)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:573)
	at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138)
	at java.base/java.lang.Thread.run(Thread.java:1576)
Caused by: java.lang.RuntimeException: 'Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.' missing from stdout/stderr
	at jdk.test.lib.process.OutputAnalyzer.shouldMatch(OutputAnalyzer.java:372)
	at jdk.test.lib.cli.CommandLineOptionTest.verifyOutput(CommandLineOptionTest.java:154)
	... 11 more

JavaTest Message: Test threw exception: java.lang.AssertionError: Expected message not found: 'Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.'.
JVM should start with '-XX:-UseSHA3Intrinsics' flag without any warnings
JavaTest Message: shutting down test

STATUS:Failed.`main' threw exception: java.lang.AssertionError: Expected message not found: 'Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.'. JVM should start with '-XX:-UseSHA3Intrinsics' flag without any warnings

ferakocz · 2024-10-16T16:06:29Z

If you still have it and it can be 'made clean'.. I would love to see some of that 'scaffolding test code' kept for the final commit. (I like to imagine the 'final code cleanup' as 'removing scaffolding from a construction site' :) ) This will be especially useful if (when?) we revisit the intrinsic. (I can already see us also needing an AVX2 version.. someone will need to re-learn how to verify that intrinsic too)

The scaffolding is really simple: instead of e.g.

@IntrinsiCandidate
void foo(byte[] output, byte[] input) {
// do some computation
}

you would have

void foo(byte[] output, byte[] input) {
byte[] inputCopy = input.clone();
int x = fooImpl(output, input);
if (x==0) {
// it was the intrinsic, so e.g. call fooImplJava() on imputCopy and compare the result
}
}

@IntrinsicCandidate
int fooImpl(byte[] output, byte[] input) {
fooImplJava(input, output);
return 1;
}

void fooImplJava(byte[] output, byte[] input) {
// do some computation
}

Just a bit more complicated for non-void methods.

ferakocz · 2024-10-16T16:58:07Z

This is not a review but I've run some testing with the current patch and found the following two failures on linux-x64-debug:

@chhagedorn could you send me the mach5 command line (or other means) to run these tests?

vnkozlov · 2024-10-16T18:10:45Z

This is not a review but I've run some testing with the current patch and found the following two failures on linux-x64-debug:

@chhagedorn could you send me the mach5 command line (or other means) to run these tests?

I think Christian missed additional important flag which limits avx512 features: -XX:+UseKNLSetting
I think you can run any test with it to trigger error because it happens during stub generation.

vnkozlov · 2024-10-16T18:14:22Z

"Failure 2" is due to issue I pointed about compiler/testlibrary/sha/predicate/IntrinsicPredicates.java#L10

chhagedorn · 2024-10-16T18:26:06Z

This is not a review but I've run some testing with the current patch and found the following two failures on linux-x64-debug:

@chhagedorn could you send me the mach5 command line (or other means) to run these tests?

I think Christian missed additional important flag which limits avx512 features: -XX:+UseKNLSetting I think you can run any test with it to trigger error because it happens during stub generation.

Right, you need -XX:+UseKNLSetting, thanks Vladimir for jumping in! Missed to mention that separately and was hard to spot in the posted command line.

openjdk · 2024-10-21T15:41:54Z

@ferakocz this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout sha3-avx512-intrinsic
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

src/hotspot/cpu/x86/assembler_x86.cpp

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp

ferakocz · 2024-10-22T11:44:30Z

@ferakocz I think you need to enble SHA3 testing in jtreg tests we have by modifying: https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/testlibrary/sha/predicate/IntrinsicPredicates.java#L106

JDK-8252204 added several C2 tests for SHA3 intrinsics in test/hotspot/jtreg/compiler/intrinsics/sha. Please make sure your changes passed those tests.

I did that, plus restored the error message, now all the tests suggested by @chhagedorn pass.

sviswa7 · 2024-10-24T15:54:27Z

@ferakocz Thanks for taking my inputs into consideration and the corresponding changes. Would it be also possible for you to add comments to the rounds24_loop code if you want us to review that in detail. Otherwise the PR looks good to me.

ferakocz · 2024-10-24T17:37:02Z

Could someone approve my changes so that I can integrate?

sviswa7

I haven't reviewed the rounds_24 loop. Other than that the PR looks good to me.

vpaprotsk

Everything from me was addressed too, thanks!

ferakocz · 2024-10-28T09:18:03Z

@sviswa7 and @vpaprotsk, I added comments to the algorithm implementation, could you take another look and approve again if you are satisfied with them? Thanks!

vpaprotsk · 2024-10-28T16:52:32Z

Thanks for the comments. Still looks good to me

(I haven't reviewed the core loop instruction-by-instruction either, I would need to spend a lot more time getting to know SHA3. But this is why I was asking about KAT/testing; This is 'condition-less' code, and no additions/carries-to-propagate. Testing should have 100% code-coverage with just a few tests., so no need to tests carries, input text value-independent. Do need to vary input length to test loop bounds).

sviswa7 · 2024-10-28T19:25:25Z

Thanks for the comments, very helpful. I have verified the theta mapping, chi step, and the xor step. They look good. Now making my way through the rho and sigma perms and rotates.

sviswa7 · 2024-10-28T22:08:58Z

test/hotspot/jtreg/compiler/intrinsics/sha

The rho and sigma perms/rotates also look good.

sviswa7

Looks good to me.

ferakocz · 2024-10-29T15:11:30Z

/integrate

openjdk · 2024-10-29T15:12:16Z

@ferakocz
Your change (at version b9cc7db) is now ready to be sponsored by a Committer.

wangweij · 2024-10-29T15:16:57Z

/sponsor

openjdk · 2024-10-29T15:18:28Z

Going to push as commit 9cfb0f7.
Since your change was applied there have been 110 commits pushed to the master branch:

4ce19ca: 8343190: GHA: Try building JTReg several times
7c800e6: 8343026: JFR: Index into fields in the topFrame
d8b3685: 8342607: Enhance register printing on x86_64 platforms
d8430ef: 8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM
6332e25: 8343183: [s390x]: Problemlist runtime/Monitor/SyncOnValueBasedClassTest.java Failure
79a07ad: 8343149: Cleanup os::print_tos_pc on AIX
beff8bf: 8342823: Ubsan: ciEnv.cpp:1614:65: runtime error: member call on null pointer of type 'struct CompileTask'
e389f82: 8343137: C2: VerifyLoopOptimizations fails with "Was reachable in only one"
0abfa3b: 8304824: NMT should not use ThreadCritical
88dc655: 8342988: GHA: Build JTReg in single step
... and 100 more: https://git.openjdk.org/jdk/compare/18bcbf7941f7567449983b3f317401efb3e34d39...master

Your commit was automatically rebased without conflicts.

openjdk · 2024-10-29T15:18:44Z

@wangweij @ferakocz Pushed as commit 9cfb0f7.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

TheShermanTanker · 2024-10-29T15:54:23Z

I think this broke the x86 assembler, I'm getting multiple failures that look like the following:

/home/runner/work/jdk/jdk/src/hotspot/cpu/x86/assembler_x86.cpp:3646:6: error: redefinition of ‘void Assembler::evmovdquw(XMMRegister, KRegister, XMMRegister, bool, int)’
 3646 | void Assembler::evmovdquw(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
      |      ^~~~~~~~~
/home/runner/work/jdk/jdk/src/hotspot/cpu/x86/assembler_x86.cpp:3593:6: note: ‘void Assembler::evmovdquw(XMMRegister, KRegister, XMMRegister, bool, int)’ previously defined here
 3593 | void Assembler::evmovdquw(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
      |      ^~~~~~~~~

TheShermanTanker · 2024-10-29T15:56:00Z

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp

+      0x8000000000008080L, 0x0000000080000001L, 0x8000000080008008L
+    };
+
+ATTRIBUTE_ALIGNED(64) static const uint64_t permsAndRots[] = {


It's too late for this now, but this should've used alignas directly instead of ATTRIBUTE_ALIGNED. No worries though, the macro ultimately expands to alignas

8341527: AVX-512 intrinsic for SHA3

37e1058

openjdk bot added the rfr Pull request is ready for review label Oct 4, 2024

openjdk bot added the hotspot hotspot-dev@openjdk.org label Oct 4, 2024

ferakocz added 3 commits October 4, 2024 12:50

fix debug build

c91b80d

fix windows build

e48dd67

Merge branch 'master' into sha3-avx512-intrinsic

1b5b71f

vpaprotsk reviewed Oct 10, 2024

View reviewed changes

TobiHartmann reviewed Oct 16, 2024

View reviewed changes

accepting review suggestions from Volodymyr and Vladimir

f007a5b

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Oct 21, 2024

ferakocz added 2 commits October 21, 2024 18:09

Merge master

75801e0

fix mismerge

52d2fba

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Oct 21, 2024

sviswa7 reviewed Oct 21, 2024

View reviewed changes

src/hotspot/cpu/x86/assembler_x86.cpp Outdated Show resolved Hide resolved

src/hotspot/cpu/x86/assembler_x86.cpp Outdated Show resolved Hide resolved

src/hotspot/cpu/x86/assembler_x86.cpp Outdated Show resolved Hide resolved

sviswa7 reviewed Oct 22, 2024

View reviewed changes

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp Outdated Show resolved Hide resolved

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp Outdated Show resolved Hide resolved

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp Outdated Show resolved Hide resolved

assembly changes suggested by sviswa7

e4979df

sviswa7 approved these changes Oct 24, 2024

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Oct 24, 2024

vpaprotsk approved these changes Oct 24, 2024

View reviewed changes

added comments

b9cc7db

openjdk bot removed the ready Pull request is ready to be integrated label Oct 25, 2024

sviswa7 approved these changes Oct 28, 2024

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Oct 28, 2024

openjdk bot added the sponsor Pull request is ready to be sponsored label Oct 29, 2024

openjdk bot added the integrated Pull request has been integrated label Oct 29, 2024

openjdk bot closed this Oct 29, 2024

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Oct 29, 2024

TheShermanTanker reviewed Oct 29, 2024

View reviewed changes

graalvmbot mentioned this pull request Nov 29, 2024

[GR-59932] Port JDK-8341527: AVX-512 intrinsic for SHA3 oracle/graal#10192

Closed

8341527: AVX-512 intrinsic for SHA3 #21352

8341527: AVX-512 intrinsic for SHA3 #21352

Uh oh!

Conversation

ferakocz commented Oct 4, 2024 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Webrev

Uh oh!

bridgekeeper bot commented Oct 4, 2024

Uh oh!

openjdk bot commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Oct 4, 2024

Uh oh!

mlbridge bot commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

vpaprotsk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ferakocz commented Oct 15, 2024

Uh oh!

vpaprotsk commented Oct 15, 2024

Uh oh!

vnkozlov commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ferakocz Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chhagedorn commented Oct 16, 2024

Failure 1

Failure 2

Uh oh!

ferakocz commented Oct 16, 2024

Uh oh!

ferakocz commented Oct 16, 2024

Uh oh!

vnkozlov commented Oct 16, 2024

Uh oh!

vnkozlov commented Oct 16, 2024

Uh oh!

chhagedorn commented Oct 16, 2024

ferakocz commented Oct 4, 2024 •

edited by openjdk bot

Loading

openjdk bot commented Oct 4, 2024 •

edited

Loading

mlbridge bot commented Oct 4, 2024 •

edited

Loading

vnkozlov commented Oct 15, 2024 •

edited

Loading

ferakocz Oct 16, 2024 •

edited

Loading