Skip to content

Conversation

@ferakocz
Copy link
Contributor

@ferakocz ferakocz commented Oct 4, 2024

There is already an intrinsic for SHA-3 for aarch64, which gives significant speed improvement on that architecture, so this pull request is bringing similar improvement for tha x64 family of systems that have the AVX-512 extension. Rudimentary measurements show that 30-40% speed improvement can be achieved.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8341527: AVX-512 intrinsic for SHA3 (Enhancement - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21352/head:pull/21352
$ git checkout pull/21352

Update a local copy of the PR:
$ git checkout pull/21352
$ git pull https://git.openjdk.org/jdk.git pull/21352/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21352

View PR using the GUI difftool:
$ git pr show -t 21352

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21352.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 4, 2024

👋 Welcome back ferakocz! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 4, 2024

@ferakocz This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8341527: AVX-512 intrinsic for SHA3

Reviewed-by: sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 110 new commits pushed to the master branch:

  • 4ce19ca: 8343190: GHA: Try building JTReg several times
  • 7c800e6: 8343026: JFR: Index into fields in the topFrame
  • d8b3685: 8342607: Enhance register printing on x86_64 platforms
  • d8430ef: 8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM
  • 6332e25: 8343183: [s390x]: Problemlist runtime/Monitor/SyncOnValueBasedClassTest.java Failure
  • 79a07ad: 8343149: Cleanup os::print_tos_pc on AIX
  • beff8bf: 8342823: Ubsan: ciEnv.cpp:1614:65: runtime error: member call on null pointer of type 'struct CompileTask'
  • e389f82: 8343137: C2: VerifyLoopOptimizations fails with "Was reachable in only one"
  • 0abfa3b: 8304824: NMT should not use ThreadCritical
  • 88dc655: 8342988: GHA: Build JTReg in single step
  • ... and 100 more: https://git.openjdk.org/jdk/compare/18bcbf7941f7567449983b3f317401efb3e34d39...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@sviswa7) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 4, 2024
@openjdk
Copy link

openjdk bot commented Oct 4, 2024

@ferakocz The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Oct 4, 2024
@mlbridge
Copy link

mlbridge bot commented Oct 4, 2024

Copy link
Contributor

@vpaprotsk vpaprotsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed performance on my dev machine. Looks good!

Instruction selection: no complaints. vperm* instructions tend to be slower on AVX2, but work great here. Clean, compact and easy-to-read implementation

I don't know enough about SHA3 to do a line-by-line asm review, but that leads me to 'experimentally confirm correctness': testing.

I am wondering how you verified your code. I did spot the existing SHA3 KAT tests from the NIST PDF. The problem with those is that unless you run tests with -Xcomp -XX:-TieredCompilation, the test will finish before the code is even compiled. I've done that before, running test twice with either options; its 'better then nothing' (unless I am not seeing some more tests?). I much prefer some sort of fuzzing; one great thing about working on JCE intrinsics is having a ready-made 'reference implementation' to verify things against.

Except I am not sure how one would implement fuzzing for SHA3, perhaps you have some thoughts. It seems impossible to have both intrinsic and java/interpreter running concurrently. For Poly1305IntrinsicFuzzTest, I used the fact that single-block digest is not intrinsified. For MontgomeryPolynomialFuzzTest, I used the fact that we have a residue-domain implementation to compare against.

For SHA3, all roads lead to the intrinsic (which is a good thing.. except for testing). No DirectByteBuffer, nor single-block bypass.. The only potential thought is the fact that single-block intrinsic appears unreachable. Looking at DigestBase.implCompressMultiBlock, it will always call the multi-block intrinsic (unless I am missing some fancy predicate-generation by the JIT).

If DigestBase.implCompressMultiBlock were 'fixed' to require at least 2 full blocks, before calling the multiblock intrinsic, then one could implement fuzzing by alternatively disabling one of the non-/multi-block intrinsics.

if (UseSHA3Intrinsics) {
warning("Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.");
FLAG_SET_DEFAULT(UseSHA3Intrinsics, false);
if (UseAVX > 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be #ifdef _LP64. (Similar format from above). Need to look up the cpu features required for the instructions in the intrinsic..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the #ifdef.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'rest' of the comment I owed you.. need AVX512F, AVX512DQ, AVX512BW.
So you will need supports_avx512bwdq() here

"Showing my (math) work.."

grep '__ ' src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp | sed -e 's| *__ *||' -e 's|(.*||' | sort -u
(only using the full 512 versions, no need for VL)
evmovdquq  AVX512F
evmovdquw  AVX512BW
evpermt2q  AVX512F
evprolq    AVX512F
evprolvq   AVX512F
evpxorq    AVX512F
vpternlogq AVX512F
kmovbl     AVX512DQ
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will change the kmovbl()s to kmovwl() and then supports_avx512bw() will suffice.


void StubGenerator::generate_sha3_stubs() {
if (UseSHA3Intrinsics) {
if (VM_Version::supports_evex()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really should be an assert. i.e. All cpu-flag checks should be done in vm_version_x86.cpp and by this point if UseSHA3Intrinsics is on, "we are good to go"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as suggested.

Assembler::evpsravw(dst, mask, nds, src, merge, vector_len);
}
}
void evpsrad(XMMRegister dst, KRegister mask, XMMRegister src, int shift, bool merge, int vector_len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more compact way to 'unhide' function from Assembler.hpp is the using C++ feature : using Assembler::evpsrad;. (You can see it being used bit further below, line 1589)

Comment repeats in (several) changes in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as suggested.

__ kmovbl(k3, rax);
__ addl(rax, 8);
__ kmovbl(k4, rax);
__ addl(rax, 16);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you need k5 soonest, you could save a few cycles by removing the propagation dependency on rax and loading the immediate directly..

(If you really want to get clever,

  KRegister masks[] = {k1,k2,k3,k4,k5};
  for (long i=2; i<=32; i*=2) {
    __ mov64(rax, i-1);
    __ kmovbl(masks[i], rax);
  }

Highly debatable if its actually any more readable.. so up to you)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative that is closer to the structure of your code (And uses smaller instructions..).

  • Start from the end, with k5, load 0x1f constant
  • Shift constant down by one and load into next KRegister
  • (still could be done with a loop. but you decide what you find more readable..)

This way k5 is available immediately for the evmovdquq

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to start loading the mask registers from k5.

@ferakocz
Copy link
Contributor Author

Confirmed performance on my dev machine. Looks good!

Thanks for looking at it!

Instruction selection: no complaints. vperm* instructions tend to be slower on AVX2, but work great here. Clean, compact and easy-to-read implementation

I don't know enough about SHA3 to do a line-by-line asm review, but that leads me to 'experimentally confirm correctness': testing.

I am wondering how you verified your code. I did spot the existing SHA3 KAT tests from the NIST PDF. The problem with those is that unless you run tests with -Xcomp -XX:-TieredCompilation, the test will finish before the code is even compiled. I've done that before, running test twice with either options; its 'better then nothing' (unless I am not seeing some more tests?). I much prefer some sort of fuzzing; one great thing about working on JCE intrinsics is having a ready-made 'reference implementation' to verify things against.

I was developing this as part of the ML-KEM and ML-DSA implementations, and there SHA3 is called quite frequently, so the test for those will test the SHA3 intrinsics, too.

The algorithms for the hash (digest) functions are designed so that any programming error would lead to erroneous output on any input, so if your implementation produces the correct result on a few randomly chosen inputs of sizes varying from 0 bytes to several blocks then you can claim with high confidence that it is correct.

Except I am not sure how one would implement fuzzing for SHA3, perhaps you have some thoughts. It seems impossible to have both intrinsic and java/interpreter running concurrently. For Poly1305IntrinsicFuzzTest, I used the fact that single-block digest is not intrinsified. For MontgomeryPolynomialFuzzTest, I used the fact that we have a residue-domain implementation to compare against.

For SHA3, all roads lead to the intrinsic (which is a good thing.. except for testing). No DirectByteBuffer, nor single-block bypass.. The only potential thought is the fact that single-block intrinsic appears unreachable. Looking at DigestBase.implCompressMultiBlock, it will always call the multi-block intrinsic (unless I am missing some fancy predicate-generation by the JIT).

In a test, you can always just copy the pure Java implementation into the test and compare the results. During development of the intrinsics I like to use methods that return 0 from the intrinsic and 1 from the pure Java implementation and at the call sites, if the method returns 0 I also call the pure Java version (with a clone of the original inputs) and compare the results.

If DigestBase.implCompressMultiBlock were 'fixed' to require at least 2 full blocks, before calling the multiblock intrinsic, then one could implement fuzzing by alternatively disabling one of the non-/multi-block intrinsics.

@vpaprotsk
Copy link
Contributor

I was developing this as part of the ML-KEM and ML-DSA implementations, and there SHA3 is called quite frequently, so the test for those will test the SHA3 intrinsics, too.

I suppose it works. When possible, I rather have a more granular unit test (and we don't have the code for those algorithms yet. erm, right?)

In a test, you can always just copy the pure Java implementation into the test and compare the results. During development of the intrinsics I like to use methods that return 0 from the intrinsic and 1 from the pure Java implementation and at the call sites, if the method returns 0 I also call the pure Java version (with a clone of the original inputs) and compare the results.

If you still have it and it can be 'made clean'.. I would love to see some of that 'scaffolding test code' kept for the final commit. (I like to imagine the 'final code cleanup' as 'removing scaffolding from a construction site' :) ) This will be especially useful if (when?) we revisit the intrinsic. (I can already see us also needing an AVX2 version.. someone will need to re-learn how to verify that intrinsic too)

@vnkozlov
Copy link
Contributor

vnkozlov commented Oct 15, 2024

@ferakocz I think you need to enble SHA3 testing in jtreg tests we have by modifying:
https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/testlibrary/sha/predicate/IntrinsicPredicates.java#L106

JDK-8252204 added several C2 tests for SHA3 intrinsics in test/hotspot/jtreg/compiler/intrinsics/sha. Please make sure your changes passed those tests.

void Assembler::vpmuldq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) {
assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
(vector_len == AVX_256bit ? VM_Version::supports_avx2() : VM_Version::supports_evex()), "");
// TODO check what legacy_mode needs to be set to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comment: There's a TODO left in here.

Copy link
Contributor Author

@ferakocz ferakocz Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was hoping that I would learn that from a reviewer @vpaprotsk or @vnkozlov , do you know? I was not able to figure it out from the manual what it should be. (with the current setting "false" at least my code works on the test machines that I tried, but I never tried with "true")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legacy_mode should be false here. This instruction is promotable to evex encoding if higher bank registers (XMM16 and above) are used. It is not a legacy instruction so legacy_mode should be false. Examples of legacy instructions are vptest, vpblend*.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@chhagedorn
Copy link
Member

This is not a review but I've run some testing with the current patch and found the following two failures on linux-x64-debug:

Failure 1

Tests:

  • testlibrary_tests/ir_framework/tests/TestCPUFeatureCheck.java
    • Additional flags: -XX:+UseParallelGC -XX:+UseNUMA
  • compiler/loopopts/superword/TestDependencyOffsets.java
    • Additional flags: -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation
CPU: total 12 (initial active 12) (6 cores per cpu, 2 threads per core) family 6 model 106 stepping 6 microcode 0x1, cx8, cmov, fxsr, ht, mmx, 3dnowpref, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, lzcnt, tsc, avx, avx2, aes, erms, clmul, bmi1, bmi2, adx, avx512f, avx512cd, sha, fma, clflush, hv, rdtscp, rdpid, fsrm, f16c, pku, ospke
CPU Model and flags from /proc/cpuinfo:
model name	: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves nt_good wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities

Failure:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/opt/mach5/mesos/work_dir/slaves/a4a7850a-7c35-410a-b879-d77fbb2f6087-S144302/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/d6cec7c3-7401-41e9-aaad-f45b38c7a9e7/runs/9e85fc0d-9d6b-426f-b5d8-e84e2daa4c8c/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:2979), pid=1550324, tid=1550336
#  Error: assert(VM_Version::supports_avx512dq()) failed
.....
Command Line: -Djava.library.path=/opt/mach5/mesos/work_dir/jib-master/install/2024-10-15-1659164.christian.hagedorn.jdk-test/linux-x64-debug.test/hotspot/jtreg/native -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -XX:MaxRAMPercentage=4.16667 -Dtest.boot.jdk=/opt/mach5/mesos/work_dir/jib-master/install/jdk/23/37/bundles/linux-x64/jdk-23_linux-x64_bin.tar.gz/jdk-23 -Djava.io.tmpdir=/opt/mach5/mesos/work_dir/slaves/a4a7850a-7c35-410a-b879-d77fbb2f6087-S151463/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/b87f227c-12f2-4145-8d72-0ba96c4ef814/runs/6cf6b0a6-7bb2-4e1d-97a8-2ad532639bfd/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_misc/tmp -XX:+UseParallelGC -XX:+UseNUMA -Dir.framework.server.port=42709 -XX:+UseKNLSetting -XX:-BackgroundCompilation -XX:CompileCommand=quiet compiler.lib.ir_framework.test.TestVM ir_framework.tests.TestCPUFeatureCheck
.....
Stack: [0x00007f4c047e9000,0x00007f4c048e9000],  sp=0x00007f4c048e57e0,  free space=1009k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x63bd61]  Assembler::kmovbl(KRegister, Register)+0x181  (assembler_x86.cpp:2979)
V  [libjvm.so+0x17615d4]  StubGenerator::generate_sha3_implCompress(bool, char const*)+0x274  (stubGenerator_x86_64_sha3.cpp:141)
V  [libjvm.so+0x176339a]  StubGenerator::generate_sha3_stubs()+0x3a  (stubGenerator_x86_64_sha3.cpp:86)
V  [libjvm.so+0x1711217]  StubGenerator::generate_compiler_stubs()+0x387  (stubGenerator_x86_64.cpp:4033)
V  [libjvm.so+0x1712ff8]  StubGenerator_generate(CodeBuffer*, StubCodeGenerator::StubsKind)+0xd8  (stubGenerator_x86_64.cpp:4242)
V  [libjvm.so+0x176d5af]  initialize_stubs(StubCodeGenerator::StubsKind, int, int, char const*, char const*, char const*)+0xef  (stubRoutines.cpp:245)
V  [libjvm.so+0x176f791]  compiler_stubs_init(bool)+0xa1  (stubRoutines.cpp:282)
V  [libjvm.so+0x87e5ae]  C2Compiler::init_c2_runtime()+0xbe  (c2compiler.cpp:92)
V  [libjvm.so+0x87e795]  C2Compiler::initialize()+0x35  (c2compiler.cpp:112)
V  [libjvm.so+0xa39716]  CompileBroker::init_compiler_runtime()+0xd6  (compileBroker.cpp:1771)
V  [libjvm.so+0xa3ff01]  CompileBroker::compiler_thread_loop()+0x121  (compileBroker.cpp:1913)
V  [libjvm.so+0xef158c]  JavaThread::thread_main_inner()+0xcc  (javaThread.cpp:759)
V  [libjvm.so+0x18199f6]  Thread::call_run()+0xb6  (thread.cpp:234)
V  [libjvm.so+0x14fc288]  thread_native_entry(Thread*)+0x128  (os_linux.cpp:858)

Failure 2

  • compiler/intrinsics/sha/cli/TestUseSHA3IntrinsicsOptionOnUnsupportedCPU.java
    • Additional flags: -server -Xmixed

Output:

stderr: [Java HotSpot(TM) 64-Bit Server VM warning: SHA3 intrinsics require AVX512 instructions
java version "24-internal" 2025-03-18
Java(TM) SE Runtime Environment (fastdebug build 24-internal-2024-10-15-1659164.christian.hagedorn.jdk-test)
Java HotSpot(TM) 64-Bit Server VM (fastdebug build 24-internal-2024-10-15-1659164.christian.hagedorn.jdk-test, mixed mode, sharing)
]
 exitValue = 0

java.lang.AssertionError: Expected message not found: 'Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.'.
JVM should start with '-XX:-UseSHA3Intrinsics' flag without any warnings
	at jdk.test.lib.cli.CommandLineOptionTest.verifyOutput(CommandLineOptionTest.java:159)
	at jdk.test.lib.cli.CommandLineOptionTest.verifyJVMStartup(CommandLineOptionTest.java:130)
	at jdk.test.lib.cli.CommandLineOptionTest.verifySameJVMStartup(CommandLineOptionTest.java:209)
	at compiler.intrinsics.sha.cli.testcases.GenericTestCaseForUnsupportedX86CPU.verifyWarnings(GenericTestCaseForUnsupportedX86CPU.java:70)
	at compiler.intrinsics.sha.cli.DigestOptionsBase$TestCase.test(DigestOptionsBase.java:162)
	at compiler.intrinsics.sha.cli.DigestOptionsBase.runTestCases(DigestOptionsBase.java:139)
	at jdk.test.lib.cli.CommandLineOptionTest.test(CommandLineOptionTest.java:537)
	at compiler.intrinsics.sha.cli.TestUseSHA3IntrinsicsOptionOnUnsupportedCPU.main(TestUseSHA3IntrinsicsOptionOnUnsupportedCPU.java:56)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:573)
	at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138)
	at java.base/java.lang.Thread.run(Thread.java:1576)
Caused by: java.lang.RuntimeException: 'Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.' missing from stdout/stderr
	at jdk.test.lib.process.OutputAnalyzer.shouldMatch(OutputAnalyzer.java:372)
	at jdk.test.lib.cli.CommandLineOptionTest.verifyOutput(CommandLineOptionTest.java:154)
	... 11 more

JavaTest Message: Test threw exception: java.lang.AssertionError: Expected message not found: 'Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.'.
JVM should start with '-XX:-UseSHA3Intrinsics' flag without any warnings
JavaTest Message: shutting down test

STATUS:Failed.`main' threw exception: java.lang.AssertionError: Expected message not found: 'Intrinsics for SHA3-224, SHA3-256, SHA3-384 and SHA3-512 crypto hash functions not available on this CPU.'. JVM should start with '-XX:-UseSHA3Intrinsics' flag without any warnings

@ferakocz
Copy link
Contributor Author

If you still have it and it can be 'made clean'.. I would love to see some of that 'scaffolding test code' kept for the final commit. (I like to imagine the 'final code cleanup' as 'removing scaffolding from a construction site' :) ) This will be especially useful if (when?) we revisit the intrinsic. (I can already see us also needing an AVX2 version.. someone will need to re-learn how to verify that intrinsic too)

The scaffolding is really simple: instead of e.g.

@IntrinsiCandidate
void foo(byte[] output, byte[] input) {
// do some computation
}

you would have

void foo(byte[] output, byte[] input) {
byte[] inputCopy = input.clone();
int x = fooImpl(output, input);
if (x==0) {
// it was the intrinsic, so e.g. call fooImplJava() on imputCopy and compare the result
}
}

@IntrinsicCandidate
int fooImpl(byte[] output, byte[] input) {
fooImplJava(input, output);
return 1;
}

void fooImplJava(byte[] output, byte[] input) {
// do some computation
}

Just a bit more complicated for non-void methods.

@ferakocz
Copy link
Contributor Author

This is not a review but I've run some testing with the current patch and found the following two failures on linux-x64-debug:

@chhagedorn could you send me the mach5 command line (or other means) to run these tests?

@vnkozlov
Copy link
Contributor

This is not a review but I've run some testing with the current patch and found the following two failures on linux-x64-debug:

@chhagedorn could you send me the mach5 command line (or other means) to run these tests?

I think Christian missed additional important flag which limits avx512 features: -XX:+UseKNLSetting
I think you can run any test with it to trigger error because it happens during stub generation.

@vnkozlov
Copy link
Contributor

"Failure 2" is due to issue I pointed about compiler/testlibrary/sha/predicate/IntrinsicPredicates.java#L10

@chhagedorn
Copy link
Member

This is not a review but I've run some testing with the current patch and found the following two failures on linux-x64-debug:

@chhagedorn could you send me the mach5 command line (or other means) to run these tests?

I think Christian missed additional important flag which limits avx512 features: -XX:+UseKNLSetting I think you can run any test with it to trigger error because it happens during stub generation.

Right, you need -XX:+UseKNLSetting, thanks Vladimir for jumping in! Missed to mention that separately and was hard to spot in the posted command line.

@openjdk
Copy link

openjdk bot commented Oct 21, 2024

@ferakocz this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout sha3-avx512-intrinsic
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Oct 21, 2024
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Oct 21, 2024
@ferakocz
Copy link
Contributor Author

@ferakocz I think you need to enble SHA3 testing in jtreg tests we have by modifying: https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/testlibrary/sha/predicate/IntrinsicPredicates.java#L106

JDK-8252204 added several C2 tests for SHA3 intrinsics in test/hotspot/jtreg/compiler/intrinsics/sha. Please make sure your changes passed those tests.

I did that, plus restored the error message, now all the tests suggested by @chhagedorn pass.

@sviswa7
Copy link

sviswa7 commented Oct 24, 2024

@ferakocz Thanks for taking my inputs into consideration and the corresponding changes. Would it be also possible for you to add comments to the rounds24_loop code if you want us to review that in detail. Otherwise the PR looks good to me.

@ferakocz
Copy link
Contributor Author

Could someone approve my changes so that I can integrate?

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the rounds_24 loop. Other than that the PR looks good to me.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 24, 2024
Copy link
Contributor

@vpaprotsk vpaprotsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything from me was addressed too, thanks!

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Oct 25, 2024
@ferakocz
Copy link
Contributor Author

@sviswa7 and @vpaprotsk, I added comments to the algorithm implementation, could you take another look and approve again if you are satisfied with them? Thanks!

@vpaprotsk
Copy link
Contributor

Thanks for the comments. Still looks good to me

(I haven't reviewed the core loop instruction-by-instruction either, I would need to spend a lot more time getting to know SHA3. But this is why I was asking about KAT/testing; This is 'condition-less' code, and no additions/carries-to-propagate. Testing should have 100% code-coverage with just a few tests., so no need to tests carries, input text value-independent. Do need to vary input length to test loop bounds).

@sviswa7
Copy link

sviswa7 commented Oct 28, 2024

Thanks for the comments, very helpful. I have verified the theta mapping, chi step, and the xor step. They look good. Now making my way through the rho and sigma perms and rotates.

@sviswa7
Copy link

sviswa7 commented Oct 28, 2024

test/hotspot/jtreg/compiler/intrinsics/sha

The rho and sigma perms/rotates also look good.

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 28, 2024
@ferakocz
Copy link
Contributor Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Oct 29, 2024
@openjdk
Copy link

openjdk bot commented Oct 29, 2024

@ferakocz
Your change (at version b9cc7db) is now ready to be sponsored by a Committer.

@wangweij
Copy link
Contributor

/sponsor

@openjdk
Copy link

openjdk bot commented Oct 29, 2024

Going to push as commit 9cfb0f7.
Since your change was applied there have been 110 commits pushed to the master branch:

  • 4ce19ca: 8343190: GHA: Try building JTReg several times
  • 7c800e6: 8343026: JFR: Index into fields in the topFrame
  • d8b3685: 8342607: Enhance register printing on x86_64 platforms
  • d8430ef: 8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM
  • 6332e25: 8343183: [s390x]: Problemlist runtime/Monitor/SyncOnValueBasedClassTest.java Failure
  • 79a07ad: 8343149: Cleanup os::print_tos_pc on AIX
  • beff8bf: 8342823: Ubsan: ciEnv.cpp:1614:65: runtime error: member call on null pointer of type 'struct CompileTask'
  • e389f82: 8343137: C2: VerifyLoopOptimizations fails with "Was reachable in only one"
  • 0abfa3b: 8304824: NMT should not use ThreadCritical
  • 88dc655: 8342988: GHA: Build JTReg in single step
  • ... and 100 more: https://git.openjdk.org/jdk/compare/18bcbf7941f7567449983b3f317401efb3e34d39...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 29, 2024
@openjdk openjdk bot closed this Oct 29, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Oct 29, 2024
@openjdk
Copy link

openjdk bot commented Oct 29, 2024

@wangweij @ferakocz Pushed as commit 9cfb0f7.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@TheShermanTanker
Copy link
Contributor

I think this broke the x86 assembler, I'm getting multiple failures that look like the following:

/home/runner/work/jdk/jdk/src/hotspot/cpu/x86/assembler_x86.cpp:3646:6: error: redefinition of ‘void Assembler::evmovdquw(XMMRegister, KRegister, XMMRegister, bool, int)’
 3646 | void Assembler::evmovdquw(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
      |      ^~~~~~~~~
/home/runner/work/jdk/jdk/src/hotspot/cpu/x86/assembler_x86.cpp:3593:6: note: ‘void Assembler::evmovdquw(XMMRegister, KRegister, XMMRegister, bool, int)’ previously defined here
 3593 | void Assembler::evmovdquw(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) {
      |      ^~~~~~~~~

0x8000000000008080L, 0x0000000080000001L, 0x8000000080008008L
};

ATTRIBUTE_ALIGNED(64) static const uint64_t permsAndRots[] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too late for this now, but this should've used alignas directly instead of ATTRIBUTE_ALIGNED. No worries though, the macro ultimately expands to alignas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

8 participants