-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ARMv8.2 accelerated SHA3 on compatible Apple CPUs #21398
Conversation
Sorry I didn't explain it clearly. It seems `crypto/armcap.c` provides `ARMV8_SHA3` bit in `OPENSSL_armcap_P` across all the systems. But `ARMV8_SHA3` bit is not enough because the author of `keccak1600-armv8.pl` said it won't be an improvement for all CPUs. We have to restrict the CPU models.On Apple systems this will be easy, because if `ARMV8_SHA3` bit is defined, it must be an A13 or later (LLVM llvm/lib/Target/AArch64/AArch64.td). But on other systems it could be any CPU... The method here is similar to line 417 in `crypto/armcap.c`.On Jul 8, 2023, at 00:11, Tom Cosgrove ***@***.***> wrote:
@tom-cosgrove-arm commented on this pull request.
In providers/implementations/digests/sha3_prov.c:
+ }
+# define KMAC_SET_MD(bitlen) \
+ if (ARM_SHA3_CAPABLE) { \
+ ctx->meth = sha3_ARMSHA3_md; \
+ } else { \
+ ctx->meth = sha3_generic_md; \
+ }
+/* Detection on other operating systems */
+# else
+# define ARM_HAS_FASTER_SHA3 \
+ (MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM) ||\
+ MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM_PRO) ||\
+ MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM_MAX) ||\
+ MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M2_AVALANCHE) ||\
+ MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M2_AVALANCHE_PRO) ||\
+ MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M2_AVALANCHE_MAX))
Look at how it's done in crypto/armcap.c, which abstracts all this away - and yes, it is done differently on Linux, macOS, Windows, etc
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Look at how it's done for other optimisations, such as This makes it simple to test optimisations on new hardware: just set It also means that when new CPU microarchitectures are released, with different And also note that So |
Please note that this is definitely outside of what would be acceptable with CLA: trivial. Could you please sign a regular CLA and remove the CLA: trivial annotation from the commit? |
The hardware-assisted ARMv8.2 implementation is already in keccak1600-armv8.pl. It is not called because the author mentioned that it's not actually obvious that it will provide performance improvements. The test on Apple M1 Firestorm shows that the ARMv8.2 implementation could improve about 36% for large blocks. So let's enable ARMv8.2 accelerated SHA3 on Apple CPU family. Fixes openssl#21380
OK, CLA sent and a new commit pushed. |
What happens if the process is migrated from a big to a little core? |
Message ID: ***@***.***>During the test, I found that except when explicitly binding the process to a big core using taskset, the MIDR was always for the Icestorm core. I'm not sure what the Linux kernel is doing here on heterogeneous CPUs. If the kernel consistently behaves this way, and if a process can be migrated from a big to a little core, the MIDR returned will be for the little core and the generic code path will always be executed.
|
Thinking further about this, there's no guarantee in the general case that a process won't be migrated from big to little or vice versa, on either Linux or macOS. If there's a significant amount of work to be done, I would expect that a reasonable O/S would migrate the process to a big core to get the work done faster. In this case, if we've determined to use the generic code, we won't do as well as we could. If there's not much hashing to be done, and we chose the accelerated code rather than the generic code, where the generic code would be faster, we haven't lost much. So it actually seems that it would be better to always just use the SHA-3 accelerated cores on systems where it would be faster on the performance cores, even if it wouldn't be faster on the efficiency cores. And I would still rather see this decision taken in |
Wouldn't always using the performance core path cause problems? |
Sorry, I wasn't clear: this is only going to be enabled for Apple Silicon, under macOS and Linux. And under macOS, On Linux, Those tests should be done in It's not worth distinguishing between big and little cores, since processes can be migrated between them. And it's strongly recommended against building systems where big and little cores support different sets of features, precisely because of process migration. (And we know that Apple Silicon doesn't do that, and this is purely an optimisation for Apple Silicon so far) |
Thanks for the instructions! I pushed a new commit :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks
This pull request is ready to merge |
Merged, thanks for the contribution. |
The hardware-assisted ARMv8.2 implementation is already in keccak1600-armv8.pl. It is not called because the author mentioned that it's not actually obvious that it will provide performance improvements. The test on Apple M1 Firestorm shows that the ARMv8.2 implementation could improve about 36% for large blocks. So let's enable ARMv8.2 accelerated SHA3 on Apple CPU family. Fixes #21380 Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #21398)
Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #21398)
Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #21398)
The hardware-assisted ARMv8.2 implementation is already in keccak1600-armv8.pl. It is not called because the author mentioned that it's not actually obvious that it will provide performance improvements. The test on Apple M1 Firestorm shows that the ARMv8.2 implementation could improve about 36% for large blocks. So let's enable ARMv8.2 accelerated SHA3 on Apple CPU family.
M1 Firestorm master
M1 Firestorm ARM SHA3 extension enabled
M1 Icestorm master
M1 Icestorm ARM SHA3 extension enabled
The ARM SHA3 extension version does not work well on M1 Icestorm, so onlyenable the code on Apple's big cores, i.e. Firestorm and Avalanche.See below.
Fixes #21380
Is it OK without a CLA?CLA: trivial