Enable ARMv8.2 accelerated SHA3 on compatible Apple CPUs #21398

sdlyyxy · 2023-07-07T15:39:40Z

The hardware-assisted ARMv8.2 implementation is already in keccak1600-armv8.pl. It is not called because the author mentioned that it's not actually obvious that it will provide performance improvements. The test on Apple M1 Firestorm shows that the ARMv8.2 implementation could improve about 36% for large blocks. So let's enable ARMv8.2 accelerated SHA3 on Apple CPU family.

M1 Firestorm master

$ DYLD_LIBRARY_PATH=. apps/openssl speed -evp sha3-256                          
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha3-256         48099.35k   192855.72k   479891.18k   566828.18k   643850.10k   651480.44k

M1 Firestorm ARM SHA3 extension enabled

$ DYLD_LIBRARY_PATH=. apps/openssl speed -evp sha3-256
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha3-256         47984.92k   192821.83k   527154.17k   738713.60k   874018.13k   890705.24k

M1 Icestorm master

$ taskpolicy -c background bash -c "DYLD_LIBRARY_PATH=. apps/openssl speed -evp sha3-256"
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha3-256         11461.94k    47571.66k   121608.50k   144754.32k   126859.74k   164252.42k

M1 Icestorm ARM SHA3 extension enabled

$ taskpolicy -c background bash -c "DYLD_LIBRARY_PATH=. apps/openssl speed -evp sha3-256"
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha3-256          8561.96k    34054.53k    93464.84k   133518.93k   159627.74k   164587.15k

~~The ARM SHA3 extension version does not work well on M1 Icestorm, so only~~
~~enable the code on Apple's big cores, i.e. Firestorm and Avalanche.~~

It's not worth distinguishing between big and little cores, since processes can be migrated between them.

See below.

Fixes #21380

~~Is it OK without a CLA?~~

~~CLA: trivial~~

providers/implementations/digests/sha3_prov.c

sdlyyxy · 2023-07-07T16:40:36Z

Sorry I didn't explain it clearly. It seems `crypto/armcap.c` provides `ARMV8_SHA3` bit in `OPENSSL_armcap_P` across all the systems. But `ARMV8_SHA3` bit is not enough because the author of `keccak1600-armv8.pl` said it won't be an improvement for all CPUs. We have to restrict the CPU models.On Apple systems this will be easy, because if `ARMV8_SHA3` bit is defined, it must be an A13 or later (LLVM llvm/lib/Target/AArch64/AArch64.td). But on other systems it could be any CPU... The method here is similar to line 417 in `crypto/armcap.c`.On Jul 8, 2023, at 00:11, Tom Cosgrove ***@***.***> wrote: @tom-cosgrove-arm commented on this pull request. In providers/implementations/digests/sha3_prov.c:

+ }

+# define KMAC_SET_MD(bitlen) \ + if (ARM_SHA3_CAPABLE) { \ + ctx->meth = sha3_ARMSHA3_md; \ + } else { \ + ctx->meth = sha3_generic_md; \ + } +/* Detection on other operating systems */ +# else +# define ARM_HAS_FASTER_SHA3 \ + (MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM) ||\ + MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM_PRO) ||\ + MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM_MAX) ||\ + MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M2_AVALANCHE) ||\ + MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M2_AVALANCHE_PRO) ||\ + MIDR_IS_CPU_MODEL(OPENSSL_arm_midr, ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M2_AVALANCHE_MAX)) Look at how it's done in crypto/armcap.c, which abstracts all this away - and yes, it is done differently on Linux, macOS, Windows, etc —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

tom-cosgrove-arm · 2023-07-07T21:01:28Z

Look at how it's done for other optimisations, such as ARMV8_UNROLL8_EOR3: the decision is taken in armcap.c, and the implementation only looks at OPENSSL_armcap_P to know whether it can use a particular optimisation or not.

This makes it simple to test optimisations on new hardware: just set OPENSSL_armcap_P in the process environment and run openssl speed.

It also means that when new CPU microarchitectures are released, with different MIDR values, there's only one place in the code to change, instead of having to grep throughout all the algorithms to see which ones might need tweaking.

And also note that MIDR, unlike Intel's CPUID, is not readable by user code (the official name is MIDR_EL1, which means it can only be read at EL1 or above, which is the kernel). This allows the kernel to handle differences in CPUs on heterogenous systems (e.g. big.LITTLE where the different CPU types aren't compatible).

So armcap.c abstracts away all the different ways of reading the CPU type/microarchitecture/capabilities.

t8m · 2023-07-10T13:22:29Z

Please note that this is definitely outside of what would be acceptable with CLA: trivial. Could you please sign a regular CLA and remove the CLA: trivial annotation from the commit?

The hardware-assisted ARMv8.2 implementation is already in keccak1600-armv8.pl. It is not called because the author mentioned that it's not actually obvious that it will provide performance improvements. The test on Apple M1 Firestorm shows that the ARMv8.2 implementation could improve about 36% for large blocks. So let's enable ARMv8.2 accelerated SHA3 on Apple CPU family. Fixes openssl#21380

sdlyyxy · 2023-07-10T18:03:44Z

OK, CLA sent and a new commit pushed.

tom-cosgrove-arm · 2023-07-11T11:53:47Z

only enable the code on Apple's big cores, i.e. Firestorm and Avalanche

What happens if the process is migrated from a big to a little core?

sdlyyxy · 2023-07-11T12:52:12Z

Message ID: ***@***.***>During the test, I found that except when explicitly binding the process to a big core using taskset, the MIDR was always for the Icestorm core. I'm not sure what the Linux kernel is doing here on heterogeneous CPUs. If the kernel consistently behaves this way, and if a process can be migrated from a big to a little core, the MIDR returned will be for the little core and the generic code path will always be executed.

tom-cosgrove-arm · 2023-07-13T15:11:36Z

Thinking further about this, there's no guarantee in the general case that a process won't be migrated from big to little or vice versa, on either Linux or macOS.

If there's a significant amount of work to be done, I would expect that a reasonable O/S would migrate the process to a big core to get the work done faster. In this case, if we've determined to use the generic code, we won't do as well as we could.

If there's not much hashing to be done, and we chose the accelerated code rather than the generic code, where the generic code would be faster, we haven't lost much.

So it actually seems that it would be better to always just use the SHA-3 accelerated cores on systems where it would be faster on the performance cores, even if it wouldn't be faster on the efficiency cores.

And I would still rather see this decision taken in armcap.c communicating with sha3_prov.c via either (preferred) a bit in OPENSSL_armcap_P or a separate variable. And yes, something like ARMV8_WORTH_USING_SHA3

paulidale · 2023-07-14T02:31:01Z

Wouldn't always using the performance core path cause problems?
Or are the efficiency cores able to emulate instructions they don't support?

tom-cosgrove-arm · 2023-07-14T06:19:01Z

Sorry, I wasn't clear: this is only going to be enabled for Apple Silicon, under macOS and Linux. And under macOS, ARMV8_WORTH_USING_SHA3 should be set under the exact same conditions that ARMV8_UNROLL8_EOR3 is set.

On Linux, ARMV8_WORTH_USING_SHA3 will have to be set based on MIDR identifying the Apple Silicon cores together with ARMV8_SHA3 being set.

Those tests should be done in armcap.c, and prov_sha3.c should just use that result to choose the code path.

It's not worth distinguishing between big and little cores, since processes can be migrated between them.

And it's strongly recommended against building systems where big and little cores support different sets of features, precisely because of process migration.

(And we know that Apple Silicon doesn't do that, and this is purely an optimisation for Apple Silicon so far)

sdlyyxy · 2023-07-14T11:11:39Z

Thanks for the instructions! I pushed a new commit :)

providers/implementations/digests/sha3_prov.c

tom-cosgrove-arm

LGTM - thanks

openssl-machine · 2023-07-21T00:00:23Z

This pull request is ready to merge

paulidale · 2023-07-21T00:19:41Z

Merged, thanks for the contribution.

The hardware-assisted ARMv8.2 implementation is already in keccak1600-armv8.pl. It is not called because the author mentioned that it's not actually obvious that it will provide performance improvements. The test on Apple M1 Firestorm shows that the ARMv8.2 implementation could improve about 36% for large blocks. So let's enable ARMv8.2 accelerated SHA3 on Apple CPU family. Fixes #21380 Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #21398)

Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #21398)

tom-cosgrove-arm reviewed Jul 7, 2023

View reviewed changes

providers/implementations/digests/sha3_prov.c Outdated Show resolved Hide resolved

t8m added hold: cla required The contributor needs to submit a license agreement cla: trivial One of the commits is marked as 'CLA: trivial' labels Jul 10, 2023

github-actions bot added the severity: fips change The pull request changes FIPS provider sources label Jul 10, 2023

sdlyyxy force-pushed the armsha3 branch from 93a3376 to afc33f8 Compare July 10, 2023 18:02

sdlyyxy closed this Jul 11, 2023

sdlyyxy reopened this Jul 11, 2023

openssl-machine removed the hold: cla required The contributor needs to submit a license agreement label Jul 11, 2023

Move CPU detection to armcap.c

e721b2d

tom-cosgrove-arm reviewed Jul 15, 2023

View reviewed changes

providers/implementations/digests/sha3_prov.c Outdated Show resolved Hide resolved

Update with ARMV8_HAVE_SHA3_AND_WORTH_USING

af71102

tom-cosgrove-arm approved these changes Jul 19, 2023

View reviewed changes

paulidale added branch: master Merge to master branch approval: review pending This pull request needs review by a committer triaged: feature The issue/pr requests/adds a feature and removed cla: trivial One of the commits is marked as 'CLA: trivial' labels Jul 19, 2023

paulidale approved these changes Jul 19, 2023

View reviewed changes

paulidale added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Jul 19, 2023

paulidale added the tests: exempted The PR is exempt from requirements for testing label Jul 19, 2023

openssl-machine added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Jul 21, 2023

paulidale closed this Jul 21, 2023

openssl-machine pushed a commit that referenced this pull request Jul 21, 2023

Move CPU detection to armcap.c

08e6eb2

Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #21398)

openssl-machine pushed a commit that referenced this pull request Jul 21, 2023

Update with ARMV8_HAVE_SHA3_AND_WORTH_USING

ba9472c

Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #21398)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable ARMv8.2 accelerated SHA3 on compatible Apple CPUs #21398

Enable ARMv8.2 accelerated SHA3 on compatible Apple CPUs #21398

sdlyyxy commented Jul 7, 2023 •

edited

sdlyyxy commented Jul 7, 2023 via email

tom-cosgrove-arm commented Jul 7, 2023 •

edited

t8m commented Jul 10, 2023

sdlyyxy commented Jul 10, 2023

tom-cosgrove-arm commented Jul 11, 2023

sdlyyxy commented Jul 11, 2023 via email

tom-cosgrove-arm commented Jul 13, 2023

paulidale commented Jul 14, 2023

tom-cosgrove-arm commented Jul 14, 2023

sdlyyxy commented Jul 14, 2023

tom-cosgrove-arm left a comment

openssl-machine commented Jul 21, 2023

paulidale commented Jul 21, 2023

Enable ARMv8.2 accelerated SHA3 on compatible Apple CPUs #21398

Enable ARMv8.2 accelerated SHA3 on compatible Apple CPUs #21398

Conversation

sdlyyxy commented Jul 7, 2023 • edited

sdlyyxy commented Jul 7, 2023 via email

tom-cosgrove-arm commented Jul 7, 2023 • edited

t8m commented Jul 10, 2023

sdlyyxy commented Jul 10, 2023

tom-cosgrove-arm commented Jul 11, 2023

sdlyyxy commented Jul 11, 2023 via email

tom-cosgrove-arm commented Jul 13, 2023

paulidale commented Jul 14, 2023

tom-cosgrove-arm commented Jul 14, 2023

sdlyyxy commented Jul 14, 2023

tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

openssl-machine commented Jul 21, 2023

paulidale commented Jul 21, 2023

sdlyyxy commented Jul 7, 2023 •

edited

tom-cosgrove-arm commented Jul 7, 2023 •

edited