Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ARMv8.2 accelerated SHA3 on compatible Apple CPUs #21398

Closed
wants to merge 3 commits into from

Conversation

sdlyyxy
Copy link
Contributor

@sdlyyxy sdlyyxy commented Jul 7, 2023

The hardware-assisted ARMv8.2 implementation is already in keccak1600-armv8.pl. It is not called because the author mentioned that it's not actually obvious that it will provide performance improvements. The test on Apple M1 Firestorm shows that the ARMv8.2 implementation could improve about 36% for large blocks. So let's enable ARMv8.2 accelerated SHA3 on Apple CPU family.

M1 Firestorm master

$ DYLD_LIBRARY_PATH=. apps/openssl speed -evp sha3-256                          
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha3-256         48099.35k   192855.72k   479891.18k   566828.18k   643850.10k   651480.44k

M1 Firestorm ARM SHA3 extension enabled

$ DYLD_LIBRARY_PATH=. apps/openssl speed -evp sha3-256
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha3-256         47984.92k   192821.83k   527154.17k   738713.60k   874018.13k   890705.24k

M1 Icestorm master

$ taskpolicy -c background bash -c "DYLD_LIBRARY_PATH=. apps/openssl speed -evp sha3-256"
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha3-256         11461.94k    47571.66k   121608.50k   144754.32k   126859.74k   164252.42k

M1 Icestorm ARM SHA3 extension enabled

$ taskpolicy -c background bash -c "DYLD_LIBRARY_PATH=. apps/openssl speed -evp sha3-256"
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha3-256          8561.96k    34054.53k    93464.84k   133518.93k   159627.74k   164587.15k

The ARM SHA3 extension version does not work well on M1 Icestorm, so only
enable the code on Apple's big cores, i.e. Firestorm and Avalanche.

It's not worth distinguishing between big and little cores, since processes can be migrated between them.

See below.

Fixes #21380

Is it OK without a CLA?

CLA: trivial

@sdlyyxy
Copy link
Contributor Author

sdlyyxy commented Jul 7, 2023 via email

@tom-cosgrove-arm
Copy link
Contributor

tom-cosgrove-arm commented Jul 7, 2023

Look at how it's done for other optimisations, such as ARMV8_UNROLL8_EOR3: the decision is taken in armcap.c, and the implementation only looks at OPENSSL_armcap_P to know whether it can use a particular optimisation or not.

This makes it simple to test optimisations on new hardware: just set OPENSSL_armcap_P in the process environment and run openssl speed.

It also means that when new CPU microarchitectures are released, with different MIDR values, there's only one place in the code to change, instead of having to grep throughout all the algorithms to see which ones might need tweaking.

And also note that MIDR, unlike Intel's CPUID, is not readable by user code (the official name is MIDR_EL1, which means it can only be read at EL1 or above, which is the kernel). This allows the kernel to handle differences in CPUs on heterogenous systems (e.g. big.LITTLE where the different CPU types aren't compatible).

So armcap.c abstracts away all the different ways of reading the CPU type/microarchitecture/capabilities.

@t8m t8m added hold: cla required The contributor needs to submit a license agreement cla: trivial One of the commits is marked as 'CLA: trivial' labels Jul 10, 2023
@t8m
Copy link
Member

t8m commented Jul 10, 2023

Please note that this is definitely outside of what would be acceptable with CLA: trivial. Could you please sign a regular CLA and remove the CLA: trivial annotation from the commit?

@github-actions github-actions bot added the severity: fips change The pull request changes FIPS provider sources label Jul 10, 2023
The hardware-assisted ARMv8.2 implementation is already in keccak1600-armv8.pl.
It is not called because the author mentioned that it's not actually obvious
that it will provide performance improvements. The test on Apple M1 Firestorm
shows that the ARMv8.2 implementation could improve about 36% for large blocks.
So let's enable ARMv8.2 accelerated SHA3 on Apple CPU family.

Fixes openssl#21380
@sdlyyxy
Copy link
Contributor Author

sdlyyxy commented Jul 10, 2023

OK, CLA sent and a new commit pushed.

@sdlyyxy sdlyyxy closed this Jul 11, 2023
@sdlyyxy sdlyyxy reopened this Jul 11, 2023
@openssl-machine openssl-machine removed the hold: cla required The contributor needs to submit a license agreement label Jul 11, 2023
@tom-cosgrove-arm
Copy link
Contributor

only enable the code on Apple's big cores, i.e. Firestorm and Avalanche

What happens if the process is migrated from a big to a little core?

@sdlyyxy
Copy link
Contributor Author

sdlyyxy commented Jul 11, 2023 via email

@tom-cosgrove-arm
Copy link
Contributor

Thinking further about this, there's no guarantee in the general case that a process won't be migrated from big to little or vice versa, on either Linux or macOS.

If there's a significant amount of work to be done, I would expect that a reasonable O/S would migrate the process to a big core to get the work done faster. In this case, if we've determined to use the generic code, we won't do as well as we could.

If there's not much hashing to be done, and we chose the accelerated code rather than the generic code, where the generic code would be faster, we haven't lost much.

So it actually seems that it would be better to always just use the SHA-3 accelerated cores on systems where it would be faster on the performance cores, even if it wouldn't be faster on the efficiency cores.

And I would still rather see this decision taken in armcap.c communicating with sha3_prov.c via either (preferred) a bit in OPENSSL_armcap_P or a separate variable. And yes, something like ARMV8_WORTH_USING_SHA3

@paulidale
Copy link
Contributor

Wouldn't always using the performance core path cause problems?
Or are the efficiency cores able to emulate instructions they don't support?

@tom-cosgrove-arm
Copy link
Contributor

Sorry, I wasn't clear: this is only going to be enabled for Apple Silicon, under macOS and Linux. And under macOS, ARMV8_WORTH_USING_SHA3 should be set under the exact same conditions that ARMV8_UNROLL8_EOR3 is set.

On Linux, ARMV8_WORTH_USING_SHA3 will have to be set based on MIDR identifying the Apple Silicon cores together with ARMV8_SHA3 being set.

Those tests should be done in armcap.c, and prov_sha3.c should just use that result to choose the code path.

It's not worth distinguishing between big and little cores, since processes can be migrated between them.

And it's strongly recommended against building systems where big and little cores support different sets of features, precisely because of process migration.

(And we know that Apple Silicon doesn't do that, and this is purely an optimisation for Apple Silicon so far)

@sdlyyxy
Copy link
Contributor Author

sdlyyxy commented Jul 14, 2023

Thanks for the instructions! I pushed a new commit :)

Copy link
Contributor

@tom-cosgrove-arm tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks

@paulidale paulidale added branch: master Merge to master branch approval: review pending This pull request needs review by a committer triaged: feature The issue/pr requests/adds a feature and removed cla: trivial One of the commits is marked as 'CLA: trivial' labels Jul 19, 2023
@paulidale paulidale added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Jul 19, 2023
@paulidale paulidale added the tests: exempted The PR is exempt from requirements for testing label Jul 19, 2023
@openssl-machine openssl-machine added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Jul 21, 2023
@openssl-machine
Copy link
Collaborator

This pull request is ready to merge

@paulidale
Copy link
Contributor

Merged, thanks for the contribution.

@paulidale paulidale closed this Jul 21, 2023
openssl-machine pushed a commit that referenced this pull request Jul 21, 2023
The hardware-assisted ARMv8.2 implementation is already in keccak1600-armv8.pl.
It is not called because the author mentioned that it's not actually obvious
that it will provide performance improvements. The test on Apple M1 Firestorm
shows that the ARMv8.2 implementation could improve about 36% for large blocks.
So let's enable ARMv8.2 accelerated SHA3 on Apple CPU family.

Fixes #21380

Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #21398)
openssl-machine pushed a commit that referenced this pull request Jul 21, 2023
Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #21398)
openssl-machine pushed a commit that referenced this pull request Jul 21, 2023
Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #21398)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Merge to master branch severity: fips change The pull request changes FIPS provider sources tests: exempted The PR is exempt from requirements for testing triaged: feature The issue/pr requests/adds a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable ARM SHA3 instruction accelerated implementation of SHA3
5 participants