New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Acceleration of chacha20 on aarch64 by SVE #17916
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance of an independent review of the assembly file?
crypto/arm64cpuid.pl
Outdated
.type _armv8_sve_probe,%function | ||
_armv8_sve_probe: | ||
AARCH64_VALID_CALL_TARGET | ||
.long 0x04a03000 // eor z0.d,z0.d,z0.d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is eor
correct here? If so it would be assigning zero wouldn't it? Which would be kind of pointless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not matter what the instruction does here for the probe. The code that calls it just sets the bit if SIGILL is not emitted by the instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the only purpose of this instruction is to test if SVE feature is supported by CPU (z0 register only available with SVE). if not, the instruction will cause SIGKILL sent to user process. if yes, it will set a scratch register (z0) to zero with no side-effect to caller.
We have the assembly code internally reviewed in Arm. but not sure if we can find reviewer from other perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that could help would be if you (or someone from Arm) were willing to become community maintainer for one of the aarch64 configuration targets that are currently unsupported:
https://www.openssl.org/policies/platformpolicy.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t8m we are keen to support the community. can you let us know what you expect of the role? in terms of people, equipment, process, effort, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really up to you to choose which one you're able to support. And of course you can support also multiple ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t8m, could we identify 4 arm engineers contributing to OpenSSL as community maintainers? We expect there are some backups and supports from different time zone. Besides individual identity is provided, maybe an email alias like openssl-arm-support.arm.com can be set up to reach all of us. Is there any formal process or guide we can follow to apply for being the community maintainer?
Another question about platform policy. Reading https://www.openssl.org/blog/blog/2021/09/22/OpenSSL3-fips-submission, there are some platforms that were tested before OpenSSL 3.0 FIPS module was submitted for FIPS validation, are these tested platforms chosen from primary and community platforms? If not, how a platform can be involved to be tested when validate FIPS module?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have multiple people as the contact persons for the platform.
Then one of them simply has to make a PR against the openssl/web repo to make the necessary modifications to the platform policy page and have that PR approved. Once approved and merged its official.
There are no more requirements than what I wrote above in this comment thread.
The FIPS module validation is a separate topic - perhaps @mattcaswell can expand on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these tested platforms chosen from primary and community platforms
Actually no. The FIPS work was sponsored by a number of companies (see https://www.openssl.org/blog/blog/2018/09/25/fips/). We included platforms nominated by our sponsors as well as a couple that the project itself nominated.
Moving forward the OpenSSL Management Committee (OMC) is responsible for maintaining the validation. It is likely that the OMC will want to keep the list of platforms tested to some minimal set. Our primary and secondary lists (currently no platforms are secondary) are likely to influence decisions about any additional platforms to be added if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @t8m for the guide, we will make that PR to apply for the community maintainer. And thanks @mattcaswell for the explanation about FIPS platform.
With the simplification to remove handling tail cases in SVE I think the code is a lot cleaner - LGTM overall |
This patch accelerates chacha20 on aarch64 when Scalable Vector Extension (SVE) is supported by CPU. Tested on modern micro-architecture with 256-bit SVE, it has the potential to improve performance up to 20% The solution takes a hybrid approach. SVE will handle multi-blocks that fit the SVE vector length, with Neon/Scalar to process any tail data Test result: With SVE type 1024 bytes 8192 bytes 16384 bytes ChaCha20 1596208.13k 1650010.79k 1653151.06k Without SVE (by Neon/Scalar) type 1024 bytes 8192 bytes 16384 bytes chacha20 1355487.91k 1372678.83k 1372662.44k The assembly code has been reviewed internally by ARM engineer Fangming.Fang@arm.com Signed-off-by: Daniel Hu <Daniel.Hu@arm.com>
This pull request is ready to merge |
Merged, thanks for the contribution. |
This patch accelerates chacha20 on aarch64 when Scalable Vector Extension (SVE) is supported by CPU. Tested on modern micro-architecture with 256-bit SVE, it has the potential to improve performance up to 20% The solution takes a hybrid approach. SVE will handle multi-blocks that fit the SVE vector length, with Neon/Scalar to process any tail data Test result: With SVE type 1024 bytes 8192 bytes 16384 bytes ChaCha20 1596208.13k 1650010.79k 1653151.06k Without SVE (by Neon/Scalar) type 1024 bytes 8192 bytes 16384 bytes chacha20 1355487.91k 1372678.83k 1372662.44k The assembly code has been reviewed internally by ARM engineer Fangming.Fang@arm.com Signed-off-by: Daniel Hu <Daniel.Hu@arm.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #17916)
This patch accelerates chacha20 on aarch64 when Scalable Vector Extension (SVE) is supported by CPU. Tested on modern micro-architecture with 256-bit SVE, it has the potential to improve performance up to 20% The solution takes a hybrid approach. SVE will handle multi-blocks that fit the SVE vector length, with Neon/Scalar to process any tail data Test result: With SVE type 1024 bytes 8192 bytes 16384 bytes ChaCha20 1596208.13k 1650010.79k 1653151.06k Without SVE (by Neon/Scalar) type 1024 bytes 8192 bytes 16384 bytes chacha20 1355487.91k 1372678.83k 1372662.44k The assembly code has been reviewed internally by ARM engineer Fangming.Fang@arm.com Signed-off-by: Daniel Hu <Daniel.Hu@arm.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from openssl#17916) (cherry picked from commit b1b2146)
This patch accelerates chacha20 on aarch64 when Scalable Vector Extension (SVE) is supported by CPU. Tested on modern micro-architecture with 256-bit SVE, it has the potential to improve performance up to 20% The solution takes a hybrid approach. SVE will handle multi-blocks that fit the SVE vector length, with Neon/Scalar to process any tail data Test result: With SVE type 1024 bytes 8192 bytes 16384 bytes ChaCha20 1596208.13k 1650010.79k 1653151.06k Without SVE (by Neon/Scalar) type 1024 bytes 8192 bytes 16384 bytes chacha20 1355487.91k 1372678.83k 1372662.44k The assembly code has been reviewed internally by ARM engineer Fangming.Fang@arm.com Signed-off-by: Daniel Hu <Daniel.Hu@arm.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from #17916) (cherry picked from commit b1b2146)
This patch accelerates chacha20 on aarch64 when Scalable Vector Extension
(SVE) is supported by CPU. Tested on modern micro-architecture with
256-bit SVE, it has the potential to improve performance up to 20%.
Optimization is only applied to the scenario where input data is over
256 bytes
Test result:
With SVE
type 1024 bytes 8192 bytes 16384 bytes
ChaCha20 1596373.33k 1639167.32k 1643173.21k
Without SVE (by Neon/Scalar)
type 1024 bytes 8192 bytes 16384 bytes
chacha20 1355487.91k 1372678.83k 1372662.44k
The assembly code has been reviewed internally by
ARM engineer Fangming.Fang@arm.com
Signed-off-by: Daniel Hu Daniel.Hu@arm.com
Checklist