Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acceleration of chacha20 on aarch64 by SVE #17916

Closed
wants to merge 1 commit into from
Closed

Acceleration of chacha20 on aarch64 by SVE #17916

wants to merge 1 commit into from

Conversation

daniel-hu-arm
Copy link
Contributor

This patch accelerates chacha20 on aarch64 when Scalable Vector Extension
(SVE) is supported by CPU. Tested on modern micro-architecture with
256-bit SVE, it has the potential to improve performance up to 20%.
Optimization is only applied to the scenario where input data is over
256 bytes

Test result:
With SVE
type 1024 bytes 8192 bytes 16384 bytes
ChaCha20 1596373.33k 1639167.32k 1643173.21k

Without SVE (by Neon/Scalar)
type 1024 bytes 8192 bytes 16384 bytes
chacha20 1355487.91k 1372678.83k 1372662.44k

The assembly code has been reviewed internally by
ARM engineer Fangming.Fang@arm.com

Signed-off-by: Daniel Hu Daniel.Hu@arm.com

Checklist
  • documentation is added or updated
  • tests are added or updated

Copy link
Contributor

@paulidale paulidale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance of an independent review of the assembly file?

.type _armv8_sve_probe,%function
_armv8_sve_probe:
AARCH64_VALID_CALL_TARGET
.long 0x04a03000 // eor z0.d,z0.d,z0.d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is eor correct here? If so it would be assigning zero wouldn't it? Which would be kind of pointless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not matter what the instruction does here for the probe. The code that calls it just sets the bit if SIGILL is not emitted by the instruction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the only purpose of this instruction is to test if SVE feature is supported by CPU (z0 register only available with SVE). if not, the instruction will cause SIGKILL sent to user process. if yes, it will set a scratch register (z0) to zero with no side-effect to caller.

We have the assembly code internally reviewed in Arm. but not sure if we can find reviewer from other perspective.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that could help would be if you (or someone from Arm) were willing to become community maintainer for one of the aarch64 configuration targets that are currently unsupported:
https://www.openssl.org/policies/platformpolicy.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t8m we are keen to support the community. can you let us know what you expect of the role? in terms of people, equipment, process, effort, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really up to you to choose which one you're able to support. And of course you can support also multiple ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t8m, could we identify 4 arm engineers contributing to OpenSSL as community maintainers? We expect there are some backups and supports from different time zone. Besides individual identity is provided, maybe an email alias like openssl-arm-support.arm.com can be set up to reach all of us. Is there any formal process or guide we can follow to apply for being the community maintainer?

Another question about platform policy. Reading https://www.openssl.org/blog/blog/2021/09/22/OpenSSL3-fips-submission, there are some platforms that were tested before OpenSSL 3.0 FIPS module was submitted for FIPS validation, are these tested platforms chosen from primary and community platforms? If not, how a platform can be involved to be tested when validate FIPS module?

Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have multiple people as the contact persons for the platform.

Then one of them simply has to make a PR against the openssl/web repo to make the necessary modifications to the platform policy page and have that PR approved. Once approved and merged its official.

There are no more requirements than what I wrote above in this comment thread.

The FIPS module validation is a separate topic - perhaps @mattcaswell can expand on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these tested platforms chosen from primary and community platforms

Actually no. The FIPS work was sponsored by a number of companies (see https://www.openssl.org/blog/blog/2018/09/25/fips/). We included platforms nominated by our sponsors as well as a couple that the project itself nominated.

Moving forward the OpenSSL Management Committee (OMC) is responsible for maintaining the validation. It is likely that the OMC will want to keep the list of platforms tested to some minimal set. Our primary and secondary lists (currently no platforms are secondary) are likely to influence decisions about any additional platforms to be added if any.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @t8m for the guide, we will make that PR to apply for the community maintainer. And thanks @mattcaswell for the explanation about FIPS platform.

@t8m t8m added branch: master Merge to master branch approval: review pending This pull request needs review by a committer approval: otc review pending This pull request needs review by an OTC member triaged: feature The issue/pr requests/adds a feature labels Mar 17, 2022
@paulidale paulidale removed the approval: otc review pending This pull request needs review by an OTC member label Mar 31, 2022
@samuel-lee-msft
Copy link
Contributor

With the simplification to remove handling tail cases in SVE I think the code is a lot cleaner - LGTM overall

This patch accelerates chacha20 on aarch64 when Scalable Vector Extension
(SVE) is supported by CPU. Tested on modern micro-architecture with
256-bit SVE, it has the potential to improve performance up to 20%

The solution takes a hybrid approach. SVE will handle multi-blocks that fit
the SVE vector length, with Neon/Scalar to process any tail data

Test result:
With SVE
type            1024 bytes   8192 bytes  16384 bytes
ChaCha20        1596208.13k  1650010.79k  1653151.06k

Without SVE (by Neon/Scalar)
type            1024 bytes   8192 bytes  16384 bytes
chacha20        1355487.91k  1372678.83k  1372662.44k

The assembly code has been reviewed internally by
ARM engineer Fangming.Fang@arm.com

Signed-off-by: Daniel Hu <Daniel.Hu@arm.com>
@t8m t8m added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Apr 28, 2022
@openssl-machine openssl-machine added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Apr 29, 2022
@openssl-machine
Copy link
Collaborator

This pull request is ready to merge

@paulidale
Copy link
Contributor

Merged, thanks for the contribution.

@paulidale paulidale closed this May 3, 2022
openssl-machine pushed a commit that referenced this pull request May 3, 2022
This patch accelerates chacha20 on aarch64 when Scalable Vector Extension
(SVE) is supported by CPU. Tested on modern micro-architecture with
256-bit SVE, it has the potential to improve performance up to 20%

The solution takes a hybrid approach. SVE will handle multi-blocks that fit
the SVE vector length, with Neon/Scalar to process any tail data

Test result:
With SVE
type            1024 bytes   8192 bytes  16384 bytes
ChaCha20        1596208.13k  1650010.79k  1653151.06k

Without SVE (by Neon/Scalar)
type            1024 bytes   8192 bytes  16384 bytes
chacha20        1355487.91k  1372678.83k  1372662.44k

The assembly code has been reviewed internally by
ARM engineer Fangming.Fang@arm.com

Signed-off-by: Daniel Hu <Daniel.Hu@arm.com>

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #17916)
t8m pushed a commit to t8m/openssl that referenced this pull request Nov 14, 2022
This patch accelerates chacha20 on aarch64 when Scalable Vector Extension
(SVE) is supported by CPU. Tested on modern micro-architecture with
256-bit SVE, it has the potential to improve performance up to 20%

The solution takes a hybrid approach. SVE will handle multi-blocks that fit
the SVE vector length, with Neon/Scalar to process any tail data

Test result:
With SVE
type            1024 bytes   8192 bytes  16384 bytes
ChaCha20        1596208.13k  1650010.79k  1653151.06k

Without SVE (by Neon/Scalar)
type            1024 bytes   8192 bytes  16384 bytes
chacha20        1355487.91k  1372678.83k  1372662.44k

The assembly code has been reviewed internally by
ARM engineer Fangming.Fang@arm.com

Signed-off-by: Daniel Hu <Daniel.Hu@arm.com>

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from openssl#17916)

(cherry picked from commit b1b2146)
openssl-machine pushed a commit that referenced this pull request Nov 21, 2022
This patch accelerates chacha20 on aarch64 when Scalable Vector Extension
(SVE) is supported by CPU. Tested on modern micro-architecture with
256-bit SVE, it has the potential to improve performance up to 20%

The solution takes a hybrid approach. SVE will handle multi-blocks that fit
the SVE vector length, with Neon/Scalar to process any tail data

Test result:
With SVE
type            1024 bytes   8192 bytes  16384 bytes
ChaCha20        1596208.13k  1650010.79k  1653151.06k

Without SVE (by Neon/Scalar)
type            1024 bytes   8192 bytes  16384 bytes
chacha20        1355487.91k  1372678.83k  1372662.44k

The assembly code has been reviewed internally by
ARM engineer Fangming.Fang@arm.com

Signed-off-by: Daniel Hu <Daniel.Hu@arm.com>

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #17916)

(cherry picked from commit b1b2146)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Merge to master branch triaged: feature The issue/pr requests/adds a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants