Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize chacha20 for aarch64 #12397

Closed
wants to merge 1 commit into from
Closed

Conversation

xffbai
Copy link
Contributor

@xffbai xffbai commented Jul 9, 2020

Optimize chacha20 for aarch64.

Previous 6 * NEON + 2 * ALU code path is optimized for thunderx2, and is suboptimal on most other platforms. Detecting micro architecture at runtime and choosing suitable code path can help achieve best performance.

This PR changes code path into 4 * NEON + 1 * ALU for A53, A55, A57 and A72(which is also commonly used in arm servers) cores. Then chacha20_neon processes 320 bytes data at a time, and has better overall performance.

Use MIDR_EL1 system register to determine cpu core at runtime. Based on PR #11744.

Peformance changes after applying optimization:

                           A55     A53    A57    A72
chacha20@8192             +10.2%  +9.1%  +5.8%  +4.3% 
chacha20@16384            +10.4%  +9.2%  +5.7%  +4.3% 
chacha20-poly1305@8192    +7.4%   +6.7%  +4.6%  +3.3% 
chacha20-poly1305@16384   +7.5%   +6.9%  +4.8%  +3.6%  

Other cores don't change code path, and performance remains the same(tested on Qualcomm SDA845).

--

Checklist
  • documentation is added or updated
  • tests are added or updated

@xffbai xffbai closed this Jul 10, 2020
@xffbai xffbai reopened this Jul 10, 2020
@xffbai xffbai force-pushed the chacha20-a72 branch 2 times, most recently from 7ba6b6c to 52fa99b Compare September 28, 2020 08:19
@ardbiesheuvel
Copy link
Contributor

Given that we are not adding an entirely new code path here, but simply jumping to an existing one based on the MIDR, I think this approach is reasonable.

@dot-asm: any thoughts?

Copy link
Contributor

@paulidale paulidale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay but I'm not very fluent with aarch64.

@t8m
Copy link
Member

t8m commented Dec 9, 2020

Please rebase to resolve conflicts.

@dot-asm
Copy link
Contributor

dot-asm commented Dec 9, 2020

@dot-asm: any thoughts?

I think you know my thoughts, as they are reflected in commentary, "penalties are considered tolerable." The question is if the extra joint in complexity chain worth the trouble. And it has to be considered in broader context. Yes, one can say "look, 10%", but when would you find yourself in such situation? You'd always use it in AEAD scheme, that's what really counts. In other words, it not 10%, but 7%. Then you also have to recognize that benchmark does not necessarily represent real-life environment. I mean in real life you'll be competing for resources, such as cache at least, with other components, so it won't be even 7%... So where would one draw they line? I for one draw higher than that...

@ardbiesheuvel
Copy link
Contributor

@dot-asm: any thoughts?

I think you know my thoughts, as they are reflected in commentary, "penalties are considered tolerable." The question is if the extra joint in complexity chain worth the trouble. And it has to be considered in broader context. Yes, one can say "look, 10%", but when would you find yourself in such situation? You'd always use it in AEAD scheme, that's what really counts. In other words, it not 10%, but 7%. Then you also have to recognize that benchmark does not necessarily represent real-life environment. I mean in real life you'll be competing for resources, such as cache at least, with other components, so it won't be even 7%... So where would one draw they line? I for one draw higher than that...

I agree. In this case, it is simply a conditional branch, but it also means you can only exercise this code path (e.g., for testing/validation purposes) if you run on this exact micro-architecture.

@xffbai xffbai closed this Dec 10, 2020
@xffbai xffbai deleted the chacha20-a72 branch December 10, 2020 06:41
@xffbai xffbai reopened this Dec 10, 2020
Previous 6 * NEON + 2 * ALU code path is optimized for thunderx2, and
is suboptimal on most other platforms. Detecting micro architecture
at runtime and choosing suitable code path can help achieve best
performance.

This PR changes code path into 4 * NEON + 1 * ALU for A53, A55, A57
and A72(which is also commonly used in arm servers) cores.
Then chacha20_neon processes 320 bytes data at a time, and has
better overall performance.

Use MIDR_EL1 system register to determine cpu core at runtime.
Based on PR openssl#11744.

Peformance changes after applying optimization:
                          A55     A53    A57    A72
chacha20@8192             +10.2%  +9.1%  +5.8%  +4.3%
chacha20@16384            +10.4%  +9.2%  +5.7%  +4.3%
chacha20-poly1305@8192    +7.4%   +6.7%  +4.6%  +3.3%
chacha20-poly1305@16384   +7.5%   +6.9%  +4.8%  +3.6%

Other cores don't change code path, and performance remains the
same(tested on Qualcomm SDA845).
@xffbai
Copy link
Contributor Author

xffbai commented Dec 22, 2020

166f370 just rebased the code.

Although 10% is not a significant improvement for chacha20, I still think this patch can be considered acceptable( as the code path in this patch exists before and the changes are small). Anyway you can decide if this patch can be merged. :)

@xffbai xffbai closed this Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants