Optimize chacha20 for aarch64 #12397

xffbai · 2020-07-09T04:07:17Z

Optimize chacha20 for aarch64.

Previous 6 * NEON + 2 * ALU code path is optimized for thunderx2, and is suboptimal on most other platforms. Detecting micro architecture at runtime and choosing suitable code path can help achieve best performance.

This PR changes code path into 4 * NEON + 1 * ALU for A53, A55, A57 and A72(which is also commonly used in arm servers) cores. Then chacha20_neon processes 320 bytes data at a time, and has better overall performance.

Use MIDR_EL1 system register to determine cpu core at runtime. Based on PR #11744.

Peformance changes after applying optimization:

                           A55     A53    A57    A72
chacha20@8192             +10.2%  +9.1%  +5.8%  +4.3% 
chacha20@16384            +10.4%  +9.2%  +5.7%  +4.3% 
chacha20-poly1305@8192    +7.4%   +6.7%  +4.6%  +3.3% 
chacha20-poly1305@16384   +7.5%   +6.9%  +4.8%  +3.6%

Other cores don't change code path, and performance remains the same(tested on Qualcomm SDA845).

--

Checklist

documentation is added or updated
tests are added or updated

ardbiesheuvel · 2020-12-07T08:05:24Z

Given that we are not adding an entirely new code path here, but simply jumping to an existing one based on the MIDR, I think this approach is reasonable.

@dot-asm: any thoughts?

paulidale

Looks okay but I'm not very fluent with aarch64.

t8m · 2020-12-09T09:12:14Z

Please rebase to resolve conflicts.

dot-asm · 2020-12-09T13:03:51Z

@dot-asm: any thoughts?

I think you know my thoughts, as they are reflected in commentary, "penalties are considered tolerable." The question is if the extra joint in complexity chain worth the trouble. And it has to be considered in broader context. Yes, one can say "look, 10%", but when would you find yourself in such situation? You'd always use it in AEAD scheme, that's what really counts. In other words, it not 10%, but 7%. Then you also have to recognize that benchmark does not necessarily represent real-life environment. I mean in real life you'll be competing for resources, such as cache at least, with other components, so it won't be even 7%... So where would one draw they line? I for one draw higher than that...

ardbiesheuvel · 2020-12-09T14:49:24Z

@dot-asm: any thoughts?

I think you know my thoughts, as they are reflected in commentary, "penalties are considered tolerable." The question is if the extra joint in complexity chain worth the trouble. And it has to be considered in broader context. Yes, one can say "look, 10%", but when would you find yourself in such situation? You'd always use it in AEAD scheme, that's what really counts. In other words, it not 10%, but 7%. Then you also have to recognize that benchmark does not necessarily represent real-life environment. I mean in real life you'll be competing for resources, such as cache at least, with other components, so it won't be even 7%... So where would one draw they line? I for one draw higher than that...

I agree. In this case, it is simply a conditional branch, but it also means you can only exercise this code path (e.g., for testing/validation purposes) if you run on this exact micro-architecture.

Previous 6 * NEON + 2 * ALU code path is optimized for thunderx2, and is suboptimal on most other platforms. Detecting micro architecture at runtime and choosing suitable code path can help achieve best performance. This PR changes code path into 4 * NEON + 1 * ALU for A53, A55, A57 and A72(which is also commonly used in arm servers) cores. Then chacha20_neon processes 320 bytes data at a time, and has better overall performance. Use MIDR_EL1 system register to determine cpu core at runtime. Based on PR openssl#11744. Peformance changes after applying optimization: A55 A53 A57 A72 chacha20@8192 +10.2% +9.1% +5.8% +4.3% chacha20@16384 +10.4% +9.2% +5.7% +4.3% chacha20-poly1305@8192 +7.4% +6.7% +4.6% +3.3% chacha20-poly1305@16384 +7.5% +6.9% +4.8% +3.6% Other cores don't change code path, and performance remains the same(tested on Qualcomm SDA845).

xffbai · 2020-12-22T08:55:22Z

166f370 just rebased the code.

Although 10% is not a significant improvement for chacha20, I still think this patch can be considered acceptable( as the code path in this patch exists before and the changes are small). Anyway you can decide if this patch can be merged. :)

xffbai closed this Jul 10, 2020

xffbai reopened this Jul 10, 2020

zorrorffm mentioned this pull request Jul 23, 2020

Read MIDR_EL1 system register on aarch64 #11744

Closed

2 tasks

xffbai force-pushed the chacha20-a72 branch 2 times, most recently from 7ba6b6c to 52fa99b Compare September 28, 2020 08:19

paulidale approved these changes Dec 9, 2020

View reviewed changes

xffbai closed this Dec 10, 2020

xffbai deleted the chacha20-a72 branch December 10, 2020 06:41

xffbai reopened this Dec 10, 2020

xffbai force-pushed the chacha20-a72 branch from 52fa99b to 166f370 Compare December 10, 2020 07:49

xffbai closed this Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize chacha20 for aarch64 #12397

Optimize chacha20 for aarch64 #12397

xffbai commented Jul 9, 2020 •

edited

ardbiesheuvel commented Dec 7, 2020

paulidale left a comment

t8m commented Dec 9, 2020

dot-asm commented Dec 9, 2020

ardbiesheuvel commented Dec 9, 2020

xffbai commented Dec 22, 2020 •

edited

Optimize chacha20 for aarch64 #12397

Optimize chacha20 for aarch64 #12397

Conversation

xffbai commented Jul 9, 2020 • edited

Checklist

ardbiesheuvel commented Dec 7, 2020

paulidale left a comment

Choose a reason for hiding this comment

t8m commented Dec 9, 2020

dot-asm commented Dec 9, 2020

ardbiesheuvel commented Dec 9, 2020

xffbai commented Dec 22, 2020 • edited

xffbai commented Jul 9, 2020 •

edited

xffbai commented Dec 22, 2020 •

edited