-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize chacha20 for aarch64 #12397
Conversation
7ba6b6c
to
52fa99b
Compare
Given that we are not adding an entirely new code path here, but simply jumping to an existing one based on the MIDR, I think this approach is reasonable. @dot-asm: any thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay but I'm not very fluent with aarch64.
Please rebase to resolve conflicts. |
I think you know my thoughts, as they are reflected in commentary, "penalties are considered tolerable." The question is if the extra joint in complexity chain worth the trouble. And it has to be considered in broader context. Yes, one can say "look, 10%", but when would you find yourself in such situation? You'd always use it in AEAD scheme, that's what really counts. In other words, it not 10%, but 7%. Then you also have to recognize that benchmark does not necessarily represent real-life environment. I mean in real life you'll be competing for resources, such as cache at least, with other components, so it won't be even 7%... So where would one draw they line? I for one draw higher than that... |
I agree. In this case, it is simply a conditional branch, but it also means you can only exercise this code path (e.g., for testing/validation purposes) if you run on this exact micro-architecture. |
Previous 6 * NEON + 2 * ALU code path is optimized for thunderx2, and is suboptimal on most other platforms. Detecting micro architecture at runtime and choosing suitable code path can help achieve best performance. This PR changes code path into 4 * NEON + 1 * ALU for A53, A55, A57 and A72(which is also commonly used in arm servers) cores. Then chacha20_neon processes 320 bytes data at a time, and has better overall performance. Use MIDR_EL1 system register to determine cpu core at runtime. Based on PR openssl#11744. Peformance changes after applying optimization: A55 A53 A57 A72 chacha20@8192 +10.2% +9.1% +5.8% +4.3% chacha20@16384 +10.4% +9.2% +5.7% +4.3% chacha20-poly1305@8192 +7.4% +6.7% +4.6% +3.3% chacha20-poly1305@16384 +7.5% +6.9% +4.8% +3.6% Other cores don't change code path, and performance remains the same(tested on Qualcomm SDA845).
166f370 just rebased the code. Although 10% is not a significant improvement for chacha20, I still think this patch can be considered acceptable( as the code path in this patch exists before and the changes are small). Anyway you can decide if this patch can be merged. :) |
Optimize chacha20 for aarch64.
Previous 6 * NEON + 2 * ALU code path is optimized for thunderx2, and is suboptimal on most other platforms. Detecting micro architecture at runtime and choosing suitable code path can help achieve best performance.
This PR changes code path into 4 * NEON + 1 * ALU for A53, A55, A57 and A72(which is also commonly used in arm servers) cores. Then chacha20_neon processes 320 bytes data at a time, and has better overall performance.
Use MIDR_EL1 system register to determine cpu core at runtime. Based on PR #11744.
Peformance changes after applying optimization:
Other cores don't change code path, and performance remains the same(tested on Qualcomm SDA845).
--
Checklist