New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read MIDR_EL1 system register on aarch64 #11744
Conversation
Could anyone take a look at this CL as well as the related one #12397? Thanks |
This PR is in a state where it requires action by @openssl/committers but the last update was 30 days ago |
Ping for review. |
crypto/armcap.c
Outdated
if (sigsetjmp(ill_jmp, 1) == 0) { | ||
_armv8_cpuid_probe(); | ||
OPENSSL_armcap_P |= ARMV8_CPUID; | ||
} | ||
# endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop this hunk: this code is not expected to execute at an exception level where MIDR_EL1 is accessible, and so the only way we can read it is through the CPUID emulation, which is a Linux specific feature, and which is advertised via the CPUID hwcap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did wonder if it was necessary but figured it wasn't harmful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it could be harmful, and it would be better to remove this code entirely at some point.
The problem is that the OS may decide to hide certain capabilities from user space, either because of silicon errata if there is a problem with the feature, or in the case of heterogeneous SMP (big.LITTLE), where a task may migrate from a CPU that has support for certain instructions to one that does not.
In this particular case, I think it is harmless, since the only possible way to access MIDR_EL1 is via the emulation, which will always produce the value the OS thinks is the correct one. But in general, we should abide by the hwcap features only, and avoid SIGILL trapping for feature discovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the OS supports reading hardware capabilities, that is the only information we should use. But it's not available on all platforms, so we need fallback code.
The OS might not always provide enough information for us to make a good decision about which implementation to use. While a CPU might support 2 implementations we have, newer generation of CPUs might preform better using the other that worked best for the older generation. There can also be differences depending on the range of CPU: mobile vs. server. And as far as I know, we don't really either have code nor benchmark results for it to pick the best implementation.
There is 1 PR making use of this PR, which has a performance again of about 3%. I think we've previously considered 10% performance increase the minimum we should see to consider such changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comments on this PR.
I agree on Ard's concern, this hunk of code maybe leads to incorrect behavior on heterogeneous SMP platform. I will correct that.
BTW, Do the other hunks above this one also lead to the same concern? Such as _armv8_sha256_probe(), _armv8_sha1_probe().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a sense, those are even worse: if sha256 is supported by the big CPUs, but not by the little CPUs, and the program happens to be started on big one, you might experience SIGILL crashes if the program gets migrated to a little CPU by the OS scheduler.
MIDR_EL1 system register exposes microarchitecture information so that people can make micro-arch related optimization such as exposing as much instruction level parallelism as possible. MIDR_EL1 register can be read only if HWCAP_CPUID feature is supported. Change-Id: Iabb8a36c5d31b184dba6399f378598058d394d4e
b95ea1a
to
b0c8bc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the code as Ard suggested. That way detecting CPUID maybe misleads code to unexpected path especially on heterogeneous SMP platform. It's more reliable to let OS make the decision if CPUID is support or not.
crypto/armcap.c
Outdated
if (sigsetjmp(ill_jmp, 1) == 0) { | ||
_armv8_cpuid_probe(); | ||
OPENSSL_armcap_P |= ARMV8_CPUID; | ||
} | ||
# endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comments on this PR.
I agree on Ard's concern, this hunk of code maybe leads to incorrect behavior on heterogeneous SMP platform. I will correct that.
BTW, Do the other hunks above this one also lead to the same concern? Such as _armv8_sha256_probe(), _armv8_sha1_probe().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still approved. Needs second review.
Not sure if my review counts or not, but I am fine with the patch as is. However, I am still not convinced we need this patch at all, given that the patch that relies on it is still under discussion, and is unlikely to be merged in its current form given the limited performance gain. So please don't merge this yet. |
I agree that we can't merge a PR with limited performance uplift. What I want to point out is that the original optimization on chacha20 is dedicated to ThunderX2, the author commented that his implementation is not friendly to most platforms but ThunderX2 because he thought ThunderX2 is server oriented, other platforms are client oriented. I believe that the author compromised because there wasn't a approach to detect the platform at run time. So actually this PR provide the possibility making chacha20 running at best performance on various platforms. Moreover we can see the performance data from his implementation, A53 gets 9% performance regression, this is to say, based on this PR, #12397 can revert the performance on A53. Another point I would like to mention is that this PR will benefit to micro architecture related optimization which maybe bring significant performance gain. Actually original chacha20 optimization on ThunderX2 is such an example, this optimization dedicated to ThunderX2 makes the performance 37% uplift. Also for further ARM micro architecture like Zeus we will have more opportunities to optimize crypto algorithms based on this PR. So I thinks this PR is valuable. |
This PR is in a state where it requires action by @openssl/committers but the last update was 30 days ago |
This PR is in a state where it requires action by @openssl/committers but the last update was 61 days ago |
This PR is in a state where it requires action by @openssl/committers but the last update was 92 days ago |
Just ping... Thanks. |
ping @openssl/committers for review? |
This pull request is ready to merge |
MIDR_EL1 system register exposes microarchitecture information so that people can make micro-arch related optimization such as exposing as much instruction level parallelism as possible. MIDR_EL1 register can be read only if HWCAP_CPUID feature is supported. Change-Id: Iabb8a36c5d31b184dba6399f378598058d394d4e Reviewed-by: Paul Dale <paul.dale@oracle.com> Reviewed-by: Tomas Mraz <tmraz@fedoraproject.org> (Merged from #11744)
Merged to master. Thank you for the contribution. |
Previous 6 * NEON + 2 * ALU code path is optimized for thunderx2, and is suboptimal on most other platforms. Detecting micro architecture at runtime and choosing suitable code path can help achieve best performance. This PR changes code path into 4 * NEON + 1 * ALU for A53, A55, A57 and A72(which is also commonly used in arm servers) cores. Then chacha20_neon processes 320 bytes data at a time, and has better overall performance. Use MIDR_EL1 system register to determine cpu core at runtime. Based on PR openssl#11744. Peformance changes after applying optimization: A55 A53 A57 A72 chacha20@8192 +10.2% +9.1% +5.8% +4.3% chacha20@16384 +10.4% +9.2% +5.7% +4.3% chacha20-poly1305@8192 +7.4% +6.7% +4.6% +3.3% chacha20-poly1305@16384 +7.5% +6.9% +4.8% +3.6% Other cores don't change code path, and performance remains the same(tested on Qualcomm SDA845). Change-Id: I844b9fcadd94595db0007bb0dbdff8548c47775e CustomizedGitHooks: yes
Previous 6 * NEON + 2 * ALU code path is optimized for thunderx2, and is suboptimal on most other platforms. Detecting micro architecture at runtime and choosing suitable code path can help achieve best performance. This PR changes code path into 4 * NEON + 1 * ALU for A53, A55, A57 and A72(which is also commonly used in arm servers) cores. Then chacha20_neon processes 320 bytes data at a time, and has better overall performance. Use MIDR_EL1 system register to determine cpu core at runtime. Based on PR openssl#11744. Peformance changes after applying optimization: A55 A53 A57 A72 chacha20@8192 +10.2% +9.1% +5.8% +4.3% chacha20@16384 +10.4% +9.2% +5.7% +4.3% chacha20-poly1305@8192 +7.4% +6.7% +4.6% +3.3% chacha20-poly1305@16384 +7.5% +6.9% +4.8% +3.6% Other cores don't change code path, and performance remains the same(tested on Qualcomm SDA845).
MIDR_EL1 system register exposes microarchitecture information so that people can make micro-arch related optimization such as exposing as much instruction level parallelism as possible. MIDR_EL1 register can be read only if HWCAP_CPUID feature is supported. Change-Id: Iabb8a36c5d31b184dba6399f378598058d394d4e Reviewed-by: Paul Dale <paul.dale@oracle.com> Reviewed-by: Tomas Mraz <tmraz@fedoraproject.org> (Merged from openssl/openssl#11744)
MIDR_EL1 system register exposes microarchitecture information so that people can make micro-arch related optimization such as exposing as much instruction level parallelism as possible.
MIDR_EL1 register can be read only if HWCAP_CPUID feature is supported.
We are going to optimize some algorithms (chacha20, AES-GCM etc.) based on this change which can get performance uplifted significantly.
Checklist