Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read MIDR_EL1 system register on aarch64 #11744

Closed
wants to merge 1 commit into from

Conversation

zorrorffm
Copy link
Contributor

MIDR_EL1 system register exposes microarchitecture information so that people can make micro-arch related optimization such as exposing as much instruction level parallelism as possible.

MIDR_EL1 register can be read only if HWCAP_CPUID feature is supported.

We are going to optimize some algorithms (chacha20, AES-GCM etc.) based on this change which can get performance uplifted significantly.

Checklist
  • documentation is added or updated
  • tests are added or updated

@xffbai xffbai mentioned this pull request Jul 9, 2020
2 tasks
@zorrorffm
Copy link
Contributor Author

zorrorffm commented Jul 23, 2020

Could anyone take a look at this CL as well as the related one #12397? Thanks

@paulidale paulidale added approval: review pending This pull request needs review by a committer branch: master Merge to master branch labels Jul 23, 2020
@openssl-machine
Copy link
Collaborator

This PR is in a state where it requires action by @openssl/committers but the last update was 30 days ago

@paulidale
Copy link
Contributor

Ping for review.

crypto/armcap.c Outdated
if (sigsetjmp(ill_jmp, 1) == 0) {
_armv8_cpuid_probe();
OPENSSL_armcap_P |= ARMV8_CPUID;
}
# endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop this hunk: this code is not expected to execute at an exception level where MIDR_EL1 is accessible, and so the only way we can read it is through the CPUID emulation, which is a Linux specific feature, and which is advertised via the CPUID hwcap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did wonder if it was necessary but figured it wasn't harmful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it could be harmful, and it would be better to remove this code entirely at some point.

The problem is that the OS may decide to hide certain capabilities from user space, either because of silicon errata if there is a problem with the feature, or in the case of heterogeneous SMP (big.LITTLE), where a task may migrate from a CPU that has support for certain instructions to one that does not.

In this particular case, I think it is harmless, since the only possible way to access MIDR_EL1 is via the emulation, which will always produce the value the OS thinks is the correct one. But in general, we should abide by the hwcap features only, and avoid SIGILL trapping for feature discovery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the OS supports reading hardware capabilities, that is the only information we should use. But it's not available on all platforms, so we need fallback code.

The OS might not always provide enough information for us to make a good decision about which implementation to use. While a CPU might support 2 implementations we have, newer generation of CPUs might preform better using the other that worked best for the older generation. There can also be differences depending on the range of CPU: mobile vs. server. And as far as I know, we don't really either have code nor benchmark results for it to pick the best implementation.

There is 1 PR making use of this PR, which has a performance again of about 3%. I think we've previously considered 10% performance increase the minimum we should see to consider such changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments on this PR.
I agree on Ard's concern, this hunk of code maybe leads to incorrect behavior on heterogeneous SMP platform. I will correct that.

BTW, Do the other hunks above this one also lead to the same concern? Such as _armv8_sha256_probe(), _armv8_sha1_probe().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a sense, those are even worse: if sha256 is supported by the big CPUs, but not by the little CPUs, and the program happens to be started on big one, you might experience SIGILL crashes if the program gets migrated to a little CPU by the OS scheduler.

MIDR_EL1 system register exposes microarchitecture information so that
people can make micro-arch related optimization such as exposing as
much instruction level parallelism as possible.

MIDR_EL1 register can be read only if HWCAP_CPUID feature is supported.

Change-Id: Iabb8a36c5d31b184dba6399f378598058d394d4e
Copy link
Contributor Author

@zorrorffm zorrorffm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the code as Ard suggested. That way detecting CPUID maybe misleads code to unexpected path especially on heterogeneous SMP platform. It's more reliable to let OS make the decision if CPUID is support or not.

crypto/armcap.c Outdated
if (sigsetjmp(ill_jmp, 1) == 0) {
_armv8_cpuid_probe();
OPENSSL_armcap_P |= ARMV8_CPUID;
}
# endif
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments on this PR.
I agree on Ard's concern, this hunk of code maybe leads to incorrect behavior on heterogeneous SMP platform. I will correct that.

BTW, Do the other hunks above this one also lead to the same concern? Such as _armv8_sha256_probe(), _armv8_sha1_probe().

Copy link
Contributor

@paulidale paulidale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still approved. Needs second review.

@ardbiesheuvel
Copy link
Contributor

Not sure if my review counts or not, but I am fine with the patch as is. However, I am still not convinced we need this patch at all, given that the patch that relies on it is still under discussion, and is unlikely to be merged in its current form given the limited performance gain. So please don't merge this yet.

@zorrorffm
Copy link
Contributor Author

zorrorffm commented Sep 4, 2020

Not sure if my review counts or not, but I am fine with the patch as is. However, I am still not convinced we need this patch at all, given that the patch that relies on it is still under discussion, and is unlikely to be merged in its current form given the limited performance gain. So please don't merge this yet.

I agree that we can't merge a PR with limited performance uplift. What I want to point out is that the original optimization on chacha20 is dedicated to ThunderX2, the author commented that his implementation is not friendly to most platforms but ThunderX2 because he thought ThunderX2 is server oriented, other platforms are client oriented. I believe that the author compromised because there wasn't a approach to detect the platform at run time. So actually this PR provide the possibility making chacha20 running at best performance on various platforms. Moreover we can see the performance data from his implementation, A53 gets 9% performance regression, this is to say, based on this PR, #12397 can revert the performance on A53.

Another point I would like to mention is that this PR will benefit to micro architecture related optimization which maybe bring significant performance gain. Actually original chacha20 optimization on ThunderX2 is such an example, this optimization dedicated to ThunderX2 makes the performance 37% uplift. Also for further ARM micro architecture like Zeus we will have more opportunities to optimize crypto algorithms based on this PR. So I thinks this PR is valuable.

@openssl-machine
Copy link
Collaborator

This PR is in a state where it requires action by @openssl/committers but the last update was 30 days ago

@openssl-machine
Copy link
Collaborator

This PR is in a state where it requires action by @openssl/committers but the last update was 61 days ago

@openssl-machine
Copy link
Collaborator

This PR is in a state where it requires action by @openssl/committers but the last update was 92 days ago

@zorrorffm
Copy link
Contributor Author

zorrorffm commented Dec 7, 2020

Just ping...
This PR provides the opportunity to balance the performance on all platform. Based on this PR, #12397 makes all the performance on different platforms(A53, A55, A57, A72) uplifted while no performance regression on ThunderX2 occurs. Some test cases on A55 get 10% performance uplift.

Thanks.

@paulidale
Copy link
Contributor

ping @openssl/committers for review?

@t8m t8m added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Dec 8, 2020
@openssl-machine openssl-machine removed the approval: done This pull request has the required number of approvals label Dec 9, 2020
@openssl-machine
Copy link
Collaborator

This pull request is ready to merge

@openssl-machine openssl-machine added the approval: ready to merge The 24 hour grace period has passed, ready to merge label Dec 9, 2020
openssl-machine pushed a commit that referenced this pull request Dec 9, 2020
MIDR_EL1 system register exposes microarchitecture information so that
people can make micro-arch related optimization such as exposing as
much instruction level parallelism as possible.

MIDR_EL1 register can be read only if HWCAP_CPUID feature is supported.

Change-Id: Iabb8a36c5d31b184dba6399f378598058d394d4e

Reviewed-by: Paul Dale <paul.dale@oracle.com>
Reviewed-by: Tomas Mraz <tmraz@fedoraproject.org>
(Merged from #11744)
@t8m
Copy link
Member

t8m commented Dec 9, 2020

Merged to master. Thank you for the contribution.

@t8m t8m closed this Dec 9, 2020
xffbai added a commit to xffbai/openssl that referenced this pull request Dec 10, 2020
Previous 6 * NEON + 2 * ALU code path is optimized for thunderx2, and
is suboptimal on most other platforms. Detecting micro architecture
at runtime and choosing suitable code path can help achieve best
performance.

This PR changes code path into 4 * NEON + 1 * ALU for A53, A55, A57
and A72(which is also commonly used in arm servers) cores.
Then chacha20_neon processes 320 bytes data at a time, and has
better overall performance.

Use MIDR_EL1 system register to determine cpu core at runtime.
Based on PR openssl#11744.

Peformance changes after applying optimization:
                          A55     A53    A57    A72
chacha20@8192             +10.2%  +9.1%  +5.8%  +4.3%
chacha20@16384            +10.4%  +9.2%  +5.7%  +4.3%
chacha20-poly1305@8192    +7.4%   +6.7%  +4.6%  +3.3%
chacha20-poly1305@16384   +7.5%   +6.9%  +4.8%  +3.6%

Other cores don't change code path, and performance remains the
same(tested on Qualcomm SDA845).

Change-Id: I844b9fcadd94595db0007bb0dbdff8548c47775e
CustomizedGitHooks: yes
@xffbai xffbai mentioned this pull request Dec 10, 2020
2 tasks
xffbai added a commit to xffbai/openssl that referenced this pull request Dec 10, 2020
Previous 6 * NEON + 2 * ALU code path is optimized for thunderx2, and
is suboptimal on most other platforms. Detecting micro architecture
at runtime and choosing suitable code path can help achieve best
performance.

This PR changes code path into 4 * NEON + 1 * ALU for A53, A55, A57
and A72(which is also commonly used in arm servers) cores.
Then chacha20_neon processes 320 bytes data at a time, and has
better overall performance.

Use MIDR_EL1 system register to determine cpu core at runtime.
Based on PR openssl#11744.

Peformance changes after applying optimization:
                          A55     A53    A57    A72
chacha20@8192             +10.2%  +9.1%  +5.8%  +4.3%
chacha20@16384            +10.4%  +9.2%  +5.7%  +4.3%
chacha20-poly1305@8192    +7.4%   +6.7%  +4.6%  +3.3%
chacha20-poly1305@16384   +7.5%   +6.9%  +4.8%  +3.6%

Other cores don't change code path, and performance remains the
same(tested on Qualcomm SDA845).
dongbeiouba pushed a commit to auvkone/BabaSSL that referenced this pull request Mar 4, 2022
MIDR_EL1 system register exposes microarchitecture information so that
people can make micro-arch related optimization such as exposing as
much instruction level parallelism as possible.

MIDR_EL1 register can be read only if HWCAP_CPUID feature is supported.

Change-Id: Iabb8a36c5d31b184dba6399f378598058d394d4e

Reviewed-by: Paul Dale <paul.dale@oracle.com>
Reviewed-by: Tomas Mraz <tmraz@fedoraproject.org>
(Merged from openssl/openssl#11744)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Merge to master branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants