-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD: Optimize the performance of einsum's submodule multiply by using universal intrinsics #17782
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since #17340 is merged, could you use NPYV partial load and store for handling the remained scalars?
- it should perform better on AVX2/AVX512F without losing performance in other
architectures since we already unroll by four-vectors. - guarantee of having equal accuracy for all array elements when fused intrinsic instructions enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you, Chunlin!
@mattip @eric-wieser The development process of version 1.21 is started, right time to accept simd PRs? |
Thanks @Qiyu8 |
This is the third part of #17049 , There has no impact on X86 platform and about 9%~16% increased performance in ARM. Here is the benchmark results:
X86(SSE2 enabled) BENCHMARKS NOT SIGNIFICANTLY CHANGED.
ARM(NEON enabled) PERFORMANCE INCREASED
System Info