-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: prototype integration with ArmCL #795
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for detailed RFC, @diaena. I added a few questions regarding documentation, validation and third party components.
- Step 1: introduce CMake changes to allow inclusion of ArmCL as a dependency; | ||
- Step 2: changes to `src/cpu/platform.hpp` to support new AARCH64 macros; | ||
- Step 3: addition of ArmCL based implementation into `src/cpu/aarch64` and | ||
integration into `src/cpu/cpu_convolution_list.cpp`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add changes to documentation (README.md and doc/*) to cover new dependencies and build options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no problem. We will include a docs update in the implementation plan.
Hijacked @kawakami-k's comment from a neighbor thread:
I expected there will be some intersection between the RFCs (e.g. the directory structure, maybe some compiler knobs, etc). So, @kawakami-k, please review this RFC too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for very detailed RFC. Taking into account @vpirogov remarks the RFC LGTM.
I don't see any part of this RFC #795 that would cause inconsistency with Fujitsu's RFC, which will be posted soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the very informative RFC and reference to the PoC implementation!
The proposal looks good to me. There is a minor issue with the implementation you described below, specifically keeping the modifiable object (acl_data_
) in immutable primitive descriptor or primitive objects (what we've already discussed offline). But since this is an implementation detail, I think this doesn't affect the RFC per se.
Since version 1.6, oneDNN has provided limited support for AArch64 builds. This minor change is to detect an AArch64 CPU and permit the use of `USE_MKLDNN` in that case. Build flags for oneDNN are also modified accordingly. Note: oneDNN on AArch64, by default, will use oneDNN's reference C++ kernels. These are not optimised for AArch64, but oneDNN v1.7 onwards provides support for a limited set of primitives based Arm Compute Library. See: oneapi-src/oneDNN#795 and: oneapi-src/oneDNN#820 for more details. Support for ACL-based oneDNN primitives in PyTorch will require some further modification,
Summary: Since version 1.6, oneDNN has provided limited support for AArch64 builds. This minor change is to detect an AArch64 CPU and permit the use of `USE_MKLDNN` in that case. Build flags for oneDNN are also modified accordingly. Note: oneDNN on AArch64, by default, will use oneDNN's reference C++ kernels. These are not optimised for AArch64, but oneDNN v1.7 onwards provides support for a limited set of primitives based Arm Compute Library. See: oneapi-src/oneDNN#795 and: oneapi-src/oneDNN#820 for more details. Support for ACL-based oneDNN primitives in PyTorch will require some further modification, Fixes #{issue number} Pull Request resolved: #50400 Reviewed By: izdeby Differential Revision: D25886589 Pulled By: malfet fbshipit-source-id: 2c81277a28ad4528c2d2211381e7c6692d952bc1
This PR adds a PReLU primitive which makes use of Compute Library for the Arm® architecture (ACL), optimised for AArch64 targets. The datatype support is just for f32 and can only be used in forward mode. Implementation follows a similar approach to oneapi-src#1281, including measures to avoid the parallelisation of small workloads. It builds upon the approach originally introduced in the PR oneapi-src#820 and RFC oneapi-src#795. The primitive offers a minimum of ~ x10 speedup for large tensors (~1million f32s) and is comparable for very small tensors. Co-authored-by: Louis Kaplan <louis.kaplan@arm.com> - [X] Do all unit and benchdnn tests (`make test` and `make test_benchdnn_*`) pass locally for each commit? - [X] Have you formatted the code using clang-format? - [X] Have you submitted performance data that demonstrates performance improvements?
This PR adds a PReLU primitive which makes use of Compute Library for the Arm® architecture (ACL), optimised for AArch64 targets. The datatype support is just for f32 and can only be used in forward mode. Implementation follows a similar approach to oneapi-src#1281, including measures to avoid the parallelisation of small workloads. It builds upon the approach originally introduced in the PR oneapi-src#820 and RFC oneapi-src#795. The primitive offers a minimum of ~ x10 speedup for large tensors (~1million f32s) and is comparable for very small tensors. Co-authored-by: Louis Kaplan <louis.kaplan@arm.com> - [X] Do all unit and benchdnn tests (`make test` and `make test_benchdnn_*`) pass locally for each commit? - [X] Have you formatted the code using clang-format? - [X] Have you submitted performance data that demonstrates performance improvements?
This PR adds a PReLU primitive which makes use of Compute Library for the Arm® architecture (ACL), optimised for AArch64 targets. The datatype support is just for f32 and can only be used in forward mode. Implementation follows a similar approach to oneapi-src#1281, including measures to avoid the parallelisation of small workloads. It builds upon the approach originally introduced in the PR oneapi-src#820 and RFC oneapi-src#795. The primitive offers a minimum of ~ x10 speedup for large tensors (~1million f32s) and is comparable for very small tensors. Co-authored-by: Louis Kaplan <louis.kaplan@arm.com> - [X] Do all unit and benchdnn tests (`make test` and `make test_benchdnn_*`) pass locally for each commit? - [X] Have you formatted the code using clang-format? - [X] Have you submitted performance data that demonstrates performance improvements?
This PR adds a PReLU primitive which makes use of Compute Library for the Arm® architecture (ACL), optimised for AArch64 targets. The datatype support is just for f32 and can only be used in forward mode. Implementation follows a similar approach to #1281, including measures to avoid the parallelisation of small workloads. It builds upon the approach originally introduced in the PR #820 and RFC #795. The primitive offers a minimum of ~ x10 speedup for large tensors (~1million f32s) and is comparable for very small tensors. Co-authored-by: Louis Kaplan <louis.kaplan@arm.com> - [X] Do all unit and benchdnn tests (`make test` and `make test_benchdnn_*`) pass locally for each commit? - [X] Have you formatted the code using clang-format? - [X] Have you submitted performance data that demonstrates performance improvements?
The goal for this RFC is to demonstrate an approach for providing optimized implementations of machine learning operators in oneDNN, on AArch64, based on the open-source Arm Compute Library.
Rendered document
Others involved: @nSircombe @cfRod @alenik01