Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: prototype integration with ArmCL #795

Merged
merged 1 commit into from
Aug 17, 2020
Merged

rfc: prototype integration with ArmCL #795

merged 1 commit into from
Aug 17, 2020

Conversation

diaena
Copy link
Contributor

@diaena diaena commented Jul 31, 2020

The goal for this RFC is to demonstrate an approach for providing optimized implementations of machine learning operators in oneDNN, on AArch64, based on the open-source Arm Compute Library.

Rendered document

Others involved: @nSircombe @cfRod @alenik01

@emfomenk emfomenk added the RFC A design document label Jul 31, 2020
Copy link
Member

@vpirogov vpirogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for detailed RFC, @diaena. I added a few questions regarding documentation, validation and third party components.

- Step 1: introduce CMake changes to allow inclusion of ArmCL as a dependency;
- Step 2: changes to `src/cpu/platform.hpp` to support new AARCH64 macros;
- Step 3: addition of ArmCL based implementation into `src/cpu/aarch64` and
integration into `src/cpu/cpu_convolution_list.cpp`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add changes to documentation (README.md and doc/*) to cover new dependencies and build options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, no problem. We will include a docs update in the implementation plan.

@emfomenk
Copy link

emfomenk commented Aug 3, 2020

Hijacked @kawakami-k's comment from a neighbor thread:

Fujitsu plans to upstream JIT kernels for aarch64 to oneDNN v1.6.
In the next few days, I will be putting out an RFC regarding the implementation of AArch64.

I expected there will be some intersection between the RFCs (e.g. the directory structure, maybe some compiler knobs, etc). So, @kawakami-k, please review this RFC too.

Copy link
Contributor

@igorsafo igorsafo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for very detailed RFC. Taking into account @vpirogov remarks the RFC LGTM.

@kawakami-k
Copy link
Contributor

I don't see any part of this RFC #795 that would cause inconsistency with Fujitsu's RFC, which will be posted soon.
Thank you.

Copy link

@emfomenk emfomenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the very informative RFC and reference to the PoC implementation!

The proposal looks good to me. There is a minor issue with the implementation you described below, specifically keeping the modifiable object (acl_data_) in immutable primitive descriptor or primitive objects (what we've already discussed offline). But since this is an implementation detail, I think this doesn't affect the RFC per se.

@emfomenk emfomenk merged commit a1d0f81 into oneapi-src:rfcs Aug 17, 2020
@diaena diaena deleted the arm-rfc branch November 13, 2020 16:19
nSircombe added a commit to nSircombe/pytorch that referenced this pull request Jan 11, 2021
Since version 1.6, oneDNN has provided limited support for AArch64 builds.

This minor change is to detect an AArch64 CPU and permit the use of
`USE_MKLDNN` in that case.

Build flags for oneDNN are also modified accordingly.

Note: oneDNN on AArch64, by default, will use oneDNN's reference C++ kernels.
These are not optimised for AArch64, but oneDNN v1.7 onwards provides support
for a limited set of primitives based Arm Compute Library.
See: oneapi-src/oneDNN#795
and: oneapi-src/oneDNN#820
for more details. Support for ACL-based oneDNN primitives in PyTorch
will require some further modification,
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Jan 13, 2021
Summary:
Since version 1.6, oneDNN has provided limited support for AArch64 builds.

This minor change is to detect an AArch64 CPU and permit the use of
`USE_MKLDNN` in that case.

Build flags for oneDNN are also modified accordingly.

Note: oneDNN on AArch64, by default, will use oneDNN's reference C++ kernels.
These are not optimised for AArch64, but oneDNN v1.7 onwards provides support
for a limited set of primitives based Arm Compute Library.
See: oneapi-src/oneDNN#795
and: oneapi-src/oneDNN#820
for more details. Support for ACL-based oneDNN primitives in PyTorch
will require some further modification,

Fixes #{issue number}

Pull Request resolved: #50400

Reviewed By: izdeby

Differential Revision: D25886589

Pulled By: malfet

fbshipit-source-id: 2c81277a28ad4528c2d2211381e7c6692d952bc1
@jondea jondea mentioned this pull request Oct 15, 2021
6 tasks
lb991 added a commit to lb991/oneDNN that referenced this pull request Apr 26, 2022
This PR adds a PReLU primitive which makes use of Compute Library for the Arm® architecture (ACL), optimised for AArch64 targets. The datatype support is just for f32 and can only be used in forward mode.
Implementation follows a similar approach to oneapi-src#1281, including measures to avoid the parallelisation of small workloads. It builds upon the approach originally introduced in the PR oneapi-src#820 and RFC oneapi-src#795. The primitive offers a minimum of ~ x10 speedup
for large tensors (~1million f32s) and is comparable for very small tensors.

Co-authored-by: Louis Kaplan <louis.kaplan@arm.com>

- [X] Do all unit and benchdnn tests (`make test` and `make test_benchdnn_*`) pass locally for each commit?
- [X] Have you formatted the code using clang-format?

- [X] Have you submitted performance data that demonstrates performance improvements?
@lb991 lb991 mentioned this pull request Apr 26, 2022
3 tasks
lb991 added a commit to lb991/oneDNN that referenced this pull request May 3, 2022
This PR adds a PReLU primitive which makes use of Compute Library for the Arm® architecture (ACL), optimised for AArch64 targets. The datatype support is just for f32 and can only be used in forward mode.
Implementation follows a similar approach to oneapi-src#1281, including measures to avoid the parallelisation of small workloads. It builds upon the approach originally introduced in the PR oneapi-src#820 and RFC oneapi-src#795. The primitive offers a minimum of ~ x10 speedup
for large tensors (~1million f32s) and is comparable for very small tensors.

Co-authored-by: Louis Kaplan <louis.kaplan@arm.com>

- [X] Do all unit and benchdnn tests (`make test` and `make test_benchdnn_*`) pass locally for each commit?
- [X] Have you formatted the code using clang-format?

- [X] Have you submitted performance data that demonstrates performance improvements?
lb991 added a commit to lb991/oneDNN that referenced this pull request May 13, 2022
This PR adds a PReLU primitive which makes use of Compute Library for the Arm® architecture (ACL), optimised for AArch64 targets. The datatype support is just for f32 and can only be used in forward mode.
Implementation follows a similar approach to oneapi-src#1281, including measures to avoid the parallelisation of small workloads. It builds upon the approach originally introduced in the PR oneapi-src#820 and RFC oneapi-src#795. The primitive offers a minimum of ~ x10 speedup
for large tensors (~1million f32s) and is comparable for very small tensors.

Co-authored-by: Louis Kaplan <louis.kaplan@arm.com>

- [X] Do all unit and benchdnn tests (`make test` and `make test_benchdnn_*`) pass locally for each commit?
- [X] Have you formatted the code using clang-format?

- [X] Have you submitted performance data that demonstrates performance improvements?
mgouicem pushed a commit that referenced this pull request May 31, 2022
This PR adds a PReLU primitive which makes use of Compute Library for the Arm® architecture (ACL), optimised for AArch64 targets. The datatype support is just for f32 and can only be used in forward mode.
Implementation follows a similar approach to #1281, including measures to avoid the parallelisation of small workloads. It builds upon the approach originally introduced in the PR #820 and RFC #795. The primitive offers a minimum of ~ x10 speedup
for large tensors (~1million f32s) and is comparable for very small tensors.

Co-authored-by: Louis Kaplan <louis.kaplan@arm.com>

- [X] Do all unit and benchdnn tests (`make test` and `make test_benchdnn_*`) pass locally for each commit?
- [X] Have you formatted the code using clang-format?

- [X] Have you submitted performance data that demonstrates performance improvements?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC A design document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants