AArch64: Add fixed format support for ACL primtives #1590

jondea · 2023-03-13T08:58:52Z

Description

This PR makes the Compute Library for Arm® (ACL) primitives which use GEMM (inner product, matmul, conv) make use of fixed format kernels. By this we mean that the weights are reordered to be blocked/interleaved in oneDNN rather than ACL, so the kernel used needs to a have a known and fixed format.
This has multiple positives:

It fits better with the oneDNN philosophy of separating the primitive and reorder
Weights can be variable, because the reorder is performed at the oneDNN/framework level
And therefore, most importantly: primitives can be properly cached in TensorFlow without checking the content of the weights (which is very hacky)

This has a negligible effect on performance overall, because the kernels are similar and the reorders are just happening in a different place.

The broad idea is:

Ask ACL what WeightsFormat it wants the weights in, via the has_opt_impl function
Convert the resulting WeightsFormat into a memory descriptor
Get oneDNN to reorder by modifying the weights memory descriptor
Set the strides on the weights TensorInfo to reflect the reordered layout

Things to note:

This patch causes the examples cpu-cnn-inference-f32-cpp and cpu-primitives-inner-product-cpp to fail because inner product now implicitly collapses dimensions. This behaviour is supported by benchdnn, but not these two examples. Would it make sense to modify reorder_primitive_desc_create and consistent_with to allow collapsing dimensions?
We have also removed wino because it is not yet compatible with this approach, and it is also failing for some unrelated reasons. Due to the existing heuristics we do not see it used in any models currently, but we may return to this later with a working implementation and more effective heuristics.
This patch causes some NCHW fastmath convolutions to no longer be supported, we are working on a fix, but these are rarely used.

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit? (see comment in body)
Have you formatted the code using clang-format?

jondea · 2023-03-15T15:31:09Z

Any feedback would be appreciated on this. Also, would it still be possible for a backport to v3.1?

src/cpu/aarch64/acl_inner_product.hpp

igorsafo · 2023-03-16T16:41:07Z

src/common/reorder.cpp

-    if (!s_mdw.consistent_with(d_mdw)) return invalid_arguments;
+    // If src and dst are not already consistent, try to reshape them
+    memory_desc_t reshaped_src_md = *src_md;
+    auto reshaped_s_mdw = memory_desc_wrapper(&reshaped_src_md);


This change applies to the whole library and changes API behavior.
If you want to have such feature we will need an RFC to discuss available options.

As an alternative option you could uncollapse weights md internally in the implementation before returning to a user (or keep both collapsed and uncollapsed versions for internal use and API calls). This way reorders will work as expected and the change will be isolated to this specific inner product implementation.

That's a fair point, I'll have a think and create an RFC if it still makes sense.

Would you be able to expand on what you mean by this

As an alternative option you could uncollapse weights md internally in the implementation before returning to a user

I think we need to pass the collapsed memory descriptor back to the user so that they can do the right reorder. The collapsed reorder is a different to the non-collapsed one. It's also beneficial, for example If we have a tensor with H=2, W=2 and IC=3 and we need to pad to 4, then

2x2x3 -pad-> 2x2x4 -collapse-> 16

2x2x3 -collapse-> 12 -pad-> 12

src/cpu/aarch64/acl_convolution_utils.cpp

igorsafo · 2023-03-16T17:12:54Z

Any feedback would be appreciated on this. Also, would it still be possible for a backport to v3.1?

There is not much time to get it into v3.1 (Production release March 23, 2023 (ww12.4'23)), we are at the end of a validation cycle.
For this to happen we need to make sure:

all tests pass on your side
the changes are isolated to Aarch64 implementation

jondea · 2023-03-20T22:20:59Z

Understood, thank you for the feedback @igorsafo!

jondea · 2023-03-29T16:14:20Z

I should say that all the tests pass now on AArch64

src/cpu/aarch64/acl_convolution_utils.cpp

Remove ACL winograd convolution because it is not compatible with upcoming fixed format kernel changes, is broken and does not have a clear use cases not have a clear use case

- Use the fixed format interface for ACL, where we first ask ACL what format it wants the weights in, and then reorder the weights in oneDNN. This has the benefit of us being able to hoist the reorders out at the framework level. It also has the added benefit of allowing the weights to be modified between executions, and a necessary step towards making the oneDNN-ACL interface stateless. - Note that this patch causes some fastmath inner products and fastmath NCHW convolutions to fall back to non-fastmath kernels Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>

mgouicem · 2023-04-11T12:05:42Z

THanks for the contribution!

igorsafo reviewed Mar 16, 2023

View reviewed changes

cpu: aarch64: Bump min ACL version to v23.02

ba8233e

jondea force-pushed the add-fixed-format-support branch from 5385481 to 166dd36 Compare March 29, 2023 08:22

igorsafo reviewed Mar 29, 2023

View reviewed changes

src/cpu/aarch64/acl_convolution_utils.cpp Show resolved Hide resolved

jondea and others added 3 commits March 30, 2023 10:22

cpu: aarch64: conv: remove ACL wino convolution

3de2e90

Remove ACL winograd convolution because it is not compatible with upcoming fixed format kernel changes, is broken and does not have a clear use cases not have a clear use case

cpu: aarch64: Bump min ACL version to 23.02.1

02a9308

jondea force-pushed the add-fixed-format-support branch from bcd9e71 to 02a9308 Compare March 30, 2023 10:38

igorsafo approved these changes Mar 30, 2023

View reviewed changes

nSircombe mentioned this pull request Apr 11, 2023

Low performance with NETranspose on aarch64 ARM-software/ComputeLibrary#1045

Closed

mgouicem approved these changes Apr 11, 2023

View reviewed changes

mgouicem merged commit f62a91b into oneapi-src:master Apr 11, 2023

vpirogov added this to the v3.2 milestone May 19, 2023

igorsafo mentioned this pull request Jul 26, 2023

aarch64: Multiple performance degradations after the migration from v3.1 to v3.2 #1691

Closed

1 task

nSircombe mentioned this pull request Sep 6, 2023

How to use fixed format kernel? ARM-software/ComputeLibrary#1070

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AArch64: Add fixed format support for ACL primtives #1590

AArch64: Add fixed format support for ACL primtives #1590

jondea commented Mar 13, 2023 •

edited

Loading

jondea commented Mar 15, 2023

igorsafo Mar 16, 2023 •

edited

Loading

jondea Mar 20, 2023

igorsafo commented Mar 16, 2023

jondea commented Mar 20, 2023

jondea commented Mar 29, 2023

mgouicem commented Apr 11, 2023

AArch64: Add fixed format support for ACL primtives #1590

AArch64: Add fixed format support for ACL primtives #1590

Conversation

jondea commented Mar 13, 2023 • edited Loading

Description

Checklist

General

jondea commented Mar 15, 2023

igorsafo Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

jondea Mar 20, 2023

Choose a reason for hiding this comment

igorsafo commented Mar 16, 2023

jondea commented Mar 20, 2023

jondea commented Mar 29, 2023

mgouicem commented Apr 11, 2023

jondea commented Mar 13, 2023 •

edited

Loading

igorsafo Mar 16, 2023 •

edited

Loading