dnn: add gemm_layer in place of fully_connected_layer for onnx models #23897

fengyuentau · 2023-06-30T09:46:22Z

This PR integrates gemm from ficus nn. Gemm layer is created to replace InnerProduct layer for onnx models. Once this PR is merged or proved to be working fine, all InnerProduct layer calls will be replaced by Gemm.

Checklist:

ficus nn gemm integration.
dedicated perf test.
~~avoid re-packing for const A.~~ Removed for simplicity.
avoid re-packing for const B.
acceleration for x86-64 platform.
add backends.
optimized for loongson.

Benchmarks:

Input scales are collected from ViTs. First scale is A, second is B and third is C if presented. All cases are without transposition. Check CI reports for more tests on different scales (square, retanglular, ...).

Updated @ 0907	WXG	WXI	MXG	MXI	MAG	MAI	UXG	UXI	UAG	UAI
[768, 768], [768, 768], [768]	7.54	25.97	2.17	6.22	2.25	7.08	3.72	13.14	5.91	20.75
[1024, 1024], [1024, 1024], [1024]	9.53	60.19	5.01	17.38	5.19	16.89	7.47	33.31	8.21	50.76
[50, 768], [768, 2304]	2.03	6.26	1.3	1.19	0.75	1.59	2.23	2.26	4.85	6.27
[197, 768], [768, 2304]	6.85	20.62	1.76	4.69	1.77	6.14	3.69	9.12	4.54	17.71
[50, 1024], [1024, 3072]	3.98	8.57	1.13	2.94	1.2	4.2	2.19	4.96	5.39	8.92
[197, 1024], [1024, 3072]	7.24	38.15	3.12	10.9	3.07	14.58	6.24	19.95	5.72	31.93

You can see basically all xxG columns have better speed. xxG columns stand for Gemm layer performance on different hardware.

Notations:

WXG: windows10_x64-Gemm
WSI: windows10_x64-InnerProduct
MXG: macos_x64-Gemm
MXI: macos_x64-InnerProduct
MAG: macos_arm64-Gemm
MAI: macos_arm64-InnerProduct
UXG: ubuntu_x64-Gemm
UXI: ubuntu_x64-InnerProduct
UAG: ubuntu_arm64-Gemm
UAI: ubuntu_arm64-InnerProduct

All data in ms (milliseconds).

Other platforms:

Updated @ 0912	loongnix_loongson-Gemm	loongnix_loongson-InnerProduct
A=[768, 768], B=[768, 768], C=[768], trans_a=0, trans_b=0	3.31	4.27
A=[1024, 1024], B=[1024, 1024], C=[1024], trans_a=0, trans_b=0	16.9	13.21
A=[50, 768], B=[768, 2304], trans_a=0, trans_b=0	1.55	3.68
A=[197, 768], B=[768, 2304], trans_a=0, trans_b=0	4.01	11.91
A=[50, 1024], B=[1024, 3072], trans_a=0, trans_b=0	2.55	35.41
A=[197, 1024], B=[1024, 3072], trans_a=0, trans_b=0	6.32	75.15

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

vpisarev · 2023-07-11T21:17:37Z

@fengyuentau, could you provide some performance numbers (before/after)?

fengyuentau · 2023-07-13T08:02:36Z

@vpisarev , I just put on the preliminary benchmark results in the first comment. Please take a look.

~~Still working on the potential overflow problem~~ (Turns out it was the initlaization problem). Will provide more results on different input sizes after the fix.

fengyuentau · 2023-07-19T08:27:18Z

Hello @opencv-alalek , could you help download the new model on the default CI node?

opencv-alalek · 2023-07-19T22:34:43Z

modules/dnn/test/test_onnx_importer.cpp

@@ -2066,7 +2066,7 @@ TEST_P(Test_ONNX_nets, Googlenet)
        applyTestTag(CV_TEST_TAG_DNN_SKIP_IE_NGRAPH);
 #endif

-    const String model = _tf("models/googlenet.onnx", false);


We should not introduce regressions for previously supported models.

You could add new test cases, but you can't just drop them.

It is just wrong when you force it running with batch size >= 2 while it has a reshape operator always reshaping the batch size to be 1 (more details here). Why do we have to keep a wrong thing in our tests? It does not make sense to replicate bugs in a refactor implementation.

@dkurt Could you please take a look?

Am I right that original model is not available anymore? https://s3.amazonaws.com/download.onnx/models/opset_8/bvlc_googlenet.tar.gz

Model <GoogleNet (ONNX)> expect 739732220ba2e3efa88f7c26f13badad9b7514bc catch [Errno 2] No such file or directory: 'bvlc_googlenet.tar.gz' hash check failed - downloading get https://s3.amazonaws.com/download.onnx/models/opset_8/bvlc_googlenet.tar.gz catch HTTP Error 403: Forbidden

Also, if a new model from https://drive.google.com/u/0/uc?id=1FucNLURGgdPk4nCxT0378isPigCcbZ1M&export=download is a personal Google Drive with a modified file, I think it's not a good idea.

Due even opset 12 version has Reshape with 1x1024 hardcoded shapes, I think it does not make sense to update the model with Flatten. However still there is a need to update URL/model/data. So I recommend this: https://github.com/onnx/models/tree/main/vision/classification/inception_and_googlenet/googlenet

Hold on. Although I am not fully agree all the opinions here, I am working on adding back this feature, which needs only two additional transposition.

@fengyuentau, sure. Here is a brief proposal so if you find it suitable, we can add a workaround: dkurt@cad21a3

Man, I found if batch is allowed here, it is even much more broken. I found in TEST_P(Test_ONNX_nets, Googlenet), the shape of ref is 2x1000 instead of 2x1x1000, meaning the shape of output of the last Gemm layer in Googlenet is 2x1000 also. Note that the input of this Gemm layer is 2 x 1 x 1024 (2 is the batch added by blobsFromImages in the source input). This is a broken shape inference with batch in the test. It may break some axis-sensitive layers, also it affects other backends with their own graph engines and shape inference.

Also I can see The forced-batch feature is going to cause some troubles in vision transformer models. Following is an example.

Above is the an attention block in ONNX. The batch dimension can be blended into other dimensions; In the 2nd last Reshape, the original batch size 1 is blended into 8. If the forced batch feature brings a batch=2, the Gemm layer won't work because the inner dimension will not match.

@fengyuentau, will dkurt@cad21a3 solve the problem at least partially?

It should fix the googlenet problem but it messes the dimensions in attention block. Transpose operators in the attention block transposes the batch dimension to axis 1.

Could we just simply follow the standards instead of going futher in a wrong way? The forced-batch feature assumes the batch dimension is at axis 0, which may work for most of the conv nets but not for vit.

fengyuentau · 2023-07-25T09:05:48Z

Hello all, more perf stats are provided in the first comment. Please have a look.

modules/dnn/include/opencv2/dnn/all_layers.hpp

modules/dnn/src/layers/gemm_layer.cpp

…dling 0d/1d cases

…date googlenet

modules/dnn/perf/perf_gemm.cpp

modules/dnn/src/layers/cpu_kernels/fast_gemm.cpp

modules/dnn/src/layers/gemm_layer.cpp

modules/dnn/test/test_onnx_importer.cpp

modules/dnn/src/onnx/onnx_importer.cpp

opencv-alalek

Test binary crashed in OpenCL builds: #24312

opencv-alalek · 2023-09-22T07:03:54Z

modules/dnn/CMakeLists.txt

@@ -9,6 +9,7 @@ ocv_add_dispatched_file_force_all("int8layers/layers_common" AVX2 AVX512_SKX LAS
 ocv_add_dispatched_file_force_all("layers/cpu_kernels/conv_block" AVX AVX2)
 ocv_add_dispatched_file_force_all("layers/cpu_kernels/conv_depthwise" AVX AVX2 RVV LASX)
 ocv_add_dispatched_file_force_all("layers/cpu_kernels/conv_winograd_f63" AVX AVX2)
+ocv_add_dispatched_file_force_all("layers/cpu_kernels/fast_gemm_kernels" AVX AVX2 NEON LASX)


NEON

We usually don't use runtime dispatching with NEON (it doesn't work due to different ABI).
Whole library is compiled with NEON instead.

Fixing via #24315

opencv-alalek · 2023-09-24T11:24:24Z

modules/dnn/test/test_onnx_importer.cpp

@@ -2597,6 +2597,40 @@ TEST_P(Test_ONNX_layers, where_node)
    testONNXModels("where_layer");
 }

+TEST_P(Test_ONNX_layers, Conformance_Gemm_all_attributes) {


Conformance

That confuses in test logs.
Avoid using conformance wording outside of test_onnx_conformance.cpp

Fixing via #24315

…opencv#23897) * first commit * turned C from input to constant; force C constant in impl; better handling 0d/1d cases * integrate with gemm from ficus nn * fix const inputs * adjust threshold for int8 tryQuantize * adjust threshold for int8 quantized 2 * support batched gemm and matmul; tune threshold for rcnn_ilsvrc13; update googlenet * add gemm perf against innerproduct * add perf tests for innerproduct with bias * fix perf * add memset * renamings for next step * add dedicated perf gemm * add innerproduct in perf_gemm * remove gemm and innerproduct perf tests from perf_layer * add perf cases for vit sizes; prepack constants * remove batched gemm; fix wrong trans; optimize KC * remove prepacking for const A; several fixes for const B prepacking * add todos and gemm expression * add optimized branch for avx/avx2 * trigger build * update macros and signature * update signature * fix macro * fix bugs for neon aarch64 & x64 * add backends: cuda, cann, inf_ngraph and vkcom * fix cuda backend * test commit for cuda * test cuda backend * remove debug message from cuda backend * use cpu dispatcher * fix neon macro undef in dispatcher * fix dispatcher * fix inner kernel for neon aarch64 * fix compiling issue on armv7; try fixing accuracy issue on other platforms * broadcast C with beta multiplied; improve func namings * fix bug for avx and avx2 * put all platform-specific kernels in dispatcher * fix typos * attempt to fix compile issues on x64 * run old gemm when neon, avx, avx2 are all not available; add kernel for armv7 neon * fix typo * quick fix: add macros for pack4 * quick fix: use vmlaq_f32 for armv7 * quick fix for missing macro of fast gemm pack f32 4 * disable conformance tests when optimized branches are not supported * disable perf tests when optimized branches are not supported * decouple cv_try_neon and cv_neon_aarch64 * drop googlenet_2023; add fastGemmBatched * fix step in fastGemmBatched * cpu: fix initialization ofb; gpu: support batch * quick followup fix for cuda * add default kernels * quick followup fix to avoid macro redef * optmized kernels for lasx * resolve mis-alignment; remove comments * tune performance for x64 platform * tune performance for neon aarch64 * tune for armv7 * comment time consuming tests * quick follow-up fix

fengyuentau added optimization category: dnn category: dnn (onnx) ONNX suport issues in DNN module labels Jun 30, 2023

fengyuentau added this to the 4.9.0 milestone Jun 30, 2023

fengyuentau requested a review from vpisarev June 30, 2023 09:46

fengyuentau assigned vpisarev Jun 30, 2023

fengyuentau force-pushed the refactor_fc branch from 8dc15c2 to d3c2703 Compare July 2, 2023 05:53

fengyuentau mentioned this pull request Jul 5, 2023

add converted conformance gemm tests opencv/opencv_extra#1073

Merged

opencv-alalek reviewed Jul 19, 2023

View reviewed changes

fengyuentau force-pushed the refactor_fc branch from 3abbbaf to 3f6a499 Compare August 7, 2023 08:40

dkurt reviewed Aug 7, 2023

View reviewed changes

modules/dnn/include/opencv2/dnn/all_layers.hpp Show resolved Hide resolved

dkurt reviewed Aug 7, 2023

View reviewed changes

modules/dnn/src/layers/gemm_layer.cpp Outdated Show resolved Hide resolved

fengyuentau mentioned this pull request Aug 29, 2023

Add Support for Einsum Layer #24037

Merged

17 tasks

fengyuentau marked this pull request as ready for review August 30, 2023 06:14

fengyuentau added 12 commits August 31, 2023 14:42

first commit

6cf9631

turned C from input to constant; force C constant in impl; better han…

d94d776

…dling 0d/1d cases

integrate with gemm from ficus nn

03476d5

fix const inputs

d466f77

adjust threshold for int8 tryQuantize

f4c3640

adjust threshold for int8 quantized 2

7a19272

support batched gemm and matmul; tune threshold for rcnn_ilsvrc13; up…

05d0793

…date googlenet

add gemm perf against innerproduct

fd14e6b

add perf tests for innerproduct with bias

9d5ac58

fix perf

de0beac

add memset

baac71d

renamings for next step

ec613bc

fengyuentau added 2 commits September 11, 2023 11:32

cpu: fix initialization ofb; gpu: support batch

07cf1c5

quick followup fix for cuda

78afd01