Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnn: add gemm_layer in place of fully_connected_layer for onnx models #23897

Merged
merged 61 commits into from Sep 19, 2023

Conversation

fengyuentau
Copy link
Member

@fengyuentau fengyuentau commented Jun 30, 2023

Merge with opencv/opencv_extra#1073.

This PR integrates gemm from ficus nn. Gemm layer is created to replace InnerProduct layer for onnx models. Once this PR is merged or proved to be working fine, all InnerProduct layer calls will be replaced by Gemm.

Checklist:

  • ficus nn gemm integration.
  • dedicated perf test.
  • avoid re-packing for const A. Removed for simplicity.
  • avoid re-packing for const B.
  • acceleration for x86-64 platform.
  • add backends.
  • optimized for loongson.

Benchmarks:

Input scales are collected from ViTs. First scale is A, second is B and third is C if presented. All cases are without transposition. Check CI reports for more tests on different scales (square, retanglular, ...).

Updated @ 0907 WXG WXI MXG MXI MAG MAI UXG UXI UAG UAI
[768, 768], [768, 768], [768] 7.54 25.97 2.17 6.22 2.25 7.08 3.72 13.14 5.91 20.75
[1024, 1024], [1024, 1024], [1024] 9.53 60.19 5.01 17.38 5.19 16.89 7.47 33.31 8.21 50.76
[50, 768], [768, 2304] 2.03 6.26 1.3 1.19 0.75 1.59 2.23 2.26 4.85 6.27
[197, 768], [768, 2304] 6.85 20.62 1.76 4.69 1.77 6.14 3.69 9.12 4.54 17.71
[50, 1024], [1024, 3072] 3.98 8.57 1.13 2.94 1.2 4.2 2.19 4.96 5.39 8.92
[197, 1024], [1024, 3072] 7.24 38.15 3.12 10.9 3.07 14.58 6.24 19.95 5.72 31.93

You can see basically all xxG columns have better speed. xxG columns stand for Gemm layer performance on different hardware.

Notations:

  • WXG: windows10_x64-Gemm
  • WSI: windows10_x64-InnerProduct
  • MXG: macos_x64-Gemm
  • MXI: macos_x64-InnerProduct
  • MAG: macos_arm64-Gemm
  • MAI: macos_arm64-InnerProduct
  • UXG: ubuntu_x64-Gemm
  • UXI: ubuntu_x64-InnerProduct
  • UAG: ubuntu_arm64-Gemm
  • UAI: ubuntu_arm64-InnerProduct

All data in ms (milliseconds).

Other platforms:

Updated @ 0912 loongnix_loongson-Gemm loongnix_loongson-InnerProduct
A=[768, 768], B=[768, 768], C=[768], trans_a=0, trans_b=0 3.31 4.27
A=[1024, 1024], B=[1024, 1024], C=[1024], trans_a=0, trans_b=0 16.9 13.21
A=[50, 768], B=[768, 2304], trans_a=0, trans_b=0 1.55 3.68
A=[197, 768], B=[768, 2304], trans_a=0, trans_b=0 4.01 11.91
A=[50, 1024], B=[1024, 3072], trans_a=0, trans_b=0 2.55 35.41
A=[197, 1024], B=[1024, 3072], trans_a=0, trans_b=0 6.32 75.15

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@vpisarev
Copy link
Contributor

@fengyuentau, could you provide some performance numbers (before/after)?

@fengyuentau
Copy link
Member Author

fengyuentau commented Jul 13, 2023

@vpisarev , I just put on the preliminary benchmark results in the first comment. Please take a look.

Still working on the potential overflow problem (Turns out it was the initlaization problem). Will provide more results on different input sizes after the fix.

@fengyuentau
Copy link
Member Author

Hello @opencv-alalek , could you help download the new model on the default CI node?

@@ -2066,7 +2066,7 @@ TEST_P(Test_ONNX_nets, Googlenet)
applyTestTag(CV_TEST_TAG_DNN_SKIP_IE_NGRAPH);
#endif

const String model = _tf("models/googlenet.onnx", false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not introduce regressions for previously supported models.

You could add new test cases, but you can't just drop them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just wrong when you force it running with batch size >= 2 while it has a reshape operator always reshaping the batch size to be 1 (more details here). Why do we have to keep a wrong thing in our tests? It does not make sense to replicate bugs in a refactor implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkurt Could you please take a look?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that original model is not available anymore? https://s3.amazonaws.com/download.onnx/models/opset_8/bvlc_googlenet.tar.gz

Model <GoogleNet (ONNX)>
  expect 739732220ba2e3efa88f7c26f13badad9b7514bc
  catch [Errno 2] No such file or directory: 'bvlc_googlenet.tar.gz'
  hash check failed - downloading
  get https://s3.amazonaws.com/download.onnx/models/opset_8/bvlc_googlenet.tar.gz
  catch HTTP Error 403: Forbidden

Also, if a new model from https://drive.google.com/u/0/uc?id=1FucNLURGgdPk4nCxT0378isPigCcbZ1M&export=download is a personal Google Drive with a modified file, I think it's not a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due even opset 12 version has Reshape with 1x1024 hardcoded shapes, I think it does not make sense to update the model with Flatten. However still there is a need to update URL/model/data. So I recommend this: https://github.com/onnx/models/tree/main/vision/classification/inception_and_googlenet/googlenet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on. Although I am not fully agree all the opinions here, I am working on adding back this feature, which needs only two additional transposition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fengyuentau, sure. Here is a brief proposal so if you find it suitable, we can add a workaround: dkurt@cad21a3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, I found if batch is allowed here, it is even much more broken. I found in TEST_P(Test_ONNX_nets, Googlenet), the shape of ref is 2x1000 instead of 2x1x1000, meaning the shape of output of the last Gemm layer in Googlenet is 2x1000 also. Note that the input of this Gemm layer is 2 x 1 x 1024 (2 is the batch added by blobsFromImages in the source input). This is a broken shape inference with batch in the test. It may break some axis-sensitive layers, also it affects other backends with their own graph engines and shape inference.

Also I can see The forced-batch feature is going to cause some troubles in vision transformer models. Following is an example.

image

Above is the an attention block in ONNX. The batch dimension can be blended into other dimensions; In the 2nd last Reshape, the original batch size 1 is blended into 8. If the forced batch feature brings a batch=2, the Gemm layer won't work because the inner dimension will not match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fengyuentau, will dkurt@cad21a3 solve the problem at least partially?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should fix the googlenet problem but it messes the dimensions in attention block. Transpose operators in the attention block transposes the batch dimension to axis 1.

Could we just simply follow the standards instead of going futher in a wrong way? The forced-batch feature assumes the batch dimension is at axis 0, which may work for most of the conv nets but not for vit.

@fengyuentau
Copy link
Member Author

Hello all, more perf stats are provided in the first comment. Please have a look.

@fengyuentau fengyuentau mentioned this pull request Aug 29, 2023
17 tasks
@fengyuentau fengyuentau marked this pull request as ready for review August 30, 2023 06:14
@vpisarev vpisarev merged commit 8a96e34 into opencv:4.x Sep 19, 2023
23 checks passed
Copy link
Contributor

@opencv-alalek opencv-alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test binary crashed in OpenCL builds: #24312

@@ -9,6 +9,7 @@ ocv_add_dispatched_file_force_all("int8layers/layers_common" AVX2 AVX512_SKX LAS
ocv_add_dispatched_file_force_all("layers/cpu_kernels/conv_block" AVX AVX2)
ocv_add_dispatched_file_force_all("layers/cpu_kernels/conv_depthwise" AVX AVX2 RVV LASX)
ocv_add_dispatched_file_force_all("layers/cpu_kernels/conv_winograd_f63" AVX AVX2)
ocv_add_dispatched_file_force_all("layers/cpu_kernels/fast_gemm_kernels" AVX AVX2 NEON LASX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEON

We usually don't use runtime dispatching with NEON (it doesn't work due to different ABI).
Whole library is compiled with NEON instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing via #24315

@@ -2597,6 +2597,40 @@ TEST_P(Test_ONNX_layers, where_node)
testONNXModels("where_layer");
}

TEST_P(Test_ONNX_layers, Conformance_Gemm_all_attributes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conformance

That confuses in test logs.
Avoid using conformance wording outside of test_onnx_conformance.cpp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing via #24315

@fengyuentau fengyuentau mentioned this pull request Sep 25, 2023
9 tasks
@asmorkalov asmorkalov mentioned this pull request Sep 28, 2023
thewoz pushed a commit to thewoz/opencv that referenced this pull request Jan 4, 2024
…opencv#23897)

* first commit

* turned C from input to constant; force C constant in impl; better handling 0d/1d cases

* integrate with gemm from ficus nn

* fix const inputs

* adjust threshold for int8 tryQuantize

* adjust threshold for int8 quantized 2

* support batched gemm and matmul; tune threshold for rcnn_ilsvrc13; update googlenet

* add gemm perf against innerproduct

* add perf tests for innerproduct with bias

* fix perf

* add memset

* renamings for next step

* add dedicated perf gemm

* add innerproduct in perf_gemm

* remove gemm and innerproduct perf tests from perf_layer

* add perf cases for vit sizes; prepack constants

* remove batched gemm; fix wrong trans; optimize KC

* remove prepacking for const A; several fixes for const B prepacking

* add todos and gemm expression

* add optimized branch for avx/avx2

* trigger build

* update macros and signature

* update signature

* fix macro

* fix bugs for neon aarch64 & x64

* add backends: cuda, cann, inf_ngraph and vkcom

* fix cuda backend

* test commit for cuda

* test cuda backend

* remove debug message from cuda backend

* use cpu dispatcher

* fix neon macro undef in dispatcher

* fix dispatcher

* fix inner kernel for neon aarch64

* fix compiling issue on armv7; try fixing accuracy issue on other platforms

* broadcast C with beta multiplied; improve func namings

* fix bug for avx and avx2

* put all platform-specific kernels in dispatcher

* fix typos

* attempt to fix compile issues on x64

* run old gemm when neon, avx, avx2 are all not available; add kernel for armv7 neon

* fix typo

* quick fix: add macros for pack4

* quick fix: use vmlaq_f32 for armv7

* quick fix for missing macro of fast gemm pack f32 4

* disable conformance tests when optimized branches are not supported

* disable perf tests when optimized branches are not supported

* decouple cv_try_neon and cv_neon_aarch64

* drop googlenet_2023; add fastGemmBatched

* fix step in fastGemmBatched

* cpu: fix initialization ofb; gpu: support batch

* quick followup fix for cuda

* add default kernels

* quick followup fix to avoid macro redef

* optmized kernels for lasx

* resolve mis-alignment; remove comments

* tune performance for x64 platform

* tune performance for neon aarch64

* tune for armv7

* comment time consuming tests

* quick follow-up fix
@fengyuentau fengyuentau mentioned this pull request Feb 21, 2024
48 tasks
@fengyuentau fengyuentau deleted the refactor_fc branch March 21, 2024 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants