dnn: refactor ONNX MatMul with fastGemm #24694

fengyuentau · 2023-12-13T06:00:05Z

Done:

Benchmark

Tests are done on M1. All data is in milliseconds (ms).

Configuration	MatMul (Prepacked)	MatMul	InnerProduct
A=[12, 197, 197], B=[12, 197, 64], trans_a=0, trans_b=0	0.39	0.41	1.33
A=[12, 197, 64], B=[12, 64, 197], trans_a=0, trans_b=0	0.42	0.42	1.17
A=[12, 50, 64], B=[12, 64, 50], trans_a=0, trans_b=0	0.13	0.15	0.33
A=[12, 50, 50], B=[12, 50, 64], trans_a=0, trans_b=0	0.11	0.13	0.22
A=[16, 197, 197], B=[16, 197, 64], trans_a=0, trans_b=0	0.46	0.54	1.46
A=[16, 197, 64], B=[16, 64, 197], trans_a=0, trans_b=0	0.46	0.95	1.74
A=[16, 50, 64], B=[16, 64, 50], trans_a=0, trans_b=0	0.18	0.32	0.43
A=[16, 50, 50], B=[16, 50, 64], trans_a=0, trans_b=0	0.15	0.25	0.25

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

modules/dnn/src/onnx/onnx_importer.cpp

fengyuentau · 2023-12-15T03:25:28Z

The previous performance results of InnerProduct was from its branch of calling BLAS. Now I put B as constant input so as to call the non-blas branch. Results show that fastGemm is generally better than FullyConnected acceleration.

fengyuentau · 2023-12-19T09:19:46Z

All todo items are checked!

vpisarev · 2023-12-19T12:52:18Z

@asmorkalov, this PR looks good to me. It needs to be merged in order to merge the other important PR, #24476.

asmorkalov · 2023-12-19T13:06:46Z

@dkurt Please join the review too.

asmorkalov · 2023-12-19T13:22:21Z

modules/dnn/src/layers/cpu_kernels/fast_gemm_kernels.default.hpp

+    int total_tiles = m_tiles * n_tiles;
+
+    auto fn = [&](const Range &r) {
+        char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size));


OpenCV AutoBuffer makes sense here: https://docs.opencv.org/4.x/d8/dd0/classcv_1_1AutoBuffer.html. No problems with memory leaks and it has built-in logic for alloca.

If a temporary buffer is usually small (a few K's of memory)

AutoBuffer says something like this. A typical buff_size would be

FAST_GEMM_F32_PACKED_STRIDE_K * (FAST_GEMM_F32_MC + FAST_GEMM_F32_NC) * 4 / 1024 = 64 * (144 + 72) * 4 / 1024 = 54 KB

Is 54 KB still considered to be a few KBs?

asmorkalov · 2023-12-19T13:23:47Z

modules/dnn/src/layers/cpu_kernels/fast_gemm_kernels.default.hpp

+    int total_tiles = m_tiles * n_tiles;
+
+    auto fn = [&](const Range &r) {
+        char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size));


The same idea for AutoBuffer.

asmorkalov · 2023-12-19T13:25:02Z

modules/dnn/src/layers/cpu_kernels/fast_gemm_kernels.simd.hpp

+    int total_tiles = m_tiles * n_tiles;
+
+    auto fn = [&](const Range &r) {
+        char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size));


AutoBuffer.

asmorkalov · 2023-12-19T13:26:16Z

modules/dnn/src/layers/cpu_kernels/fast_gemm_kernels.simd.hpp

+    int total_tiles = m_tiles * n_tiles;
+
+    auto fn = [&](const Range &r) {
+        char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size));


asmorkalov · 2023-12-19T15:12:03Z

modules/dnn/src/cuda4dnn/csl/cublas.hpp

+        half **dev_C_slices = 0;
+        cudaMalloc((void**)&dev_A_slices, batch_count * sizeof(half*));
+        cudaMalloc((void**)&dev_B_slices, batch_count * sizeof(half*));
+        cudaMalloc((void**)&dev_C_slices, batch_count * sizeof(half*));
+        cudaMemcpy(dev_A_slices, A_slices, batch_count * sizeof(half*), cudaMemcpyHostToDevice);
+        cudaMemcpy(dev_B_slices, B_slices, batch_count * sizeof(half*), cudaMemcpyHostToDevice);
+        cudaMemcpy(dev_C_slices, C_slices, batch_count * sizeof(half*), cudaMemcpyHostToDevice);
+
+        CUDA4DNN_CHECK_CUBLAS(cublasHgemmBatched(handle.get(), opa, opb, iM, iN, iK, &alpha, dev_A_slices, ilda, dev_B_slices, ildb, &beta, dev_C_slices, ildc, batch_count));
+
+        cudaFree(dev_A_slices);
+        cudaFree(dev_B_slices);
+        cudaFree(dev_C_slices);


Optional optimization with streams is possible. E.g. create stream, use cudaMemcopyAsync and cublasSetStream(). It reduces amount of CPU-GPU syncs.

Do we have examples demonstrating how to use these two APIs?

By the way, Linux-RISC-V-Clang seems to have trouble starting jobs.

asmorkalov · 2023-12-19T15:12:40Z

modules/dnn/src/cuda4dnn/csl/cublas.hpp

+        cudaMalloc((void**)&dev_A_slices, batch_count * sizeof(float*));
+        cudaMalloc((void**)&dev_B_slices, batch_count * sizeof(float*));
+        cudaMalloc((void**)&dev_C_slices, batch_count * sizeof(float*));
+        cudaMemcpy(dev_A_slices, A_slices, batch_count * sizeof(float*), cudaMemcpyHostToDevice);
+        cudaMemcpy(dev_B_slices, B_slices, batch_count * sizeof(float*), cudaMemcpyHostToDevice);
+        cudaMemcpy(dev_C_slices, C_slices, batch_count * sizeof(float*), cudaMemcpyHostToDevice);
+
+        // cuBLAS is column-major
+        CUDA4DNN_CHECK_CUBLAS(cublasSgemmBatched(handle.get(), opa, opb, iM, iN, iK, &alpha, dev_A_slices, ilda, dev_B_slices, ildb, &beta, dev_C_slices, ildc, batch_count));
+
+        cudaFree(dev_A_slices);
+        cudaFree(dev_B_slices);
+        cudaFree(dev_C_slices);


The same optional recommendation here.

attention layer

fengyuentau added 4 commits November 27, 2023 20:02

initial commit

c5204b5

second initial commit

8522496

intial impl

514e08c

bug fix and working version

e87fca6

fengyuentau added optimization category: dnn labels Dec 13, 2023

fengyuentau added this to the 4.9.0 milestone Dec 13, 2023

fengyuentau assigned vpisarev Dec 13, 2023

bug fix; add perf test

01b5259

fengyuentau marked this pull request as draft December 13, 2023 09:17

fengyuentau added 3 commits December 13, 2023 17:32

bug fix for smaller dims of B

7195c07

another bug fix

05ba415

add perf innerproduct

58c6672

dkurt reviewed Dec 13, 2023

View reviewed changes

modules/dnn/src/onnx/onnx_importer.cpp Show resolved Hide resolved

innerproduct calls non-blas routines

4e0d878

fengyuentau mentioned this pull request Dec 15, 2023

ERROR input shape not found in function 'cv::dnn::dnn4_v20230620::ONNXImporter::parseEinsum' #24697

Closed

4 tasks

fengyuentau and others added 9 commits December 15, 2023 18:01

add openvino backend

31ece1f

try to add opencl backend

ddf5da4

try to add vulkan backend

c28e5b1

add cuda backend

fc2989a

add CANN backend

73fb279

fix doc

9ea9f79

fix cuda backend

5dbacfd

fix cann backend

ba97439

add B prepacking

af1f299

fengyuentau marked this pull request as ready for review December 19, 2023 09:19

fengyuentau added 2 commits December 19, 2023 17:25

quick fix to remove trailing whitespace

532177e

try fixing vulkan backend with constant input B

6931c1c

vpisarev self-requested a review December 19, 2023 12:50

vpisarev approved these changes Dec 19, 2023

View reviewed changes

asmorkalov reviewed Dec 19, 2023

View reviewed changes

dkurt previously approved these changes Dec 19, 2023

View reviewed changes

dkurt approved these changes Dec 19, 2023

View reviewed changes

asmorkalov merged commit fa5ed62 into opencv:4.x Dec 19, 2023
24 of 26 checks passed

fengyuentau deleted the matmul_refactor branch December 19, 2023 16:45

asmorkalov mentioned this pull request Jan 19, 2024

5.x merge 4.x #24862

Merged

fengyuentau mentioned this pull request Feb 21, 2024

ONNX conformance test results #21078

Open

48 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dnn: refactor ONNX MatMul with fastGemm #24694

dnn: refactor ONNX MatMul with fastGemm #24694

fengyuentau commented Dec 13, 2023 •

edited by asmorkalov

fengyuentau commented Dec 15, 2023

fengyuentau commented Dec 19, 2023

vpisarev commented Dec 19, 2023

asmorkalov commented Dec 19, 2023

asmorkalov Dec 19, 2023

fengyuentau Dec 19, 2023

asmorkalov Dec 19, 2023

asmorkalov Dec 19, 2023

asmorkalov Dec 19, 2023

asmorkalov Dec 19, 2023

fengyuentau Dec 19, 2023

fengyuentau Dec 19, 2023

asmorkalov Dec 19, 2023

dnn: refactor ONNX MatMul with fastGemm #24694

dnn: refactor ONNX MatMul with fastGemm #24694

Conversation

fengyuentau commented Dec 13, 2023 • edited by asmorkalov

Benchmark

Pull Request Readiness Checklist

fengyuentau commented Dec 15, 2023

fengyuentau commented Dec 19, 2023

vpisarev commented Dec 19, 2023

asmorkalov commented Dec 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fengyuentau commented Dec 13, 2023 •

edited by asmorkalov