New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dnn: refactor ONNX MatMul with fastGemm #24694
Conversation
The previous performance results of |
All todo items are checked! |
@asmorkalov, this PR looks good to me. It needs to be merged in order to merge the other important PR, #24476. |
@dkurt Please join the review too. |
int total_tiles = m_tiles * n_tiles; | ||
|
||
auto fn = [&](const Range &r) { | ||
char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenCV AutoBuffer makes sense here: https://docs.opencv.org/4.x/d8/dd0/classcv_1_1AutoBuffer.html. No problems with memory leaks and it has built-in logic for alloca.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a temporary buffer is usually small (a few K's of memory)
AutoBuffer
says something like this. A typical buff_size
would be
FAST_GEMM_F32_PACKED_STRIDE_K * (FAST_GEMM_F32_MC + FAST_GEMM_F32_NC) * 4 / 1024
= 64 * (144 + 72) * 4 / 1024 = 54 KB
Is 54 KB
still considered to be a few KBs?
int total_tiles = m_tiles * n_tiles; | ||
|
||
auto fn = [&](const Range &r) { | ||
char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same idea for AutoBuffer.
int total_tiles = m_tiles * n_tiles; | ||
|
||
auto fn = [&](const Range &r) { | ||
char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AutoBuffer.
int total_tiles = m_tiles * n_tiles; | ||
|
||
auto fn = [&](const Range &r) { | ||
char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AutoBuffer
half **dev_C_slices = 0; | ||
cudaMalloc((void**)&dev_A_slices, batch_count * sizeof(half*)); | ||
cudaMalloc((void**)&dev_B_slices, batch_count * sizeof(half*)); | ||
cudaMalloc((void**)&dev_C_slices, batch_count * sizeof(half*)); | ||
cudaMemcpy(dev_A_slices, A_slices, batch_count * sizeof(half*), cudaMemcpyHostToDevice); | ||
cudaMemcpy(dev_B_slices, B_slices, batch_count * sizeof(half*), cudaMemcpyHostToDevice); | ||
cudaMemcpy(dev_C_slices, C_slices, batch_count * sizeof(half*), cudaMemcpyHostToDevice); | ||
|
||
CUDA4DNN_CHECK_CUBLAS(cublasHgemmBatched(handle.get(), opa, opb, iM, iN, iK, &alpha, dev_A_slices, ilda, dev_B_slices, ildb, &beta, dev_C_slices, ildc, batch_count)); | ||
|
||
cudaFree(dev_A_slices); | ||
cudaFree(dev_B_slices); | ||
cudaFree(dev_C_slices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional optimization with streams is possible. E.g. create stream, use cudaMemcopyAsync and cublasSetStream(). It reduces amount of CPU-GPU syncs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have examples demonstrating how to use these two APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, Linux-RISC-V-Clang
seems to have trouble starting jobs.
cudaMalloc((void**)&dev_A_slices, batch_count * sizeof(float*)); | ||
cudaMalloc((void**)&dev_B_slices, batch_count * sizeof(float*)); | ||
cudaMalloc((void**)&dev_C_slices, batch_count * sizeof(float*)); | ||
cudaMemcpy(dev_A_slices, A_slices, batch_count * sizeof(float*), cudaMemcpyHostToDevice); | ||
cudaMemcpy(dev_B_slices, B_slices, batch_count * sizeof(float*), cudaMemcpyHostToDevice); | ||
cudaMemcpy(dev_C_slices, C_slices, batch_count * sizeof(float*), cudaMemcpyHostToDevice); | ||
|
||
// cuBLAS is column-major | ||
CUDA4DNN_CHECK_CUBLAS(cublasSgemmBatched(handle.get(), opa, opb, iM, iN, iK, &alpha, dev_A_slices, ilda, dev_B_slices, ildb, &beta, dev_C_slices, ildc, batch_count)); | ||
|
||
cudaFree(dev_A_slices); | ||
cudaFree(dev_B_slices); | ||
cudaFree(dev_C_slices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same optional recommendation here.
Done:
Benchmark
Tests are done on M1. All data is in milliseconds (ms).
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.