Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNN/CUDA: make 'abcd op 1b11' broadcast eltwise operator support cuda #23528

Merged
merged 1 commit into from Apr 24, 2023

Conversation

WanliZhong
Copy link
Member

@WanliZhong WanliZhong commented Apr 23, 2023

This PR will fix #23278

Current implement is a temp impl. I will try to make more eltwise broadcast cases support CUDA.

The inference time of model is from 26.7651 ms to 17.8416 ms.

perf_test result
run this script to generate result

.\bin\opencv_perf_dnn.exe '--gtest_filter=CUDA/Layer_NaryEltwise.*/*:CUDA/Layer_NaryEltwise/*.*' --gtest_output=xml:../tmp/1th.xml --perf_threads=1

use this script to generate summary

python ../modules/ts/misc/summary.py -m min 1th.xml 0th.xml -o markdown

result

Name of Test 0th 1th 1th vs 0th (x-factor)
NHWC_H::CUDA/Layer_NaryEltwise::CUDA/CUDA 39.003 (fallback to cpu) 17.936 2.17

Layer by layer data:

  • before being fixed
    onnx_node!ResNet18/0_conv/Conv2D   0.1515ms
    onnx_node!ResNet18/0_PReLU/Relu   0.0193ms
    onnx_node!ResNet18/0_PReLU/Neg_1   0.0145ms
    onnx_node!ResNet18/0_PReLU/Relu_1   0.0121ms
    ResNet18/0_PReLU/Neg:0   0.0167ms
    onnx_node!ResNet18/0_PReLU/mul   2.071ms
    onnx_node!ResNet18/0_PReLU/add   0.0585ms
    onnx_node!ResNet18/stack1_block1_shortcut_conv/Conv2D   0.1643ms
    onnx_node!ResNet18/stack1_block1_1_bn/FusedBatchNormV3   0.0166ms
    onnx_node!ResNet18/stack1_block1_1_conv/Conv2D   0.1179ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Relu   0.0192ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Neg_1   0.0114ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Relu_1   0.0095ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/mul   4.0522ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/add   0.0857ms
    onnx_node!ResNet18/stack1_block1_2_conv/Conv2D   0.1803ms
    onnx_node!ResNet18/stack1_block2_1_bn/FusedBatchNormV3   0.013ms
    onnx_node!ResNet18/stack1_block2_1_conv/Conv2D   0.0533ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Relu   0.0145ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Neg_1   0.0116ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Relu_1   0.0093ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/mul   2.346ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/add   0.0483ms
    onnx_node!ResNet18/stack1_block2_2_conv/Conv2D   0.0748ms
    onnx_node!ResNet18/stack2_block1_shortcut_conv/Conv2D   0.1015ms
    onnx_node!ResNet18/stack2_block1_1_bn/FusedBatchNormV3   0.0135ms
    onnx_node!ResNet18/stack2_block1_1_conv/Conv2D   0.0639ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Relu   0.0161ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Neg_1   0.0137ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Relu_1   0.0133ms
    ResNet18/stack2_block2_2_PReLU/Neg:0   0.0177ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/mul   2.7318ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/add   0.0643ms
    onnx_node!ResNet18/stack2_block1_2_conv/Conv2D   0.1083ms
    onnx_node!ResNet18/stack2_block2_1_bn/FusedBatchNormV3   0.0139ms
    onnx_node!ResNet18/stack2_block2_1_conv/Conv2D   0.0496ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Relu   0.0147ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Neg_1   0.0115ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Relu_1   0.0096ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/mul   1.79ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/add   0.045ms
    onnx_node!ResNet18/stack2_block2_2_conv/Conv2D   0.0701ms
    onnx_node!ResNet18/stack3_block1_shortcut_conv/Conv2D   0.0776ms
    onnx_node!ResNet18/stack3_block1_1_bn/FusedBatchNormV3   0.016ms
    onnx_node!ResNet18/stack3_block1_1_conv/Conv2D   0.0479ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Relu   0.0159ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Neg_1   0.0135ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Relu_1   0.0121ms
    ResNet18/stack3_block2_2_PReLU/Neg:0   0.0173ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/mul   2.1251ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/add   0.043ms
    onnx_node!ResNet18/stack3_block1_2_conv/Conv2D   0.0793ms
    onnx_node!ResNet18/stack3_block2_1_bn/FusedBatchNormV3   0.012ms
    onnx_node!ResNet18/stack3_block2_1_conv/Conv2D   0.0458ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Relu   0.0133ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Neg_1   0.0106ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Relu_1   0.0091ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/mul   1.7766ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/add   0.043ms
    onnx_node!ResNet18/stack3_block2_2_conv/Conv2D   0.0751ms
    onnx_node!ResNet18/stack4_block1_shortcut_conv/Conv2D   0.0758ms
    onnx_node!ResNet18/stack4_block1_1_bn/FusedBatchNormV3   0.0153ms
    onnx_node!ResNet18/stack4_block1_1_conv/Conv2D   0.048ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Relu   0.0151ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Neg_1   0.013ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Relu_1   0.012ms
    ResNet18/stack4_block1_2_PReLU/Neg:0   0.0176ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/mul   2.1163ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/add   0.0396ms
    onnx_node!ResNet18/stack4_block1_2_conv/Conv2D   0.0751ms
    onnx_node!ResNet18/stack4_block2_1_bn/FusedBatchNormV3   0.0121ms
    onnx_node!ResNet18/stack4_block2_1_conv/Conv2D   0.0485ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Relu   0.0158ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Neg_1   0.013ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Relu_1   0.0121ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/mul   2.0351ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/add   0.037ms
    onnx_node!ResNet18/stack4_block2_2_conv/Conv2D   0.072ms
    onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3   0.0142ms
    onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3__210   0.0169ms
    onnx_node!ResNet18/E_flatten/Reshape   0.0014ms
    onnx_node!ResNet18/E_dense/MatMul   0.0445ms
    ResNet18/E_batchnorm/ReadVariableOp_1:0   0.0165ms
    onnx_node!ResNet18/pre_embedding/batchnorm/mul_1   0.0156ms
    embedding   0.001ms
  • after being fixed
    onnx_node!ResNet18/0_conv/Conv2D   0.255ms
    onnx_node!ResNet18/0_PReLU/Relu   0.0309ms
    onnx_node!ResNet18/0_PReLU/Neg_1   0.0181ms
    onnx_node!ResNet18/0_PReLU/Relu_1   0.0147ms
    ResNet18/0_PReLU/Neg:0   0.0539ms
    onnx_node!ResNet18/0_PReLU/mul   0.0276ms
    onnx_node!ResNet18/0_PReLU/add   0.018ms
    onnx_node!ResNet18/stack1_block1_shortcut_conv/Conv2D   0.1718ms
    onnx_node!ResNet18/stack1_block1_1_bn/FusedBatchNormV3   0.0215ms
    onnx_node!ResNet18/stack1_block1_1_conv/Conv2D   0.1762ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Relu   0.0201ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Neg_1   0.0156ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Relu_1   0.0142ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/mul   0.0199ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/add   0.0478ms
    onnx_node!ResNet18/stack1_block1_2_conv/Conv2D   0.1198ms
    onnx_node!ResNet18/stack1_block2_1_bn/FusedBatchNormV3   0.0139ms
    onnx_node!ResNet18/stack1_block2_1_conv/Conv2D   0.2334ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Relu   0.0244ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Neg_1   0.0238ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Relu_1   0.0196ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/mul   0.0256ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/add   0.0204ms
    onnx_node!ResNet18/stack1_block2_2_conv/Conv2D   0.1101ms
    onnx_node!ResNet18/stack2_block1_shortcut_conv/Conv2D   0.1641ms
    onnx_node!ResNet18/stack2_block1_1_bn/FusedBatchNormV3   0.0296ms
    onnx_node!ResNet18/stack2_block1_1_conv/Conv2D   0.0867ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Relu   0.0253ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Neg_1   0.0223ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Relu_1   0.0208ms
    ResNet18/stack2_block2_2_PReLU/Neg:0   0.0337ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/mul   0.0334ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/add   0.0306ms
    onnx_node!ResNet18/stack2_block1_2_conv/Conv2D   0.1605ms
    onnx_node!ResNet18/stack2_block2_1_bn/FusedBatchNormV3   0.0266ms
    onnx_node!ResNet18/stack2_block2_1_conv/Conv2D   0.0904ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Relu   0.0712ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Neg_1   0.0305ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Relu_1   0.0237ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/mul   0.0299ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/add   0.0257ms
    onnx_node!ResNet18/stack2_block2_2_conv/Conv2D   0.1648ms
    onnx_node!ResNet18/stack3_block1_shortcut_conv/Conv2D   0.147ms
    onnx_node!ResNet18/stack3_block1_1_bn/FusedBatchNormV3   0.0269ms
    onnx_node!ResNet18/stack3_block1_1_conv/Conv2D   0.0805ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Relu   0.0274ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Neg_1   0.0214ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Relu_1   0.0969ms
    ResNet18/stack3_block2_2_PReLU/Neg:0   0.03ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/mul   0.0272ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/add   0.0247ms
    onnx_node!ResNet18/stack3_block1_2_conv/Conv2D   0.1316ms
    onnx_node!ResNet18/stack3_block2_1_bn/FusedBatchNormV3   0.0241ms
    onnx_node!ResNet18/stack3_block2_1_conv/Conv2D   0.0792ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Relu   0.0259ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Neg_1   0.0213ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Relu_1   0.0962ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/mul   0.0633ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/add   0.0246ms
    onnx_node!ResNet18/stack3_block2_2_conv/Conv2D   0.1131ms
    onnx_node!ResNet18/stack4_block1_shortcut_conv/Conv2D   0.1028ms
    onnx_node!ResNet18/stack4_block1_1_bn/FusedBatchNormV3   0.0273ms
    onnx_node!ResNet18/stack4_block1_1_conv/Conv2D   0.0834ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Relu   0.031ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Neg_1   0.1031ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Relu_1   0.0858ms
    ResNet18/stack4_block1_2_PReLU/Neg:0   0.032ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/mul   0.0333ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/add   0.0229ms
    onnx_node!ResNet18/stack4_block1_2_conv/Conv2D   0.1609ms
    onnx_node!ResNet18/stack4_block2_1_bn/FusedBatchNormV3   0.0336ms
    onnx_node!ResNet18/stack4_block2_1_conv/Conv2D   0.0869ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Relu   0.0314ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Neg_1   0.0235ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Relu_1   0.0236ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/mul   0.0368ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/add   0.0234ms
    onnx_node!ResNet18/stack4_block2_2_conv/Conv2D   0.1913ms
    onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3   0.0269ms
    onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3__210   0.0234ms
    onnx_node!ResNet18/E_flatten/Reshape   0.0016ms
    onnx_node!ResNet18/E_dense/MatMul   0.1472ms
    ResNet18/E_batchnorm/ReadVariableOp_1:0   0.0635ms
    onnx_node!ResNet18/pre_embedding/batchnorm/mul_1   0.0692ms
    embedding   0.0019ms

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@WanliZhong WanliZhong added bug category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib category: dnn labels Apr 23, 2023
@WanliZhong WanliZhong added this to the 4.8.0 milestone Apr 23, 2023
@WanliZhong WanliZhong requested a review from zihaomu April 23, 2023 09:53
@WanliZhong WanliZhong self-assigned this Apr 23, 2023
@WanliZhong WanliZhong changed the title make 'abcd op 1b11' broadcast support cuda DNN/CUDA: make 'abcd op 1b11' broadcast eltwise operator support cuda Apr 23, 2023
@asmorkalov
Copy link
Contributor

@WanliZhong In case if you get the results with OpenCV perf tests then you can use modules/ts/misc/summary.py to generate accurate performance comparison report. Just run the test before the patch and after the patch with `--gtest_output=xml:<xml_file_name> and run the script with two or more reports.

@zihaomu
Copy link
Member

zihaomu commented Apr 24, 2023

Hi @asmorkalov, thanks for your reminder. I will tell Wanli how to do this performance test.

@WanliZhong
Copy link
Member Author

Thanks, @asmorkalov . I have updated the summary results

run this script to generate result

.\bin\opencv_perf_dnn.exe '--gtest_filter=CUDA/Layer_NaryEltwise.*/*:CUDA/Layer_NaryEltwise/*.*' --gtest_output=xml:../tmp/1th.xml --perf_threads=1

use this script to generate summary

python ../modules/ts/misc/summary.py -m min 1th.xml 0th.xml -o markdown

result

Name of Test 0th 1th 1th vs 0th (x-factor)
NHWC_H::CUDA/Layer_NaryEltwise::CUDA/CUDA 39.003 (fallback to cpu) 17.936 2.17

Copy link
Member

@zihaomu zihaomu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@asmorkalov asmorkalov merged commit e3e1f70 into opencv:4.x Apr 24, 2023
19 checks passed
@WanliZhong WanliZhong deleted the issue23278 branch May 16, 2023 12:33
@asmorkalov asmorkalov mentioned this pull request May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug category: dnn category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend
3 participants