Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNN: optimize the speed of general Depth-wise #23952

Merged
merged 2 commits into from
Jul 14, 2023

Conversation

zihaomu
Copy link
Member

@zihaomu zihaomu commented Jul 9, 2023

Try to solve the issue: #23941

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@zihaomu
Copy link
Member Author

zihaomu commented Jul 10, 2023

@WanliZhong, Can you check if this patch fixes your issue?

@zihaomu zihaomu requested a review from asmorkalov July 10, 2023 02:35
@zihaomu zihaomu changed the title DNN: optimize the speed of general Depth-wise on ARM platform DNN: optimize the speed of general Depth-wise Jul 10, 2023
@WanliZhong
Copy link
Member

WanliZhong commented Jul 10, 2023

I have tested this patch on arm and x86 (use the min value). Palm detection model has many 5x5 depth wise layers. Handpose and person detection models have several 5x5 depth wise layers. So the effect is evident on palm detection model.

Intel chip:

Model 4.7 4.8 this patch
palm detection 5.91ms 12.1 ms 6.41ms
handpose estimation 4.36ms 3.65ms 3.21ms
person detection 12.4ms 11.0ms 9.07ms

M2 chip:

Model 4.7 4.8 this patch
palm detection 8.35ms 18.76 ms 9.44ms
handpose estimation 4.6ms 7.04ms 4.91ms
person detection 10.83ms 14.06ms 10.74ms

@vpisarev vpisarev self-requested a review July 10, 2023 07:08
@vpisarev
Copy link
Contributor

@asmorkalov, I believe, 4.8.1 should be released as soon as possible with this and a few other fixes that you mentioned (Python-related)

@opencv-alalek
Copy link
Contributor

Related performance tests should be updated to avoid similar incidents in the future.

@opencv-alalek opencv-alalek added the pr: needs test New functionality requires minimal tests set label Jul 10, 2023
@zihaomu
Copy link
Member Author

zihaomu commented Jul 10, 2023

@wanli will be assigned to add such a performance test.

@WanliZhong
Copy link
Member

WanliZhong commented Jul 11, 2023

Perf test (use min value) result on MacBook Air M2:

Name of Test 480-1th 481-1th 481-1th vs 480-1th (x-factor)
conv::Conv::(GFLOPS=0.472, K=[5 x 5], IN={1, 32, 96, 96}, OCN=32, G=32, P=[2 x 2], BIAS, OCV/CPU) 3.721 1.921 1.94
conv::Conv::(GFLOPS=0.472, K=[5 x 5], IN={1, 32, 96, 96}, OCN=32, G=32, P=[2 x 2], BIAS, OCV/CPU_FP16) 2.358 1.869 1.26

@asmorkalov asmorkalov removed the pr: needs test New functionality requires minimal tests set label Jul 11, 2023
@zihaomu
Copy link
Member Author

zihaomu commented Jul 12, 2023

Looks like the default CI has a failure irrelated to this PR.

@asmorkalov
Copy link
Contributor

asmorkalov commented Jul 14, 2023

Perf results for i5-2500K CPU @ 3.30GHz (No AVX2)

                                                                 Name of Test                                                                     4.x       fix       fix
                                                                                                                                               depthwise depthwise depthwise.
                                                                                                                                                  5x5       5x5       5x5
                                                                                                                                                   1         1         1
                                                                                                                                                                       vs
                                                                                                                                                                      4.x
                                                                                                                                                                   depthwise.
                                                                                                                                                                      5x5
                                                                                                                                                                       1
conv::Conv::(GFLOPS=0.472, K=[5 x 5], IN={1, 32, 96, 96}, OCN=32, G=32, P=[2 x 2], BIAS, OCV/CPU)                                               5.111     1.068      4.79

@asmorkalov
Copy link
Contributor

Perf results for ARM v7 (Jetson tk1):

                                                                 Name of Test                                                                     4.x       fix       fix
                                                                                                                                               depthwise depthwise depthwise.
                                                                                                                                                  5x5       5x5       5x5
                                                                                                                                                   1         1         1
                                                                                                                                                                       vs
                                                                                                                                                                      4.x
                                                                                                                                                                   depthwise.
                                                                                                                                                                      5x5
                                                                                                                                                                       1
conv::Conv::(GFLOPS=0.472, K=[5 x 5], IN={1, 32, 96, 96}, OCN=32, G=32, P=[2 x 2], BIAS, OCV/CPU)                                               64.849    18.178      3.57  

Copy link
Contributor

@asmorkalov asmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@asmorkalov asmorkalov merged commit 1920993 into opencv:4.x Jul 14, 2023
21 checks passed
@asmorkalov asmorkalov mentioned this pull request Jul 27, 2023
asmorkalov pushed a commit that referenced this pull request Sep 27, 2023
DNN: optimize the speed of general Depth-wise #23952

Try to solve the issue: #23941

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
thewoz pushed a commit to thewoz/opencv that referenced this pull request Jan 4, 2024
DNN: optimize the speed of general Depth-wise opencv#23952

Try to solve the issue: opencv#23941

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
thewoz pushed a commit to thewoz/opencv that referenced this pull request May 29, 2024
DNN: optimize the speed of general Depth-wise opencv#23952

Try to solve the issue: opencv#23941

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Depthwise Convolution layer with 5x5 kernel much slower than 4.7.0
5 participants