New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNN: clean old convolution and optimize depth-wise Conv, Conv1D and Conv3D #22905
Conversation
c7a5ccd
to
81448a1
Compare
b2cf1d7
to
7e9d28c
Compare
54cf528
to
82efdb5
Compare
Hi @alalek, it seems that default CI's error is not related to this PR. |
#if CV_TRY_AVX2 | ||
if(useAVX2) | ||
opt_AVX2::fastDepthwiseConv(wptr, kernel_h, kernel_w, | ||
stride_h, stride_w, dilation_h, dilation_w, pad_t, pad_l, | ||
biasptr, relu, inptr_, height, width, outptr_, out_d, outH, outW); | ||
else | ||
#endif | ||
#if CV_TRY_AVX | ||
if(useAVX) | ||
opt_AVX::fastDepthwiseConv(wptr, kernel_h, kernel_w, | ||
stride_h, stride_w, dilation_h, dilation_w, pad_t, pad_l, | ||
biasptr, relu, inptr_, height, width, outptr_, out_d, outH, outW); | ||
else | ||
#endif | ||
#if CV_TRY_RVV | ||
if(useRVV) | ||
opt_RVV::fastDepthwiseConv(wptr, kernel_h, kernel_w, | ||
stride_h, stride_w, dilation_h, dilation_w, pad_t, pad_l, | ||
biasptr, relu, inptr_, height, width, outptr_, out_d, outH, outW); | ||
else | ||
#endif | ||
#if CV_TRY_LASX | ||
if(useLASX) | ||
opt_LASX::fastDepthwiseConv(wptr, kernel_h, kernel_w, | ||
stride_h, stride_w, dilation_h, dilation_w, pad_t, pad_l, | ||
biasptr, relu, inptr_, height, width, outptr_, out_d, outH, outW); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need performance comparison report in PR's description for that massive changes.
Also it makes sense to compare memory consumption too (to avoid 3x regressions like recent case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alalek. Thanks for your reminder, the purpose of this PR is that let Conv3d and Conv1D execute in the new branch. And it will not affect the speed and memory consumption of Conv2D. (Conv2D Related PR: #21910. )
I will update the speed performance test and the memory consumption of Conv3D and Conv1D.
In theory, Conv3D and Conv1D require twice as much memory as before, since we need repack the weight at the fast_conv
initialized stage. It will not reach 7 times (because Conv3D and Conv1D do not support Winograd).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use OpenCV performance testing infrastructure instead of some kind of manual testing:
- Single result is not enough. We must ensure that there are no regression in other cases.
Resnet34_kinetics
- no such test in opencv_perf_dnn- CPU/SIMD code optimization validation should be done with
--perf_threads=1
- i7-12700k has P and E cores. Need to bind test process to selected CPUs.
Finally, we have regressions, in conv1d case:
Name of Test | base | patch | x-factor |
---|---|---|---|
conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 19}, OCN=2, G=2, S=2, P=(1, 1), BIAS, OCV/CPU) | 0.001 | 0.001 | 0.74 |
conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 25}, OCN=2, G=2, P=(2, 2), PM=SAME, OCV/CPU) | 0.001 | 0.001 | 0.72 |
conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 6, 10}, OCN=6, PM=VALID, BIAS, OCV/CPU) | 0.001 | 0.001 | 0.74 |
(CPU: i7-12700k, Linux)
Test launch cmd:
taskset -c 0 ./bin/opencv_perf_dnn '--gtest_filter=*' --gtest_output=xml:../perf/pr22905/0-1th.xml --perf_threads=1
Report cmd:
python3 ${OPENCV_SRC}/modules/ts/misc/summary.py -m min -f conv1d ../perf/pr22905/{0,1}-1th.xml
(use -o markdown
for GitHub report)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i7-12700k has P and E cores. Need to bind test process to selected CPUs.
Thanks for your reminder. How about I add a new performance with a single thread result? I'm not sure if this is enough.
Resnet34_kinetics - no such test in opencv_perf_dnn
Resnet34_kinetics
is an accuracy test. And the backbone of Resnet34_Conv3d is a typical implementation of Conv3D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance report is not intended to show non-reproducible marketing single number.
Performance report is required to track regressions and to avoid them. We need all numbers from performance tests.
@vpisarev Used pipeline of optimization development has serious gaps. Ignoring existence of performance tests during optimization is not an acceptable flow. We need to fix that process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to update the performance test resulting from performance testing infrastructure.
Because the calculation amount of Conv1D in the performance test is too small, I am worried that the test results cannot reflect the model in the real application scenario.
Hi @alalek, the default CI still fails. And it seems that these errors are not related to this PR. |
Hi @alalek, I just check the CI's error:
And these three test regressions only contain the |
Ignore them. Not related to this patch (just check builder history for the same target branch). |
the pull request improves speed and simplifies implementation (removes old branches). I suggest to merge it |
int conv_dim = CONV_2D; | ||
if (inputs[0].dims == 3) | ||
conv_dim = CONV_1D; | ||
if (inputs[0].dims == 5) | ||
conv_dim = CONV_3D; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List of regressions for non-SIMD (fully scalar) mode:
cmake ... -DCMAKE_BUILD_TYPE=Release -DCV_DISABLE_OPTIMIZATION=ON
$ python3 ../../dev/modules/ts/misc/summary.py -m mean --regressions-only=0.97 ../perf/pr22905/nosimd-{0,1}-1th.xml -o markdown
Name of Test | nosimd-0-1th | nosimd-1-1th | nosimd-1-1th vs nosimd-0-1th (x-factor) |
---|---|---|---|
YOLOv4_tiny_2::Layer_Slice::OCV/CPU | 0.016 | 0.017 | 0.97 |
conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 19}, OCN=2, G=2, S=2, P=(1, 1), BIAS, OCV/CPU) | 0.001 | 0.001 | 0.78 |
conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 25}, OCN=2, G=2, P=(2, 2), PM=SAME, OCV/CPU) | 0.001 | 0.001 | 0.77 |
conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 6, 10}, OCN=6, PM=VALID, BIAS, OCV/CPU) | 0.001 | 0.001 | 0.91 |
conv3d::Conv3D::(GFLOPS=0.000, K=[3 x 3 x 3], IN={1, 2, 19, 19, 19}, OCN=2, G=2, S=[2 x 2 x 2], P=(1, 1) x (1, 1) x (1, 1), BIAS, OCV/CPU) | 0.062 | 0.080 | 0.78 |
conv3d::Conv3D::(GFLOPS=0.001, K=[3 x 3 x 3], IN={1, 2, 25, 19, 19}, OCN=2, G=2, S=[1 x 2 x 2], P=(2, 2) x (2, 2) x (2, 2), PM=SAME, OCV/CPU) | 0.169 | 0.197 | 0.86 |
(CPU: i7-12700k)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are regressions since 4.6.0 release:
Name of Test | nosimd-4.6.0-1th | nosimd-1-1th | nosimd-1-1th vs nosimd-4.6.0-1th (x-factor) |
---|---|---|---|
conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 19}, OCN=2, G=2, S=2, P=(1, 1), BIAS, OCV/CPU) | 0.001 | 0.001 | 0.75 |
conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 25}, OCN=2, G=2, P=(2, 2), PM=SAME, OCV/CPU) | 0.001 | 0.001 | 0.74 |
conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 6, 10}, OCN=6, PM=VALID, BIAS, OCV/CPU) | 0.001 | 0.001 | 0.88 |
conv3d::Conv3D::(GFLOPS=0.000, K=[3 x 3 x 3], IN={1, 2, 19, 19, 19}, OCN=2, G=2, S=[2 x 2 x 2], P=(1, 1) x (1, 1) x (1, 1), BIAS, OCV/CPU) | 0.063 | 0.080 | 0.79 |
conv3d::Conv3D::(GFLOPS=0.001, K=[3 x 3 x 3], IN={1, 2, 25, 19, 19}, OCN=2, G=2, S=[1 x 2 x 2], P=(2, 2) x (2, 2) x (2, 2), PM=SAME, OCV/CPU) | 0.161 | 0.197 | 0.82 |
conv::Conv::(GFLOPS=1.660, K=[3 x 3], IN={1, 128, 75, 75}, OCN=128, G=128, P=[1 x 1], BIAS, OCV/CPU) | 1.066 | 2.823 | 0.38 |
conv::Conv::(GFLOPS=1.704, K=[3 x 3], IN={1, 256, 38, 38}, OCN=256, G=256, P=[1 x 1], BIAS, OCV/CPU) | 0.585 | 1.609 | 0.36 |
conv::Conv::(GFLOPS=1.704, K=[3 x 3], IN={1, 512, 19, 19}, OCN=512, G=512, P=[1 x 1], BIAS, OCV/CPU) | 0.281 | 0.975 | 0.29 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for your test. I will try to optimize the performance of the new engine on x86 and will be delivered in a day or two.
b905d6d
to
91fe85d
Compare
Performance comparison for armv7 with neon:
There are several significant regressions for both conv3d and conv1d. |
Please use |
91fe85d
to
13bede9
Compare
Hi @alalek, @asmorkalov and @vpisarev. I have tried my best to speed up all cases in the performance test. And there are some regressions for Conv1D. And for Conv3D and Conv2D, We get a noticeable speedup. Regarding the performance test of Conv1D:
Why Conv1D can't get the acceleration of new engine, like Conv2D or Conv3D?
|
5084ff5
to
8f44578
Compare
@alalek, @asmorkalov, what is the final decision? I guess, we get noticeable acceleration in most cases. Shall we merge it before the release? |
Updated benchmark result on armv7+neon (jetson tk1):
|
f78f026
to
b8a42fa
Compare
@zihaomu please squash commits before merge. |
b8a42fa
to
71c6339
Compare
The purpose of this PR:
Speed performance test for Conv3D and Conv1D.
Speed Test at Apple M1 (ARMv8):
Conv1D, Conv2D and Conv3D performance test of SIMD
Min (ms)
Conv3D model performance test mannually
Test at i7-12700K (X86_64):
Conv1D, Conv2D and Conv3D performance test of SIMD
Min (ms)
Conv1D, Conv2D and Conv3D performance test of no-simd
Min (ms)
Conv3D model performance test mannually
Run a thousand times, choose the shortest time.
Manually test the code
Memory usage increase
With the
fast_conv
implementation, we need to repack the weight of convolution to run convolution as fast as possible. And this will double the memory requirements for Conv1D and Conv3D compared to the previous implementation.Note: And this will not affect the Conv2D memory consumption, since we have supported Conv2D with
fast_conv
before.Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.