-
-
Notifications
You must be signed in to change notification settings - Fork 55.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluid: SIMD for Resize Linear 8UC3 #20664
Conversation
c83d0f9
to
7708f35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't figured out the rest yet.
The rest looks good to me |
} | ||
} | ||
|
||
static void resetScratch(cv::gapi::fluid::Buffer& /*scratch*/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need that at all, if it has empty implementation anyway ?
53c0ea4
to
d5c5070
Compare
8c57290
to
56cd203
Compare
f66aaa1
to
016896b
Compare
016896b
to
3effa66
Compare
jenkins cn please retry a build |
3c67cd4
to
761397e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some CI failures, I believe it's because of Canny invocations on the images which were resized by not-bit exact resizers. Maybe if we change evaluating criteria to allow some percentage of pixels to be different (1% for example), this issue will be fixed? I think there are a dedicated functions for such image comparison in our test system
@@ -2114,6 +2114,51 @@ PERF_TEST_P_(ResizePerfTest, TestPerformance) | |||
SANITY_CHECK_NOTHING(); | |||
} | |||
|
|||
PERF_TEST_P_(StackOverflowPerfTest, TestPerformance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename this test to reflect what it actually tests, or add a verbose comment with a link to StackOverflow discussion and a brief description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PERF_TEST_P_(StackOverflowPerfTest, TestPerformance)
Why do we need that performance test case? Why is it not accuracy?
What is "done"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case were created to control performance result of test scenario mentioned here:
https://stackoverflow.com/questions/60629331/opencv-gapi-performance-not-good-as-expected
As mentioned in the post on StackOverflow post, kernels gathered in this test case have low performance in comparing to OpenCV. So we need this test to compare our performance improving.
I changed name the test and added explanatory comment.
641e16c
to
e99b128
Compare
660670c
to
d46d768
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently none of launched builders on public CI compile/run provided code (baseline is SSE3, SSE4.2 is enabled through dispatching only - not a this case)
@@ -2114,6 +2114,51 @@ PERF_TEST_P_(ResizePerfTest, TestPerformance) | |||
SANITY_CHECK_NOTHING(); | |||
} | |||
|
|||
PERF_TEST_P_(StackOverflowPerfTest, TestPerformance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PERF_TEST_P_(StackOverflowPerfTest, TestPerformance)
Why do we need that performance test case? Why is it not accuracy?
What is "done"?
// pixel_1 (rgbx) | ||
vec.val = _mm_insert_epi32(vec.val, *reinterpret_cast<const int*>(&src[chanNum * (*index + pos)]), 0); | ||
// pixel_2 (rgbx) | ||
vec.val = _mm_insert_epi32(vec.val, *reinterpret_cast<const int*>(&src[chanNum * (*(index + 1) + pos)]), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*index
index[0]
*(index + 1)
index[1]
Keep code readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index
pointer assume &alpha[x]
, i.e. it is pointer to x-th element of alpha
array, so I'm afraid that it can't be navigated by index[0] and index[1] construction.
Otherwise, I would love to use [] to navigate through an array here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I'm afraid that it can't be
https://stackoverflow.com/questions/4622461/difference-between-pointer-index-and-pointer
const Size& , | ||
const int ) | ||
{ | ||
CV_Assert("Unsupported number of channel"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we raise exception for the case? How is it handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact is that at the moment the Resize kernel in fluid is able to process only 3-channel images. In the next PRs I'm going to add 1 and 4 channel versions by adding two more specializations of this template. Processing images with a different number of channels (except 1, 3 and 4) will not be supported. So when users try to pass an image with other number of channels, we must notify them that this functionality is not supported. I am doing this with CV_Assert
. Could you tell me please what suggestions do you have in this regard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very confusing to emit such errors under conditions like "if SSE42" and "inputWidth > 16".
Supported parameters check should be done on the upper level. Optimization code paths should not do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree.
So, the main check for the number of channels also there is 2 level up - here
So this CV_Assert in this line of the current (gfluidcore_simd_sse42.hpp) file is just a formality. It was possible to leave this method empty, but I decided to add an assert just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method empty
Empty implementation is a wrong way.
It should be not defined (to get error from linker).
or use static_assert()
(to get error message from compiler)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied.
#if CV_SSE4_1 | ||
#include "gfluidcore_simd_sse42.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, The current SIMD implementation of the Resize is limited by SSE2, SSE3 and SSE4_1 instruction sets, so I enable it when these sets are present in the system and cmake configuration. Real target is SIMD128.
In the future there is possibility of addition to this file other kernel SIMD implementations with using SSE4.2 instructions.
So not so important how exactly this file was named SSE41 or SSE42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway we should not have such mess.
sse42 filename assumes that we (or future contributors) can place SSE4.2-specific code there. But really we can't because this file is used under sse41 guard.
Select and use one consistent value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied.
d46d768
to
c68c199
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is better:
- fix fluid comparing with OpenCV without any optimizations (no threading, no SIMD)
- after that try to optimize Fluid with SIMD
Currently it is not clear that is really done here. And which "tunes" have effect.
There are several code parts where SIMD is added though scalar processing (filling registers with "insert" ops) - it may not work at all, or even provide worse results.
@@ -2949,109 +2953,295 @@ GAPI_FLUID_KERNEL(GFluidPhase, cv::gapi::core::GPhase, false) | |||
} | |||
}; | |||
|
|||
GAPI_FLUID_KERNEL(GFluidResize, cv::gapi::core::GResize, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance results are questionable.
To properly measure related code we need:
- disabled threading (fluid is 1 threaded for now)
- disable IPP (to compare with existed OpenCV SIMD)
- force SSE4.2 during the build
$ cmake -DCMAKE_BUILD_TYPE=Release -DCPU_BASELINE=SSE4_2 -DCPU_DISPATCH= ...
$ ninja opencv_perf_gapi && OPENCV_IPP=disabled ./bin/opencv_perf_gapi --perf_threads=1 --perf_min_samples=100 --perf_force_samples=200 --gtest_filter=*esize*:*Bottleneck*:*StackOverflow* --gtest_output=xml:../perf/pr20664.xml
Results with i5-6600 are below.
We still have slow test here:
python3 <opencv>/modules/ts/misc/summary.py -f 'Bottle' -m median ../perf/pr20664.xml
...
TestPerformance::BottlenecksPerfTestCPU/BottlenecksPerfTest::(compare_f, { gapi.kernel_package }) 0.946
TestPerformance::BottlenecksPerfTestFluid/BottlenecksPerfTest::(compare_f, { gapi.kernel_package, gapi.kernel_package }) 1.005
All cases from here:
python3 <opencv>/modules/ts/misc/summary.py -f 'StackOverflowPerfTest' -m median ../perf/pr20664.xml
...
TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 128x128, { gapi.kernel_package }) 0.064
TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 640x480, { gapi.kernel_package }) 0.709
TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 1280x720, { gapi.kernel_package }) 2.012
TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 1920x1080, { gapi.kernel_package }) 4.580
TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 128x128, { gapi.kernel_package, gapi.kernel_package }) 0.071
TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 640x480, { gapi.kernel_package, gapi.kernel_package }) 0.758
TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 1280x720, { gapi.kernel_package, gapi.kernel_package }) 2.130
TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 1920x1080, { gapi.kernel_package, gapi.kernel_package }) 4.822
And here comparing "CPU" and "Fluid" branches:
$ python3 <opencv>/modules/ts/misc/summary.py -f '8UC3.*1080' -m median ../perf/pr20664.xml -o markdown
Name of Test | 10 |
---|---|
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) | 0.048 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) | 0.175 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) | 0.130 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) | 0.622 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) | 0.078 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) | 0.385 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) | 0.314 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) | 0.728 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) | 2.199 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) | 2.650 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) | 2.643 |
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) | 0.727 |
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) | 0.137 |
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) | 0.666 |
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) | 0.190 |
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) | 0.947 |
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 0, 1920x1080, 32x32, { gapi.kernel_package }) | 0.016 |
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 0, 1920x1080, 64x64, { gapi.kernel_package }) | 0.027 |
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 32x32, { gapi.kernel_package }) | 0.027 |
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 64x64, { gapi.kernel_package }) | 0.036 |
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 3, 1920x1080, 32x32, { gapi.kernel_package }) | 5.146 |
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 3, 1920x1080, 64x64, { gapi.kernel_package }) | 5.221 |
TestPerformance::ResizePerfTestFluid/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 30x30, { gapi.kernel_package }) | 0.049 |
TestPerformance::ResizePerfTestFluid/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 64x64, { gapi.kernel_package }) | 0.083 |
There is some exception for "1920x1080, 0.5, 0.1" case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thank you for performance report.
Could you tell me please did you test these test cases for fluid without my SIMD optimization? Your testing result not indicative without testing result of the scalar version of the Fluid's Resize 8UC3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see SIMD speedup, but I see significant degradation which comes from Fluid kernel itself.
In the summary of this hard work we don't reach the baseline performance (no IPP, single threaded OpenCV code).
For the same result we can just exclude Resize kernel from fluid kernels().
In code optimization we need to minimize algorithm complexity and overhead of processing tasks before trying to perform vectorization or other limited micro-optimizations.
Resize operation is usually processed as some kind of separable filter:
- horizontal pass
- vertical pass
Currently, I don't see such separation in G-API implementation (I expecting to see GResizeHPass / GResizeVPass kernels in the pipeline).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see SIMD speedup, but I see significant degradation which comes from Fluid kernel itself.
In the summary of this hard work we don't reach the baseline performance (no IPP, single threaded OpenCV code).
For the same result we can just exclude Resize kernel from fluid kernels().
Fluid kernels singly can often run slower than OpenCV's algorithms. We can observe their true effectiveness in the pipeline only.
So testing results of along kernel are not indicative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resize operation is usually processed as some kind of separable filter:
- horizontal pass
- vertical pass
Currently, I don't see such separation in G-API implementation (I expecting to see GResizeHPass / GResizeVPass kernels in the pipeline).
Yes, you are right. And so the algorithm in this PR is divided into both vertical and horizontal pass. Please take look here and here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the Blur kernel hasn't been sped up yet.
So, we need to have other performance tests with fluid compatible pipelines after/before resize (replace problematic blur/canny with elementwise add/sub operations) to see clear benefits of these optimizations of the resize stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added new perf test. It was called ResizeInSimpleGraph*
. Updated perf report will be added a little bit later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all,
Not sure what kind of input is expected here from my end but w.r.t this
For the same result we can just exclude Resize kernel from fluid kernels().
They key thing about using fluid kernels is not their individual performance but a pipeline performance. The individual performance may be comparable or even poorer than a single cv:: or IPP call, but when it comes to a pipeline the situation (usually) changes. E.g. (taken from the https://pullrequest.opencv.org/buildbot/builders/precommit_linux64/builds/34060/steps/perf_gapi/logs/stdio):
[----------] 15 tests from Benchmark/SobelEdgeDetector
[ RUN ] Benchmark/SobelEdgeDetector.Fluid/0, where GetParam() = 320x240
[ PERFSTAT ] (samples=1 mean=0.61 median=0.61 min=0.61 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.Fluid/0 (4 ms)
[ RUN ] Benchmark/SobelEdgeDetector.Fluid/1, where GetParam() = 640x480
[ PERFSTAT ] (samples=1 mean=2.04 median=2.04 min=2.04 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.Fluid/1 (10 ms)
[ RUN ] Benchmark/SobelEdgeDetector.Fluid/2, where GetParam() = 1280x720
[ PERFSTAT ] (samples=1 mean=5.65 median=5.65 min=5.65 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.Fluid/2 (25 ms)
[ RUN ] Benchmark/SobelEdgeDetector.Fluid/3, where GetParam() = 1920x1080
[ PERFSTAT ] (samples=1 mean=12.70 median=12.70 min=12.70 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.Fluid/3 (55 ms)
[ RUN ] Benchmark/SobelEdgeDetector.Fluid/4, where GetParam() = 3840x2170
[ PERFSTAT ] (samples=1 mean=51.19 median=51.19 min=51.19 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.Fluid/4 (237 ms)
[ RUN ] Benchmark/SobelEdgeDetector.OpenCV/0, where GetParam() = 320x240
[ PERFSTAT ] (samples=1 mean=0.50 median=0.50 min=0.50 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.OpenCV/0 (2 ms)
[ RUN ] Benchmark/SobelEdgeDetector.OpenCV/1, where GetParam() = 640x480
[ PERFSTAT ] (samples=1 mean=3.82 median=3.82 min=3.82 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.OpenCV/1 (15 ms)
[ RUN ] Benchmark/SobelEdgeDetector.OpenCV/2, where GetParam() = 1280x720
[ PERFSTAT ] (samples=1 mean=12.31 median=12.31 min=12.31 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.OpenCV/2 (46 ms)
[ RUN ] Benchmark/SobelEdgeDetector.OpenCV/3, where GetParam() = 1920x1080
[ PERFSTAT ] (samples=1 mean=30.73 median=30.73 min=30.73 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.OpenCV/3 (116 ms)
[ RUN ] Benchmark/SobelEdgeDetector.OpenCV/4, where GetParam() = 3840x2170
[ PERFSTAT ] (samples=1 mean=153.73 median=153.73 min=153.73 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.OpenCV/4 (525 ms)
[ RUN ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/0, where GetParam() = 320x240
[ PERFSTAT ] (samples=1 mean=0.52 median=0.52 min=0.52 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/0 (3 ms)
[ RUN ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/1, where GetParam() = 640x480
[ PERFSTAT ] (samples=1 mean=3.02 median=3.02 min=3.02 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/1 (14 ms)
[ RUN ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/2, where GetParam() = 1280x720
[ PERFSTAT ] (samples=1 mean=9.48 median=9.48 min=9.48 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/2 (46 ms)
[ RUN ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/3, where GetParam() = 1920x1080
[ PERFSTAT ] (samples=1 mean=21.38 median=21.38 min=21.38 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/3 (116 ms)
[ RUN ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/4, where GetParam() = 3840x2170
[ PERFSTAT ] (samples=1 mean=87.20 median=87.20 min=87.20 stddev=0.00 (0.0%))
[ OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/4 (452 ms)
[----------] 15 tests from Benchmark/SobelEdgeDetector (1666 ms total)
You may note how the performance differs based on the image resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding splitting the single resize into two kernels based on its separable properties, that's interesting idea but won't give much - in fact, now these two passes run via the scratch buffer (allocated for the kernel), and with such change this buffer will move to the graph side -- but the scheduling & execution overhead may grow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, Don't use perf numbers from precommit/nightly builders. These perf tests are run in sanity check mode (samples=1) with several concurrent processes/threads from other tests (results are neither reliable nor reproducible).
Unfortunately latencies of |
You would see 2x speed-up for Resize 8U3C on target cases (2x downscaling), if you compared test result before and after my SIMD optimization. |
{ | ||
for (; x <= outSz.width - nlanes; x += nlanes) | ||
{ | ||
#ifdef _WIN64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have that in the middle of SIMD code?
Code size can be reduced 2x, up to 30 lines.
Anyway compiler check must be applied here, not a target platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I didn't see another (more beautiful) way. Could you please share beautiful approach if you saw it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code size can be reduced 2x, up to 30 lines.
Could you please suggest how can I do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create small static inline function with #ifdef _MSC_VER
branches inside.
But it is better to use SIMD wide load + shuffle instead of scalar loading through int64. Or even read alpha "as is" (don't use 4 clones) and clone them by SIMD instructions.
BTW, be careful: MSVS2015/2017 may cause some bugs (#20894)
cv::Canny(cvblurred, out_mat_ocv, 32, 128, 3); | ||
|
||
cv::GMat in; | ||
cv::GMat vga = cv::gapi::resize(in, cv::Size(), 0.5, 0.5, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1
Make code readable, avoid magic constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Changed.
cv::resize(in_mat1, cvvga, cv::Size(), 0.5, 0.5); | ||
cv::cvtColor(cvvga, cvgray, cv::COLOR_BGR2GRAY); | ||
cv::blur(cvgray, cvblurred, cv::Size(3, 3)); | ||
cv::Canny(cvblurred, out_mat_ocv, 32, 128, 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need two similar tests?
Compare with code from the test above:
cv::resize(in_mat1, cvvga, cv::Size(), 0.5, 0.5);
cv::cvtColor(cvvga, cvgray, cv::COLOR_BGR2GRAY);
cv::blur(cvgray, cvblurred, cv::Size(3, 3));
cv::Canny(cvblurred, out_mat_ocv, 32, 128, 3);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked.
__m128i a11 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)])); | ||
__m128i a22 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)])); | ||
__m128i a23 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)])); | ||
__m128i a33 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clone[4 * (x + 3)]
You read each value 3 times in general.
Compilers are smart enough and do CSE job for you, but code looks bad (not readable form).
clone
meaningless name.
You probably don't need 4 clones of alpha.
These "clone" values are stored in linear block:
- clone[4 * x]
- clone[4 * (x + 1)]
- ...
- clone[4 * (x + 15)]
No need to read them one by one.
You using here scalar reading into SIMD registers. This is anti-pattern.
Number of load-store units on CPU core is limited.
It is much slower than reading into SIMD register and shuffle.
__m128i a00 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * x]), *reinterpret_cast<const int64_t*>(&clone[4 * x])); | ||
__m128i a01 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * x]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)])); | ||
__m128i a11 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)])); | ||
__m128i a22 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)])); | ||
__m128i a23 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)])); | ||
__m128i a33 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)])); | ||
__m128i a44 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 4)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 4)])); | ||
__m128i a45 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 4)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 5)])); | ||
__m128i a55 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 5)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 5)])); | ||
__m128i a66 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 6)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 6)])); | ||
__m128i a67 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 6)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 7)])); | ||
__m128i a77 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 7)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 7)])); | ||
__m128i a88 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 8)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 8)])); | ||
__m128i a89 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 8)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 9)])); | ||
__m128i a99 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 9)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 9)])); | ||
__m128i a1010 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 10)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 10)])); | ||
__m128i a1011 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 10)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 11)])); | ||
__m128i a1111 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 11)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 11)])); | ||
__m128i a1212 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 12)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 12)])); | ||
__m128i a1213 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 12)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 13)])); | ||
__m128i a1313 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 13)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 13)])); | ||
__m128i a1414 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 14)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 14)])); | ||
__m128i a1415 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 14)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 15)])); | ||
__m128i a1515 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 15)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 15)])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compare:
- number of declared local variables
- amount of CPU SSE registers
How is this code can be efficient?
Hopefully compiler is smart enough.
__m128i val1hi = _mm_insert_epi64(_mm_loadl_epi64(reinterpret_cast<const __m128i*>(&src1[2][w])), | ||
*reinterpret_cast<const int64_t*>(&src1[3][w]), 1); | ||
#endif | ||
__m128i val0_0 = _mm_cvtepu8_epi16(val0lo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more a matter of taste
Limited number of load-store units (LSU) don't agree with you.
However if I apply this advice I would have to re-run all perf test and gather all statistic again to check that performance hasn't decreased
This should took up to 10 mins with report analysis if you use the right tools (OpenCV have them!).
|
||
#include "opencv2/gapi/own/saturate.hpp" | ||
|
||
#include <nmmintrin.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is SSE4.2 header: https://stackoverflow.com/questions/11228855/header-files-for-x86-simd-intrinsics
(to the topic of SSE4.2/4.1 mess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for update!
No more comments from my side.
Awaiting approval from G-API team member before merge.
@alalek thank for your thorough review here! Actually, we discussed the approach with Anna yesterday and agreed to explore the "OpenCV way" of packing vectors to see if it helps to reduce load-store pressure and maybe simplify the code while keeping the performance at the same or better level. Anna will follow-up on this. :) |
9904bde
to
b24752f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decision: please merge it as-is, so we have a stable baseline now for the future improvement (not to mix the things & not to rollback if the idea doesn't work out)
Fluid: SIMD for Resize Linear 8UC3 * SIMD for fluid::Resize 8U3C * Rework horizontal pass + add 8U4C case * Reproduce stackoverflow test * StackOverflow test * SSE42 impl * SSE42 impl improvement * GAPI:SSE42 simd opt for Resize 8UC3. Final version * Fix tests * Conditional compilation fix * Applied comments * Applied comments. Step2 * Applied comments. Step2
SIMD optimization for fluid::core::Resize 8UC3 bilinear.
@smirnov-alexey @TolyaTalamanov @mpashchenkov @rgarnov please take a look.Report:
Resize_ocv_sse42.xlsx