Fluid: SIMD for Resize Linear 8UC3 #20664

anna-khakimova · 2021-09-06T22:42:35Z

SIMD optimization for fluid::core::Resize 8UC3 bilinear.

@smirnov-alexey @TolyaTalamanov @mpashchenkov @rgarnov please take a look.

force_builders=Linux AVX2,Custom,Custom Win,Custom Mac
build_gapi_standalone:Linux x64=ade-0.1.1f
build_gapi_standalone:Win64=ade-0.1.1f
Xbuild_gapi_standalone:Mac=ade-0.1.1f
build_gapi_standalone:Linux x64 Debug=ade-0.1.1f

build_image:Custom=centos:7
buildworker:Custom=linux-1
build_gapi_standalone:Custom=ade-0.1.1f

Xbuild_image:Custom=ubuntu-openvino-2021.3.0:20.04
build_image:Custom Win=openvino-2021.4.1
build_image:Custom Mac=openvino-2021.2.0

buildworker:Custom Win=windows-3

test_modules:Custom=gapi,python2,python3,java
test_modules:Custom Win=gapi,python2,python3,java
test_modules:Custom Mac=gapi,python2,python3,java

buildworker:Custom=linux-1
# disabled due high memory usage: test_opencl:Custom=ON
Xtest_opencl:Custom=OFF
Xtest_bigdata:Custom=1
Xtest_filter:Custom=*

CPU_BASELINE:Custom Win=AVX512_SKX
CPU_BASELINE:Custom=SSE4_2

mpashchenkov

I haven't figured out the rest yet.

modules/core/include/opencv2/core/hal/intrin_avx512.hpp

modules/gapi/include/opencv2/gapi/core.hpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

modules/core/include/opencv2/core/hal/intrin_avx.hpp

modules/core/include/opencv2/core/hal/intrin_avx512.hpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

smirnov-alexey · 2021-09-09T13:01:16Z

The rest looks good to me

modules/gapi/perf/common/gapi_core_perf_tests_inl.hpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

TolyaTalamanov · 2021-09-09T13:00:26Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+       }
+    }
+
+    static void resetScratch(cv::gapi::fluid::Buffer& /*scratch*/)


Do we need that at all, if it has empty implementation anyway ?

modules/gapi/src/backends/fluid/gfluidcore.cpp

asenyaev · 2021-09-29T10:03:55Z

jenkins cn please retry a build

rgarnov

There are some CI failures, I believe it's because of Canny invocations on the images which were resized by not-bit exact resizers. Maybe if we change evaluating criteria to allow some percentage of pixels to be different (1% for example), this issue will be fixed? I think there are a dedicated functions for such image comparison in our test system

rgarnov · 2021-10-01T15:37:16Z

modules/gapi/perf/common/gapi_core_perf_tests_inl.hpp

@@ -2114,6 +2114,51 @@ PERF_TEST_P_(ResizePerfTest, TestPerformance)
    SANITY_CHECK_NOTHING();
 }

+PERF_TEST_P_(StackOverflowPerfTest, TestPerformance)


Please rename this test to reflect what it actually tests, or add a verbose comment with a link to StackOverflow discussion and a brief description

PERF_TEST_P_(StackOverflowPerfTest, TestPerformance)

Why do we need that performance test case? Why is it not accuracy?
What is "done"?

This test case were created to control performance result of test scenario mentioned here:
https://stackoverflow.com/questions/60629331/opencv-gapi-performance-not-good-as-expected
As mentioned in the post on StackOverflow post, kernels gathered in this test case have low performance in comparing to OpenCV. So we need this test to compare our performance improving.
I changed name the test and added explanatory comment.

modules/gapi/perf/common/gapi_core_perf_tests_inl.hpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp

alalek

Currently none of launched builders on public CI compile/run provided code (baseline is SSE3, SSE4.2 is enabled through dispatching only - not a this case)

modules/gapi/perf/common/gapi_core_perf_tests_inl.hpp

alalek · 2021-10-12T12:43:18Z

modules/gapi/perf/common/gapi_core_perf_tests_inl.hpp

@@ -2114,6 +2114,51 @@ PERF_TEST_P_(ResizePerfTest, TestPerformance)
    SANITY_CHECK_NOTHING();
 }

+PERF_TEST_P_(StackOverflowPerfTest, TestPerformance)


PERF_TEST_P_(StackOverflowPerfTest, TestPerformance)

Why do we need that performance test case? Why is it not accuracy?
What is "done"?

modules/gapi/perf/cpu/gapi_core_perf_tests_fluid.cpp

modules/gapi/test/internal/gapi_int_recompilation_test.cpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

alalek · 2021-10-12T13:00:58Z

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp

+    // pixel_1 (rgbx)
+    vec.val = _mm_insert_epi32(vec.val, *reinterpret_cast<const int*>(&src[chanNum * (*index + pos)]), 0);
+    // pixel_2 (rgbx)
+    vec.val = _mm_insert_epi32(vec.val, *reinterpret_cast<const int*>(&src[chanNum * (*(index + 1) + pos)]), 1);


*index

index[0]

*(index + 1)

index[1]

Keep code readable.

The index pointer assume &alpha[x], i.e. it is pointer to x-th element of alpha array, so I'm afraid that it can't be navigated by index[0] and index[1] construction.
Otherwise, I would love to use [] to navigate through an array here.

so I'm afraid that it can't be

https://stackoverflow.com/questions/4622461/difference-between-pointer-index-and-pointer

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp

alalek · 2021-10-12T13:14:48Z

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp

+                                              const Size& ,
+                                              const int )
+{
+    CV_Assert("Unsupported number of channel");


Why do we raise exception for the case? How is it handled?

The fact is that at the moment the Resize kernel in fluid is able to process only 3-channel images. In the next PRs I'm going to add 1 and 4 channel versions by adding two more specializations of this template. Processing images with a different number of channels (except 1, 3 and 4) will not be supported. So when users try to pass an image with other number of channels, we must notify them that this functionality is not supported. I am doing this with CV_Assert. Could you tell me please what suggestions do you have in this regard?

It is very confusing to emit such errors under conditions like "if SSE42" and "inputWidth > 16".
Supported parameters check should be done on the upper level. Optimization code paths should not do that.

Yes, I agree.
So, the main check for the number of channels also there is 2 level up - here
So this CV_Assert in this line of the current (gfluidcore_simd_sse42.hpp) file is just a formality. It was possible to leave this method empty, but I decided to add an assert just in case.

method empty

Empty implementation is a wrong way.
It should be not defined (to get error from linker).
or use static_assert() (to get error message from compiler)

alalek · 2021-10-13T01:02:56Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+#if CV_SSE4_1
+#include "gfluidcore_simd_sse42.hpp"


SSE4_1
sse42

Why do we need this mess?

@dmatveev Which ISA is the real target?

In fact, The current SIMD implementation of the Resize is limited by SSE2, SSE3 and SSE4_1 instruction sets, so I enable it when these sets are present in the system and cmake configuration. Real target is SIMD128.
In the future there is possibility of addition to this file other kernel SIMD implementations with using SSE4.2 instructions.
So not so important how exactly this file was named SSE41 or SSE42

Anyway we should not have such mess.

sse42 filename assumes that we (or future contributors) can place SSE4.2-specific code there. But really we can't because this file is used under sse41 guard.

Select and use one consistent value.

alalek

Perhaps it is better:

fix fluid comparing with OpenCV without any optimizations (no threading, no SIMD)
after that try to optimize Fluid with SIMD

Currently it is not clear that is really done here. And which "tunes" have effect.

There are several code parts where SIMD is added though scalar processing (filling registers with "insert" ops) - it may not work at all, or even provide worse results.

alalek · 2021-10-14T05:09:52Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

@@ -2949,109 +2953,295 @@ GAPI_FLUID_KERNEL(GFluidPhase, cv::gapi::core::GPhase, false)
    }
 };

-GAPI_FLUID_KERNEL(GFluidResize, cv::gapi::core::GResize, true)


Performance results are questionable.

To properly measure related code we need:

disabled threading (fluid is 1 threaded for now)

disable IPP (to compare with existed OpenCV SIMD)

force SSE4.2 during the build

$ cmake -DCMAKE_BUILD_TYPE=Release -DCPU_BASELINE=SSE4_2 -DCPU_DISPATCH= ... $ ninja opencv_perf_gapi && OPENCV_IPP=disabled ./bin/opencv_perf_gapi --perf_threads=1 --perf_min_samples=100 --perf_force_samples=200 --gtest_filter=*esize*:*Bottleneck*:*StackOverflow* --gtest_output=xml:../perf/pr20664.xml

Results with i5-6600 are below.

We still have slow test here:

python3 <opencv>/modules/ts/misc/summary.py -f 'Bottle' -m median ../perf/pr20664.xml ... TestPerformance::BottlenecksPerfTestCPU/BottlenecksPerfTest::(compare_f, { gapi.kernel_package }) 0.946 TestPerformance::BottlenecksPerfTestFluid/BottlenecksPerfTest::(compare_f, { gapi.kernel_package, gapi.kernel_package }) 1.005

All cases from here:

python3 <opencv>/modules/ts/misc/summary.py -f 'StackOverflowPerfTest' -m median ../perf/pr20664.xml ... TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 128x128, { gapi.kernel_package }) 0.064 TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 640x480, { gapi.kernel_package }) 0.709 TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 1280x720, { gapi.kernel_package }) 2.012 TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 1920x1080, { gapi.kernel_package }) 4.580 TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 128x128, { gapi.kernel_package, gapi.kernel_package }) 0.071 TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 640x480, { gapi.kernel_package, gapi.kernel_package }) 0.758 TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 1280x720, { gapi.kernel_package, gapi.kernel_package }) 2.130 TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 1920x1080, { gapi.kernel_package, gapi.kernel_package }) 4.822

And here comparing "CPU" and "Fluid" branches:

$ python3 <opencv>/modules/ts/misc/summary.py -f '8UC3.*1080' -m median ../perf/pr20664.xml -o markdown

Name of Test 10

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) 0.048

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) 0.175

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) 0.130

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) 0.622

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) 0.078

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) 0.385

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) 0.314

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) 0.728

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) 2.199

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) 2.650

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) 2.643

TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) 0.727

TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) 0.137

TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) 0.666

TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) 0.190

TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) 0.947

TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 0, 1920x1080, 32x32, { gapi.kernel_package }) 0.016

TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 0, 1920x1080, 64x64, { gapi.kernel_package }) 0.027

TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 32x32, { gapi.kernel_package }) 0.027

TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 64x64, { gapi.kernel_package }) 0.036

TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 3, 1920x1080, 32x32, { gapi.kernel_package }) 5.146

TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 3, 1920x1080, 64x64, { gapi.kernel_package }) 5.221

TestPerformance::ResizePerfTestFluid/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 30x30, { gapi.kernel_package }) 0.049

TestPerformance::ResizePerfTestFluid/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 64x64, { gapi.kernel_package }) 0.083

There is some exception for "1920x1080, 0.5, 0.1" case.

Ok. Thank you for performance report.
Could you tell me please did you test these test cases for fluid without my SIMD optimization? Your testing result not indicative without testing result of the scalar version of the Fluid's Resize 8UC3.

I see SIMD speedup, but I see significant degradation which comes from Fluid kernel itself.
In the summary of this hard work we don't reach the baseline performance (no IPP, single threaded OpenCV code).
For the same result we can just exclude Resize kernel from fluid kernels().

In code optimization we need to minimize algorithm complexity and overhead of processing tasks before trying to perform vectorization or other limited micro-optimizations.

Resize operation is usually processed as some kind of separable filter:

horizontal pass

vertical pass

Currently, I don't see such separation in G-API implementation (I expecting to see GResizeHPass / GResizeVPass kernels in the pipeline).

I see SIMD speedup, but I see significant degradation which comes from Fluid kernel itself.
In the summary of this hard work we don't reach the baseline performance (no IPP, single threaded OpenCV code).
For the same result we can just exclude Resize kernel from fluid kernels().

Fluid kernels singly can often run slower than OpenCV's algorithms. We can observe their true effectiveness in the pipeline only.
So testing results of along kernel are not indicative.

Resize operation is usually processed as some kind of separable filter:

horizontal pass

vertical pass

Currently, I don't see such separation in G-API implementation (I expecting to see GResizeHPass / GResizeVPass kernels in the pipeline).

Yes, you are right. And so the algorithm in this PR is divided into both vertical and horizontal pass. Please take look here and here

But the Blur kernel hasn't been sped up yet.

So, we need to have other performance tests with fluid compatible pipelines after/before resize (replace problematic blur/canny with elementwise add/sub operations) to see clear benefits of these optimizations of the resize stage.

I've added new perf test. It was called ResizeInSimpleGraph*. Updated perf report will be added a little bit later.

Hi all,

Not sure what kind of input is expected here from my end but w.r.t this

For the same result we can just exclude Resize kernel from fluid kernels().

They key thing about using fluid kernels is not their individual performance but a pipeline performance. The individual performance may be comparable or even poorer than a single cv:: or IPP call, but when it comes to a pipeline the situation (usually) changes. E.g. (taken from the https://pullrequest.opencv.org/buildbot/builders/precommit_linux64/builds/34060/steps/perf_gapi/logs/stdio):

[----------] 15 tests from Benchmark/SobelEdgeDetector [ RUN ] Benchmark/SobelEdgeDetector.Fluid/0, where GetParam() = 320x240 [ PERFSTAT ] (samples=1 mean=0.61 median=0.61 min=0.61 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.Fluid/0 (4 ms) [ RUN ] Benchmark/SobelEdgeDetector.Fluid/1, where GetParam() = 640x480 [ PERFSTAT ] (samples=1 mean=2.04 median=2.04 min=2.04 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.Fluid/1 (10 ms) [ RUN ] Benchmark/SobelEdgeDetector.Fluid/2, where GetParam() = 1280x720 [ PERFSTAT ] (samples=1 mean=5.65 median=5.65 min=5.65 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.Fluid/2 (25 ms) [ RUN ] Benchmark/SobelEdgeDetector.Fluid/3, where GetParam() = 1920x1080 [ PERFSTAT ] (samples=1 mean=12.70 median=12.70 min=12.70 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.Fluid/3 (55 ms) [ RUN ] Benchmark/SobelEdgeDetector.Fluid/4, where GetParam() = 3840x2170 [ PERFSTAT ] (samples=1 mean=51.19 median=51.19 min=51.19 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.Fluid/4 (237 ms) [ RUN ] Benchmark/SobelEdgeDetector.OpenCV/0, where GetParam() = 320x240 [ PERFSTAT ] (samples=1 mean=0.50 median=0.50 min=0.50 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.OpenCV/0 (2 ms) [ RUN ] Benchmark/SobelEdgeDetector.OpenCV/1, where GetParam() = 640x480 [ PERFSTAT ] (samples=1 mean=3.82 median=3.82 min=3.82 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.OpenCV/1 (15 ms) [ RUN ] Benchmark/SobelEdgeDetector.OpenCV/2, where GetParam() = 1280x720 [ PERFSTAT ] (samples=1 mean=12.31 median=12.31 min=12.31 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.OpenCV/2 (46 ms) [ RUN ] Benchmark/SobelEdgeDetector.OpenCV/3, where GetParam() = 1920x1080 [ PERFSTAT ] (samples=1 mean=30.73 median=30.73 min=30.73 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.OpenCV/3 (116 ms) [ RUN ] Benchmark/SobelEdgeDetector.OpenCV/4, where GetParam() = 3840x2170 [ PERFSTAT ] (samples=1 mean=153.73 median=153.73 min=153.73 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.OpenCV/4 (525 ms) [ RUN ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/0, where GetParam() = 320x240 [ PERFSTAT ] (samples=1 mean=0.52 median=0.52 min=0.52 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/0 (3 ms) [ RUN ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/1, where GetParam() = 640x480 [ PERFSTAT ] (samples=1 mean=3.02 median=3.02 min=3.02 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/1 (14 ms) [ RUN ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/2, where GetParam() = 1280x720 [ PERFSTAT ] (samples=1 mean=9.48 median=9.48 min=9.48 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/2 (46 ms) [ RUN ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/3, where GetParam() = 1920x1080 [ PERFSTAT ] (samples=1 mean=21.38 median=21.38 min=21.38 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/3 (116 ms) [ RUN ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/4, where GetParam() = 3840x2170 [ PERFSTAT ] (samples=1 mean=87.20 median=87.20 min=87.20 stddev=0.00 (0.0%)) [ OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/4 (452 ms) [----------] 15 tests from Benchmark/SobelEdgeDetector (1666 ms total)

You may note how the performance differs based on the image resolution.

Regarding splitting the single resize into two kernels based on its separable properties, that's interesting idea but won't give much - in fact, now these two passes run via the scratch buffer (allocated for the kernel), and with such change this buffer will move to the graph side -- but the scheduling & execution overhead may grow.

BTW, Don't use perf numbers from precommit/nightly builders. These perf tests are run in sanity check mode (samples=1) with several concurrent processes/threads from other tests (results are neither reliable nor reproducible).

anna-khakimova · 2021-10-14T11:31:41Z

There are several code parts where SIMD is added though scalar processing (filling registers with "insert" ops) - it may not work at all, or even provide worse results.

Unfortunately latencies of _mm_intert_epi64() and _mm_intert_epi32() aren't described in the official documentation.
However, from my personal practice and the practice of other optimization engineers, this instruction works great. For example _mm_inser_epi64() works even a little faster than its analog from load instructions - _mm_loadl_epi64() and other load instructions.
Could you please provide examples when _mm_insert_epi64() doesn't work or work slower then other analogs for load data to vector?

anna-khakimova · 2021-10-14T12:12:49Z

Currently it is not clear that is really done here. And which "tunes" have effect.

You would see 2x speed-up for Resize 8U3C on target cases (2x downscaling), if you compared test result before and after my SIMD optimization.

alalek · 2021-10-15T01:33:22Z

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp

+            {
+                for (; x <= outSz.width - nlanes; x += nlanes)
+                {
+#ifdef _WIN64


Why do we have that in the middle of SIMD code?
Code size can be reduced 2x, up to 30 lines.

Anyway compiler check must be applied here, not a target platform.

Because I didn't see another (more beautiful) way. Could you please share beautiful approach if you saw it?

Code size can be reduced 2x, up to 30 lines.

Could you please suggest how can I do this?

Create small static inline function with #ifdef _MSC_VER branches inside.
But it is better to use SIMD wide load + shuffle instead of scalar loading through int64. Or even read alpha "as is" (don't use 4 clones) and clone them by SIMD instructions.

BTW, be careful: MSVS2015/2017 may cause some bugs (#20894)

alalek · 2021-10-15T01:48:27Z

modules/gapi/perf/common/gapi_core_perf_tests_inl.hpp

+    cv::Canny(cvblurred, out_mat_ocv, 32, 128, 3);
+
+    cv::GMat in;
+    cv::GMat vga = cv::gapi::resize(in, cv::Size(), 0.5, 0.5, 1);


1

Make code readable, avoid magic constants

Ok. Changed.

alalek · 2021-10-15T01:50:41Z

modules/gapi/perf/common/gapi_core_perf_tests_inl.hpp

+    cv::resize(in_mat1, cvvga, cv::Size(), 0.5, 0.5);
+    cv::cvtColor(cvvga, cvgray, cv::COLOR_BGR2GRAY);
+    cv::blur(cvgray, cvblurred, cv::Size(3, 3));
+    cv::Canny(cvblurred, out_mat_ocv, 32, 128, 3);


Why do we need two similar tests?

Compare with code from the test above:

cv::resize(in_mat1, cvvga, cv::Size(), 0.5, 0.5); cv::cvtColor(cvvga, cvgray, cv::COLOR_BGR2GRAY); cv::blur(cvgray, cvblurred, cv::Size(3, 3)); cv::Canny(cvblurred, out_mat_ocv, 32, 128, 3);

alalek · 2021-10-15T02:02:24Z

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp

+                    __m128i a11 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]));
+                    __m128i a22 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]));
+                    __m128i a23 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]));
+                    __m128i a33 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]));


clone[4 * (x + 3)]

You read each value 3 times in general.
Compilers are smart enough and do CSE job for you, but code looks bad (not readable form).

clone

meaningless name.
You probably don't need 4 clones of alpha.

These "clone" values are stored in linear block:

clone[4 * x]

clone[4 * (x + 1)]

...

clone[4 * (x + 15)]

No need to read them one by one.

You using here scalar reading into SIMD registers. This is anti-pattern.
Number of load-store units on CPU core is limited.
It is much slower than reading into SIMD register and shuffle.

alalek · 2021-10-15T02:02:35Z

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp

+                    __m128i a00 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * x]), *reinterpret_cast<const int64_t*>(&clone[4 * x]));
+                    __m128i a01 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * x]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]));
+                    __m128i a11 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]));
+                    __m128i a22 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]));
+                    __m128i a23 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]));
+                    __m128i a33 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]));
+                    __m128i a44 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 4)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 4)]));
+                    __m128i a45 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 4)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 5)]));
+                    __m128i a55 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 5)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 5)]));
+                    __m128i a66 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 6)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 6)]));
+                    __m128i a67 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 6)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 7)]));
+                    __m128i a77 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 7)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 7)]));
+                    __m128i a88 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 8)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 8)]));
+                    __m128i a89 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 8)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 9)]));
+                    __m128i a99 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 9)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 9)]));
+                    __m128i a1010 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 10)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 10)]));
+                    __m128i a1011 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 10)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 11)]));
+                    __m128i a1111 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 11)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 11)]));
+                    __m128i a1212 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 12)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 12)]));
+                    __m128i a1213 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 12)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 13)]));
+                    __m128i a1313 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 13)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 13)]));
+                    __m128i a1414 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 14)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 14)]));
+                    __m128i a1415 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 14)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 15)]));
+                    __m128i a1515 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 15)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 15)]));


Compare:

number of declared local variables

amount of CPU SSE registers

How is this code can be efficient?
Hopefully compiler is smart enough.

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp

alalek · 2021-10-15T02:18:36Z

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp

+                    __m128i val1hi = _mm_insert_epi64(_mm_loadl_epi64(reinterpret_cast<const __m128i*>(&src1[2][w])),
+                                                      *reinterpret_cast<const int64_t*>(&src1[3][w]), 1);
+#endif
+                    __m128i val0_0 = _mm_cvtepu8_epi16(val0lo);


It's more a matter of taste

Limited number of load-store units (LSU) don't agree with you.

However if I apply this advice I would have to re-run all perf test and gather all statistic again to check that performance hasn't decreased

This should took up to 10 mins with report analysis if you use the right tools (OpenCV have them!).

alalek · 2021-10-18T06:19:51Z

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp

+
+#include "opencv2/gapi/own/saturate.hpp"
+
+#include <nmmintrin.h>


This is SSE4.2 header: https://stackoverflow.com/questions/11228855/header-files-for-x86-simd-intrinsics

(to the topic of SSE4.2/4.1 mess)

alalek

Thank you for update!
No more comments from my side.

Awaiting approval from G-API team member before merge.

dmatveev · 2021-11-09T15:30:18Z

Thank you for update!
No more comments from my side.
Awaiting approval from G-API team member before merge.

@alalek thank for your thorough review here! Actually, we discussed the approach with Anna yesterday and agreed to explore the "OpenCV way" of packing vectors to see if it helps to reduce load-store pressure and maybe simplify the code while keeping the performance at the same or better level. Anna will follow-up on this. :)

dmatveev

Decision: please merge it as-is, so we have a stable baseline now for the future improvement (not to mix the things & not to rollback if the idea doesn't work out)

Fluid: SIMD for Resize Linear 8UC3 * SIMD for fluid::Resize 8U3C * Rework horizontal pass + add 8U4C case * Reproduce stackoverflow test * StackOverflow test * SSE42 impl * SSE42 impl improvement * GAPI:SSE42 simd opt for Resize 8UC3. Final version * Fix tests * Conditional compilation fix * Applied comments * Applied comments. Step2 * Applied comments. Step2

anna-khakimova force-pushed the ak/resize_simd branch 3 times, most recently from c83d0f9 to 7708f35 Compare September 7, 2021 09:17

mpashchenkov reviewed Sep 9, 2021

View reviewed changes

modules/core/include/opencv2/core/hal/intrin_avx512.hpp Outdated Show resolved Hide resolved

modules/gapi/include/opencv2/gapi/core.hpp Outdated Show resolved Hide resolved

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed Sep 9, 2021

View reviewed changes

modules/core/include/opencv2/core/hal/intrin_avx.hpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed Sep 9, 2021

View reviewed changes

modules/core/include/opencv2/core/hal/intrin_avx512.hpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed Sep 9, 2021

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed Sep 9, 2021

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

TolyaTalamanov reviewed Sep 9, 2021

View reviewed changes

asmorkalov added category: g-api / gapi optimization labels Sep 13, 2021

anna-khakimova force-pushed the ak/resize_simd branch from 53c0ea4 to d5c5070 Compare September 13, 2021 09:55

anna-khakimova marked this pull request as draft September 16, 2021 06:28

anna-khakimova force-pushed the ak/resize_simd branch 3 times, most recently from 8c57290 to 56cd203 Compare September 23, 2021 23:51

anna-khakimova changed the title ~~Add SIMD for fluid::Resize 8U3C~~ Add SIMD for fluid::Resize 8UC3 Sep 28, 2021

anna-khakimova marked this pull request as ready for review September 28, 2021 11:25

anna-khakimova force-pushed the ak/resize_simd branch 2 times, most recently from f66aaa1 to 016896b Compare September 28, 2021 13:29

anna-khakimova commented Sep 28, 2021

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

anna-khakimova force-pushed the ak/resize_simd branch from 016896b to 3effa66 Compare September 28, 2021 21:11

anna-khakimova force-pushed the ak/resize_simd branch 4 times, most recently from 3c67cd4 to 761397e Compare October 1, 2021 11:54

rgarnov requested changes Oct 1, 2021

View reviewed changes

rgarnov self-requested a review October 1, 2021 16:41

anna-khakimova requested a review from dmatveev October 8, 2021 11:28

Applied comments

616a53d

anna-khakimova force-pushed the ak/resize_simd branch 2 times, most recently from 641e16c to e99b128 Compare October 10, 2021 21:21

terfendail approved these changes Oct 11, 2021

View reviewed changes

anna-khakimova force-pushed the ak/resize_simd branch 3 times, most recently from 660670c to d46d768 Compare October 12, 2021 12:24

alalek reviewed Oct 12, 2021

View reviewed changes

alalek requested changes Oct 13, 2021

View reviewed changes

Applied comments. Step2

c68c199

anna-khakimova force-pushed the ak/resize_simd branch from d46d768 to c68c199 Compare October 13, 2021 20:19

alalek reviewed Oct 14, 2021

View reviewed changes

alalek reviewed Oct 15, 2021

View reviewed changes

alalek reviewed Oct 18, 2021

View reviewed changes

alalek assigned dmatveev Oct 20, 2021

anna-khakimova changed the title ~~Add SIMD for fluid::Resize 8UC3~~ Fluid: SIMD for Resize Linear 8UC3 Nov 7, 2021

alalek reviewed Nov 8, 2021

View reviewed changes

Applied comments. Step2

b24752f

anna-khakimova force-pushed the ak/resize_simd branch from 9904bde to b24752f Compare November 10, 2021 11:42

dmatveev approved these changes Nov 10, 2021

View reviewed changes

alalek merged commit 3cfca01 into opencv:4.x Nov 10, 2021

alalek mentioned this pull request Nov 22, 2021

G-API: valgrind issues (2021-11-14) #21097

Closed

alalek mentioned this pull request Dec 30, 2021

(5.x) Merge 4.x #21371

Merged

alalek mentioned this pull request Feb 11, 2022

opencv 4.5.5 32 bits build fails #21597

Closed

alalek mentioned this pull request Feb 22, 2022

(5.x) Merge 4.x #21651

Merged

Name of Test	10
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.1, 0.1, { gapi.kernel_package })	0.048
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.1, 0.5, { gapi.kernel_package })	0.175
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.5, 0.1, { gapi.kernel_package })	0.130
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.5, 0.5, { gapi.kernel_package })	0.622
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.1, { gapi.kernel_package })	0.078
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.5, { gapi.kernel_package })	0.385
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.1, { gapi.kernel_package })	0.314
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.5, { gapi.kernel_package })	0.728
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.1, 0.1, { gapi.kernel_package })	2.199
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.1, 0.5, { gapi.kernel_package })	2.650
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.5, 0.1, { gapi.kernel_package })	2.643
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.5, 0.5, { gapi.kernel_package })	0.727
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.1, { gapi.kernel_package })	0.137
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.5, { gapi.kernel_package })	0.666
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.1, { gapi.kernel_package })	0.190
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.5, { gapi.kernel_package })	0.947
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 0, 1920x1080, 32x32, { gapi.kernel_package })	0.016
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 0, 1920x1080, 64x64, { gapi.kernel_package })	0.027
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 32x32, { gapi.kernel_package })	0.027
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 64x64, { gapi.kernel_package })	0.036
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 3, 1920x1080, 32x32, { gapi.kernel_package })	5.146
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 3, 1920x1080, 64x64, { gapi.kernel_package })	5.221
TestPerformance::ResizePerfTestFluid/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 30x30, { gapi.kernel_package })	0.049
TestPerformance::ResizePerfTestFluid/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 64x64, { gapi.kernel_package })	0.083


		#include "opencv2/gapi/own/saturate.hpp"

		#include <nmmintrin.h>

Fluid: SIMD for Resize Linear 8UC3 #20664

Fluid: SIMD for Resize Linear 8UC3 #20664

Conversation

anna-khakimova commented Sep 6, 2021 • edited Loading

mpashchenkov left a comment

Choose a reason for hiding this comment

smirnov-alexey commented Sep 9, 2021

Choose a reason for hiding this comment

asenyaev commented Sep 29, 2021

rgarnov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alalek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anna-khakimova Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anna-khakimova Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anna-khakimova Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

alalek Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alalek left a comment

Choose a reason for hiding this comment

alalek Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

anna-khakimova Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anna-khakimova Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

anna-khakimova Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anna-khakimova Nov 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anna-khakimova commented Oct 14, 2021 • edited Loading

anna-khakimova commented Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alalek Oct 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alalek left a comment

Choose a reason for hiding this comment

dmatveev commented Nov 9, 2021

dmatveev left a comment

Choose a reason for hiding this comment

anna-khakimova commented Sep 6, 2021 •

edited

Loading

anna-khakimova Oct 13, 2021 •

edited

Loading

anna-khakimova Oct 13, 2021 •

edited

Loading

anna-khakimova Oct 13, 2021 •

edited

Loading

alalek Oct 14, 2021 •

edited

Loading

alalek Oct 14, 2021 •

edited

Loading

anna-khakimova Oct 14, 2021 •

edited

Loading

anna-khakimova Oct 18, 2021 •

edited

Loading

anna-khakimova Oct 18, 2021 •

edited

Loading

anna-khakimova Nov 7, 2021 •

edited

Loading

anna-khakimova commented Oct 14, 2021 •

edited

Loading

anna-khakimova commented Oct 14, 2021 •

edited

Loading

alalek Oct 16, 2021 •

edited

Loading