Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluid: SIMD for Resize Linear 8UC3 #20664

Merged
merged 12 commits into from
Nov 10, 2021
Merged

Conversation

anna-khakimova
Copy link
Member

@anna-khakimova anna-khakimova commented Sep 6, 2021

SIMD optimization for fluid::core::Resize 8UC3 bilinear.

@smirnov-alexey @TolyaTalamanov @mpashchenkov @rgarnov please take a look.

Report:
Resize_ocv_sse42.xlsx

force_builders=Linux AVX2,Custom,Custom Win,Custom Mac
build_gapi_standalone:Linux x64=ade-0.1.1f
build_gapi_standalone:Win64=ade-0.1.1f
Xbuild_gapi_standalone:Mac=ade-0.1.1f
build_gapi_standalone:Linux x64 Debug=ade-0.1.1f

build_image:Custom=centos:7
buildworker:Custom=linux-1
build_gapi_standalone:Custom=ade-0.1.1f

Xbuild_image:Custom=ubuntu-openvino-2021.3.0:20.04
build_image:Custom Win=openvino-2021.4.1
build_image:Custom Mac=openvino-2021.2.0

buildworker:Custom Win=windows-3

test_modules:Custom=gapi,python2,python3,java
test_modules:Custom Win=gapi,python2,python3,java
test_modules:Custom Mac=gapi,python2,python3,java

buildworker:Custom=linux-1
# disabled due high memory usage: test_opencl:Custom=ON
Xtest_opencl:Custom=OFF
Xtest_bigdata:Custom=1
Xtest_filter:Custom=*

CPU_BASELINE:Custom Win=AVX512_SKX
CPU_BASELINE:Custom=SSE4_2

@anna-khakimova anna-khakimova force-pushed the ak/resize_simd branch 3 times, most recently from c83d0f9 to 7708f35 Compare September 7, 2021 09:17
Copy link
Contributor

@mpashchenkov mpashchenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't figured out the rest yet.

modules/core/include/opencv2/core/hal/intrin_avx512.hpp Outdated Show resolved Hide resolved
modules/gapi/include/opencv2/gapi/core.hpp Outdated Show resolved Hide resolved
modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved
@smirnov-alexey
Copy link
Contributor

The rest looks good to me

modules/gapi/perf/common/gapi_core_perf_tests_inl.hpp Outdated Show resolved Hide resolved
modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved
modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved
}
}

static void resetScratch(cv::gapi::fluid::Buffer& /*scratch*/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need that at all, if it has empty implementation anyway ?

@anna-khakimova anna-khakimova marked this pull request as draft September 16, 2021 06:28
@anna-khakimova anna-khakimova force-pushed the ak/resize_simd branch 3 times, most recently from 8c57290 to 56cd203 Compare September 23, 2021 23:51
@anna-khakimova anna-khakimova changed the title Add SIMD for fluid::Resize 8U3C Add SIMD for fluid::Resize 8UC3 Sep 28, 2021
@anna-khakimova anna-khakimova marked this pull request as ready for review September 28, 2021 11:25
@anna-khakimova anna-khakimova force-pushed the ak/resize_simd branch 2 times, most recently from f66aaa1 to 016896b Compare September 28, 2021 13:29
@asenyaev
Copy link
Contributor

jenkins cn please retry a build

@anna-khakimova anna-khakimova force-pushed the ak/resize_simd branch 4 times, most recently from 3c67cd4 to 761397e Compare October 1, 2021 11:54
Copy link
Contributor

@rgarnov rgarnov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some CI failures, I believe it's because of Canny invocations on the images which were resized by not-bit exact resizers. Maybe if we change evaluating criteria to allow some percentage of pixels to be different (1% for example), this issue will be fixed? I think there are a dedicated functions for such image comparison in our test system

@@ -2114,6 +2114,51 @@ PERF_TEST_P_(ResizePerfTest, TestPerformance)
SANITY_CHECK_NOTHING();
}

PERF_TEST_P_(StackOverflowPerfTest, TestPerformance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this test to reflect what it actually tests, or add a verbose comment with a link to StackOverflow discussion and a brief description

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PERF_TEST_P_(StackOverflowPerfTest, TestPerformance)

Why do we need that performance test case? Why is it not accuracy?
What is "done"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case were created to control performance result of test scenario mentioned here:
https://stackoverflow.com/questions/60629331/opencv-gapi-performance-not-good-as-expected
As mentioned in the post on StackOverflow post, kernels gathered in this test case have low performance in comparing to OpenCV. So we need this test to compare our performance improving.
I changed name the test and added explanatory comment.

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp Outdated Show resolved Hide resolved
modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp Outdated Show resolved Hide resolved
@rgarnov rgarnov self-requested a review October 1, 2021 16:41
@anna-khakimova anna-khakimova force-pushed the ak/resize_simd branch 2 times, most recently from 641e16c to e99b128 Compare October 10, 2021 21:21
@anna-khakimova anna-khakimova force-pushed the ak/resize_simd branch 3 times, most recently from 660670c to d46d768 Compare October 12, 2021 12:24
Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently none of launched builders on public CI compile/run provided code (baseline is SSE3, SSE4.2 is enabled through dispatching only - not a this case)

modules/gapi/perf/common/gapi_core_perf_tests_inl.hpp Outdated Show resolved Hide resolved
@@ -2114,6 +2114,51 @@ PERF_TEST_P_(ResizePerfTest, TestPerformance)
SANITY_CHECK_NOTHING();
}

PERF_TEST_P_(StackOverflowPerfTest, TestPerformance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PERF_TEST_P_(StackOverflowPerfTest, TestPerformance)

Why do we need that performance test case? Why is it not accuracy?
What is "done"?

modules/gapi/test/internal/gapi_int_recompilation_test.cpp Outdated Show resolved Hide resolved
modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved
Comment on lines 56 to 40
// pixel_1 (rgbx)
vec.val = _mm_insert_epi32(vec.val, *reinterpret_cast<const int*>(&src[chanNum * (*index + pos)]), 0);
// pixel_2 (rgbx)
vec.val = _mm_insert_epi32(vec.val, *reinterpret_cast<const int*>(&src[chanNum * (*(index + 1) + pos)]), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*index

index[0]

*(index + 1)

index[1]


Keep code readable.

Copy link
Member Author

@anna-khakimova anna-khakimova Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index pointer assume &alpha[x], i.e. it is pointer to x-th element of alpha array, so I'm afraid that it can't be navigated by index[0] and index[1] construction.
Otherwise, I would love to use [] to navigate through an array here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp Outdated Show resolved Hide resolved
modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp Outdated Show resolved Hide resolved
modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp Outdated Show resolved Hide resolved
const Size& ,
const int )
{
CV_Assert("Unsupported number of channel");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we raise exception for the case? How is it handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact is that at the moment the Resize kernel in fluid is able to process only 3-channel images. In the next PRs I'm going to add 1 and 4 channel versions by adding two more specializations of this template. Processing images with a different number of channels (except 1, 3 and 4) will not be supported. So when users try to pass an image with other number of channels, we must notify them that this functionality is not supported. I am doing this with CV_Assert. Could you tell me please what suggestions do you have in this regard?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very confusing to emit such errors under conditions like "if SSE42" and "inputWidth > 16".
Supported parameters check should be done on the upper level. Optimization code paths should not do that.

Copy link
Member Author

@anna-khakimova anna-khakimova Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree.
So, the main check for the number of channels also there is 2 level up - here
So this CV_Assert in this line of the current (gfluidcore_simd_sse42.hpp) file is just a formality. It was possible to leave this method empty, but I decided to add an assert just in case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method empty

Empty implementation is a wrong way.
It should be not defined (to get error from linker).
or use static_assert() (to get error message from compiler)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied.

Comment on lines 22 to 23
#if CV_SSE4_1
#include "gfluidcore_simd_sse42.hpp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSE4_1
sse42

Why do we need this mess?

@dmatveev Which ISA is the real target?

Copy link
Member Author

@anna-khakimova anna-khakimova Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, The current SIMD implementation of the Resize is limited by SSE2, SSE3 and SSE4_1 instruction sets, so I enable it when these sets are present in the system and cmake configuration. Real target is SIMD128.
In the future there is possibility of addition to this file other kernel SIMD implementations with using SSE4.2 instructions.
So not so important how exactly this file was named SSE41 or SSE42

Copy link
Member

@alalek alalek Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway we should not have such mess.

sse42 filename assumes that we (or future contributors) can place SSE4.2-specific code there. But really we can't because this file is used under sse41 guard.

Select and use one consistent value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied.

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is better:

  • fix fluid comparing with OpenCV without any optimizations (no threading, no SIMD)
  • after that try to optimize Fluid with SIMD

Currently it is not clear that is really done here. And which "tunes" have effect.

There are several code parts where SIMD is added though scalar processing (filling registers with "insert" ops) - it may not work at all, or even provide worse results.

@@ -2949,109 +2953,295 @@ GAPI_FLUID_KERNEL(GFluidPhase, cv::gapi::core::GPhase, false)
}
};

GAPI_FLUID_KERNEL(GFluidResize, cv::gapi::core::GResize, true)
Copy link
Member

@alalek alalek Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance results are questionable.

To properly measure related code we need:

  • disabled threading (fluid is 1 threaded for now)
  • disable IPP (to compare with existed OpenCV SIMD)
  • force SSE4.2 during the build
$ cmake -DCMAKE_BUILD_TYPE=Release -DCPU_BASELINE=SSE4_2 -DCPU_DISPATCH= ...
$ ninja opencv_perf_gapi && OPENCV_IPP=disabled ./bin/opencv_perf_gapi --perf_threads=1 --perf_min_samples=100 --perf_force_samples=200 --gtest_filter=*esize*:*Bottleneck*:*StackOverflow* --gtest_output=xml:../perf/pr20664.xml

Results with i5-6600 are below.

We still have slow test here:

python3 <opencv>/modules/ts/misc/summary.py -f 'Bottle' -m median ../perf/pr20664.xml
...
TestPerformance::BottlenecksPerfTestCPU/BottlenecksPerfTest::(compare_f, { gapi.kernel_package })                        0.946
TestPerformance::BottlenecksPerfTestFluid/BottlenecksPerfTest::(compare_f, { gapi.kernel_package, gapi.kernel_package }) 1.005

All cases from here:

python3 <opencv>/modules/ts/misc/summary.py -f 'StackOverflowPerfTest' -m median ../perf/pr20664.xml
...
TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 128x128, { gapi.kernel_package })                          0.064
TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 640x480, { gapi.kernel_package })                          0.709
TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 1280x720, { gapi.kernel_package })                         2.012
TestPerformance::StackOverflowPerfTestCPU/StackOverflowPerfTest::(compare_f, 8UC3, 1920x1080, { gapi.kernel_package })                        4.580
TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 128x128, { gapi.kernel_package, gapi.kernel_package })   0.071
TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 640x480, { gapi.kernel_package, gapi.kernel_package })   0.758
TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 1280x720, { gapi.kernel_package, gapi.kernel_package })  2.130
TestPerformance::StackOverflowPerfTestFluid/StackOverflowPerfTest::(compare_f, 8UC3, 1920x1080, { gapi.kernel_package, gapi.kernel_package }) 4.822

And here comparing "CPU" and "Fluid" branches:

$ python3 <opencv>/modules/ts/misc/summary.py -f '8UC3.*1080' -m median ../perf/pr20664.xml -o markdown
Name of Test 10
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) 0.048
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) 0.175
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) 0.130
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 0, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) 0.622
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) 0.078
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) 0.385
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) 0.314
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) 0.728
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) 2.199
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) 2.650
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) 2.643
TestPerformance::ResizeFxFyPerfTestCPU/ResizeFxFyPerfTest::(compare_f, 8UC3, 3, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) 0.727
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.1, { gapi.kernel_package }) 0.137
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.1, 0.5, { gapi.kernel_package }) 0.666
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.1, { gapi.kernel_package }) 0.190
TestPerformance::ResizeFxFyPerfTestFluid/ResizeFxFyPerfTest::(compare_f, 8UC3, 1, 1920x1080, 0.5, 0.5, { gapi.kernel_package }) 0.947
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 0, 1920x1080, 32x32, { gapi.kernel_package }) 0.016
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 0, 1920x1080, 64x64, { gapi.kernel_package }) 0.027
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 32x32, { gapi.kernel_package }) 0.027
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 64x64, { gapi.kernel_package }) 0.036
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 3, 1920x1080, 32x32, { gapi.kernel_package }) 5.146
TestPerformance::ResizePerfTestCPU/ResizePerfTest::(compare_f, 8UC3, 3, 1920x1080, 64x64, { gapi.kernel_package }) 5.221
TestPerformance::ResizePerfTestFluid/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 30x30, { gapi.kernel_package }) 0.049
TestPerformance::ResizePerfTestFluid/ResizePerfTest::(compare_f, 8UC3, 1, 1920x1080, 64x64, { gapi.kernel_package }) 0.083

There is some exception for "1920x1080, 0.5, 0.1" case.

Copy link
Member Author

@anna-khakimova anna-khakimova Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thank you for performance report.
Could you tell me please did you test these test cases for fluid without my SIMD optimization? Your testing result not indicative without testing result of the scalar version of the Fluid's Resize 8UC3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see SIMD speedup, but I see significant degradation which comes from Fluid kernel itself.
In the summary of this hard work we don't reach the baseline performance (no IPP, single threaded OpenCV code).
For the same result we can just exclude Resize kernel from fluid kernels().

In code optimization we need to minimize algorithm complexity and overhead of processing tasks before trying to perform vectorization or other limited micro-optimizations.

Resize operation is usually processed as some kind of separable filter:

  • horizontal pass
  • vertical pass

Currently, I don't see such separation in G-API implementation (I expecting to see GResizeHPass / GResizeVPass kernels in the pipeline).

Copy link
Member Author

@anna-khakimova anna-khakimova Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see SIMD speedup, but I see significant degradation which comes from Fluid kernel itself.
In the summary of this hard work we don't reach the baseline performance (no IPP, single threaded OpenCV code).
For the same result we can just exclude Resize kernel from fluid kernels().

Fluid kernels singly can often run slower than OpenCV's algorithms. We can observe their true effectiveness in the pipeline only.
So testing results of along kernel are not indicative.

Copy link
Member Author

@anna-khakimova anna-khakimova Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resize operation is usually processed as some kind of separable filter:

  • horizontal pass
  • vertical pass

Currently, I don't see such separation in G-API implementation (I expecting to see GResizeHPass / GResizeVPass kernels in the pipeline).

Yes, you are right. And so the algorithm in this PR is divided into both vertical and horizontal pass. Please take look here and here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the Blur kernel hasn't been sped up yet.

So, we need to have other performance tests with fluid compatible pipelines after/before resize (replace problematic blur/canny with elementwise add/sub operations) to see clear benefits of these optimizations of the resize stage.

Copy link
Member Author

@anna-khakimova anna-khakimova Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added new perf test. It was called ResizeInSimpleGraph*. Updated perf report will be added a little bit later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all,

Not sure what kind of input is expected here from my end but w.r.t this

For the same result we can just exclude Resize kernel from fluid kernels().

They key thing about using fluid kernels is not their individual performance but a pipeline performance. The individual performance may be comparable or even poorer than a single cv:: or IPP call, but when it comes to a pipeline the situation (usually) changes. E.g. (taken from the https://pullrequest.opencv.org/buildbot/builders/precommit_linux64/builds/34060/steps/perf_gapi/logs/stdio):

[----------] 15 tests from Benchmark/SobelEdgeDetector
[ RUN      ] Benchmark/SobelEdgeDetector.Fluid/0, where GetParam() = 320x240
[ PERFSTAT ]    (samples=1   mean=0.61   median=0.61   min=0.61   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.Fluid/0 (4 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.Fluid/1, where GetParam() = 640x480
[ PERFSTAT ]    (samples=1   mean=2.04   median=2.04   min=2.04   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.Fluid/1 (10 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.Fluid/2, where GetParam() = 1280x720
[ PERFSTAT ]    (samples=1   mean=5.65   median=5.65   min=5.65   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.Fluid/2 (25 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.Fluid/3, where GetParam() = 1920x1080
[ PERFSTAT ]    (samples=1   mean=12.70   median=12.70   min=12.70   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.Fluid/3 (55 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.Fluid/4, where GetParam() = 3840x2170
[ PERFSTAT ]    (samples=1   mean=51.19   median=51.19   min=51.19   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.Fluid/4 (237 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.OpenCV/0, where GetParam() = 320x240
[ PERFSTAT ]    (samples=1   mean=0.50   median=0.50   min=0.50   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.OpenCV/0 (2 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.OpenCV/1, where GetParam() = 640x480
[ PERFSTAT ]    (samples=1   mean=3.82   median=3.82   min=3.82   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.OpenCV/1 (15 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.OpenCV/2, where GetParam() = 1280x720
[ PERFSTAT ]    (samples=1   mean=12.31   median=12.31   min=12.31   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.OpenCV/2 (46 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.OpenCV/3, where GetParam() = 1920x1080
[ PERFSTAT ]    (samples=1   mean=30.73   median=30.73   min=30.73   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.OpenCV/3 (116 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.OpenCV/4, where GetParam() = 3840x2170
[ PERFSTAT ]    (samples=1   mean=153.73   median=153.73   min=153.73   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.OpenCV/4 (525 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/0, where GetParam() = 320x240
[ PERFSTAT ]    (samples=1   mean=0.52   median=0.52   min=0.52   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/0 (3 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/1, where GetParam() = 640x480
[ PERFSTAT ]    (samples=1   mean=3.02   median=3.02   min=3.02   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/1 (14 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/2, where GetParam() = 1280x720
[ PERFSTAT ]    (samples=1   mean=9.48   median=9.48   min=9.48   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/2 (46 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/3, where GetParam() = 1920x1080
[ PERFSTAT ]    (samples=1   mean=21.38   median=21.38   min=21.38   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/3 (116 ms)
[ RUN      ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/4, where GetParam() = 3840x2170
[ PERFSTAT ]    (samples=1   mean=87.20   median=87.20   min=87.20   stddev=0.00 (0.0%))
[       OK ] Benchmark/SobelEdgeDetector.OpenCV_Smarter/4 (452 ms)
[----------] 15 tests from Benchmark/SobelEdgeDetector (1666 ms total)

You may note how the performance differs based on the image resolution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding splitting the single resize into two kernels based on its separable properties, that's interesting idea but won't give much - in fact, now these two passes run via the scratch buffer (allocated for the kernel), and with such change this buffer will move to the graph side -- but the scheduling & execution overhead may grow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, Don't use perf numbers from precommit/nightly builders. These perf tests are run in sanity check mode (samples=1) with several concurrent processes/threads from other tests (results are neither reliable nor reproducible).

@anna-khakimova
Copy link
Member Author

anna-khakimova commented Oct 14, 2021

There are several code parts where SIMD is added though scalar processing (filling registers with "insert" ops) - it may not work at all, or even provide worse results.

Unfortunately latencies of _mm_intert_epi64() and _mm_intert_epi32() aren't described in the official documentation.
However, from my personal practice and the practice of other optimization engineers, this instruction works great. For example _mm_inser_epi64() works even a little faster than its analog from load instructions - _mm_loadl_epi64() and other load instructions.
Could you please provide examples when _mm_insert_epi64() doesn't work or work slower then other analogs for load data to vector?

@anna-khakimova
Copy link
Member Author

anna-khakimova commented Oct 14, 2021

Currently it is not clear that is really done here. And which "tunes" have effect.

You would see 2x speed-up for Resize 8U3C on target cases (2x downscaling), if you compared test result before and after my SIMD optimization.

{
for (; x <= outSz.width - nlanes; x += nlanes)
{
#ifdef _WIN64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have that in the middle of SIMD code?
Code size can be reduced 2x, up to 30 lines.

Anyway compiler check must be applied here, not a target platform.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I didn't see another (more beautiful) way. Could you please share beautiful approach if you saw it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code size can be reduced 2x, up to 30 lines.

Could you please suggest how can I do this?

Copy link
Member

@alalek alalek Oct 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create small static inline function with #ifdef _MSC_VER branches inside.
But it is better to use SIMD wide load + shuffle instead of scalar loading through int64. Or even read alpha "as is" (don't use 4 clones) and clone them by SIMD instructions.

BTW, be careful: MSVS2015/2017 may cause some bugs (#20894)

cv::Canny(cvblurred, out_mat_ocv, 32, 128, 3);

cv::GMat in;
cv::GMat vga = cv::gapi::resize(in, cv::Size(), 0.5, 0.5, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1

Make code readable, avoid magic constants

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Changed.

cv::resize(in_mat1, cvvga, cv::Size(), 0.5, 0.5);
cv::cvtColor(cvvga, cvgray, cv::COLOR_BGR2GRAY);
cv::blur(cvgray, cvblurred, cv::Size(3, 3));
cv::Canny(cvblurred, out_mat_ocv, 32, 128, 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two similar tests?

Compare with code from the test above:

    cv::resize(in_mat1, cvvga, cv::Size(), 0.5, 0.5);
    cv::cvtColor(cvvga, cvgray, cv::COLOR_BGR2GRAY);
    cv::blur(cvgray, cvblurred, cv::Size(3, 3));
    cv::Canny(cvblurred, out_mat_ocv, 32, 128, 3);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked.

__m128i a11 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]));
__m128i a22 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]));
__m128i a23 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]));
__m128i a33 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clone[4 * (x + 3)]

You read each value 3 times in general.
Compilers are smart enough and do CSE job for you, but code looks bad (not readable form).


clone

meaningless name.
You probably don't need 4 clones of alpha.


These "clone" values are stored in linear block:

  • clone[4 * x]
  • clone[4 * (x + 1)]
  • ...
  • clone[4 * (x + 15)]

No need to read them one by one.


You using here scalar reading into SIMD registers. This is anti-pattern.
Number of load-store units on CPU core is limited.
It is much slower than reading into SIMD register and shuffle.

Comment on lines 307 to 330
__m128i a00 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * x]), *reinterpret_cast<const int64_t*>(&clone[4 * x]));
__m128i a01 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * x]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]));
__m128i a11 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 1)]));
__m128i a22 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]));
__m128i a23 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 2)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]));
__m128i a33 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 3)]));
__m128i a44 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 4)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 4)]));
__m128i a45 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 4)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 5)]));
__m128i a55 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 5)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 5)]));
__m128i a66 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 6)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 6)]));
__m128i a67 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 6)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 7)]));
__m128i a77 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 7)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 7)]));
__m128i a88 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 8)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 8)]));
__m128i a89 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 8)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 9)]));
__m128i a99 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 9)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 9)]));
__m128i a1010 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 10)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 10)]));
__m128i a1011 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 10)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 11)]));
__m128i a1111 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 11)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 11)]));
__m128i a1212 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 12)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 12)]));
__m128i a1213 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 12)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 13)]));
__m128i a1313 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 13)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 13)]));
__m128i a1414 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 14)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 14)]));
__m128i a1415 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 14)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 15)]));
__m128i a1515 = _mm_setr_epi64x(*reinterpret_cast<const int64_t*>(&clone[4 * (x + 15)]), *reinterpret_cast<const int64_t*>(&clone[4 * (x + 15)]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare:

  • number of declared local variables
  • amount of CPU SSE registers

How is this code can be efficient?
Hopefully compiler is smart enough.

modules/gapi/src/backends/fluid/gfluidcore_simd_sse42.hpp Outdated Show resolved Hide resolved
__m128i val1hi = _mm_insert_epi64(_mm_loadl_epi64(reinterpret_cast<const __m128i*>(&src1[2][w])),
*reinterpret_cast<const int64_t*>(&src1[3][w]), 1);
#endif
__m128i val0_0 = _mm_cvtepu8_epi16(val0lo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more a matter of taste

Limited number of load-store units (LSU) don't agree with you.


However if I apply this advice I would have to re-run all perf test and gather all statistic again to check that performance hasn't decreased

This should took up to 10 mins with report analysis if you use the right tools (OpenCV have them!).


#include "opencv2/gapi/own/saturate.hpp"

#include <nmmintrin.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is SSE4.2 header: https://stackoverflow.com/questions/11228855/header-files-for-x86-simd-intrinsics

(to the topic of SSE4.2/4.1 mess)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied.

@anna-khakimova anna-khakimova changed the title Add SIMD for fluid::Resize 8UC3 Fluid: SIMD for Resize Linear 8UC3 Nov 7, 2021
Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update!
No more comments from my side.

Awaiting approval from G-API team member before merge.

@dmatveev
Copy link
Contributor

dmatveev commented Nov 9, 2021

Thank you for update!
No more comments from my side.
Awaiting approval from G-API team member before merge.

@alalek thank for your thorough review here! Actually, we discussed the approach with Anna yesterday and agreed to explore the "OpenCV way" of packing vectors to see if it helps to reduce load-store pressure and maybe simplify the code while keeping the performance at the same or better level. Anna will follow-up on this. :)

Copy link
Contributor

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision: please merge it as-is, so we have a stable baseline now for the future improvement (not to mix the things & not to rollback if the idea doesn't work out)

@alalek alalek merged commit 3cfca01 into opencv:4.x Nov 10, 2021
@alalek alalek mentioned this pull request Dec 30, 2021
@alalek alalek mentioned this pull request Feb 22, 2022
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
Fluid: SIMD for Resize Linear 8UC3

* SIMD for fluid::Resize 8U3C

* Rework horizontal pass + add 8U4C case

* Reproduce stackoverflow test

* StackOverflow test

* SSE42 impl

* SSE42 impl improvement

* GAPI:SSE42 simd opt for Resize 8UC3. Final version

* Fix tests

* Conditional compilation fix

* Applied comments

* Applied comments. Step2

* Applied comments. Step2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants