The performance of `cv::parallel_for_` is quite low on Windows #25260

LiuPeiqiCN · 2024-03-25T03:48:58Z

System Information

OpenCV version: 4.8.0
Operating System / Platform: Windows11
Compiler & compiler version: VS2022, MSVC1938

Detailed description

In the Windows environment, the performance of cv::parallel_for_ is not satisfactory, and it is speculated that some optimization options may not be enabled during the compilation process, leading to this issue.

Steps to reproduce

I am studying the book Digital Image Processing, 4th edition, Rafael C. Gonzalez • Richard E. Woods, and try to implement Contraharmonic Mean Filter described in Chapter 5.3 Restoration in the Presence of Noise Only—Spatial Filtering.

I used three methods for Contraharmonic Mean Filter implementation: Halide, cv::parallel_for_, and Windows PPL.

From the test results, OpenCV's performance seems to be subpar compared to PPL. However, I prefer not to use PPL because it is specific to the Windows system. I prefer to achieve similar performance using OpenCV and some simple code.

At least on the Windows system, OpenCV doesn't seem to be selecting the correct parallel backend. I seriously doubt that may lead to a significant decrease in the performance of many OpenCV operators.

Code:

#include <ppl.h> // only windows

void IPL::halide::contraharmonicMeanFilter(const cv::Mat& src, cv::Mat& dst, int radius, float Q, bool useHalide)
{
    if (useHalide && src.rows >= 32 && src.cols >= 32)
    {
        //if (dst.empty())
        //    dst = cv::Mat(src.size(), src.type());
        //CV_Assert(src.type() == dst.type() && src.size() == dst.size() && radius >= 1);
        //CV_Assert(src.isContinuous() && dst.isContinuous());
        //Halide::Runtime::Buffer<uint8_t> hal_src = Halide::Runtime::Buffer<uint8_t>::make_interleaved(src.data, src.cols, src.rows, src.channels());
        //Halide::Runtime::Buffer<uint8_t> hal_dst(dst.data, src.channels(), dst.cols, dst.rows);
        //Halide_ContraharmonicMeanFilter(hal_src, radius, Q, hal_dst);
    }
    else
    {
        cv::Mat padded;

        cv::copyMakeBorder(src, padded, radius, radius, radius, radius, cv::BORDER_REPLICATE);
        dst = padded.clone();

        int srcW = src.cols;
        int srcH = src.rows;
        int paddedW = padded.cols;
        int paddedH = padded.rows;
        size_t paddedStep = padded.step1();
        int channel = src.channels();
        int nChannelProc = MIN(channel, 3);

        uchar* op = padded.data;
        uchar* np = dst.data;

        cv::parallel_for_(cv::Range(0, nChannelProc), [&](const cv::Range& range) {
            for (int ch = range.start; ch < range.end; ch++)
            {
                cv::parallel_for_(cv::Range(radius, paddedH - radius), [&](const cv::Range& range) {
                    for (int i = range.start; i < range.end; i++)
                    {
                        for (int j = radius; j < paddedW - radius; j++)
                        {
                            float s1 = 0.0f;
                            float s2 = 0.0f;
#pragma omp simd
                            for (int m = i - radius; m <= i + radius; m++)
                            {
                                for (int n = j - radius; n <= j + radius; n++)
                                {
                                    float tmp = *(op + m * paddedStep + n * channel + ch);
                                    float value = std::pow(tmp, Q);
                                    s1 += value * tmp;
                                    s2 += value;
                                }
                            }
                            if (s2 <= FLT_EPSILON)
                                s2 = 1e-9f;
                            *(np + i * paddedStep + j * channel + ch) = cv::saturate_cast<uchar>(s1 / s2);
                        }
                    }
                });
            }
        });

        /*Concurrency::parallel_for(0, MIN(channel, 3), [&](int ch) {
            Concurrency::parallel_for(radius, paddedH - radius, [&](int i) {
                for (int j = radius; j < paddedW - radius; j++)
                {
                    float s1 = 0.0f;
                    float s2 = 0.0f;
#pragma omp simd
                    for (int m = i - radius; m <= i + radius; m++)
                    {
                        for (int n = j - radius; n <= j + radius; n++)
                        {
                            float tmp = *(op + m * paddedStep + n * channel + ch);
                            float value = std::pow(tmp, Q);
                            s1 += value * tmp;
                            s2 += value;
                        }
                    }
                    if (s2 <= FLT_EPSILON)
                        s2 = 1e-9f;
                    *(np + i * paddedStep + j * channel + ch) = cv::saturate_cast<uchar>(s1 / s2);
                }
            });
        });*/
        dst = dst(cv::Rect(radius, radius, srcW, srcH));
    }
}

Test code:

int main()
{
    //cv::parallel::setParallelForBackend(std::make_shared<cv::parallel::tbb::ParallelForBackend>());
    cv::Mat src = cv::imread("images/lena.png", cv::IMREAD_COLOR); // lena512 color
    cv::Mat dst;

    int iter = 5;
    while (iter--)
    {
        auto start = std::chrono::high_resolution_clock::now();
        IPL::halide::contraharmonicMeanFilter(src, dst, 2, 3, false);
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 1000.0;
        std::cout << duration << "ms" << std::endl;
    }
}

Output:

`cv::parallel_for_`	PPL	Halide
65.026ms	28.517ms	8.697ms
81.711ms	30.769ms	8.548ms
70.233ms	31.237ms	9.144ms
80.467ms	30.658ms	8.06ms

OpenCV build information:

General configuration for OpenCV 4.8.0 =====================================
  Version control:               unknown

  Extra modules:
    Location (extra):            D:/vcpkg/buildtrees/opencv4/src/4.8.0-8d756cdf2d.clean/modules
    Version control (extra):     unknown

  Platform:
    Timestamp:                   2024-03-13T05:58:07Z
    Host:                        Windows 10.0.22631 AMD64
    CMake:                       3.27.1
    CMake generator:             Ninja
    CMake build tool:            D:/vcpkg/downloads/tools/ninja/1.10.2-windows/ninja.exe
    MSVC:                        1938
    Configuration:               Release

  CPU/HW features:
    Baseline:                    SSE SSE2 SSE3
      requested:                 SSE3
    Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
      requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
      SSE4_1 (14 files):         + SSSE3 SSE4_1
      SSE4_2 (1 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
      FP16 (0 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
      AVX (7 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
      AVX2 (33 files):           + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
      AVX512_SKX (5 files):      + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX

  C/C++:
    Built as dynamic libs?:      YES
    C++ standard:                11
    C++ Compiler:                C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.38.33130/bin/Hostx64/x64/cl.exe  (ver 19.38.33130.0)
    C++ flags (Release):         /nologo /DWIN32 /D_WINDOWS /W4 /utf-8 /Zc:inline /GR /MP   /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise /FS     /EHa /wd4127 /wd4251 /wd4324 /wd4275 /wd4512 /wd4589 /wd4819  /MD /O2 /Oi /Gy /DNDEBUG
    C++ flags (Debug):           /nologo /DWIN32 /D_WINDOWS /W4 /utf-8 /Zc:inline /GR /MP   /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise /FS     /EHa /wd4127 /wd4251 /wd4324 /wd4275 /wd4512 /wd4589 /wd4819  /D_DEBUG /MDd /Ob0 /Od
    C Compiler:                  C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.38.33130/bin/Hostx64/x64/cl.exe
    C flags (Release):           /nologo /DWIN32 /D_WINDOWS /W3 /utf-8 /Zc:inline /MP   /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise /FS       /MD /O2 /Oi /Gy /DNDEBUG
    C flags (Debug):             /nologo /DWIN32 /D_WINDOWS /W3 /utf-8 /Zc:inline /MP   /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise /FS     /D_DEBUG /MDd /Ob0 /Od
    Linker flags (Release):      /machine:x64  /nologo /DEBUG /INCREMENTAL:NO /OPT:REF /OPT:ICF    /debug
    Linker flags (Debug):        /machine:x64  /nologo    /debug /INCREMENTAL
    ccache:                      NO
    Precompiled headers:         NO
    Extra dependencies:
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 alphamat aruco bgsegm bioinspired calib3d ccalib core datasets dnn dnn_objdetect dnn_superres dpm face features2d flann fuzzy hdf hfs highgui img_hash imgcodecs imgproc intensity_transform line_descriptor mcc ml objdetect optflow phase_unwrapping photo plot quality rapid reg saliency shape stereo stitching structured_light superres surface_matching text tracking video videoio videostab wechat_qrcode xfeatures2d ximgproc xobjdetect xphoto
    Disabled:                    rgbd sfm world
    Disabled by dependency:      -
    Unavailable:                 cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev cvv freetype gapi java julia matlab ovis python2 python3 ts viz
    Applications:                -
    Documentation:               NO
    Non-free algorithms:         NO

  Windows RT support:            NO

  GUI:                           WIN32UI
    Win32 UI:                    YES

  Media I/O:
    ZLib:                        optimized D:/vcpkg/installed/x64-windows/lib/zlib.lib debug D:/vcpkg/installed/x64-windows/debug/lib/zlibd.lib (ver 1.3.0)
    JPEG:                        optimized D:/vcpkg/installed/x64-windows/lib/jpeg.lib debug D:/vcpkg/installed/x64-windows/debug/lib/jpeg.lib (ver 62)
    WEBP:                        (ver 1.3.2)
    PNG:                         optimized D:/vcpkg/installed/x64-windows/lib/libpng16.lib debug D:/vcpkg/installed/x64-windows/debug/lib/libpng16d.lib (ver 1.6.40)
    TIFF:                        optimized D:/vcpkg/installed/x64-windows/lib/tiff.lib debug D:/vcpkg/installed/x64-windows/debug/lib/tiffd.lib (ver 42 / 4.6.0)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DirectShow:                  YES
    Media Foundation:            YES
      DXVA:                      YES

  Parallel framework:            Concurrency

  Trace:                         YES (built-in)

  Other third-party libraries:
    Eigen:                       YES (ver 3.4.0)
    Custom HAL:                  NO
    Protobuf:                    optimized D:/vcpkg/installed/x64-windows/bin/libprotobuf.dll debug D:/vcpkg/installed/x64-windows/debug/bin/libprotobufd.dll   version (3.21.12.0)
    Flatbuffers:                 23.5.26

  OpenCL:                        YES (NVD3D11)
    Include path:                D:/vcpkg/buildtrees/opencv4/src/4.8.0-2bf495557d.clean/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python (for build):            NO

  Install to:                    D:/vcpkg/packages/opencv4_x64-windows
-----------------------------------------------------------------

Issue submission checklist

I report the issue, it's not a question
I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
I updated to the latest OpenCV version and the issue is still there
There is reproducer code and related data files (videos, images, onnx, etc)

The text was updated successfully, but these errors were encountered:

asmorkalov · 2024-03-29T08:17:01Z

@LiuPeiqiCN Thanks for the issue report. Several notes on the code:

Nested parallel_for does not make sense.
OpenMP uses own thread pool and scheduler. Mix of OpenCV (Concurrency by default) parallel framework and OpenMP may produce collisions and dead locks. At least you'll get 2 concurrent thread pools.

I propose to get rid of nested parallelism and OpenMP and recheck result.

LiuPeiqiCN · 2024-04-01T02:41:19Z

@asmorkalov Thanks for your response. I had never doubted whether nested parallelism would be effective before, so I consulted the documentation and found that PPL, OpenMP, and TBB all claim to support nested parallelism. Therefore, I tried using PPL and TBB for testing separately:

Using nested parallelism.
Outer loop serial, inner loop parallel_for.

However, the result was that there was no significant difference in the time consumed. Nested parallelism did not bring real performance improvement. Perhaps this is effective only in specific cases, as TBB suggests, it makes sense when outer level parallelism is not enough to utilize the system but each parallel work part is big enough to be paralleled.

Once I removed the outer level cv::parallel_for_, the time consumption was almost the same as with PPL and TBB.

crisluengo · 2024-07-22T14:23:59Z

According to the source code, “nested parallel_for_() calls are not parallelized” (it actually checks to see if it’s being nested and works sequentially in that’s the case).

According to the documentation… nothing. The function is not documented at all! How can people write so much code and not write three lines of documentation to help the user actually use all of that code???

fengyuentau · 2024-08-15T09:21:13Z

According to the documentation… nothing. The function is not documented at all! How can people write so much code and not write three lines of documentation to help the user actually use all of that code???

@crisluengo We welcome contributions on improving documentations as well :)

LiuPeiqiCN added the bug label Mar 25, 2024

LiuPeiqiCN changed the title ~~In some cases, the performance of cv::parallel_for_ is quite low~~ The performance of cv::parallel_for_ is quite low on Windows Mar 25, 2024

asmorkalov added needs investigation Collect and attach more details (build flags, stacktraces, input dumps, etc) optimization and removed bug labels Mar 26, 2024

LiuPeiqiCN closed this as completed Apr 1, 2024

fengyuentau mentioned this issue May 8, 2024

parallel_for_ flagNestedParallelFor cause multi task in difference thread may not parallel #25556

Closed

4 tasks

fengyuentau mentioned this issue Aug 17, 2024

onnx dnn performance decreases in multithreading #26037

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The performance of `cv::parallel_for_` is quite low on Windows #25260

The performance of `cv::parallel_for_` is quite low on Windows #25260

LiuPeiqiCN commented Mar 25, 2024 •

edited

Loading

asmorkalov commented Mar 29, 2024

LiuPeiqiCN commented Apr 1, 2024 •

edited

Loading

crisluengo commented Jul 22, 2024

fengyuentau commented Aug 15, 2024

The performance of cv::parallel_for_ is quite low on Windows #25260

The performance of cv::parallel_for_ is quite low on Windows #25260

Comments

LiuPeiqiCN commented Mar 25, 2024 • edited Loading

System Information

Detailed description

Steps to reproduce

Issue submission checklist

asmorkalov commented Mar 29, 2024

LiuPeiqiCN commented Apr 1, 2024 • edited Loading

crisluengo commented Jul 22, 2024

fengyuentau commented Aug 15, 2024

The performance of `cv::parallel_for_` is quite low on Windows #25260

The performance of `cv::parallel_for_` is quite low on Windows #25260

LiuPeiqiCN commented Mar 25, 2024 •

edited

Loading

LiuPeiqiCN commented Apr 1, 2024 •

edited

Loading