Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The performance of cv::parallel_for_ is quite low on Windows #25260

Closed
1 of 4 tasks
LiuPeiqiCN opened this issue Mar 25, 2024 · 4 comments
Closed
1 of 4 tasks

The performance of cv::parallel_for_ is quite low on Windows #25260

LiuPeiqiCN opened this issue Mar 25, 2024 · 4 comments
Labels
needs investigation Collect and attach more details (build flags, stacktraces, input dumps, etc) optimization

Comments

@LiuPeiqiCN
Copy link
Contributor

LiuPeiqiCN commented Mar 25, 2024

System Information

OpenCV version: 4.8.0
Operating System / Platform: Windows11
Compiler & compiler version: VS2022, MSVC1938

Detailed description

In the Windows environment, the performance of cv::parallel_for_ is not satisfactory, and it is speculated that some optimization options may not be enabled during the compilation process, leading to this issue.

Steps to reproduce

I am studying the book Digital Image Processing, 4th edition, Rafael C. Gonzalez • Richard E. Woods, and try to implement Contraharmonic Mean Filter described in Chapter 5.3 Restoration in the Presence of Noise Only—Spatial Filtering.

I used three methods for Contraharmonic Mean Filter implementation: Halide, cv::parallel_for_, and Windows PPL.

From the test results, OpenCV's performance seems to be subpar compared to PPL. However, I prefer not to use PPL because it is specific to the Windows system. I prefer to achieve similar performance using OpenCV and some simple code.

At least on the Windows system, OpenCV doesn't seem to be selecting the correct parallel backend. I seriously doubt that may lead to a significant decrease in the performance of many OpenCV operators.

Code:

#include <ppl.h> // only windows

void IPL::halide::contraharmonicMeanFilter(const cv::Mat& src, cv::Mat& dst, int radius, float Q, bool useHalide)
{
    if (useHalide && src.rows >= 32 && src.cols >= 32)
    {
        //if (dst.empty())
        //    dst = cv::Mat(src.size(), src.type());
        //CV_Assert(src.type() == dst.type() && src.size() == dst.size() && radius >= 1);
        //CV_Assert(src.isContinuous() && dst.isContinuous());
        //Halide::Runtime::Buffer<uint8_t> hal_src = Halide::Runtime::Buffer<uint8_t>::make_interleaved(src.data, src.cols, src.rows, src.channels());
        //Halide::Runtime::Buffer<uint8_t> hal_dst(dst.data, src.channels(), dst.cols, dst.rows);
        //Halide_ContraharmonicMeanFilter(hal_src, radius, Q, hal_dst);
    }
    else
    {
        cv::Mat padded;

        cv::copyMakeBorder(src, padded, radius, radius, radius, radius, cv::BORDER_REPLICATE);
        dst = padded.clone();

        int srcW = src.cols;
        int srcH = src.rows;
        int paddedW = padded.cols;
        int paddedH = padded.rows;
        size_t paddedStep = padded.step1();
        int channel = src.channels();
        int nChannelProc = MIN(channel, 3);

        uchar* op = padded.data;
        uchar* np = dst.data;

        cv::parallel_for_(cv::Range(0, nChannelProc), [&](const cv::Range& range) {
            for (int ch = range.start; ch < range.end; ch++)
            {
                cv::parallel_for_(cv::Range(radius, paddedH - radius), [&](const cv::Range& range) {
                    for (int i = range.start; i < range.end; i++)
                    {
                        for (int j = radius; j < paddedW - radius; j++)
                        {
                            float s1 = 0.0f;
                            float s2 = 0.0f;
#pragma omp simd
                            for (int m = i - radius; m <= i + radius; m++)
                            {
                                for (int n = j - radius; n <= j + radius; n++)
                                {
                                    float tmp = *(op + m * paddedStep + n * channel + ch);
                                    float value = std::pow(tmp, Q);
                                    s1 += value * tmp;
                                    s2 += value;
                                }
                            }
                            if (s2 <= FLT_EPSILON)
                                s2 = 1e-9f;
                            *(np + i * paddedStep + j * channel + ch) = cv::saturate_cast<uchar>(s1 / s2);
                        }
                    }
                });
            }
        });

        /*Concurrency::parallel_for(0, MIN(channel, 3), [&](int ch) {
            Concurrency::parallel_for(radius, paddedH - radius, [&](int i) {
                for (int j = radius; j < paddedW - radius; j++)
                {
                    float s1 = 0.0f;
                    float s2 = 0.0f;
#pragma omp simd
                    for (int m = i - radius; m <= i + radius; m++)
                    {
                        for (int n = j - radius; n <= j + radius; n++)
                        {
                            float tmp = *(op + m * paddedStep + n * channel + ch);
                            float value = std::pow(tmp, Q);
                            s1 += value * tmp;
                            s2 += value;
                        }
                    }
                    if (s2 <= FLT_EPSILON)
                        s2 = 1e-9f;
                    *(np + i * paddedStep + j * channel + ch) = cv::saturate_cast<uchar>(s1 / s2);
                }
            });
        });*/
        dst = dst(cv::Rect(radius, radius, srcW, srcH));
    }
}

Test code:

int main()
{
    //cv::parallel::setParallelForBackend(std::make_shared<cv::parallel::tbb::ParallelForBackend>());
    cv::Mat src = cv::imread("images/lena.png", cv::IMREAD_COLOR); // lena512 color
    cv::Mat dst;

    int iter = 5;
    while (iter--)
    {
        auto start = std::chrono::high_resolution_clock::now();
        IPL::halide::contraharmonicMeanFilter(src, dst, 2, 3, false);
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 1000.0;
        std::cout << duration << "ms" << std::endl;
    }
}

Output:

cv::parallel_for_ PPL Halide
65.026ms 28.517ms 8.697ms
81.711ms 30.769ms 8.548ms
70.233ms 31.237ms 9.144ms
80.467ms 30.658ms 8.06ms
OpenCV build information:
General configuration for OpenCV 4.8.0 =====================================
  Version control:               unknown

  Extra modules:
    Location (extra):            D:/vcpkg/buildtrees/opencv4/src/4.8.0-8d756cdf2d.clean/modules
    Version control (extra):     unknown

  Platform:
    Timestamp:                   2024-03-13T05:58:07Z
    Host:                        Windows 10.0.22631 AMD64
    CMake:                       3.27.1
    CMake generator:             Ninja
    CMake build tool:            D:/vcpkg/downloads/tools/ninja/1.10.2-windows/ninja.exe
    MSVC:                        1938
    Configuration:               Release

  CPU/HW features:
    Baseline:                    SSE SSE2 SSE3
      requested:                 SSE3
    Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
      requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
      SSE4_1 (14 files):         + SSSE3 SSE4_1
      SSE4_2 (1 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
      FP16 (0 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
      AVX (7 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
      AVX2 (33 files):           + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
      AVX512_SKX (5 files):      + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX

  C/C++:
    Built as dynamic libs?:      YES
    C++ standard:                11
    C++ Compiler:                C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.38.33130/bin/Hostx64/x64/cl.exe  (ver 19.38.33130.0)
    C++ flags (Release):         /nologo /DWIN32 /D_WINDOWS /W4 /utf-8 /Zc:inline /GR /MP   /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise /FS     /EHa /wd4127 /wd4251 /wd4324 /wd4275 /wd4512 /wd4589 /wd4819  /MD /O2 /Oi /Gy /DNDEBUG
    C++ flags (Debug):           /nologo /DWIN32 /D_WINDOWS /W4 /utf-8 /Zc:inline /GR /MP   /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise /FS     /EHa /wd4127 /wd4251 /wd4324 /wd4275 /wd4512 /wd4589 /wd4819  /D_DEBUG /MDd /Ob0 /Od
    C Compiler:                  C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.38.33130/bin/Hostx64/x64/cl.exe
    C flags (Release):           /nologo /DWIN32 /D_WINDOWS /W3 /utf-8 /Zc:inline /MP   /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise /FS       /MD /O2 /Oi /Gy /DNDEBUG
    C flags (Debug):             /nologo /DWIN32 /D_WINDOWS /W3 /utf-8 /Zc:inline /MP   /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise /FS     /D_DEBUG /MDd /Ob0 /Od
    Linker flags (Release):      /machine:x64  /nologo /DEBUG /INCREMENTAL:NO /OPT:REF /OPT:ICF    /debug
    Linker flags (Debug):        /machine:x64  /nologo    /debug /INCREMENTAL
    ccache:                      NO
    Precompiled headers:         NO
    Extra dependencies:
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 alphamat aruco bgsegm bioinspired calib3d ccalib core datasets dnn dnn_objdetect dnn_superres dpm face features2d flann fuzzy hdf hfs highgui img_hash imgcodecs imgproc intensity_transform line_descriptor mcc ml objdetect optflow phase_unwrapping photo plot quality rapid reg saliency shape stereo stitching structured_light superres surface_matching text tracking video videoio videostab wechat_qrcode xfeatures2d ximgproc xobjdetect xphoto
    Disabled:                    rgbd sfm world
    Disabled by dependency:      -
    Unavailable:                 cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev cvv freetype gapi java julia matlab ovis python2 python3 ts viz
    Applications:                -
    Documentation:               NO
    Non-free algorithms:         NO

  Windows RT support:            NO

  GUI:                           WIN32UI
    Win32 UI:                    YES

  Media I/O:
    ZLib:                        optimized D:/vcpkg/installed/x64-windows/lib/zlib.lib debug D:/vcpkg/installed/x64-windows/debug/lib/zlibd.lib (ver 1.3.0)
    JPEG:                        optimized D:/vcpkg/installed/x64-windows/lib/jpeg.lib debug D:/vcpkg/installed/x64-windows/debug/lib/jpeg.lib (ver 62)
    WEBP:                        (ver 1.3.2)
    PNG:                         optimized D:/vcpkg/installed/x64-windows/lib/libpng16.lib debug D:/vcpkg/installed/x64-windows/debug/lib/libpng16d.lib (ver 1.6.40)
    TIFF:                        optimized D:/vcpkg/installed/x64-windows/lib/tiff.lib debug D:/vcpkg/installed/x64-windows/debug/lib/tiffd.lib (ver 42 / 4.6.0)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DirectShow:                  YES
    Media Foundation:            YES
      DXVA:                      YES

  Parallel framework:            Concurrency

  Trace:                         YES (built-in)

  Other third-party libraries:
    Eigen:                       YES (ver 3.4.0)
    Custom HAL:                  NO
    Protobuf:                    optimized D:/vcpkg/installed/x64-windows/bin/libprotobuf.dll debug D:/vcpkg/installed/x64-windows/debug/bin/libprotobufd.dll   version (3.21.12.0)
    Flatbuffers:                 23.5.26

  OpenCL:                        YES (NVD3D11)
    Include path:                D:/vcpkg/buildtrees/opencv4/src/4.8.0-2bf495557d.clean/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python (for build):            NO

  Install to:                    D:/vcpkg/packages/opencv4_x64-windows
-----------------------------------------------------------------

Issue submission checklist

  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
  • I updated to the latest OpenCV version and the issue is still there
  • There is reproducer code and related data files (videos, images, onnx, etc)
@LiuPeiqiCN LiuPeiqiCN added the bug label Mar 25, 2024
@LiuPeiqiCN LiuPeiqiCN changed the title In some cases, the performance of cv::parallel_for_ is quite low The performance of cv::parallel_for_ is quite low on Windows Mar 25, 2024
@asmorkalov asmorkalov added needs investigation Collect and attach more details (build flags, stacktraces, input dumps, etc) optimization and removed bug labels Mar 26, 2024
@asmorkalov
Copy link
Contributor

@LiuPeiqiCN Thanks for the issue report. Several notes on the code:

  • Nested parallel_for does not make sense.
  • OpenMP uses own thread pool and scheduler. Mix of OpenCV (Concurrency by default) parallel framework and OpenMP may produce collisions and dead locks. At least you'll get 2 concurrent thread pools.

I propose to get rid of nested parallelism and OpenMP and recheck result.

@LiuPeiqiCN
Copy link
Contributor Author

LiuPeiqiCN commented Apr 1, 2024

@asmorkalov Thanks for your response. I had never doubted whether nested parallelism would be effective before, so I consulted the documentation and found that PPL, OpenMP, and TBB all claim to support nested parallelism. Therefore, I tried using PPL and TBB for testing separately:

  • Using nested parallelism.
  • Outer loop serial, inner loop parallel_for.

However, the result was that there was no significant difference in the time consumed. Nested parallelism did not bring real performance improvement. Perhaps this is effective only in specific cases, as TBB suggests, it makes sense when outer level parallelism is not enough to utilize the system but each parallel work part is big enough to be paralleled.

Once I removed the outer level cv::parallel_for_, the time consumption was almost the same as with PPL and TBB.

@crisluengo
Copy link

According to the source code, “nested parallel_for_() calls are not parallelized” (it actually checks to see if it’s being nested and works sequentially in that’s the case).

According to the documentation… nothing. The function is not documented at all! How can people write so much code and not write three lines of documentation to help the user actually use all of that code???

@fengyuentau
Copy link
Member

According to the documentation… nothing. The function is not documented at all! How can people write so much code and not write three lines of documentation to help the user actually use all of that code???

@crisluengo We welcome contributions on improving documentations as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation Collect and attach more details (build flags, stacktraces, input dumps, etc) optimization
Projects
None yet
Development

No branches or pull requests

4 participants