Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel_for_ flagNestedParallelFor cause multi task in difference thread may not parallel #25556

Closed
4 tasks done
DaydreamCoding opened this issue May 7, 2024 · 8 comments
Closed
4 tasks done

Comments

@DaydreamCoding
Copy link

System Information

opencv 4.9

Detailed description

    static std::atomic<bool> flagNestedParallelFor(false);
    if (isNotNestedRegion)
            isNotNestedRegion = !flagNestedParallelFor.exchange(true); // multi threads in this line, may cause isNotNestedRegion to be false
    if (isNotNestedRegion)
    {
        try
        {
            parallel_for_impl(range, body, nstripes); // before done, flagNestedParallelFor always true, and another job could not fall into parallel_for_impl
            flagNestedParallelFor = false;
        }
        catch (...)
        {
            flagNestedParallelFor = false;
            throw;
        }
    }
    else // nested parallel_for_() calls are not parallelized
    {
        CV_UNUSED(nstripes);
        body(range);
    }

When multiple threads are using cv::resize, the cv::resize of some threads cannot be fall into parallel for

Steps to reproduce

auto t1 = std::thread([]() {
    cv::resize(); // this fall into parallel_for_impl
});

auto t2 = std::thread([]() {
    cv::resize(); // can not fall into parallel_for_impl
});

t1.join();
t2.join();

Issue submission checklist

  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
  • I updated to the latest OpenCV version and the issue is still there
  • There is reproducer code and related data files (videos, images, onnx, etc)
@asmorkalov
Copy link
Contributor

@opencv-alalek could you take a look?

@fengyuentau
Copy link
Member

fengyuentau commented May 8, 2024

It is not a bug. Parallel for is designed to work in this way. Nested parallel for does not lead to better performance. See #25260 (comment).

@DaydreamCoding
Copy link
Author

DaydreamCoding commented May 9, 2024

It is not a bug. Parallel for is designed to work in this way. Nested parallel for does not lead to better performance. See #25260 (comment).

This design is flawed, resulting in external non-nested calls that probabilistically cannot be concurrent. And for multi-threaded calls to a function that uses parral_for, this scheme may be invalidated

auto t1 = std::thread([]() {
    cv::resize(); // this fall into parallel_for_impl
});

auto t2 = std::thread([]() {
    cv::resize(); // can not fall into parallel_for_impl
});

t1.join();
t2.join();

@opencv-alalek

@cdeln
Copy link

cdeln commented Jun 21, 2024

Just replace static with thread_local? Might be worth scanning the rest of the code base for non-reentrant functions.

@opencv-alalek
Copy link
Contributor

duplicate of #9336

@opencv-alalek opencv-alalek closed this as not planned Won't fix, can't repro, duplicate, stale Jun 22, 2024
@cdeln
Copy link

cdeln commented Jun 22, 2024

@opencv-alalek This is not a duplicate of #9336 . That (low quality, no repro code) issue talks about nested (i.e. vertical) parallelization, this issue is about horizontal.

@cdeln
Copy link

cdeln commented Jun 22, 2024

Here is minimal reproducing example (MRE).
It starts by allocating an big 20k x 20k image, and then performs two image up-samplings in parallel using lanczos4 interpolation (to maximize processing requirements).

#include <opencv2/imgproc.hpp>

#include <thread>
#include <chrono>
#include <iostream>

using namespace std;
using namespace std::chrono;

int main()
{
    int rows = 20000;
    int cols = 20000;
    cv::Size size(cols, rows);
    cv::Mat src(size, CV_8UC1);
    cv::Mat out1, out2;

    auto time0 = high_resolution_clock::now();
    decltype(time0) time1, time2;

    auto thread1 = thread([&]() {
        cv::resize(src, out1, {}, 2, 2, cv::INTER_LANCZOS4);
        time1 = high_resolution_clock::now();
    });

    auto thread2 = thread([&]() {
        cv::resize(src, out2, {}, 2, 2, cv::INTER_LANCZOS4);
        time2 = high_resolution_clock::now();
    });

    thread1.join();
    thread2.join();

    cout << "Time resize (thread 1) : " << duration_cast<nanoseconds>(time1 - time0).count() * 1e-9 << endl;
    cout << "Time resize (thread 2) : " << duration_cast<nanoseconds>(time2 - time0).count() * 1e-9 << endl;
    return 0;
}

Running it twice gives

Run 1:

Time resize (thread 1) : 9.20218
Time resize (thread 2) : 2.04223

Run 2:

Time resize (thread 1) : 2.11035
Time resize (thread 2) : 7.64846

I run htop -d 1 while running the executable and you can easily observe the behavior of "first one thread claiming all cores" followed by "the other thread running with single core". I do not think this behavior is intentional? The expected behavior (from a user perspective) is that each thread claims as many as allowed from cv::setNumThreads. If the user performs their own external parallelization (like in the MRE above), they can configure the maximum number of cores each sub-thread allocates in the internal parallel for. Hope this example helps to clear things out.

@cdeln
Copy link

cdeln commented Jun 23, 2024

Ah sry, ofc you cant replace static with thread_local, brain fart on my side. This issue might require more effort to solve that I thought. Not experienced enough with the code base to help any further atm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants