Deadlock issue in OpenBLAS with TBB #1336

goplanid · 2024-04-01T18:38:55Z

Brief Description: I am trying out this OpenBLAS PR [https://github.com/OpenMathLib/OpenBLAS/pull/4577] with TBB. I first register a callback in my code to dynamically change the threading backend. Instead of creating its own threads, OpenBLAS passes the work to the registered callback. I use TBB for running gemm and again want to use TBB for executing the callback.

Issue: I am facing deadlock issue in OpenBLAS (multiple threads get stuck in inner_threads function in OpenBLAS). OpenBLAS apears to encounter deadlock when used with fewer threads than no of available threads.

Below is my test code and steps to reproduce it.

#include <iostream>
#include <cblas.h>
#include <vector>
#include <tbb/tbb.h>
#include <chrono>

const int MATRIX_DIMENSION = 1000; // Adjust as needed
bool delay_threading = 1;

class MatrixMultiplicationTask {
private:
    const std::vector<double>& A;
    const std::vector<double>& B;
    std::vector<double>& C;

public:
    MatrixMultiplicationTask(const std::vector<double>& A,
                             const std::vector<double>& B,
                             std::vector<double>& C)
        : A(A), B(B), C(C) {}

    void operator()(const tbb::blocked_range<int>& range) const {
        for (int i = range.begin(); i != range.end(); ++i) {
            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        MATRIX_DIMENSION, MATRIX_DIMENSION, MATRIX_DIMENSION,
                        1.0, A.data(), MATRIX_DIMENSION, B.data(), MATRIX_DIMENSION,
                        0.0, &C[i * MATRIX_DIMENSION], MATRIX_DIMENSION);
        }
    }
};

class InnerLoopTask {
private:
    openblas_dojob_callback dojob;
    int numjobs;
    size_t jobdata_elsize;
    void* jobdata;
    int dojob_data;

public:
    InnerLoopTask(openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void* jobdata, int dojob_data)
        : dojob(dojob), numjobs(numjobs), jobdata_elsize(jobdata_elsize), jobdata(jobdata), dojob_data(dojob_data) {}

    void operator()(const tbb::blocked_range<int>& range) const {
        for (int i = range.begin(); i != range.end(); ++i) {
            void* element_adrr = (void*)(((char*)jobdata) + ((unsigned)i) * jobdata_elsize);
            dojob(i, element_adrr, dojob_data);
        }
    }
};

class MyObserver : public tbb::task_scheduler_observer {
public:
    MyObserver() {
        observe(true);
    }

    ~MyObserver() {
        observe(false);
    }

    void on_scheduler_entry(bool is_worker) override {
        std::cout << "Task scheduler entry" << std::endl;
    }

    void on_scheduler_exit(bool is_worker) override {
        std::cout << "Task scheduler exit" << std::endl;
    }
};

void myfunction_ (int sync, openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void *jobdata, int dojob_data)
{
    //MyObserver observer;
    //observer.observe(true);
    InnerLoopTask innerLoopTask(dojob, numjobs, jobdata_elsize, jobdata, dojob_data);
    //tbb::global_control gc(tbb::global_control::max_allowed_parallelism, 32);
    tbb::parallel_for(tbb::blocked_range<int>(0, numjobs), innerLoopTask);
}


int main() {
    // Dynamically create matrices using std::vector for easier management
    std::vector<double> A(MATRIX_DIMENSION * MATRIX_DIMENSION, 8.0);
    std::vector<double> B(MATRIX_DIMENSION * MATRIX_DIMENSION, 5.0);
    std::vector<double> C(MATRIX_DIMENSION * MATRIX_DIMENSION, 0.5);

    if (delay_threading)
        openblas_set_threads_callback_function(myfunction_);

    auto start = std::chrono::high_resolution_clock::now();

    tbb::parallel_for(tbb::blocked_range<int>(0, 2), MatrixMultiplicationTask(A,B,C));

    auto stop = std::chrono::high_resolution_clock::now();

    // Output a portion of the result (printing the entire matrix would be too much)
    for (int i = 0; i < 10; ++i) {
        for (int j = 0; j < 10; ++j) {
            std::cout << C[i * MATRIX_DIMENSION + j] << "\t";
        }
        std::cout << std::endl;
    }

    // Compute the duration
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
    std::cout << "Time taken by function: " << duration.count() << " milliseconds\n";

    return 0;
}

Run command: g++ -std=c++11 -o tbb_nested tbb_nested.cpp -ltbb -lpthread -I/home/openblas/include -L/home/openblas/lib -lopenblas -Wl,-rpath,/home/openblas/lib

Help needed: So as you can see here, I have below case of nested parallelism,
outer loop: tbb::parallel_for(tbb::blocked_range(0, 2), MatrixMultiplicationTask(A,B,C));
inner loop: tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);

In the above code Level 1 runs for 2 iterations and each iteration of Level 1 runs numjobs no of iterations(as it is an inner loop). I have a dependency in my code such that innerLoopTask can only operate when exact no of numjobs threads are used. What is the best possible nested solution provided by TBB to solve this problem? Kindly advise.

The text was updated successfully, but these errors were encountered:

goplanid · 2024-04-03T06:30:45Z

@anton-malakhov

dnmokhov · 2024-04-05T21:30:20Z

Hi @goplanid,

To guarantee parallelism in the inner loop, you could use TBB in the outer loop only. In the inner loop, you could launch numjobs threads (e.g., with std::thread) in myfunction_, with each thread performing an InnerLoopTask.

You can prevent oversubscription by throttling down the oneTBB concurrency (e.g., to hardware_concurrency / numjobs).

dnmokhov added the help wanted label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock issue in OpenBLAS with TBB #1336

Deadlock issue in OpenBLAS with TBB #1336

goplanid commented Apr 1, 2024

goplanid commented Apr 3, 2024

dnmokhov commented Apr 5, 2024

Deadlock issue in OpenBLAS with TBB #1336

Deadlock issue in OpenBLAS with TBB #1336

Comments

goplanid commented Apr 1, 2024

goplanid commented Apr 3, 2024

dnmokhov commented Apr 5, 2024