You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Brief Description: I am trying out this OpenBLAS PR [https://github.com/OpenMathLib/OpenBLAS/pull/4577] with TBB. I first register a callback in my code to dynamically change the threading backend. Instead of creating its own threads, OpenBLAS passes the work to the registered callback. I use TBB for running gemm and again want to use TBB for executing the callback.
Issue: I am facing deadlock issue in OpenBLAS (multiple threads get stuck in inner_threads function in OpenBLAS). OpenBLAS apears to encounter deadlock when used with fewer threads than no of available threads.
Below is my test code and steps to reproduce it.
#include <iostream>
#include <cblas.h>
#include <vector>
#include <tbb/tbb.h>
#include <chrono>
const int MATRIX_DIMENSION = 1000; // Adjust as needed
bool delay_threading = 1;
class MatrixMultiplicationTask {
private:
const std::vector<double>& A;
const std::vector<double>& B;
std::vector<double>& C;
public:
MatrixMultiplicationTask(const std::vector<double>& A,
const std::vector<double>& B,
std::vector<double>& C)
: A(A), B(B), C(C) {}
void operator()(const tbb::blocked_range<int>& range) const {
for (int i = range.begin(); i != range.end(); ++i) {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
MATRIX_DIMENSION, MATRIX_DIMENSION, MATRIX_DIMENSION,
1.0, A.data(), MATRIX_DIMENSION, B.data(), MATRIX_DIMENSION,
0.0, &C[i * MATRIX_DIMENSION], MATRIX_DIMENSION);
}
}
};
class InnerLoopTask {
private:
openblas_dojob_callback dojob;
int numjobs;
size_t jobdata_elsize;
void* jobdata;
int dojob_data;
public:
InnerLoopTask(openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void* jobdata, int dojob_data)
: dojob(dojob), numjobs(numjobs), jobdata_elsize(jobdata_elsize), jobdata(jobdata), dojob_data(dojob_data) {}
void operator()(const tbb::blocked_range<int>& range) const {
for (int i = range.begin(); i != range.end(); ++i) {
void* element_adrr = (void*)(((char*)jobdata) + ((unsigned)i) * jobdata_elsize);
dojob(i, element_adrr, dojob_data);
}
}
};
class MyObserver : public tbb::task_scheduler_observer {
public:
MyObserver() {
observe(true);
}
~MyObserver() {
observe(false);
}
void on_scheduler_entry(bool is_worker) override {
std::cout << "Task scheduler entry" << std::endl;
}
void on_scheduler_exit(bool is_worker) override {
std::cout << "Task scheduler exit" << std::endl;
}
};
void myfunction_ (int sync, openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void *jobdata, int dojob_data)
{
//MyObserver observer;
//observer.observe(true);
InnerLoopTask innerLoopTask(dojob, numjobs, jobdata_elsize, jobdata, dojob_data);
//tbb::global_control gc(tbb::global_control::max_allowed_parallelism, 32);
tbb::parallel_for(tbb::blocked_range<int>(0, numjobs), innerLoopTask);
}
int main() {
// Dynamically create matrices using std::vector for easier management
std::vector<double> A(MATRIX_DIMENSION * MATRIX_DIMENSION, 8.0);
std::vector<double> B(MATRIX_DIMENSION * MATRIX_DIMENSION, 5.0);
std::vector<double> C(MATRIX_DIMENSION * MATRIX_DIMENSION, 0.5);
if (delay_threading)
openblas_set_threads_callback_function(myfunction_);
auto start = std::chrono::high_resolution_clock::now();
tbb::parallel_for(tbb::blocked_range<int>(0, 2), MatrixMultiplicationTask(A,B,C));
auto stop = std::chrono::high_resolution_clock::now();
// Output a portion of the result (printing the entire matrix would be too much)
for (int i = 0; i < 10; ++i) {
for (int j = 0; j < 10; ++j) {
std::cout << C[i * MATRIX_DIMENSION + j] << "\t";
}
std::cout << std::endl;
}
// Compute the duration
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
std::cout << "Time taken by function: " << duration.count() << " milliseconds\n";
return 0;
}
Help needed: So as you can see here, I have below case of nested parallelism,
outer loop: tbb::parallel_for(tbb::blocked_range(0, 2), MatrixMultiplicationTask(A,B,C));
inner loop: tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);
In the above code Level 1 runs for 2 iterations and each iteration of Level 1 runs numjobs no of iterations(as it is an inner loop). I have a dependency in my code such that innerLoopTask can only operate when exact no of numjobs threads are used. What is the best possible nested solution provided by TBB to solve this problem? Kindly advise.
The text was updated successfully, but these errors were encountered:
To guarantee parallelism in the inner loop, you could use TBB in the outer loop only. In the inner loop, you could launch numjobs threads (e.g., with std::thread) in myfunction_, with each thread performing an InnerLoopTask.
You can prevent oversubscription by throttling down the oneTBB concurrency (e.g., to hardware_concurrency / numjobs).
Brief Description: I am trying out this OpenBLAS PR [https://github.com/OpenMathLib/OpenBLAS/pull/4577] with TBB. I first register a callback in my code to dynamically change the threading backend. Instead of creating its own threads, OpenBLAS passes the work to the registered callback. I use TBB for running gemm and again want to use TBB for executing the callback.
Issue: I am facing deadlock issue in OpenBLAS (multiple threads get stuck in inner_threads function in OpenBLAS). OpenBLAS apears to encounter deadlock when used with fewer threads than no of available threads.
Below is my test code and steps to reproduce it.
Run command: g++ -std=c++11 -o tbb_nested tbb_nested.cpp -ltbb -lpthread -I/home/openblas/include -L/home/openblas/lib -lopenblas -Wl,-rpath,/home/openblas/lib
Help needed: So as you can see here, I have below case of nested parallelism,
outer loop: tbb::parallel_for(tbb::blocked_range(0, 2), MatrixMultiplicationTask(A,B,C));
inner loop: tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);
In the above code Level 1 runs for 2 iterations and each iteration of Level 1 runs numjobs no of iterations(as it is an inner loop). I have a dependency in my code such that innerLoopTask can only operate when exact no of numjobs threads are used. What is the best possible nested solution provided by TBB to solve this problem? Kindly advise.
The text was updated successfully, but these errors were encountered: