Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock between std::thread, TBB TaskGroup, and OpenMP #353

Closed
sloh-ms opened this issue Feb 26, 2021 · 6 comments
Closed

Deadlock between std::thread, TBB TaskGroup, and OpenMP #353

sloh-ms opened this issue Feb 26, 2021 · 6 comments

Comments

@sloh-ms
Copy link

sloh-ms commented Feb 26, 2021

UPDATE: Adding a code-comment with the solution (where to do taskGroup.wait()) for future reference. See comments below for explanation.

Moving from https://stackoverflow.com/questions/66346225/ to here.

I'm working in some preexisting code that uses a number of multi-threading techniques; std::thread, plus Intel TBB TaskGroup, plus OpenMP. 🙄 It looks like I've hit a race condition involving thread reuse that causes both thread::join and OpenMP to fail to return.

The scenario is that the main thread kicks off a bunch of I/O worker std::threads, which themselves initiate some tasks, and the tasks have some segments of code that use OpenMP for parallelism. The main thread does std::thread::join() to wait for the std::threads, then tbb::TaskGroup::wait() to wait for the tasks to complete.

#include <Windows.h>
#include <tbb/task_group.h>
#include <tbb/concurrent_vector.h>
#include <iostream>
#include <sstream>
#include <thread>

void DoCPUIntensiveWork(int chunkIndex);

int main ()
{
    unsigned int hardwareConcurrency = 4;
    tbb::concurrent_vector<std::shared_ptr<std::thread>> ioThreads;
    tbb::task_group taskGroup;

    wprintf(L"Starting %u IO threads\n", hardwareConcurrency);
    for (unsigned int cx = 0; cx < hardwareConcurrency; ++cx)
    {
        ioThreads.push_back(std::shared_ptr<std::thread>(new std::thread([&taskGroup, /*&taskGroupLock,*/ cx]
            {
                wprintf(L"IO thread %u starting\r\n", GetCurrentThreadId());
            
                // Not doing any actual IO

                taskGroup.run([cx]
                    {
                        wprintf(L"CPU task %u starting\r\n", GetCurrentThreadId());
                        DoCPUIntensiveWork(cx);
                        wprintf(L"CPU task %u done\r\n", GetCurrentThreadId());
                    });

                // Sleep(1000);  // Un-commenting this will make the program terminate, but it's a hack
                // taskGroup.wait();  // EDIT: Un-commenting this is the correct fix, see comments below

                wprintf(L"IO thread %u done\r\n", GetCurrentThreadId());
            })));
    }

    // Join the IO workers
    for (std::shared_ptr<std::thread>& thread : ioThreads)
    {
        std::stringstream ss;
        ss << thread->get_id();
        wprintf(L"Wait for thread %S\r\n", ss.str().c_str());
        thread->join();  // Main thread hangs here
    }

    wprintf(L"IO work complete\n");

    // And then wait for the CPU tasks
    taskGroup.wait();

    wprintf(L"CPU work complete\n");

    return 0;
}

And the CPU-Intensive work includes usage of OpenMP.

// Note: I shrunk these numbers down until the amount of work is actually
// small, not CPU-intensive at all, and it still hangs
static const int GlobalBufferChunkSize = 64;
static const int NumGlobalBufferChunks = 4;
static const int StrideSize = 16;
static const int OverwriteCount = 4;
BYTE GlobalBuffer[NumGlobalBufferChunks * GlobalBufferChunkSize];

void DoCPUIntensiveWork(int chunkIndex)
{
    BYTE* pChunk = GlobalBuffer + (chunkIndex * GlobalBufferChunkSize);

#pragma omp parallel for num_threads(2)
    for (int i = 0; i < (GlobalBufferChunkSize / StrideSize); i++)
    {
        BYTE* pStride = pChunk + (i * StrideSize);
        for (int j = 0; j < OverwriteCount; j++)
        {
            memset(pStride, i, StrideSize);
        }
    }  // Task thread hangs here
}

This code hangs; the main thread waits on thread->join() forever. Even on a test case that has only a single IO job / CPU-intensive task. I added the printf's you see above, and the result showed that the IO job finished fast, that thread exited, and then the CPU-intensive task spun up with the same thread ID before the main thread even got into the join() call.

Starting 4 IO threads
Wait for thread 24452  <-- Main thread never returns from this wait
IO thread 24452 starting
IO thread 23504 starting
IO thread 21356 starting
IO thread 23504 done
IO thread 24452 done  <-- The IO thread the main thread has waited for, is done
IO thread 21356 done
CPU task 23504 starting  <-- Thread reuse is on a *different* thread in this case (usually it's the same thread)
IO thread 20728 starting
CPU task 5240 starting
IO thread 20728 done
... hang forever ...

The problem reproduces whenever an IO thread ID is reused for a TBB task after the IO thread finishes. In my code here, that's every time; in my actual more complicated project, it doesn't happen 100% of the time. The thread->join() call is always still sitting there waiting, even though the IO work is all done and it should be able to move on to wait for the tasks. When I looked in the debugger for the case listed above, thread->join() was indeed still waiting on thread ID 24452, and a thread with that ID did exist, but that thread was completely done executing. Most of the time the main thread ends up waiting on the same thread that gets reused, but apparently that isn't a requirement to reproduce the problem. There just has to be reuse of any thread.
This seems like a bug: thread::join() should return even if TBB reuses a thread, because the work the std::thread was created with has completed. Somehow TBB is breaking the semantics of thread::join

Second funny thing, the task on the reused thread never completes. When I look at the thread that's executing the task in the debugger, it's sitting at the end of the OpenMP parallel execution. I don't see any other threads executing parallel work. There are a number of threads from vcomp140[d].dll sitting around in ntdll.dll code, for which I don't have symbols - I presume these are just waiting for new work, not doing my task. The CPU is idle. I'm pretty confident nobody is looping. So, the TBB task is hung somewhere in the OpenMP multi-threading implementation.
This seems like another bug: OpenMP breaks when used on a TBB task, although it only happens if the thread happens to be a reused std::thread.

Here's the callstack of the hung task:

ntdll.dll!NtWaitForSingleObject�()	Unknown
KernelBase.dll!WaitForSingleObjectEx�()	Unknown
vcomp140d.dll!_vcomp::PartialBarrier1::Block() Line 196	C++
vcomp140d.dll!InvokeThreadTeam(_THREAD_TEAM * ptm, void(*)(void *, unsigned long) pvContext, void *) Line 850	C++
vcomp140d.dll!_vcomp_fork(int if_test, int arg_count, void(*)() funclet, ...) Line 230	C++
BugRepro.exe!DoCPUIntensiveWork(int chunkIndex) Line 84	C++
BugRepro.exe!main::__l4::void <lambda>(void)::__l2::<lambda>() Line 34	C++
BugRepro.exe!tbb::internal::function_task<void <lambda>(void)>::execute() Line 1060	C++
tbb_debug.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop(tbb::internal::context_guard_helper<0> & context_guard, tbb::task * t, __int64 isolation) Line 474	C++
tbb_debug.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task & parent, tbb::task * child) Line 636	C++
tbb_debug.dll!tbb::internal::generic_scheduler::cleanup_master(bool blocking_terminate) Line 1360	C++
tbb_debug.dll!tbb::internal::governor::auto_terminate(void * arg) Line 224	C++
tbb_debug.dll!tbb::internal::governor::terminate_auto_initialized_scheduler() Line 146	C++
tbb_debug.dll!DllMain(void * __formal, unsigned long reason, void * lpvReserved) Line 267	C++
tbb_debug.dll!dllmain_dispatch(HINSTANCE__ * const instance, const unsigned long reason, void * const reserved) Line 281	C++
tbb_debug.dll!_DllMainCRTStartup(HINSTANCE__ * const instance, const unsigned long reason, void * const reserved) Line 335	C++
ntdll.dll!LdrpCallInitRoutine()	Unknown
ntdll.dll!LdrShutdownThread()	Unknown
ntdll.dll!RtlExitUserThread�()	Unknown
KernelBase.dll!FreeLibraryAndExitThread()	Unknown
ucrtbased.dll!common_end_thread(const unsigned int return_code) Line 264	C++
ucrtbased.dll!_endthreadex(unsigned int return_code) Line 278	C++
ucrtbased.dll!thread_start<unsigned int (__cdecl*)(void *),1>(void * const parameter) Line 104	C++
kernel32.dll!BaseThreadInitThunk�()	Unknown
ntdll.dll!RtlUserThreadStart�()	Unknown

So, somewhere between std::thread and TBB tasks and OpenMP parallelism there's a race condition triggered by thread reuse.

I have found two workarounds that make the hang go away:

  1. Put a Sleep(1000) at the end of the IO thread, so IO thread IDs aren't reused by the tasks.
  2. Remove the use of OpenMP parallelism. Either by removing the pragma or by changing to num_threads(1).

I'm reporting this more as a possible bug report than as a cry for help. I've worked around it for now by removing OpenMP. The above is my full code, though you'll need to link with TBB to build.

This is a copy of TBB obtained within the last month via VCPKG, built for Windows via the VCPKG commands.

@alexey-katranov
Copy link
Contributor

Thank you for the callstack. It seems we are deadlocked under loader lock. Do you use OpenMP from MSVC? If possible, can you load symbols for vcomp140d (and maybe other system libs)

@sloh-ms
Copy link
Author

sloh-ms commented Mar 2, 2021

Ahh got it! Sorry, I should have figured out how to get that before. Updated above. I don't think I have source from OpenMP to look at exactly what it's doing inside PartialBarrier1::Block() though. (As far as I know this is from MSVC.)

@alexey-katranov
Copy link
Contributor

The OpenMP sources are not required. I just wanted to be sure that it is a barrier. My initial supposition on SO about oversubscription is incorrect and I think it is a loader lock issue.

std::thread is finishing and DLLMain is called for TBB, i.e. under the loader lock. Then TBB cleanups its structures and finds a missed tasks (that is created with tg.run on this thread) and executes it. Inside the task we start OpenMP that has implicit barrier at the end of scope. I.e. OpenMP tries to sync the calling thread with other OpenMP threads. However, https://docs.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-best-practices states:

You should never perform the following tasks from within DllMain:
Call CreateThread. Creating a thread can work if you do not synchronize with other threads, but it is risky.

Therefore, thread ID is not reused, it is really the same thread. Sleep(1000) helps because a TBB worker thread has time to take the task and prevent the task execution in destructor of std::thread.

From TBB perspective, it is not a good practice to run the tasks and do not wait them before the thread completion:

  • This thread starts task execution in thread destructor
  • In rear cases, tasks might be invisible for TBB worker threads and this thread should execute them with wait (automatic execution in destructor prevents missed task issue)

In oneTBB, we resolved the first issue but we still have the second issue. So, you can either try oneTBB or use enqueue tasks (it is not really convenient with task_group because of possible race) , e.g., wrap taskGroup.run([cx]{...}) into the following code:

tbb::task_arena arena(tbb::task_arena::attach);
std::atomic<bool> is_run{}
arena.enqueue([&tg, &is_run] { 
    tg.run([cx]{...});
    is_run= true; 
});
while (!is_run) std::thread::yield();

We need is_run synchronization until we implement task_handle approach (uxlfoundation/oneAPI-spec#292). There is a racy window between enqueue and run, i.e. the main thread might exit tg.wait() before all the tasks are really executed.

@sloh-ms
Copy link
Author

sloh-ms commented Mar 2, 2021

I see! I was unaware that the task is "tied" to the current thread, and the authors of this code probably were too. So that's why the Sleep() on the IO thread fixes the deadlock - the task is always completed in that case before the IO thread exits and cleans up while holding the loader lock. I don't entirely understand your second bullet ("In rear cases") but are you proposing I should fix this by using the arena, or that I should fix this by having the IO thread wait for the task to complete? I think the latter. I just tried it by adding a taskGroup.wait() on the IO thread before it exits, and that removes the deadlock.

Hooray! Thank you for the help.

@alexey-katranov
Copy link
Contributor

alexey-katranov commented Mar 2, 2021

You are welcome!

I just tried it by adding a taskGroup.wait() on the IO thread before it exits

I did not think that it was possible (e.g. it is a driver thread for callbacks) but it is also a valid solutioin.

@sloh-ms
Copy link
Author

sloh-ms commented Mar 2, 2021

I've added an edit above showing where I needed to put the taskGroup.wait() to get rid of the deadlock, for future reference.

@sloh-ms sloh-ms closed this as completed Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants