Horovod will hang forever when run it with data parallel model (one process multiple GPUs) #2944

weberxie · 2021-05-28T10:50:04Z

Environment:

Framework: TensorFlow
Framework version: 1.15
Horovod version: 0.19
MPI version: 4.0.3
CUDA version: 11
NCCL version: 2.7.8
Python version: 3.6
Spark / PySpark version:
Ray version:
OS and version: centos 7
GCC version:
CMake version:

We are using Horovod with data parallel models, 1 process with 4 GPUs, there are 6 processes. It will hand forever, the 2 GPUs of the 2 processes are always 0%, the others are always 100%, the key logs of the processes is:

Thread 204 (Thread 0x7f00d3e5e700 (LWP 6483)):
#0 0x00007f01da356de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1 0x00007f00e53edaa7 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#2 0x00007f00e55994f4 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#3 0x00007f00e55998a7 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#4 0x00007f00e540628b in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#5 0x00007f00e564aac0 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#6 0x00007f00e53c0083 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#7 0x00007f00e53c1395 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#8 0x00007f00e54674f3 in cuLaunchKernel () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#9 0x00007f0118de682b in cudart::cudaApiLaunchKernel(void const*, dim3, dim3, void**, unsigned long, CUstream_st*) () from /usr/local/cuda/lib64/libnccl.so.2
#10 0x00007f0118e285f6 in cudaLaunchKernel () from /usr/local/cuda/lib64/libnccl.so.2
#11 0x00007f0118d91075 in ncclBarrierEnqueueWait (comm=0x7ee6c9fb6800) at enqueue.cc:215
#12 0x00007f0118d91c42 in ncclEnqueueCheck (info=info@entry=0x7f00d3e5d2a0) at enqueue.cc:565
#13 0x00007f0118da9720 in ncclAllReduce (sendbuff=0x7eea42ea0200, recvbuff=, count=count@entry=4096, datatype=ncclFloat16, op=op@entry=ncclSum, comm=comm@entry=0x7ee6c9fb6800, stream=0x7f0026b448b0) at collectives/all_reduce.cc:16
#14 0x00007f00e7900218 in horovod::common::NCCLAllreduce::Execute (this=0x7f007e3e6d20, entries=std::vector of length 1, capacity 1 = {...}, response=...) at horovod/common/ops/nccl_operations.cc:144
#15 0x00007f00e78c35d1 in horovod::common::OperationManager::ExecuteAllreduce (this=this@entry=0x7f007e3f2710, entries=std::vector of length 1, capacity 1 = {...}, response=...) at horovod/common/ops/operation_manager.cc:41
#16 0x00007f00e78c3a41 in horovod::common::OperationManager::ExecuteOperation (this=0x7f007e3f2710, entries=std::vector of length 1, capacity 1 = {...}, response=...) at horovod/common/ops/operation_manager.cc:90
#17 0x00007f00e789c184 in horovod::common::(anonymous namespace)::PerformOperation (response=..., state=..., this=, this=, this=, this=, this=) at horovod/common/operations.cc:302
#18 0x00007f00e78a0550 in RunLoopOnce (state=...) at horovod/common/operations.cc:607
#19 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:528
#20 0x00007f0120cd7830 in execute_native_thread_routine () from /usr/local/lib64/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#21 0x00007f01da352ea5 in start_thread () from /usr/lib64/libpthread.so.0
#22 0x00007f01d99728dd in clone () from /usr/lib64/libc.so.6

Thread 201 (Thread 0x7f006789e700 (LWP 6486)):
#0 0x00007f01da358b3b in do_futex_wait.constprop.1 () from /usr/lib64/libpthread.so.0
#1 0x00007f01da358bcf in __new_sem_wait_slow.constprop.0 () from /usr/lib64/libpthread.so.0
#2 0x00007f01da358c6b in sem_wait@@GLIBC_2.2.5 () from /usr/lib64/libpthread.so.0
#3 0x00007f00e53eefe2 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#4 0x00007f00e5455a9a in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#5 0x00007f00e53c87e5 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#6 0x00007f00e538e18b in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#7 0x00007f00e548c6f7 in cuEventSynchronize () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#8 0x00007f011667cb4e in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0
#9 0x00007f01166a91a8 in cudaEventSynchronize () from /usr/local/cuda/lib64/libcudart.so.11.0
#10 0x00007f00e78f7523 in WaitForEvents (timeline=..., entries=std::vector of length 1, capacity 1 = {...}, event_queue=std::queue wrapping: std::deque with 0 elements, this=0x1d12e10) at horovod/common/ops/cuda_operations.cc:87
#11 horovod::common::GPUContext::WaitForEvents (this=, event_queue=std::queue wrapping: std::deque with 0 elements, entries=std::vector of length 1, capacity 1 = {...}, timeline=...) at horovod/common/ops/gpu_context_impl.cc:17
#12 0x00007f00e78f8979 in operator() (__closure=0x7f0070852040) at horovod/common/ops/gpu_operations.cc:67
#13 std::_Function_handler<void(), horovod::common::GPUOpContext::FinalizeGPUQueue(const std::vectorhorovod::common::TensorTableEntry&, bool)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-4/root/usr/include/c++/5.3.1/functional:1871
#14 0x00007f00e78b92ff in operator() (this=0x7f006789de90) at /opt/rh/devtoolset-4/root/usr/include/c++/5.3.1/functional:2267
#15 horovod::common::ThreadPool::loop (this=0x7f00e7b2d9f8 <horovod::common::(anonymous namespace)::gpu_context+24>) at horovod/common/thread_pool.cc:62
#16 0x00007f0120cd7830 in execute_native_thread_routine () from /usr/local/lib64/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#17 0x00007f01da352ea5 in start_thread () from /usr/lib64/libpthread.so.0
#18 0x00007f01d99728dd in clone () from /usr/lib64/libc.so.6

So it seems like some kernels are deadlock, anyone one could leave a message that whether it caused by wrong usage of Horovod? Thanks in advance!

chongxiaoc · 2021-05-28T18:22:57Z

Any python scripts can reproduce your problem?

weberxie added the bug label May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod will hang forever when run it with data parallel model (one process multiple GPUs) #2944

Horovod will hang forever when run it with data parallel model (one process multiple GPUs) #2944

weberxie commented May 28, 2021

chongxiaoc commented May 28, 2021

Horovod will hang forever when run it with data parallel model (one process multiple GPUs) #2944

Horovod will hang forever when run it with data parallel model (one process multiple GPUs) #2944

Comments

weberxie commented May 28, 2021

chongxiaoc commented May 28, 2021