You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using Horovod with data parallel models, 1 process with 4 GPUs, there are 6 processes. It will hand forever, the 2 GPUs of the 2 processes are always 0%, the others are always 100%, the key logs of the processes is:
Thread 204 (Thread 0x7f00d3e5e700 (LWP 6483)):
#0 0x00007f01da356de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00007f00e53edaa7 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #2 0x00007f00e55994f4 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #3 0x00007f00e55998a7 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #4 0x00007f00e540628b in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #5 0x00007f00e564aac0 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #6 0x00007f00e53c0083 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #7 0x00007f00e53c1395 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #8 0x00007f00e54674f3 in cuLaunchKernel () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #9 0x00007f0118de682b in cudart::cudaApiLaunchKernel(void const*, dim3, dim3, void**, unsigned long, CUstream_st*) () from /usr/local/cuda/lib64/libnccl.so.2 #10 0x00007f0118e285f6 in cudaLaunchKernel () from /usr/local/cuda/lib64/libnccl.so.2 #11 0x00007f0118d91075 in ncclBarrierEnqueueWait (comm=0x7ee6c9fb6800) at enqueue.cc:215 #12 0x00007f0118d91c42 in ncclEnqueueCheck (info=info@entry=0x7f00d3e5d2a0) at enqueue.cc:565 #13 0x00007f0118da9720 in ncclAllReduce (sendbuff=0x7eea42ea0200, recvbuff=, count=count@entry=4096, datatype=ncclFloat16, op=op@entry=ncclSum, comm=comm@entry=0x7ee6c9fb6800, stream=0x7f0026b448b0) at collectives/all_reduce.cc:16 #14 0x00007f00e7900218 in horovod::common::NCCLAllreduce::Execute (this=0x7f007e3e6d20, entries=std::vector of length 1, capacity 1 = {...}, response=...) at horovod/common/ops/nccl_operations.cc:144 #15 0x00007f00e78c35d1 in horovod::common::OperationManager::ExecuteAllreduce (this=this@entry=0x7f007e3f2710, entries=std::vector of length 1, capacity 1 = {...}, response=...) at horovod/common/ops/operation_manager.cc:41 #16 0x00007f00e78c3a41 in horovod::common::OperationManager::ExecuteOperation (this=0x7f007e3f2710, entries=std::vector of length 1, capacity 1 = {...}, response=...) at horovod/common/ops/operation_manager.cc:90 #17 0x00007f00e789c184 in horovod::common::(anonymous namespace)::PerformOperation (response=..., state=..., this=, this=, this=, this=, this=) at horovod/common/operations.cc:302 #18 0x00007f00e78a0550 in RunLoopOnce (state=...) at horovod/common/operations.cc:607 #19 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:528 #20 0x00007f0120cd7830 in execute_native_thread_routine () from /usr/local/lib64/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1 #21 0x00007f01da352ea5 in start_thread () from /usr/lib64/libpthread.so.0 #22 0x00007f01d99728dd in clone () from /usr/lib64/libc.so.6
Thread 201 (Thread 0x7f006789e700 (LWP 6486)):
#0 0x00007f01da358b3b in do_futex_wait.constprop.1 () from /usr/lib64/libpthread.so.0 #1 0x00007f01da358bcf in __new_sem_wait_slow.constprop.0 () from /usr/lib64/libpthread.so.0 #2 0x00007f01da358c6b in sem_wait@@GLIBC_2.2.5 () from /usr/lib64/libpthread.so.0 #3 0x00007f00e53eefe2 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #4 0x00007f00e5455a9a in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #5 0x00007f00e53c87e5 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #6 0x00007f00e538e18b in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #7 0x00007f00e548c6f7 in cuEventSynchronize () from /usr/local/nvidia/cpu_lib/libcuda.so.1 #8 0x00007f011667cb4e in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0 #9 0x00007f01166a91a8 in cudaEventSynchronize () from /usr/local/cuda/lib64/libcudart.so.11.0 #10 0x00007f00e78f7523 in WaitForEvents (timeline=..., entries=std::vector of length 1, capacity 1 = {...}, event_queue=std::queue wrapping: std::deque with 0 elements, this=0x1d12e10) at horovod/common/ops/cuda_operations.cc:87 #11 horovod::common::GPUContext::WaitForEvents (this=, event_queue=std::queue wrapping: std::deque with 0 elements, entries=std::vector of length 1, capacity 1 = {...}, timeline=...) at horovod/common/ops/gpu_context_impl.cc:17 #12 0x00007f00e78f8979 in operator() (__closure=0x7f0070852040) at horovod/common/ops/gpu_operations.cc:67 #13 std::_Function_handler<void(), horovod::common::GPUOpContext::FinalizeGPUQueue(const std::vectorhorovod::common::TensorTableEntry&, bool)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-4/root/usr/include/c++/5.3.1/functional:1871 #14 0x00007f00e78b92ff in operator() (this=0x7f006789de90) at /opt/rh/devtoolset-4/root/usr/include/c++/5.3.1/functional:2267 #15 horovod::common::ThreadPool::loop (this=0x7f00e7b2d9f8 <horovod::common::(anonymous namespace)::gpu_context+24>) at horovod/common/thread_pool.cc:62 #16 0x00007f0120cd7830 in execute_native_thread_routine () from /usr/local/lib64/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1 #17 0x00007f01da352ea5 in start_thread () from /usr/lib64/libpthread.so.0 #18 0x00007f01d99728dd in clone () from /usr/lib64/libc.so.6
So it seems like some kernels are deadlock, anyone one could leave a message that whether it caused by wrong usage of Horovod? Thanks in advance!
The text was updated successfully, but these errors were encountered:
Environment:
We are using Horovod with data parallel models, 1 process with 4 GPUs, there are 6 processes. It will hand forever, the 2 GPUs of the 2 processes are always 0%, the others are always 100%, the key logs of the processes is:
Thread 204 (Thread 0x7f00d3e5e700 (LWP 6483)):
#0 0x00007f01da356de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1 0x00007f00e53edaa7 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#2 0x00007f00e55994f4 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#3 0x00007f00e55998a7 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#4 0x00007f00e540628b in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#5 0x00007f00e564aac0 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#6 0x00007f00e53c0083 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#7 0x00007f00e53c1395 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#8 0x00007f00e54674f3 in cuLaunchKernel () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#9 0x00007f0118de682b in cudart::cudaApiLaunchKernel(void const*, dim3, dim3, void**, unsigned long, CUstream_st*) () from /usr/local/cuda/lib64/libnccl.so.2
#10 0x00007f0118e285f6 in cudaLaunchKernel () from /usr/local/cuda/lib64/libnccl.so.2
#11 0x00007f0118d91075 in ncclBarrierEnqueueWait (comm=0x7ee6c9fb6800) at enqueue.cc:215
#12 0x00007f0118d91c42 in ncclEnqueueCheck (info=info@entry=0x7f00d3e5d2a0) at enqueue.cc:565
#13 0x00007f0118da9720 in ncclAllReduce (sendbuff=0x7eea42ea0200, recvbuff=, count=count@entry=4096, datatype=ncclFloat16, op=op@entry=ncclSum, comm=comm@entry=0x7ee6c9fb6800, stream=0x7f0026b448b0) at collectives/all_reduce.cc:16
#14 0x00007f00e7900218 in horovod::common::NCCLAllreduce::Execute (this=0x7f007e3e6d20, entries=std::vector of length 1, capacity 1 = {...}, response=...) at horovod/common/ops/nccl_operations.cc:144
#15 0x00007f00e78c35d1 in horovod::common::OperationManager::ExecuteAllreduce (this=this@entry=0x7f007e3f2710, entries=std::vector of length 1, capacity 1 = {...}, response=...) at horovod/common/ops/operation_manager.cc:41
#16 0x00007f00e78c3a41 in horovod::common::OperationManager::ExecuteOperation (this=0x7f007e3f2710, entries=std::vector of length 1, capacity 1 = {...}, response=...) at horovod/common/ops/operation_manager.cc:90
#17 0x00007f00e789c184 in horovod::common::(anonymous namespace)::PerformOperation (response=..., state=..., this=, this=, this=, this=, this=) at horovod/common/operations.cc:302
#18 0x00007f00e78a0550 in RunLoopOnce (state=...) at horovod/common/operations.cc:607
#19 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:528
#20 0x00007f0120cd7830 in execute_native_thread_routine () from /usr/local/lib64/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#21 0x00007f01da352ea5 in start_thread () from /usr/lib64/libpthread.so.0
#22 0x00007f01d99728dd in clone () from /usr/lib64/libc.so.6
Thread 201 (Thread 0x7f006789e700 (LWP 6486)):
#0 0x00007f01da358b3b in do_futex_wait.constprop.1 () from /usr/lib64/libpthread.so.0
#1 0x00007f01da358bcf in __new_sem_wait_slow.constprop.0 () from /usr/lib64/libpthread.so.0
#2 0x00007f01da358c6b in sem_wait@@GLIBC_2.2.5 () from /usr/lib64/libpthread.so.0
#3 0x00007f00e53eefe2 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#4 0x00007f00e5455a9a in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#5 0x00007f00e53c87e5 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#6 0x00007f00e538e18b in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#7 0x00007f00e548c6f7 in cuEventSynchronize () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#8 0x00007f011667cb4e in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0
#9 0x00007f01166a91a8 in cudaEventSynchronize () from /usr/local/cuda/lib64/libcudart.so.11.0
#10 0x00007f00e78f7523 in WaitForEvents (timeline=..., entries=std::vector of length 1, capacity 1 = {...}, event_queue=std::queue wrapping: std::deque with 0 elements, this=0x1d12e10) at horovod/common/ops/cuda_operations.cc:87
#11 horovod::common::GPUContext::WaitForEvents (this=, event_queue=std::queue wrapping: std::deque with 0 elements, entries=std::vector of length 1, capacity 1 = {...}, timeline=...) at horovod/common/ops/gpu_context_impl.cc:17
#12 0x00007f00e78f8979 in operator() (__closure=0x7f0070852040) at horovod/common/ops/gpu_operations.cc:67
#13 std::_Function_handler<void(), horovod::common::GPUOpContext::FinalizeGPUQueue(const std::vectorhorovod::common::TensorTableEntry&, bool)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-4/root/usr/include/c++/5.3.1/functional:1871
#14 0x00007f00e78b92ff in operator() (this=0x7f006789de90) at /opt/rh/devtoolset-4/root/usr/include/c++/5.3.1/functional:2267
#15 horovod::common::ThreadPool::loop (this=0x7f00e7b2d9f8 <horovod::common::(anonymous namespace)::gpu_context+24>) at horovod/common/thread_pool.cc:62
#16 0x00007f0120cd7830 in execute_native_thread_routine () from /usr/local/lib64/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#17 0x00007f01da352ea5 in start_thread () from /usr/lib64/libpthread.so.0
#18 0x00007f01d99728dd in clone () from /usr/lib64/libc.so.6
So it seems like some kernels are deadlock, anyone one could leave a message that whether it caused by wrong usage of Horovod? Thanks in advance!
The text was updated successfully, but these errors were encountered: