You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have been testing OpenMPI 3.0.0 and Horovod 0.18.2 in SGE.
It works fine in terminal (not use job scheduler)
but in SGE, My mpi process hangs in initialize step once in a while.
so, I executed gdb with python process. and check to the backtrace in threads.
It shows some function that hangs.
1. opal_timer_linux_get_cycles_sys_timer function.
#0 0x00002b11ef2a8795 in opal_timer_linux_get_cycles_sys_timer () from /APP/openmpi3/lib/libopen-pal.so.40 #1 0x00002b11ef215f79 in opal_progress () from /APP/openmpi3/lib/libopen-pal.so.40 #2 0x00002b11eec857b3 in ompi_request_default_test_all () from /APP/openmpi3/lib/libmpi.so.40 #3 0x00002b121294112a in NBC_Progress () from /APP/openmpi3/lib/openmpi/mca_coll_libnbc.so #4 0x00002b121294058f in ompi_coll_libnbc_progress () from /APP/openmpi3/lib/openmpi/mca_coll_libnbc.so #5 0x00002b11ef215eec in opal_progress () from /APP/openmpi3/lib/libopen-pal.so.40 #6 0x00002b11ef21c5a5 in sync_wait_mt () from /APP/openmpi3/lib/libopen-pal.so.40 #7 0x00002b11eec6f62b in ompi_comm_nextcid () from /APP/openmpi3/lib/libmpi.so.40 #8 0x00002b11eec6ac76 in ompi_comm_dup_with_info () from /APP/openmpi3/lib/libmpi.so.40 #9 0x00002b11eec9e270 in PMPI_Comm_dup () from /APP/openmpi3/lib/libmpi.so.40 #10 0x00002b11eea25c36 in horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:1463 #11 0x00002b11eb55f070 in ?? () from /lib64/libstdc++.so.6 #12 0x00002b11a404ddd5 in start_thread () from /lib64/libpthread.so.0 #13 0x00002b11a4a68ead in clone () from /lib64/libc.so.6
2. ompi_comm_request_progress
(gdb) info thread
Id Target Id Frame
5 Thread 0x2b3d3aead700 (LWP 61690) "python" 0x00002b3d30148965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
4 Thread 0x2b3d8cf4c700 (LWP 61691) "python" 0x00002b3d7ae6fb9d in ompi_comm_request_progress () from /APP/openmpi3/lib/libmpi.so.40
3 Thread 0x2b3d8dd69700 (LWP 61692) "python" 0x00002b3d30b5520d in poll () from /lib64/libc.so.6
2 Thread 0x2b3d8ea43700 (LWP 61693) "python" 0x00002b3d30b60483 in epoll_wait () from /lib64/libc.so.6
Maybe #1123 is related? I feel some familiarity with the stack trace, although on closer inspection, it is different, so not sure.
You might try LD_PRELOAD=libhwloc.so or this Python snippet:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Environment:
Hello, I have been testing OpenMPI 3.0.0 and Horovod 0.18.2 in SGE.
It works fine in terminal (not use job scheduler)
but in SGE, My mpi process hangs in initialize step once in a while.
so, I executed gdb with python process. and check to the backtrace in threads.
It shows some function that hangs.
1. opal_timer_linux_get_cycles_sys_timer function.
#0 0x00002b11ef2a8795 in opal_timer_linux_get_cycles_sys_timer () from /APP/openmpi3/lib/libopen-pal.so.40
#1 0x00002b11ef215f79 in opal_progress () from /APP/openmpi3/lib/libopen-pal.so.40
#2 0x00002b11eec857b3 in ompi_request_default_test_all () from /APP/openmpi3/lib/libmpi.so.40
#3 0x00002b121294112a in NBC_Progress () from /APP/openmpi3/lib/openmpi/mca_coll_libnbc.so
#4 0x00002b121294058f in ompi_coll_libnbc_progress () from /APP/openmpi3/lib/openmpi/mca_coll_libnbc.so
#5 0x00002b11ef215eec in opal_progress () from /APP/openmpi3/lib/libopen-pal.so.40
#6 0x00002b11ef21c5a5 in sync_wait_mt () from /APP/openmpi3/lib/libopen-pal.so.40
#7 0x00002b11eec6f62b in ompi_comm_nextcid () from /APP/openmpi3/lib/libmpi.so.40
#8 0x00002b11eec6ac76 in ompi_comm_dup_with_info () from /APP/openmpi3/lib/libmpi.so.40
#9 0x00002b11eec9e270 in PMPI_Comm_dup () from /APP/openmpi3/lib/libmpi.so.40
#10 0x00002b11eea25c36 in horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:1463
#11 0x00002b11eb55f070 in ?? () from /lib64/libstdc++.so.6
#12 0x00002b11a404ddd5 in start_thread () from /lib64/libpthread.so.0
#13 0x00002b11a4a68ead in clone () from /lib64/libc.so.6
2. ompi_comm_request_progress
(gdb) info thread
Id Target Id Frame
5 Thread 0x2b3d3aead700 (LWP 61690) "python" 0x00002b3d30148965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
4 Thread 0x2b3d8cf4c700 (LWP 61691) "python" 0x00002b3d7ae6fb9d in ompi_comm_request_progress () from /APP/openmpi3/lib/libmpi.so.40
3 Thread 0x2b3d8dd69700 (LWP 61692) "python" 0x00002b3d30b5520d in poll () from /lib64/libc.so.6
2 Thread 0x2b3d8ea43700 (LWP 61693) "python" 0x00002b3d30b60483 in epoll_wait () from /lib64/libc.so.6
(gdb) thread 4
[Switching to thread 4 (Thread 0x2b3d8cf4c700 (LWP 61691))]
#0 0x00002b3d7ae6fb9d in ompi_comm_request_progress () from /APP/openmpi3/lib/libmpi.so.40
(gdb) bt
#0 0x00002b3d7ae6fb9d in ompi_comm_request_progress () from /APP/openmpi3/lib/libmpi.so.40
Add reference to Baidu article from README #1 0x00002b3d7b415eec in opal_progress () from /APP/openmpi3/lib/libopen-pal.so.40
Add LICENSE file to sdist, link to website and license classifier. Bump version. #2 0x00002b3d7ae857b3 in ompi_request_default_test_all () from /APP/openmpi3/lib/libmpi.so.40
Add multi-node mpirun example to README #3 0x00002b3d9e94212a in NBC_Progress () from /APP/openmpi3/lib/openmpi/mca_coll_libnbc.so
Add travis support #4 0x00002b3d9e94158f in ompi_coll_libnbc_progress () from /APP/openmpi3/lib/openmpi/mca_coll_libnbc.so
NCCL 2 does not support RoCE #5 0x00002b3d7b415eec in opal_progress () from /APP/openmpi3/lib/libopen-pal.so.40
hvd.allreduce() is not reducing tensors #6 0x00002b3d7b41c5a5 in sync_wait_mt () from /APP/openmpi3/lib/libopen-pal.so.40
Update README.md #7 0x00002b3d7ae6f62b in ompi_comm_nextcid () from /APP/openmpi3/lib/libmpi.so.40
Parallel mini-batch scalability issue #8 0x00002b3d7ae6ac76 in ompi_comm_dup_with_info () from /APP/openmpi3/lib/libmpi.so.40
motivation for this project #9 0x00002b3d7ae9e270 in PMPI_Comm_dup () from /APP/openmpi3/lib/libmpi.so.40
Update documentation #10 0x00002b3d7ac25c36 in horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:1463
Horovod Timelines #11 0x00002b3d77656070 in ?? () from /lib64/libstdc++.so.6
Update Travis to validate against 1.3.0 instead of 1.3.0rc2 #12 0x00002b3d30144dd5 in start_thread () from /lib64/libpthread.so.0
Benchmark description #13 0x00002b3d30b5fead in clone () from /lib64/libc.so.6
and It works fine in OpenMPI 4.0.0 and Horovod 0.18.2 with SGE.
I need to get some advice
The text was updated successfully, but these errors were encountered: