Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI 3.0.0 hangs initialize step in SGE #1500

Closed
johnkim126 opened this issue Nov 6, 2019 · 3 comments
Closed

OpenMPI 3.0.0 hangs initialize step in SGE #1500

johnkim126 opened this issue Nov 6, 2019 · 3 comments

Comments

@johnkim126
Copy link

Environment:

  1. Framework: TensorFlow
  2. Framework version: 1.3.0
  3. Horovod version: 0.18.2
  4. MPI version: 3.0.0
  5. CUDA version: 8.0
  6. NCCL version:
  7. Python version: 3.6.8
  8. OS and version: RHEL 7.6
  9. GCC version: 4.8.5

Hello, I have been testing OpenMPI 3.0.0 and Horovod 0.18.2 in SGE.
It works fine in terminal (not use job scheduler)

but in SGE, My mpi process hangs in initialize step once in a while.
so, I executed gdb with python process. and check to the backtrace in threads.

It shows some function that hangs.

1. opal_timer_linux_get_cycles_sys_timer function.
#0 0x00002b11ef2a8795 in opal_timer_linux_get_cycles_sys_timer () from /APP/openmpi3/lib/libopen-pal.so.40
#1 0x00002b11ef215f79 in opal_progress () from /APP/openmpi3/lib/libopen-pal.so.40
#2 0x00002b11eec857b3 in ompi_request_default_test_all () from /APP/openmpi3/lib/libmpi.so.40
#3 0x00002b121294112a in NBC_Progress () from /APP/openmpi3/lib/openmpi/mca_coll_libnbc.so
#4 0x00002b121294058f in ompi_coll_libnbc_progress () from /APP/openmpi3/lib/openmpi/mca_coll_libnbc.so
#5 0x00002b11ef215eec in opal_progress () from /APP/openmpi3/lib/libopen-pal.so.40
#6 0x00002b11ef21c5a5 in sync_wait_mt () from /APP/openmpi3/lib/libopen-pal.so.40
#7 0x00002b11eec6f62b in ompi_comm_nextcid () from /APP/openmpi3/lib/libmpi.so.40
#8 0x00002b11eec6ac76 in ompi_comm_dup_with_info () from /APP/openmpi3/lib/libmpi.so.40
#9 0x00002b11eec9e270 in PMPI_Comm_dup () from /APP/openmpi3/lib/libmpi.so.40
#10 0x00002b11eea25c36 in horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:1463
#11 0x00002b11eb55f070 in ?? () from /lib64/libstdc++.so.6
#12 0x00002b11a404ddd5 in start_thread () from /lib64/libpthread.so.0
#13 0x00002b11a4a68ead in clone () from /lib64/libc.so.6

2. ompi_comm_request_progress
(gdb) info thread
Id Target Id Frame
5 Thread 0x2b3d3aead700 (LWP 61690) "python" 0x00002b3d30148965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
4 Thread 0x2b3d8cf4c700 (LWP 61691) "python" 0x00002b3d7ae6fb9d in ompi_comm_request_progress () from /APP/openmpi3/lib/libmpi.so.40
3 Thread 0x2b3d8dd69700 (LWP 61692) "python" 0x00002b3d30b5520d in poll () from /lib64/libc.so.6
2 Thread 0x2b3d8ea43700 (LWP 61693) "python" 0x00002b3d30b60483 in epoll_wait () from /lib64/libc.so.6

and It works fine in OpenMPI 4.0.0 and Horovod 0.18.2 with SGE.
I need to get some advice

@johnkim126
Copy link
Author

johnkim126 commented Nov 6, 2019

I'm testing this python script (https://github.com/horovod/horovod/blob/master/examples/tensorflow_mnist.py)

command:
$MPIHOME/bin/mpirun --report-bindings --verbose -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x NCCL_SOCKET_IFNAME=^lo,docker0,docker_gwbridge -mca btl_tcp_if_exclude lo,docker0,docker_gwbridge -mca pml ob1 -mca btl ^openib -np $NSLOTS -machinefile $TMPDIR/machines python /APP/tensorflow_mnist_old.py

@albertz
Copy link

albertz commented Jun 18, 2020

Maybe #1123 is related? I feel some familiarity with the stack trace, although on closer inspection, it is different, so not sure.
You might try LD_PRELOAD=libhwloc.so or this Python snippet:

import ctypes
ctypes.CDLL("libhwloc.so", mode=ctypes.RTLD_GLOBAL)

@stale
Copy link

stale bot commented Nov 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 6, 2020
@stale stale bot closed this as completed Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants