Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun hangs intermittently #10586

Closed
devreal opened this issue Jul 19, 2022 · 15 comments
Closed

mpirun hangs intermittently #10586

devreal opened this issue Jul 19, 2022 · 15 comments

Comments

@devreal
Copy link
Contributor

devreal commented Jul 19, 2022

I'm seeing mpirun hanging during startup on our system. Running mpirun in a loop eventually hangs, typically after a few dozen iterations:

for i in $(seq 1 100 ); do echo $i &&  mpirun -n 1 hostname ; done

The system has dual socket 64-core AMD Epyc Rome nodes connected through Infiniband ConnectX-6. I built Open MPI main with GCC 10.3.0 using the following git tags:

Open MPI: v2.x-dev-9896-g3bda0109c4
PRRTE: psrvr-v2.0.0rc1-4370-gdf7d17d0a3
PMIX: v1.1.3-3554-g6c9d3dde

My configure line is:

../configure --prefix=$HOME/opt-hawk/openmpi-main-ucx/ --with-ucx=/opt/hlrs/non-spack/mpi/openmpi/ucx/1.12.0/ --disable-man-pages --with-xpmem=$HOME/opt-hawk/xpmem --enable-debug

It appears that the more processes I spawn the higher is the chance of the hang to actually occur. I should also note that if I allocate a single node from PBS the hang does not seem to occur but if I allocate 8 nodes I can fairly reliably get to the hang even when spawning a single process. I'm not sure where to look here and which knobs to turn in order to get meaningful debug output. Any suggestions are more than welcome :)

@hppritcha
Copy link
Member

Is this system running slurm?

@awlauria
Copy link
Contributor

I'll see if I can reproduce locally.

@hppritcha
Copy link
Member

never mind I see the PBS comment.
I am seeing this type of behavior on a slurm system.

@devreal
Copy link
Contributor Author

devreal commented Jul 19, 2022

Argh, I wasn't on latest main. Updated, problem persists:

Open MPI: v2.x-dev-9961-gc6dca98c71
PRRTE and PMIx are the same as above.

@awlauria
Copy link
Contributor

It may be worth trying the latest prrte/pmix main, a fix may have come since the last submodule update.

@devreal
Copy link
Contributor Author

devreal commented Jul 19, 2022

Mhh, I'm seeing build time issues with current PMIx:

  CC       prm_tm.lo
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c: In function ‘tm_notify’:
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:54:46: error: unused parameter ‘status’ [-Werror=unused-parameter]
   54 | static pmix_status_t tm_notify(pmix_status_t status, const pmix_proc_t *source,
      |                                ~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:54:73: error: unused parameter ‘source’ [-Werror=unused-parameter]
   54 | static pmix_status_t tm_notify(pmix_status_t status, const pmix_proc_t *source,
      |                                                      ~~~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:50: error: unused parameter ‘range’ [-Werror=unused-parameter]
   55 |                                pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
      |                                ~~~~~~~~~~~~~~~~~~^~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:75: error: unused parameter ‘info’ [-Werror=unused-parameter]
   55 |                                pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
      |                                                         ~~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:55:90: error: unused parameter ‘ninfo’ [-Werror=unused-parameter]
   55 |                                pmix_data_range_t range, const pmix_info_t info[], size_t ninfo,
      |                                                                                   ~~~~~~~^~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:56:49: error: unused parameter ‘cbfunc’ [-Werror=unused-parameter]
   56 |                                pmix_op_cbfunc_t cbfunc, void *cbdata)
      |                                ~~~~~~~~~~~~~~~~~^~~~~~
../../../../../../../3rd-party/openpmix/src/mca/prm/tm/prm_tm.c:56:63: error: unused parameter ‘cbdata’ [-Werror=unused-parameter]
   56 |                                pmix_op_cbfunc_t cbfunc, void *cbdata)
      |                                 

@devreal
Copy link
Contributor Author

devreal commented Jul 19, 2022

OK, so I got the latest PMIx and PRRTE to build with CFLAGS=-Wno-unused-parameter. I can still reproduce the problem when starting a single process in a PBS job with 8 nodes. I cannot reproduce it in a job with a single or two nodes allocated.

Funny regression: I can run mpirun -n 1 -N 1 ... but setting -n 1 -N 2 leads to an error:

--------------------------------------------------------------------------
Your job has requested more processes than the ppr for
this topology can support:

  App: hostname
  Number of procs:  1
  Procs mapped:  1
  Total number of procs:  2
  PPR: 2:node

Please revise the conflict and try again.
--------------------------------------------------------------------------

@devreal
Copy link
Contributor Author

devreal commented Jul 20, 2022

Seeing the same occasional hang with v5.0.x.

@jsquyres jsquyres added the bug label Jul 21, 2022
@devreal
Copy link
Contributor Author

devreal commented Jul 21, 2022

I attached to a prterun that hung but couldn't see anything useful, other than that every thread is waiting for something to happen...

(gdb) thread apply all bt

Thread 4 (Thread 0x1481724a1700 (LWP 141944)):
#0  0x00001481741a929f in select () from /lib64/libc.so.6
#1  0x0000148175063395 in listen_thread (obj=<optimized out>) at ../../../../../../../3rd-party/prrte/src/mca/oob/tcp/oob_tcp_listener.c:602
#2  0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#3  0x00001481741b1dc3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x1481726a2700 (LWP 141943)):
#0  0x00001481741a929f in select () from /lib64/libc.so.6
#1  0x0000148174af0522 in listen_thread (obj=<optimized out>) at ../../../../../../3rd-party/openpmix/src/mca/ptl/base/ptl_base_listener.c:167
#2  0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#3  0x00001481741b1dc3 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x1481729b5700 (LWP 141936)):
#0  0x00001481741b20f7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000148174f63993 in epoll_dispatch () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#2  0x0000148174f58e48 in event_base_loop () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#3  0x00001481749d3f21 in progress_engine (obj=<optimized out>) at ../../../../3rd-party/openpmix/src/runtime/pmix_progress_threads.c:228
#4  0x000014817448214a in start_thread () from /lib64/libpthread.so.0
#5  0x00001481741b1dc3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x1481735ced80 (LWP 141929)):
#0  0x00001481741b20f7 in epoll_wait () from /lib64/libc.so.6
#1  0x0000148174f63993 in epoll_dispatch () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#2  0x0000148174f58e48 in event_base_loop () from /zhome/academic/HLRS/hlrs/hpcjschu/opt-hawk/openmpi-v5.0.x-ucx/lib/libevent_core-2.1.so.7
#3  0x00000000004055f7 in main (argc=<optimized out>, argv=<optimized out>) at ../../../../../../3rd-party/prrte/src/tools/prte/prte.c:732

I also played around with some of the verbosity mca parameters but didn't see anything useful. Any ideas on how to debug this further?

@awlauria
Copy link
Contributor

@devreal can you reproduce when calling prterun instead of mpirun? and prun? I wonder if it is a cleanup issue where something in /tmp is not getting cleaned up, or isn't getting cleaned up fast enough.

One thing you could try is after every run manually removing (in your scrpt) /tmp/prte.$HOSTNAME.* to see if that clears it up.

@devreal
Copy link
Contributor Author

devreal commented Jul 22, 2022

can you reproduce when calling prterun instead of mpirun?

Yes, it hangs too.

and prun?

I get the following error:

prun failed to initialize, likely due to no DVM being available

Not sure what to do about that.

One thing you could try is after every run manually removing (in your scrpt) /tmp/prte.$HOSTNAME.* to see if that clears it up.

I don't see any prte files in /tmp, neither before nor after a run that hangs.

Is there a way to get extra debug output that might help dig into where the launch gets stuck?

@awlauria
Copy link
Contributor

awlauria commented Jul 27, 2022

With prun you have to daemonize the prte first: prte --daemnonze. However since you hit with mpirun and prterun I'd say the odds of it not reproduceing with prun is remote.

For what it's worth I can't reproduce when using the latest and greatest prte+ pmix.

@rhc54
Copy link
Contributor

rhc54 commented Aug 9, 2022

I believe this is fixed by openpmix/prrte#1401 and openpmix/prrte#1403, which should now be in your main branch. Those changes are also in the PMIx v4.2 and PRRTE v3.0 branches, so they should come into OMPI v5 once updated.

@devreal
Copy link
Contributor Author

devreal commented Aug 16, 2022

@rhc54 provided a fix in openpmix/prrte#1436 and ported it back to the PRTE 3.0 branch in openpmix/prrte#1437. @awlauria can we bump the PRTE pointers for both main and 5.0.x?

@devreal
Copy link
Contributor Author

devreal commented Nov 2, 2022

This has been fixed in prrte. Thanks @rhc54, closing.

@devreal devreal closed this as completed Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants