-
Notifications
You must be signed in to change notification settings - Fork 844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpirun hangs intermittently #10586
Comments
Is this system running slurm? |
I'll see if I can reproduce locally. |
never mind I see the PBS comment. |
Argh, I wasn't on latest Open MPI: v2.x-dev-9961-gc6dca98c71 |
It may be worth trying the latest prrte/pmix main, a fix may have come since the last submodule update. |
Mhh, I'm seeing build time issues with current PMIx:
|
OK, so I got the latest PMIx and PRRTE to build with Funny regression: I can run
|
Seeing the same occasional hang with v5.0.x. |
I attached to a
I also played around with some of the verbosity mca parameters but didn't see anything useful. Any ideas on how to debug this further? |
@devreal can you reproduce when calling prterun instead of mpirun? and prun? I wonder if it is a cleanup issue where something in /tmp is not getting cleaned up, or isn't getting cleaned up fast enough. One thing you could try is after every run manually removing (in your scrpt) /tmp/prte.$HOSTNAME.* to see if that clears it up. |
Yes, it hangs too.
I get the following error:
Not sure what to do about that.
I don't see any Is there a way to get extra debug output that might help dig into where the launch gets stuck? |
With For what it's worth I can't reproduce when using the latest and greatest prte+ pmix. |
I believe this is fixed by openpmix/prrte#1401 and openpmix/prrte#1403, which should now be in your main branch. Those changes are also in the PMIx v4.2 and PRRTE v3.0 branches, so they should come into OMPI v5 once updated. |
@rhc54 provided a fix in openpmix/prrte#1436 and ported it back to the PRTE 3.0 branch in openpmix/prrte#1437. @awlauria can we bump the PRTE pointers for both |
This has been fixed in prrte. Thanks @rhc54, closing. |
I'm seeing mpirun hanging during startup on our system. Running
mpirun
in a loop eventually hangs, typically after a few dozen iterations:The system has dual socket 64-core AMD Epyc Rome nodes connected through Infiniband ConnectX-6. I built Open MPI main with GCC 10.3.0 using the following git tags:
Open MPI: v2.x-dev-9896-g3bda0109c4
PRRTE: psrvr-v2.0.0rc1-4370-gdf7d17d0a3
PMIX: v1.1.3-3554-g6c9d3dde
My configure line is:
It appears that the more processes I spawn the higher is the chance of the hang to actually occur. I should also note that if I allocate a single node from PBS the hang does not seem to occur but if I allocate 8 nodes I can fairly reliably get to the hang even when spawning a single process. I'm not sure where to look here and which knobs to turn in order to get meaningful debug output. Any suggestions are more than welcome :)
The text was updated successfully, but these errors were encountered: