-
Notifications
You must be signed in to change notification settings - Fork 920
Description
I had a direct email contact that I'll just quote below, because it contains a good amount of detail:
I ran into a strange issue with OpenMPI 3.0.0 and Torque 6.1.0 I was hoping you had some insight on. When I launch a 2-node test job (HPL in this case) from the command line with the follow command, everything works fine:
/shared/openmpi-3.0.0/gcc-4.8.5/bin/mpirun -machinefile $PBS_NODEFILE -np $ncpus ./xhpl
If I put that same command in a torque submission script, I get the correct number of MPI processes on both nodes, but the job never really launches. It's very similar as if a firewall were blocking traffic. This occurs irrespective of whether I try to run on Infiniband or GigE. Adding " -show-progress" to the mpirun options gives
App launch reported: 2 (out of 2) daemons - 0 (out of 128) procs
App launch reported: 2 (out of 2) daemons - 0 (out of 128) procs
I confirmed OpenMPI is compiled with torque support:
/shared/openmpi-3.0.0/gcc-4.8.5/bin/ompi_info | grep tm
MCA ess: tm (MCA v2.1.0, API v3.0.0, Component v3.0.0)
MCA plm: tm (MCA v2.1.0, API v2.0.0, Component v3.0.0)
MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v3.0.0)
However, if I use OpenMPI 2.1.2 and modify the mpirun line as follows, everything runs happily within torque:
/shared/openmpi-2.1.2/gcc-4.8.5/bin/mpirun -machinefile $PBS_NODEFILE -np $ncpus ./xhpl
I asked the person to update and try with the latest nightly snapshots in v3.0.x, v3.1.x, and master. Here's the reply:
Same issue with openmpi-3.0.x-201711040323-888fac7 and openmpi-3.1.x-201711040241-d4ad767. Using openmpi-master-201711060242-ec6b2e1, the job immediately terminates with
[gpu008:49268] [[6179,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_receive.c at line 342
--------------------------------------------------------------------------
An internal error has occurred in ORTE:
[[6179,0],0] FORCE-TERMINATE AT (null):1 - error base/plm_base_receive.c(343)
This is something that should be reported to the developers.
--------------------------------------------------------------------------
Finally, the person confirmed that if they remove the --machinefile
option while running under TM, everything works fine.
So it seems like there is a problem when using --machinefile
with TM support.