Skip to content

Using TM launch support with --machinefile causes ORTE error #4468

@jsquyres

Description

@jsquyres

I had a direct email contact that I'll just quote below, because it contains a good amount of detail:


I ran into a strange issue with OpenMPI 3.0.0 and Torque 6.1.0 I was hoping you had some insight on. When I launch a 2-node test job (HPL in this case) from the command line with the follow command, everything works fine:

/shared/openmpi-3.0.0/gcc-4.8.5/bin/mpirun -machinefile $PBS_NODEFILE -np $ncpus ./xhpl

If I put that same command in a torque submission script, I get the correct number of MPI processes on both nodes, but the job never really launches. It's very similar as if a firewall were blocking traffic. This occurs irrespective of whether I try to run on Infiniband or GigE. Adding " -show-progress" to the mpirun options gives

                App launch reported: 2 (out of 2) daemons - 0 (out of 128) procs
                App launch reported: 2 (out of 2) daemons - 0 (out of 128) procs

I confirmed OpenMPI is compiled with torque support:

/shared/openmpi-3.0.0/gcc-4.8.5/bin/ompi_info | grep tm
                 MCA ess: tm (MCA v2.1.0, API v3.0.0, Component v3.0.0)
                 MCA plm: tm (MCA v2.1.0, API v2.0.0, Component v3.0.0)
                 MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v3.0.0)

However, if I use OpenMPI 2.1.2 and modify the mpirun line as follows, everything runs happily within torque:

/shared/openmpi-2.1.2/gcc-4.8.5/bin/mpirun -machinefile $PBS_NODEFILE -np $ncpus ./xhpl

I asked the person to update and try with the latest nightly snapshots in v3.0.x, v3.1.x, and master. Here's the reply:


Same issue with openmpi-3.0.x-201711040323-888fac7 and openmpi-3.1.x-201711040241-d4ad767. Using openmpi-master-201711060242-ec6b2e1, the job immediately terminates with

[gpu008:49268] [[6179,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_receive.c at line 342
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[6179,0],0] FORCE-TERMINATE AT (null):1 - error base/plm_base_receive.c(343)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

Finally, the person confirmed that if they remove the --machinefile option while running under TM, everything works fine.

So it seems like there is a problem when using --machinefile with TM support.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions