Skip to content

Conversation

@karasevb
Copy link
Member

OMPI component query can not choice the ext2x pmix component. It uses s1 component instead.

Reproduce:

env OMPI_MCA_pmix_base_async_modex=1 OMPI_MCA_pmix_base_collect_data=0 OMPI_MCA_pmix_base_verbose=100 \
srun -N1 -n 2 --mpi=pmix ./ring_c

...

[node08:10329] mca:base:select: Auto-selecting pmix components
[node08:10329] mca:base:select:( pmix) Querying component [isolated]
[node08:10329] mca:base:select:( pmix) Query of component [isolated] set priority to 0
[node08:10329] mca:base:select:( pmix) Querying component [ext2x]
[node08:10329] mca:base:select:( pmix) Query of component [ext2x] set priority to 5
[node08:10329] mca:base:select:( pmix) Querying component [flux]
[node08:10329] mca:base:select:( pmix) Querying component [s1]
[node08:10329] mca:base:select:( pmix) Query of component [s1] set priority to 10
[node08:10329] mca:base:select:( pmix) Selected component [s1]
[node08:10329] mca: base: close: component isolated closed
[node08:10329] mca: base: close: unloading component isolated
[node08:10329] mca: base: close: component ext2x closed
[node08:10329] mca: base: close: unloading component ext2x
[node08:10329] mca: base: close: unloading component flux
[node08:10329] [[2,24],0] pmix:s1: assigned tmp name

...

[node08:10454] [[2,26],1] pmix:s1 got key btl.tcp.4.0
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[2,26],1]) is on host: node08
  Process 2 ([[2,26],0]) is on host: unknown!
  BTLs attempted: self openib tcp vader

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[node08:10454] *** An error occurred in MPI_Init
[node08:10454] *** reported by process [140574279729178,140724603453441]
[node08:10454] *** on a NULL communicator
[node08:10454] *** Unknown error
[node08:10454] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node08:10454] ***    and potentially your MPI job)
In: PMI_Abort(1, N/A)
srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
slurmstepd: error: *** STEP 2.26 ON node08 CANCELLED AT 2018-06-20T13:04:14 ***
srun: error: eio_message_socket_accept: slurm_receive_msg[10.209.45.150]: Zero Bytes were transmitted or received
srun: error: node08: task 0: Killed
srun: Terminating job step 2.26
srun: error: node08: task 1: Exited with exit code 1

There are *URI* variables exported by different versions of PMIx server:

  • PMIx v2.1:
    PMIX_SERVER_URI2
    PMIX_SERVER_URI21
    PMIX_SERVER_URI
    PMIX_SERVER_URI2USOCK
  • PMIx v2.0:
    PMIX_SERVER_URI2

OMPI pmix component checks only PMIX_SERVER_URI (this is enough to detect PMIx 2.1 only): https://github.com/open-mpi/ompi/blob/master/opal/mca/pmix/ext2x/ext2x_component.c#L147. This check does not detect PMIx v2.0.
In this commit added the extra checking that covers PMIx 2.0 env (PMIX_SERVER_URI2)

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
@rhc54
Copy link
Contributor

rhc54 commented Jun 20, 2018

Just to clarify, you statement "this is enough to detect PMIx 2.1 only" isn't quite correct. The PMIX_SERVER_URI provides the rendezvous point for the usock component, not the tcp component. The tcp component uses PMIX_SERVER_URI2. So if the user has enabled usock in the system, then the current code works fine. However, ORTE disables usock by default, and so you need to also look for the tcp rendezvous envar.

@rhc54 rhc54 merged commit 4bd7459 into open-mpi:master Jun 20, 2018
@rhc54
Copy link
Contributor

rhc54 commented Jun 20, 2018

Sorry - I didn't see that you had requested reviews from multiple people.

@artpol84
Copy link
Contributor

@rhc54 it’s ok, 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants