Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to detect when we are direct-launched without the necessary P… #3778

Merged
merged 1 commit into from Jun 29, 2017
Merged

Attempt to detect when we are direct-launched without the necessary P… #3778

merged 1 commit into from Jun 29, 2017

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jun 28, 2017

…MI support, and thus are incorrectly identified as being "singleton". Advise the user on the required PMI(x) support and error out.

Signed-off-by: Ralph Castain rhc@open-mpi.org

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 28, 2017

@hppritcha @jsquyres @bwbarrett Please see if this meets your requirements.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 28, 2017

Here is what it looks like for SLURM:

$ srun -n 1 ./mpi_spin
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[rhc001:189810] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: rhc001: task 0: Exited with exit code 1

@bwbarrett
Copy link
Member

I'll give it a try today, but that looks awesome!

@hppritcha
Copy link
Member

Im not sure the ALPS check is correct. I'll double check for an env variable to check. Not really important though since configure auto detects for Cray pmi. User would have to explicitly request no PMI support.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 29, 2017

Thanks @hppritcha - I wasn't sure about the ALPS check either, but figured I should at least give it a try. I was more concerned about the singleton case where someone builds OMPI without ALPS support (e.g., ALPS is installed somewhere unexpected) and then launches a job using aprun. We can remove the ALPS code if you feel it isn't needed.

@hppritcha
Copy link
Member

@rhc54 this might help:

n17276@kaibab:~>aprun -n 1 env | grep ALPS
PE_PRODUCT_LIST=CRAY_PMI:CRAY_LIBSCI:TOTALVIEW:TOTALVIEW-SUPPORT:GNU:GCC:CRAYPE:CRAYPE_MC12:CRAY_LLM:CRAY_XPMEM:CRAY_DMAPP:CRAY_UGNI:CRAY_UDREG:CRAY_ALPS
CRAY_ALPS_POST_LINK_OPTS=-L/opt/cray/alps/5.2.4-2.0502.9774.31.12.gem/lib64
CRAY_ALPS_INCLUDE_OPTS=-I/opt/cray/alps/5.2.4-2.0502.9774.31.12.gem/include
ALPS_APP_DEPTH=1
ALPS_APP_ID=21657911
ALPS_APP_PE=0

I

…MI support, and thus are incorrectly identified as being "singleton". Advise the user on the required PMI(x) support and error out.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@rhc54
Copy link
Contributor Author

rhc54 commented Jun 29, 2017

@hppritcha Done - thanks!

@rhc54 rhc54 merged commit 7cbea77 into open-mpi:master Jun 29, 2017
@rhc54 rhc54 deleted the topic/warn branch June 29, 2017 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants