Skip to content

Conversation

@igor-ivanov
Copy link
Member

There is an issue with attemption of double pmix1xx intialization
in case ess selection order is singleton,pmi.
In case an order is pmi,singleton it is not happened.

@rhc54, @jladd-mlnx could you look at

It was observed in simple hello application running using srun:
env PMIX_VERBOSE=100 OMPI_MCA_opal_pmix_base_verbose=100 OMPI_MCA_orte_ess_base_verbose=100 srun -n2 ./hello.out
and selection logic looks as following:
mca:base:select:( ess) Querying component [singleton]
mca:base:select:( ess) Query of component [singleton] set priority to 25
mca:base:select:( ess) Querying component [env]
mca:base:select:( ess) Querying component [pmi]

This order is ok:
mca:base:select:( ess) Querying component [env]
mca:base:select:( ess) Querying component [hnp]
mca:base:select:( ess) Querying component [pmi]
mca:base:select:( ess) Query of component [singleton] set priority to 25

:bot:assign: @rhc54
:bot🏷️bug

There is an issue with attemption of double pmix1xx intialization
in case ess selection order is as singleton,pmi.
In case an order is pmi,singleton it is not happened.
@rhc54
Copy link
Contributor

rhc54 commented Oct 14, 2015

I'm puzzled - how did you wind up with the "bad" order? It shouldn't be possible, I believe.

@igor-ivanov
Copy link
Member Author

@rhc54 I do not have a reasonable answer. I investigated this case w/o success. Tried to set ordering in OMPI_MCA_orte_ess variable as env,pmi,etc but seems it does not work (probably it should not work). So I have not got understanding of ordering selection but I see one on real launch on specific cluster. I checked this issue on two different clusters using the same master code point, building with the same configure options and saw two different ordering cases.
I believe that my change should not break expected logic but should fix wrong case in addition. What do you think?

@ggouaillardet
Copy link
Contributor

@rhc54 we previously ran into a potentially similar issue. components are loaded in the order the OS enumerates them, and this is site dependent. i suggested we could sort the components by name to have something 100% reproductible, and @jsquyres pointed there is a risk similar bugs might remain undetected.

@igor-ivanov could you please use the -l srun parameter so we can see which task is printing which message ?
do you run on one or two nodes ?
if you always run on on the very same node, is your issue 100% reproductible ?
it this is the case, then you can try
mv /.../lib/openmpi/mca_ess_singleton.* /tmp
mv /tmp/mca_ess_singleton.* /.../lib/openmpi
(once if openmpi is installed on NFS, or on each node)
that should at least make things consistent, and hopefully work around the issue
and if this still does not work, you can try replacing singleton with pmi

btw, :bot:xxx are only valid in the ompi-release repository
if you have write access to the ompi repository, you can directly set label(s) and assignee from the web interface

@rhc54
Copy link
Contributor

rhc54 commented Oct 16, 2015

I wouldn't worry about moving the singleton component around - we understand the issue. The concern I have is whether changing the priority will cause singleton's to fail.

@igor-ivanov Could you please get a slurm allocation, and then run:

env OMPI_MCA_orte_ess_base_verbose=100 ./hello_out

In other words, try running a singleton without srun and let's see if your change still allows the ess/singleton component to be selected.

@igor-ivanov
Copy link
Member Author

@rhc54 I see the same behaivour in your case: before suggested change is segfault, with fix applied launch is ok.

$salloc -N2
$env PMIX_VERBOSE=0 OMPI_MCA_opal_pmix_base_verbose=0 OMPI_MCA_orte_ess_base_verbose=100 ./a.out

@ggouaillardet Initially I ran on single node (srun -n2). I have verified issue on two nodes (srun -N2 with the same result. The result is 100% reproducable ie one cluster - one ordering, the other cluster demostrates another one.

Issue case (no fix):

$env PMIX_VERBOSE=0 OMPI_MCA_opal_pmix_base_verbose=0 OMPI_MCA_orte_ess_base_verbose=100 srun -n2 -l ./a.out
0: [clx-orion-113:32334] mca: base: components_register: registering framework ess components
0: [clx-orion-113:32334] mca: base: components_register: found loaded component singleton
1: [clx-orion-113:32335] mca: base: components_register: registering framework ess components
1: [clx-orion-113:32335] mca: base: components_register: found loaded component singleton
1: [clx-orion-113:32335] mca: base: components_register: component singleton register function successful
1: [clx-orion-113:32335] mca: base: components_register: found loaded component env
0: [clx-orion-113:32334] mca: base: components_register: component singleton register function successful
0: [clx-orion-113:32334] mca: base: components_register: found loaded component env
0: [clx-orion-113:32334] mca: base: components_register: component env has no register or open function
1: [clx-orion-113:32335] mca: base: components_register: component env has no register or open function
0: [clx-orion-113:32334] mca: base: components_register: found loaded component pmi
0: [clx-orion-113:32334] mca: base: components_register: component pmi has no register or open function
1: [clx-orion-113:32335] mca: base: components_register: found loaded component pmi
1: [clx-orion-113:32335] mca: base: components_register: component pmi has no register or open function
1: [clx-orion-113:32335] mca: base: components_register: found loaded component slurm
1: [clx-orion-113:32335] mca: base: components_register: component slurm has no register or open function
0: [clx-orion-113:32334] mca: base: components_register: found loaded component slurm
0: [clx-orion-113:32334] mca: base: components_register: component slurm has no register or open function
0: [clx-orion-113:32334] mca: base: components_register: found loaded component tool
0: [clx-orion-113:32334] mca: base: components_register: component tool has no register or open function
1: [clx-orion-113:32335] mca: base: components_register: found loaded component tool
1: [clx-orion-113:32335] mca: base: components_register: component tool has no register or open function
0: [clx-orion-113:32334] mca: base: components_register: found loaded component hnp
0: [clx-orion-113:32334] mca: base: components_register: component hnp has no register or open function
1: [clx-orion-113:32335] mca: base: components_register: found loaded component hnp
1: [clx-orion-113:32335] mca: base: components_register: component hnp has no register or open function
0: [clx-orion-113:32334] mca: base: components_open: opening ess components
0: [clx-orion-113:32334] mca: base: components_open: found loaded component singleton
0: [clx-orion-113:32334] mca: base: components_open: component singleton open function successful
0: [clx-orion-113:32334] mca: base: components_open: found loaded component env
0: [clx-orion-113:32334] mca: base: components_open: component env open function successful
0: [clx-orion-113:32334] mca: base: components_open: found loaded component pmi
0: [clx-orion-113:32334] mca: base: components_open: component pmi open function successful
0: [clx-orion-113:32334] mca: base: components_open: found loaded component slurm
0: [clx-orion-113:32334] mca: base: components_open: component slurm open function successful
1: [clx-orion-113:32335] mca: base: components_open: opening ess components
1: [clx-orion-113:32335] mca: base: components_open: found loaded component singleton
1: [clx-orion-113:32335] mca: base: components_open: component singleton open function successful
1: [clx-orion-113:32335] mca: base: components_open: found loaded component env
1: [clx-orion-113:32335] mca: base: components_open: component env open function successful
1: [clx-orion-113:32335] mca: base: components_open: found loaded component pmi
1: [clx-orion-113:32335] mca: base: components_open: component pmi open function successful
1: [clx-orion-113:32335] mca: base: components_open: found loaded component slurm
1: [clx-orion-113:32335] mca: base: components_open: component slurm open function successful
1: [clx-orion-113:32335] mca: base: components_open: found loaded component tool
1: [clx-orion-113:32335] mca: base: components_open: component tool open function successful
1: [clx-orion-113:32335] mca: base: components_open: found loaded component hnp
1: [clx-orion-113:32335] mca:
1:  base: components_open: component hnp open function successful
0: [clx-orion-113:32334] mca: base: components_open: found loaded component tool
0: [clx-orion-113:32334] mca: base: components_open: component tool open function successful
0: [clx-orion-113:32334] mca: base: components_open: found loaded component hnp
0: [clx-orion-113:32334] mca: base: components_open: component hnp open function successful
0: [clx-orion-113:32334] mca:base:select: Auto-selecting ess components
0: [clx-orion-113:32334] mca:base:select:(  ess) Querying component [singleton]
1: [clx-orion-113:32335] mca:base:select: Auto-selecting ess components
1: [clx-orion-113:32335] mca:base:select:(  ess) Querying component [singleton]
0: [clx-orion-113:32334] mca:base:select:(  ess) Query of component [singleton] set priority to 25
0: [clx-orion-113:32334] mca:base:select:(  ess) Querying component [env]
0: [clx-orion-113:32334] mca:base:select:(  ess) Querying component [pmi]
1: [clx-orion-113:32335] mca:base:select:(  ess) Query of component [singleton] set priority to 25
1: [clx-orion-113:32335] mca:base:select:(  ess) Querying component [env]
1: [clx-orion-113:32335] mca:base:select:(  ess) Querying component [pmi]
0: [clx-orion-113:32334] mca:base:select:(  ess) Querying component [slurm]
0: [clx-orion-113:32334] mca:base:select:(  ess) Querying component [tool]
0: [clx-orion-113:32334] mca:base:select:(  ess) Querying component [hnp]
0: [clx-orion-113:32334] mca:base:select:(  ess) Selected component [singleton]
0: [clx-orion-113:32334] mca: base: close: component env closed
0: [clx-orion-113:32334] mca: base: close: unloading component env
0: [clx-orion-113:32334] mca: base: close: component pmi closed
0: [clx-orion-113:32334] mca: base: close: unloading component pmi
1: [clx-orion-113:32335] mca:base:select:(  ess) Querying component [slurm]
1: [clx-orion-113:32335] mca:base:select:(  ess) Querying component [tool]
1: [clx-orion-113:32335] mca:base:select:(  ess) Querying component [hnp]
1: [clx-orion-113:32335] mca:base:select:(  ess) Selected component [singleton]
0: [clx-orion-113:32334] mca: base: close: component slurm closed
0: [clx-orion-113:32334] mca: base: close: unloading component slurm
1: [clx-orion-113:32335] mca: base: close: component env closed
1: [clx-orion-113:32335] mca: base: close: unloading component env
0: [clx-orion-113:32334] mca: base: close: component tool closed
0: [clx-orion-113:32334] mca: base: close: unloading component tool
0: [clx-orion-113:32334] mca: base: close: component hnp closed
1: [clx-orion-113:32335] mca: base: close: component pmi closed
1: [clx-orion-113:32335] mca: base: close: unloading component pmi
0: [clx-orion-113:32334] mca: base: close: unloading component hnp
1: [clx-orion-113:32335] mca: base: close: component slurm closed
1: [clx-orion-113:32335] mca: base: close: unloading component slurm
1: [clx-orion-113:32335] mca: base: close: component tool closed
1: [clx-orion-113:32335] mca: base: close: unloading component tool
1: [clx-orion-113:32335] mca: base: close: component hnp closed
1: [clx-orion-113:32335] mca: base: close: unloading component hnp
0: [clx-orion-113:32338] mca: base: components_register: registering framework ess components
0: [clx-orion-113:32338] mca: base: components_register: found loaded component singleton
0: [clx-orion-113:32338] mca: base: components_register: component singleton register function successful
0: [clx-orion-113:32338] mca: base: components_register: found loaded component env
0: [clx-orion-113:32338] mca: base: components_register: component env has no register or open function
0: [clx-orion-113:32338] mca: base: components_register: found loaded component pmi
0: [clx-orion-113:32338] mca: base: components_register: component pmi has no register or open function
0: [clx-orion-113:32338] mca: base: components_register: found loaded component slurm
0: [clx-orion-113:32338] mca: base: components_register: component slurm has no register or open function
0: [clx-orion-113:32338] mca: base: components_register: found loaded component tool
0: [clx-orion-113:32338] mca: base: components_register: component tool has no register or open function
0: [clx-orion-113:32338] mca: base: components_register: found loaded component hnp
0: [clx-orion-113:32338] mca: base: components_register: component hnp has no register or open function
0: [clx-orion-113:32338] mca: base: components_open: opening ess components
0: [clx-orion-113:32338] mca: base: components_open: found loaded component singleton
0: [clx-orion-113:32338] mca: base: components_open: component singleton open function successful
0: [clx-orion-113:32338] mca: base: components_open: found loaded component env
0: [clx-orion-113:32338] mca: base: components_open: component env open function successful
0: [clx-orion-113:32338] mca: base: components_open: found loaded component pmi
0: [clx-orion-113:32338] mca: base: components_open: component pmi open function successful
0: [clx-orion-113:32338] mca: base: components_open: found loaded component slurm
0: [clx-orion-113:32338] mca: base: components_open: component slurm open function successful
0: [clx-orion-113:32338] mca: base: components_open: found loaded component tool
0: [clx-orion-113:32338] mca: base: components_open: component tool open function successful
0: [clx-orion-113:32338] mca: base: components_open: found loaded component hnp
0: [clx-orion-113:32338] mca: base: components_open: component hnp open function successful
0: [clx-orion-113:32338] mca:base:select: Auto-selecting ess components
0: [clx-orion-113:32338] mca:base:select:(  ess) Querying component [singleton]
0: [clx-orion-113:32338] mca:base:select:(  ess) Querying component [env]
0: [clx-orion-113:32338] mca:base:select:(  ess) Querying component [pmi]
0: [clx-orion-113:32338] mca:base:select:(  ess) Querying component [slurm]
0: [clx-orion-113:32338] mca:base:select:(  ess) Querying component [tool]
0: [clx-orion-113:32338] mca:base:select:(  ess) Querying component [hnp]
0: [clx-orion-113:32338] mca:base:select:(  ess) Query of component [hnp] set priority to 100
0: [clx-orion-113:32338] mca:base:select:(  ess) Selected component [hnp]
0: [clx-orion-113:32338] mca: base: close: component singleton closed
0: [clx-orion-113:32338] mca: base: close: unloading component singleton
0: [clx-orion-113:32338] mca: base: close: component env closed
0: [clx-orion-113:32338] mca: base: close: unloading component env
0: [clx-orion-113:32338] mca: base: close: component pmi closed
0: [clx-orion-113:32338] mca: base: close: unloading component pmi
0: [clx-orion-113:32338] mca: base: close: component slurm closed
0: [clx-orion-113:32338] mca: base: close: unloading component slurm
0: [clx-orion-113:32338] mca: base: close: component tool closed
0: [clx-orion-113:32338] mca: base: close: unloading component tool
1: [clx-orion-113:32339] mca: base: components_register: registering framework ess components
1: [clx-orion-113:32339] mca: base: components_register: found loaded component singleton
1: [clx-orion-113:32339] mca: base: components_register: component singleton register function successful
1: [clx-orion-113:32339] mca: base: components_register: found loaded component env
1: [clx-orion-113:32339] mca: base: components_register: component env has no register or open function
1: [clx-orion-113:32339] mca: base: components_register: found loaded component pmi
1: [clx-orion-113:32339] mca: base: components_register: component pmi has no register or open function
1: [clx-orion-113:32339] mca: base: components_register: found loaded component slurm
1: [clx-orion-113:32339] mca: base: components_register: component slurm has no register or open function
1: [clx-orion-113:32339] mca: base: components_register: found loaded component tool
1: [clx-orion-113:32339] mca: base: components_register: component tool has no register or open function
1: [clx-orion-113:32339] mca: base: components_register: found loaded component hnp
1: [clx-orion-113:32339] mca: base: components_register: component hnp has no register or open function
1: [clx-orion-113:32339] mca: base: components_open: opening ess components
1: [clx-orion-113:32339] mca: base: components_open: found loaded component singleton
1: [clx-orion-113:32339] mca: base: components_open: component singleton open function successful
1: [clx-orion-113:32339] mca: base: components_open: found loaded component env
1: [clx-orion-113:32339] mca: base: components_open: component env open function successful
1: [clx-orion-113:32339] mca: base: components_open: found loaded component pmi
1: [clx-orion-113:32339] mca: base: components_open: component pmi open function successful
1: [clx-orion-113:32339] mca: base: components_open: found loaded component slurm
1: [clx-orion-113:32339] mca: base: components_open: component slurm open function successful
1: [clx-orion-113:32339] mca: base: components_open: found loaded component tool
1: [clx-orion-113:32339] mca: base: components_open: component tool open function successful
1: [clx-orion-113:32339] mca: base: components_open: found loaded component hnp
1: [clx-orion-113:32339] mca: base: components_open: component hnp open function successful
1: [clx-orion-113:32339] mca:base:select: Auto-selecting ess components
1: [clx-orion-113:32339] mca:base:select:(  ess) Querying component [singleton]
1: [clx-orion-113:32339] mca:base:select:(  ess) Querying component [env]
1: [clx-orion-113:32339] mca:base:select:(  ess) Querying component [pmi]
1: [clx-orion-113:32339] mca:base:select:(  ess) Querying component [slurm]
1: [clx-orion-113:32339] mca:base:select:(  ess) Querying component [tool]
1: [clx-orion-113:32339] mca:base:select:(  ess) Querying component [hnp]
1: [clx-orion-113:32339] mca:base:select:(  ess) Query of component [hnp] set priority to 100
1: [clx-orion-113:32339] mca:base:select:(  ess) Selected component [hnp]
1: [clx-orion-113:32339] mca: base: close: component singleton closed
1: [clx-orion-113:32339] mca: base: cl
1: ose: unloading component singleton
1: [clx-orion-113:32339] mca: base: close: component env closed
1: [clx-orion-113:32339] mca: base: close: unloading component env
1: [clx-orion-113:32339] mca: base: close: component pmi closed
1: [clx-orion-113:32339] mca: base: close: unloading component pmi
1: [clx-orion-113:32339] mca: base: close: component slurm closed
1: [clx-orion-113:32339] mca: base: close: unloading component slurm
1: [clx-orion-113:32339] mca: base: close: component tool closed
1: [clx-orion-113:32339] mca: base: close: unloading component tool

@rhc54
Copy link
Contributor

rhc54 commented Oct 18, 2015

Okay, I dug thru this tonight. I'm afraid your patch isn't complete as problems in pmix component selection continued. So I have created a patch that appears to fully resolve the issue. Please give it a shot.

@rhc54 rhc54 closed this Oct 18, 2015
@ggouaillardet
Copy link
Contributor

is your patch rhc54/ompi@363f62a you did not commit yet ?

@rhc54
Copy link
Contributor

rhc54 commented Oct 18, 2015

yes, that is correct - given the issues with slurm integration, i'd like another pair of eyes on it

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
…onfigury

configury: UCX uses CPPFLAGS (instead of CFLAGS)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants