-
Notifications
You must be signed in to change notification settings - Fork 931
orte/ess: Fix issue in pmi:pmix1xx initialization #1022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There is an issue with attemption of double pmix1xx intialization in case ess selection order is as singleton,pmi. In case an order is pmi,singleton it is not happened.
|
I'm puzzled - how did you wind up with the "bad" order? It shouldn't be possible, I believe. |
|
@rhc54 I do not have a reasonable answer. I investigated this case w/o success. Tried to set ordering in OMPI_MCA_orte_ess variable as env,pmi,etc but seems it does not work (probably it should not work). So I have not got understanding of ordering selection but I see one on real launch on specific cluster. I checked this issue on two different clusters using the same master code point, building with the same configure options and saw two different ordering cases. |
|
@rhc54 we previously ran into a potentially similar issue. components are loaded in the order the OS enumerates them, and this is site dependent. i suggested we could sort the components by name to have something 100% reproductible, and @jsquyres pointed there is a risk similar bugs might remain undetected. @igor-ivanov could you please use the btw, :bot:xxx are only valid in the ompi-release repository |
|
I wouldn't worry about moving the singleton component around - we understand the issue. The concern I have is whether changing the priority will cause singleton's to fail. @igor-ivanov Could you please get a slurm allocation, and then run: In other words, try running a singleton without srun and let's see if your change still allows the ess/singleton component to be selected. |
|
@rhc54 I see the same behaivour in your case: before suggested change is segfault, with fix applied launch is ok. @ggouaillardet Initially I ran on single node Issue case (no fix): |
|
Okay, I dug thru this tonight. I'm afraid your patch isn't complete as problems in pmix component selection continued. So I have created a patch that appears to fully resolve the issue. Please give it a shot. |
|
is your patch rhc54/ompi@363f62a you did not commit yet ? |
|
yes, that is correct - given the issues with slurm integration, i'd like another pair of eyes on it |
…onfigury configury: UCX uses CPPFLAGS (instead of CFLAGS)
There is an issue with attemption of double pmix1xx intialization
in case ess selection order is singleton,pmi.
In case an order is pmi,singleton it is not happened.
@rhc54, @jladd-mlnx could you look at
It was observed in simple hello application running using srun:
env PMIX_VERBOSE=100 OMPI_MCA_opal_pmix_base_verbose=100 OMPI_MCA_orte_ess_base_verbose=100 srun -n2 ./hello.out
and selection logic looks as following:
mca:base:select:( ess) Querying component [singleton]
mca:base:select:( ess) Query of component [singleton] set priority to 25
mca:base:select:( ess) Querying component [env]
mca:base:select:( ess) Querying component [pmi]
This order is ok:
mca:base:select:( ess) Querying component [env]
mca:base:select:( ess) Querying component [hnp]
mca:base:select:( ess) Querying component [pmi]
mca:base:select:( ess) Query of component [singleton] set priority to 25
:bot:assign: @rhc54
:bot🏷️bug