Skip to content

ompi/dpm: set arch to spawn procs in heterogeneous mode #1292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ggouaillardet
Copy link
Contributor

Thanks Siegmar Gross for reporting this issue.

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2016

@ggouaillardet Looks good to me - thanks for tracking it down!

👍

@ggouaillardet
Copy link
Contributor Author

i am still wondering whether this is the correct fix ...

in a non dynamic program, proc_arch is set in ompi_proc_complete_init_single which is invoked by ompi_proc_complete_init in ompi_mpi_init
among other things, ompi_proc_complete_init_single sets the proc_arch

i made the very mimimal changes in ompi_dpm_connect_accept in order to fix the issue, but i am wondering if that is really enough. for example, do we still miss something and should we invoke ompi_proc_complete_init_single instead ?

@hppritcha
Copy link
Member

@rhc54 could you double check whether this is the right way to fix this.

@hppritcha hppritcha added this to the v2.0.0 milestone Jan 12, 2016
@rhc54
Copy link
Contributor

rhc54 commented Jan 26, 2016

@ggouaillardet I don't think we can call ompi_proc_complete_init_single as it's trying to receive the locality instead of just setting it. However, that is the only point of departure between the two places. So I would suggest considering adding a flag to the function to indicate whether the locality should be recovered or set, and then reuse the code so we avoid issues in the future.

Make sense?

… extra parameter

to optionally retrieve hostname and locality.
Thanks Siegmar Gross for reporting this issue.
@ggouaillardet
Copy link
Contributor Author

@rhc54 i made the changes to ompi_proc_complete_init_single

can you especially double check ompi/dpm/dpm.c ?
i #ifdef'ed out some code since i have no idea whether i should do something with the locality or not

@rhc54
Copy link
Contributor

rhc54 commented Feb 13, 2016

@ggouaillardet Sorry for delay - this just fell off my radar. Your changes to ompi/dpm involving the locality are incorrect - we must flag the dynamically connected procs as "non-local" so the shared memory components don't attempt to talk to them as we don't support that mode. Other than that, this looks fine.

@ggouaillardet
Copy link
Contributor Author

Is "we" btl/sm and btl/vader ?
If yes, should dpm (ompi) let the btl (opal) figure out what to do if they do not support that ?

@rhc54
Copy link
Contributor

rhc54 commented Feb 13, 2016

Yes - neither of the shared memory components support communication across jobs. You are welcome to propose a change to their logic so they don't use "non-local" as their flag, though I'm not sure they have any plans to support cross-job shared memory.

@ggouaillardet
Copy link
Contributor Author

@rhc54

from mca_btl_sm_add_procs

    for (proc = 0; proc < (int32_t)nprocs; proc++) {
        /* check to see if this proc can be reached via shmem (i.e.,
           if they're on my local host and in my job) */
        if (procs[proc]->proc_name.jobid != my_proc->proc_name.jobid ||
            !OPAL_PROC_ON_LOCAL_NODE(procs[proc]->proc_flags)) {
            peers[proc] = NULL;
            continue;
        }

and from vader_add_procs

    for (int32_t proc = 0, local_rank = 0 ; proc < (int32_t) nprocs ; ++proc) {
        /* check to see if this proc can be reached via shmem (i.e.,
           if they're on my local host and in my job) */
        if (procs[proc]->proc_name.jobid != my_proc->proc_name.jobid ||
            !OPAL_PROC_ON_LOCAL_NODE(procs[proc]->proc_flags)) {
            peers[proc] = NULL;
            continue;
        }

neither sm nor vader can be used between two different jobs, and regardless their proc_flags

did i miss something ?

if not, i will simplify the commit, and optionally retrieve locality since that cannot hurt.

@rhc54
Copy link
Contributor

rhc54 commented Feb 18, 2016

lol - ok, they already addressed this issue. Sure, go ahead!

@ggouaillardet
Copy link
Contributor Author

this PR is now replaced by #1385

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Aug 23, 2016
…_exp_atomic_cap

configury + btl/openib: fix a typo
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 25, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 26, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 28, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 29, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 30, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 30, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 30, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 31, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants