-
Notifications
You must be signed in to change notification settings - Fork 911
ompi/dpm: set arch to spawn procs in heterogeneous mode #1292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@ggouaillardet Looks good to me - thanks for tracking it down! 👍 |
i am still wondering whether this is the correct fix ... in a non dynamic program, i made the very mimimal changes in |
@rhc54 could you double check whether this is the right way to fix this. |
@ggouaillardet I don't think we can call ompi_proc_complete_init_single as it's trying to receive the locality instead of just setting it. However, that is the only point of departure between the two places. So I would suggest considering adding a flag to the function to indicate whether the locality should be recovered or set, and then reuse the code so we avoid issues in the future. Make sense? |
… extra parameter to optionally retrieve hostname and locality.
Thanks Siegmar Gross for reporting this issue.
7d37d49
to
9b8f9e8
Compare
@rhc54 i made the changes to can you especially double check |
@ggouaillardet Sorry for delay - this just fell off my radar. Your changes to ompi/dpm involving the locality are incorrect - we must flag the dynamically connected procs as "non-local" so the shared memory components don't attempt to talk to them as we don't support that mode. Other than that, this looks fine. |
Is "we" btl/sm and btl/vader ? |
Yes - neither of the shared memory components support communication across jobs. You are welcome to propose a change to their logic so they don't use "non-local" as their flag, though I'm not sure they have any plans to support cross-job shared memory. |
from for (proc = 0; proc < (int32_t)nprocs; proc++) {
/* check to see if this proc can be reached via shmem (i.e.,
if they're on my local host and in my job) */
if (procs[proc]->proc_name.jobid != my_proc->proc_name.jobid ||
!OPAL_PROC_ON_LOCAL_NODE(procs[proc]->proc_flags)) {
peers[proc] = NULL;
continue;
} and from for (int32_t proc = 0, local_rank = 0 ; proc < (int32_t) nprocs ; ++proc) {
/* check to see if this proc can be reached via shmem (i.e.,
if they're on my local host and in my job) */
if (procs[proc]->proc_name.jobid != my_proc->proc_name.jobid ||
!OPAL_PROC_ON_LOCAL_NODE(procs[proc]->proc_flags)) {
peers[proc] = NULL;
continue;
} neither did i miss something ? if not, i will simplify the commit, and optionally retrieve locality since that cannot hurt. |
lol - ok, they already addressed this issue. Sure, go ahead! |
this PR is now replaced by #1385 |
…_exp_atomic_cap configury + btl/openib: fix a typo
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Thanks Siegmar Gross for reporting this issue.