Update the distributed mapping system to maintain coherence #3524

rhc54 · 2017-05-12T23:17:21Z

Signed-off-by: Ralph Castain rhc@open-mpi.org

rhc54 · 2017-05-12T23:18:15Z

rhc54 · 2017-05-20T16:43:50Z

+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report                                                          |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| MPI Install | my installation | 4.0.0a1     | 00:00    | 1    |      |          |      | MPI_Install-my_installation-my_installation-4.0.0a1-my_installation.html |
| Test Build  | trivial         | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-trivial-my_installation-4.0.0a1-my_installation.html          |
| Test Build  | ibm             | 4.0.0a1     | 00:43    | 1    |      |          |      | Test_Build-ibm-my_installation-4.0.0a1-my_installation.html              |
| Test Build  | intel           | 4.0.0a1     | 01:16    | 1    |      |          |      | Test_Build-intel-my_installation-4.0.0a1-my_installation.html            |
| Test Build  | java            | 4.0.0a1     | 00:02    | 1    |      |          |      | Test_Build-java-my_installation-4.0.0a1-my_installation.html             |
| Test Build  | orte            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-orte-my_installation-4.0.0a1-my_installation.html             |
| Test Run    | trivial         | 4.0.0a1     | 00:07    | 8    |      |          |      | Test_Run-trivial-my_installation-4.0.0a1-my_installation.html            |
| Test Run    | ibm             | 4.0.0a1     | 11:03    | 505  |      | 1        |      | Test_Run-ibm-my_installation-4.0.0a1-my_installation.html                |
| Test Run    | spawn           | 4.0.0a1     | 00:09    | 6    | 1    |          | 1    | Test_Run-spawn-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | loopspawn       | 4.0.0a1     | 10:05    | 1    |      |          |      | Test_Run-loopspawn-my_installation-4.0.0a1-my_installation.html          |
| Test Run    | intel           | 4.0.0a1     | 19:48    | 468  | 4    | 2        | 4    | Test_Run-intel-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | intel_skip      | 4.0.0a1     | 13:26    | 425  | 6    |          | 47   | Test_Run-intel_skip-my_installation-4.0.0a1-my_installation.html         |
| Test Run    | java            | 4.0.0a1     | 00:00    | 1    |      |          |      | Test_Run-java-my_installation-4.0.0a1-my_installation.html               |
| Test Run    | orte            | 4.0.0a1     | 00:42    | 19   |      |          |      | Test_Run-orte-my_installation-4.0.0a1-my_installation.html               |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+


    Total Tests:    1453
    Total Failures: 14
    Total Passed:   1439
    Total Duration: 3443 secs. (57:23)

rhc54 · 2017-05-20T16:53:38Z

This change fixed the dynamics (loop_spawn and no-disconnect), but I'm not entirely satisfied with the way this works as mpirun now uses several seconds to compute all the proc node locations prior to sending the launch message when running at exascale sizes. I can improve that some by deferring the assignment of hwloc locales in mpirun to the backend, the same as is done for the compute node daemons. This would eliminate all the hwloc tree traversing that is the primary source of the delay, but would take some more thought on implementation for those mappers that operate per-hwloc-object. I propose to defer this optimization until after v3.0.0.

The mindist mapper change has been implemented, but is untested. Assigning that to @artpol84 as I have no way of testing it.

The seq and rank_file mappers are a problem as they read input files that may not be available on the backend. I've decided that the best way forward there is to have mpirun simply generate the full location-aware launch message for these mappers under the assumption that they are not usually used at scale. Alternatively, we could package the input file in the launch message, or using ORTE's "preload" capability to push the input file to the orted's session directory. Someone is welcome to tackle those if they have interest. These two mappers still need to be updated - at this point, they will error out.

rhc54 · 2017-05-20T16:58:39Z

I'll look at the spawn_multiple problem, but perhaps someone could look at the rest of these failures - they have nothing to do with this PR so far as I can tell:

https://mtt.open-mpi.org/index.php?do_redir=2443

and these timeouts:

https://mtt.open-mpi.org/index.php?do_redir=2444

ggouaillardet · 2017-05-22T01:16:26Z

@rhc54 no-disconnect fails for me, even with -np 1 and on a single host

$ mpirun -np 1 --oversubscribe --bind-to none ./no-disconnect
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
level = 0
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
Parent sent: level 0 (pid:56331)
level = 1
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
Parent sent: level 1 (pid:56334)
level = 2
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation. This can happen if you request a map type
(e.g., loadbalance) and the corresponding mapper was not built.

  Mapper result:    mapped
  #procs mapped:    1
  #nodes assigned:  0

--------------------------------------------------------------------------
[motomachi:56337] *** An error occurred in MPI_Comm_spawn
[motomachi:56337] *** reported by process [140556137332739,0]
[motomachi:56337] *** on communicator MPI_COMM_SELF
[motomachi:56337] *** MPI_ERR_SPAWN: could not spawn processes
[motomachi:56337] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[motomachi:56337] ***    and potentially your MPI job)

it seems mapping fails when the spawner is not vpid 0 from jobid 1

ggouaillardet · 2017-05-22T07:02:13Z

@rhc54 that looks like a race condition

i applied this patch to add some debug info

diff --git a/orte/mca/odls/base/odls_base_default_fns.c b/orte/mca/odls/base/odls_base_default_fns.c
index 8e4da04..bd94994 100644
--- a/orte/mca/odls/base/odls_base_default_fns.c
+++ b/orte/mca/odls/base/odls_base_default_fns.c
@@ -492,6 +492,9 @@ int orte_odls_base_default_construct_child_list(opal_buffer_t *buffer,
    /* reset any node map flags we used so the next job will start clean */
     for (n=0; n < jdata->map->nodes->size; n++) {
         if (NULL != (node = (orte_node_t*)opal_pointer_array_get_item(jdata->map->nodes, n))) {
+            opal_output_verbose(1, orte_odls_base_framework.framework_output,
+                                "%s job %s node %s %s will be unmapped",
+                                __func__, ORTE_JOBID_PRINT(jdata->jobid), node->name);
             ORTE_FLAG_UNSET(node, ORTE_NODE_FLAG_MAPPED);
         }
     }
diff --git a/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c b/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c
index c0b08e2..f2d82e2 100644
--- a/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c
+++ b/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c
@@ -546,6 +546,11 @@ int orte_rmaps_rr_byobj(orte_job_t *jdata,
                 }
             }
             /* add this node to the map, if reqd */
+            opal_output_verbose(1, orte_rmaps_base_framework.framework_output,
+                                "mca:rmaps:rr jobid %s node %s %s mapped",
+                                ORTE_JOBID_PRINT(jdata->jobid),
+                                node->name,
+                                ORTE_FLAG_TEST(node, ORTE_NODE_FLAG_MAPPED)?"is ":"is not");
             if (!ORTE_FLAG_TEST(node, ORTE_NODE_FLAG_MAPPED)) {
                 if (ORTE_SUCCESS > (idx = opal_pointer_array_add(jdata->map->nodes, (void*)node))) {
                     ORTE_ERROR_LOG(idx);

here is the output

$ mpirun --mca rmaps_base_verbose 1 --mca odls_base_verbose 1 -np 1 ./no-disconnect
[n0:21975] [[20338,0],0] rmaps:seq called on job [20338,1]
[n0:21975] mca:rmaps:rr jobid [20338,1] node n0 is not mapped
[n0:21975] orte_odls_base_default_construct_child_list job [20338,1] node n0 [20338,1] will be unmapped
[n0:21975] [[20338,0],0] odls:dispatch [[20338,1],0] to thread 0
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
level = 0
[n0:21975] [[20338,0],0] rmaps:seq called on job [20338,2]
[n0:21975] mca:rmaps:rr jobid [20338,2] node n0 is not mapped
[n0:21975] orte_odls_base_default_construct_child_list job [20338,2] node n0 [20338,2] will be unmapped
[n0:21975] [[20338,0],0] odls:dispatch [[20338,2],0] to thread 0
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
Parent sent: level 0 (pid:21980)
level = 1
[n0:21975] [[20338,0],0] rmaps:seq called on job [20338,3]
[n0:21975] mca:rmaps:rr jobid [20338,3] node n0 is not mapped
[n0:21975] orte_odls_base_default_construct_child_list job [20338,3] node n0 0 will be unmapped
[n0:21975] [[20338,0],0] odls:dispatch [[20338,3],0] to thread 0
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
Parent sent: level 1 (pid:21983)
level = 2
[n0:21975] [[20338,0],0] rmaps:seq called on job [20338,4]
[n0:21975] mca:rmaps:rr jobid [20338,4] node n0 is not mapped
[n0:21975] [[20338,0],0] rmaps:seq called on job [20338,5]
[n0:21975] mca:rmaps:rr jobid [20338,5] node n0 is  mapped
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation. This can happen if you request a map type
(e.g., loadbalance) and the corresponding mapper was not built.

  Mapper result:    mapped
  #procs mapped:    1
  #nodes assigned:  0

--------------------------------------------------------------------------
[n0:21975] orte_odls_base_default_construct_child_list job [20338,4] node n0 [[20338,4],0] will be unmapped
[n0:21986] *** An error occurred in MPI_Comm_spawn
[n0:21986] *** reported by process [139914187505667,0]
[n0:21986] *** on communicator MPI_COMM_SELF
[n0:21986] *** MPI_ERR_SPAWN: could not spawn processes
[n0:21986] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[n0:21986] ***    and potentially your MPI job)

it looks like the race-condition is involving the ORTE_NODE_FLAG_MAPPED flag on a node.
it is tested and set in rmaps, and unset in odls
the race occurs if rmaps is invoked twice in a row (and that can happen when MPI_Comm_spawn() is invoked by several tasks in parallel). the second time, the flag is already set, and that leads to the error.

ggouaillardet · 2017-05-22T08:04:49Z

while trying how to figure out a fix, i noticed this

diff --git a/orte/mca/state/base/state_base_fns.c b/orte/mca/state/base/state_base_fns.c
index 69cfa89..c165f2c 100644
--- a/orte/mca/state/base/state_base_fns.c
+++ b/orte/mca/state/base/state_base_fns.c
@@ -1,6 +1,8 @@
 /*
  * Copyright (c) 2011-2012 Los Alamos National Security, LLC.
  * Copyright (c) 2014-2017 Intel, Inc.  All rights reserved.
+ * Copyright (c) 2017      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -897,10 +899,10 @@ void orte_state_base_check_all_complete(int fd, short args, void *cbdata)
             }
             /* set the node location to NULL */
             opal_pointer_array_set_item(map->nodes, index, NULL);
-            /* maintain accounting */
-            OBJ_RELEASE(node);
             /* flag that the node is no longer in a map */
             ORTE_FLAG_UNSET(node, ORTE_NODE_FLAG_MAPPED);
+            /* maintain accounting */
+            OBJ_RELEASE(node);
         }
         OBJ_RELEASE(map);
         jdata->map = NULL;

rhc54 · 2017-05-24T10:09:21Z

@ggouaillardet I pushed a fix - see what you think. I also fixed the issue you identified above, but did it a little differently. There is no need to unset the mapped flag at that point - indeed, that is another race condition problem.

rhc54 · 2017-05-25T03:51:23Z

I have this working now, so far as I can tell. I have updated the seq, rankfile, and mindist mappers to preserve their functionality by setting them to work in the "old" mode where mpirun computes everything and sends it to the backend daemons. Thus, they will scale poorly compared to the other mappers, but at least will still function until someone who cares can update them.

ggouaillardet · 2017-05-25T05:15:55Z

@rhc54 that fixed no-disconnect indeed !
i found an issue with loop_spawn

mpirun --host n0:1,n1:2 -np 1 ./loop_spawn

mpirun crashes and here is where and why

(gdb) bt
#0  0x00007f73ef3bd192 in orte_util_nidmap_generate_ppn (jdata=0xf21130, ppn=0x7fffd97a3f70) at ../../../src/ompi-master/orte/util/nidmap.c:1234
#1  0x00007f73ef4044d3 in orte_odls_base_default_get_add_procs_data (buffer=0xf22a60, job=3594387463) at ../../../../../src/ompi-master/orte/mca/odls/base/odls_base_default_fns.c:251
#2  0x00007f73ef4142e6 in orte_plm_base_launch_apps (fd=-1, args=4, cbdata=0xdb36f0) at ../../../../../src/ompi-master/orte/mca/plm/base/plm_base_launch_support.c:523
#3  0x00007f73ef0c05b0 in event_process_active_single_queue (base=0xd7bf50, activeq=0xd7c4d0) at ../../../../../../../src/ompi-master/opal/mca/event/libevent2022/libevent/event.c:1370
#4  0x00007f73ef0c0828 in event_process_active (base=0xd7bf50) at ../../../../../../../src/ompi-master/opal/mca/event/libevent2022/libevent/event.c:1440
#5  0x00007f73ef0c0e7a in opal_libevent2022_event_base_loop (base=0xd7bf50, flags=1) at ../../../../../../../src/ompi-master/opal/mca/event/libevent2022/libevent/event.c:1644
#6  0x00000000004015aa in orterun (argc=6, argv=0x7fffd97a4408) at ../../../../../src/ompi-master/orte/tools/orterun/orterun.c:199
#7  0x0000000000400f74 in main (argc=6, argv=0x7fffd97a4408) at ../../../../../src/ompi-master/orte/tools/orterun/main.c:13
(gdb) p nptr->procs
$1 = (opal_pointer_array_t *) 0x0
(gdb) whatis nptr
type = orte_node_t *
(gdb) p *nptr->super.super.obj_class
$2 = {cls_name = 0x7f73ef4482e1 "(void *)0) != ((opal_object_t *) (map->nodes))->obj_class", cls_parent = 0x7f73ef365a60, cls_construct = 0x7f73ef38f030 <orte_proc_construct>, 
  cls_destruct = 0x7f73ef38f168 <orte_proc_destruct>, cls_initialized = 1, cls_depth = 3, cls_construct_array = 0xdb3b10, cls_destruct_array = 0xdb3b28, cls_sizeof = 256}

note nptr is a orte_node_t *, but it really points to a orte_proc_t object (!)

the patch below can be used as a very temporary workaround

diff --git a/orte/mca/rmaps/base/rmaps_base_support_fns.c b/orte/mca/rmaps/base/rmaps_base_support_fns.c
index b9003c9..1d17e1c 100644
--- a/orte/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/orte/mca/rmaps/base/rmaps_base_support_fns.c
@@ -503,7 +503,7 @@ int orte_rmaps_base_get_target_nodes(opal_list_t *allocated_nodes, orte_std_cntr
                                      ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
                                      node->name, node->slots, node->slots_inuse));
                 opal_list_remove_item(allocated_nodes, item);
-                OBJ_RELEASE(item);  /* "un-retain" it */
+                // OBJ_RELEASE(item);  /* "un-retain" it */
                 item = next;
                 continue;
             }

rhc54 · 2017-05-25T12:04:39Z

I'm at a loss on how to interpret that report combined with the patch. The two code areas appear to be completely unrelated. If a mapper is truly putting an orte_proc_t on the node array instead of an orte_node_t, then I would expect it to always fail. Yet it appears to be running correctly for me.

Can you tell me a little more about this failure? Is it on the very first comm_spawn, or a later one?

ggouaillardet · 2017-05-25T12:21:19Z

my interpretation is that the orte_node_t is not correctly retained, then it ends up being freed.
at some point in time, a orte_proc_t is allocated with the same address of the orte_node_t that should have never been freed.
bottom line, no one ever puts a orte_proc_t in the node array. the node array contains an invalid pointer, and it happens this pointer now points to a orte_proc_t
makes sense ?

iirc the crash occurs around the spawn of the 5th child

rhc54 · 2017-05-25T13:15:08Z

Thanks - that helps a great deal. I'll check it out.

rhc54 · 2017-05-25T13:37:59Z

I'm afraid I'm batting zero here - I can't replicate it, valgrind isn't flagging it, and I can't find the problem by inspecting the code. Can you perhaps do a little more detective work for me?

rhc54 · 2017-05-25T21:36:16Z

@ggouaillardet Please give this updated version a try. I found a bug that impacted the case where the HNP is not in the allocation - I can't replicate that here, but it might be the situation where you are?

artpol84 · 2017-05-26T01:07:07Z

@karasevb please runtime check if mindist is working.

ggouaillardet · 2017-05-26T01:07:37Z

@rhc54 i can reproduce the issue with the latest updates.
do you use the very same command line ?

n0$ mpirun -np 1 --host n0:1,n1:2 ./loop_spawn

the same error occurs with orte/test/mpi or dynamic from the ibm test suite

here is an other possible fix

diff --git a/orte/mca/rmaps/base/rmaps_base_support_fns.c b/orte/mca/rmaps/base/rmaps_base_support_fns.c
index 5633789..cf8b9b7 100644
--- a/orte/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/orte/mca/rmaps/base/rmaps_base_support_fns.c
@@ -351,6 +351,7 @@ int orte_rmaps_base_get_target_nodes(opal_list_t *allocated_nodes, orte_std_cntr
         /* the list is empty - if the HNP is allocated, then add it */
         if (orte_hnp_is_allocated) {
             nd = (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, 0);
+            OBJ_RETAIN(nd);
             opal_list_append(allocated_nodes, &nd->super);
         } else {
             nd = NULL;

i just noted you must use the very same command line.
for example

n0$ mpirun -np 1 --host n0:2,n1:1 ./loop_spawn

or

n0$ mpirun -np 1 --host n2:1,n1:2 ./loop_spawn

both work just fine

rhc54 · 2017-05-26T01:32:35Z

Yes, I replicated your cmd line exactly. I'll try again, and also try with your change (which looks correct to me).

rhc54 · 2017-05-26T01:40:23Z

Okay, I found the difference. You must not have any hostfile (e.g., a default one) at all. That sends you down a different code path that hits the line you flagged in your fix. If I delete my default hostfile envar, then I can replicate this failure.

It's always the little differences we never think to mention that get us 😄

I confirm your patch fixes it - will commit

@ggouaillardet

…ions Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it). Remove unneeded test Fix memory corruption by re-initializing variable to NULL in loop Resolve the race condition identified by @ggouaillardet by resetting the mapped flag within the same event where it was set. There is no need to retain the flag beyond that point as it isn't used again. Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them. Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers. Have the mindist module add procs to the job's proc array as it is a fully described module Protect the hnp-not-in-allocation case Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile Signed-off-by: Ralph Castain <rhc@open-mpi.org>

rhc54 · 2017-05-26T01:48:08Z

@karasevb FWIW: the Mellanox Jenkins tests the mindist mapper, and it passes that test. The mapper won't scale as well as the others until someone updates it to add backend support.

This patch also fixes --novm operations 😸

rhc54 mentioned this pull request May 12, 2017

IBM dynamic/no-disconnect test hangs #3525

Closed

rhc54 assigned ggouaillardet and artpol84 May 20, 2017

rhc54 changed the title ~~Add debug verbosity to the orte data server and pmix pub/lookup functions~~ Update the distributed mapping system to maintain coherence May 20, 2017

rhc54 merged commit 10b103a into open-mpi:master May 26, 2017

rhc54 deleted the topic/nodis branch May 26, 2017 04:09

ggouaillardet mentioned this pull request May 26, 2017

Update ORTE to include all fixes since v3.x branching #3462

Merged

Update the distributed mapping system to maintain coherence #3524

Update the distributed mapping system to maintain coherence #3524

Uh oh!

Conversation

rhc54 commented May 12, 2017

Uh oh!

rhc54 commented May 12, 2017

Uh oh!

rhc54 commented May 20, 2017

Uh oh!

rhc54 commented May 20, 2017

Uh oh!

rhc54 commented May 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggouaillardet commented May 22, 2017

Uh oh!

ggouaillardet commented May 22, 2017

Uh oh!

ggouaillardet commented May 22, 2017

Uh oh!

rhc54 commented May 24, 2017

Uh oh!

rhc54 commented May 25, 2017

Uh oh!

ggouaillardet commented May 25, 2017

Uh oh!

rhc54 commented May 25, 2017

Uh oh!

ggouaillardet commented May 25, 2017

Uh oh!

rhc54 commented May 25, 2017

Uh oh!

rhc54 commented May 25, 2017

Uh oh!

rhc54 commented May 25, 2017

Uh oh!

artpol84 commented May 26, 2017

Uh oh!

ggouaillardet commented May 26, 2017

Uh oh!

rhc54 commented May 26, 2017

Uh oh!

rhc54 commented May 26, 2017

Uh oh!

rhc54 commented May 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rhc54 commented May 20, 2017 •

edited

Loading