Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented May 12, 2017

Signed-off-by: Ralph Castain rhc@open-mpi.org

@rhc54
Copy link
Contributor Author

rhc54 commented May 12, 2017

Refs #3525

@rhc54
Copy link
Contributor Author

rhc54 commented May 20, 2017

+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report                                                          |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| MPI Install | my installation | 4.0.0a1     | 00:00    | 1    |      |          |      | MPI_Install-my_installation-my_installation-4.0.0a1-my_installation.html |
| Test Build  | trivial         | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-trivial-my_installation-4.0.0a1-my_installation.html          |
| Test Build  | ibm             | 4.0.0a1     | 00:43    | 1    |      |          |      | Test_Build-ibm-my_installation-4.0.0a1-my_installation.html              |
| Test Build  | intel           | 4.0.0a1     | 01:16    | 1    |      |          |      | Test_Build-intel-my_installation-4.0.0a1-my_installation.html            |
| Test Build  | java            | 4.0.0a1     | 00:02    | 1    |      |          |      | Test_Build-java-my_installation-4.0.0a1-my_installation.html             |
| Test Build  | orte            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-orte-my_installation-4.0.0a1-my_installation.html             |
| Test Run    | trivial         | 4.0.0a1     | 00:07    | 8    |      |          |      | Test_Run-trivial-my_installation-4.0.0a1-my_installation.html            |
| Test Run    | ibm             | 4.0.0a1     | 11:03    | 505  |      | 1        |      | Test_Run-ibm-my_installation-4.0.0a1-my_installation.html                |
| Test Run    | spawn           | 4.0.0a1     | 00:09    | 6    | 1    |          | 1    | Test_Run-spawn-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | loopspawn       | 4.0.0a1     | 10:05    | 1    |      |          |      | Test_Run-loopspawn-my_installation-4.0.0a1-my_installation.html          |
| Test Run    | intel           | 4.0.0a1     | 19:48    | 468  | 4    | 2        | 4    | Test_Run-intel-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | intel_skip      | 4.0.0a1     | 13:26    | 425  | 6    |          | 47   | Test_Run-intel_skip-my_installation-4.0.0a1-my_installation.html         |
| Test Run    | java            | 4.0.0a1     | 00:00    | 1    |      |          |      | Test_Run-java-my_installation-4.0.0a1-my_installation.html               |
| Test Run    | orte            | 4.0.0a1     | 00:42    | 19   |      |          |      | Test_Run-orte-my_installation-4.0.0a1-my_installation.html               |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+


    Total Tests:    1453
    Total Failures: 14
    Total Passed:   1439
    Total Duration: 3443 secs. (57:23)

@rhc54
Copy link
Contributor Author

rhc54 commented May 20, 2017

This change fixed the dynamics (loop_spawn and no-disconnect), but I'm not entirely satisfied with the way this works as mpirun now uses several seconds to compute all the proc node locations prior to sending the launch message when running at exascale sizes. I can improve that some by deferring the assignment of hwloc locales in mpirun to the backend, the same as is done for the compute node daemons. This would eliminate all the hwloc tree traversing that is the primary source of the delay, but would take some more thought on implementation for those mappers that operate per-hwloc-object. I propose to defer this optimization until after v3.0.0.

The mindist mapper change has been implemented, but is untested. Assigning that to @artpol84 as I have no way of testing it.

The seq and rank_file mappers are a problem as they read input files that may not be available on the backend. I've decided that the best way forward there is to have mpirun simply generate the full location-aware launch message for these mappers under the assumption that they are not usually used at scale. Alternatively, we could package the input file in the launch message, or using ORTE's "preload" capability to push the input file to the orted's session directory. Someone is welcome to tackle those if they have interest. These two mappers still need to be updated - at this point, they will error out.

@rhc54 rhc54 changed the title Add debug verbosity to the orte data server and pmix pub/lookup functions Update the distributed mapping system to maintain coherence May 20, 2017
@rhc54
Copy link
Contributor Author

rhc54 commented May 20, 2017

I'll look at the spawn_multiple problem, but perhaps someone could look at the rest of these failures - they have nothing to do with this PR so far as I can tell:

https://mtt.open-mpi.org/index.php?do_redir=2443

and these timeouts:

https://mtt.open-mpi.org/index.php?do_redir=2444

@ggouaillardet
Copy link
Contributor

@rhc54 no-disconnect fails for me, even with -np 1 and on a single host

$ mpirun -np 1 --oversubscribe --bind-to none ./no-disconnect
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
level = 0
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
Parent sent: level 0 (pid:56331)
level = 1
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
Parent sent: level 1 (pid:56334)
level = 2
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation. This can happen if you request a map type
(e.g., loadbalance) and the corresponding mapper was not built.

  Mapper result:    mapped
  #procs mapped:    1
  #nodes assigned:  0

--------------------------------------------------------------------------
[motomachi:56337] *** An error occurred in MPI_Comm_spawn
[motomachi:56337] *** reported by process [140556137332739,0]
[motomachi:56337] *** on communicator MPI_COMM_SELF
[motomachi:56337] *** MPI_ERR_SPAWN: could not spawn processes
[motomachi:56337] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[motomachi:56337] ***    and potentially your MPI job)

it seems mapping fails when the spawner is not vpid 0 from jobid 1

@ggouaillardet
Copy link
Contributor

@rhc54 that looks like a race condition

i applied this patch to add some debug info

diff --git a/orte/mca/odls/base/odls_base_default_fns.c b/orte/mca/odls/base/odls_base_default_fns.c
index 8e4da04..bd94994 100644
--- a/orte/mca/odls/base/odls_base_default_fns.c
+++ b/orte/mca/odls/base/odls_base_default_fns.c
@@ -492,6 +492,9 @@ int orte_odls_base_default_construct_child_list(opal_buffer_t *buffer,
    /* reset any node map flags we used so the next job will start clean */
     for (n=0; n < jdata->map->nodes->size; n++) {
         if (NULL != (node = (orte_node_t*)opal_pointer_array_get_item(jdata->map->nodes, n))) {
+            opal_output_verbose(1, orte_odls_base_framework.framework_output,
+                                "%s job %s node %s %s will be unmapped",
+                                __func__, ORTE_JOBID_PRINT(jdata->jobid), node->name);
             ORTE_FLAG_UNSET(node, ORTE_NODE_FLAG_MAPPED);
         }
     }
diff --git a/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c b/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c
index c0b08e2..f2d82e2 100644
--- a/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c
+++ b/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c
@@ -546,6 +546,11 @@ int orte_rmaps_rr_byobj(orte_job_t *jdata,
                 }
             }
             /* add this node to the map, if reqd */
+            opal_output_verbose(1, orte_rmaps_base_framework.framework_output,
+                                "mca:rmaps:rr jobid %s node %s %s mapped",
+                                ORTE_JOBID_PRINT(jdata->jobid),
+                                node->name,
+                                ORTE_FLAG_TEST(node, ORTE_NODE_FLAG_MAPPED)?"is ":"is not");
             if (!ORTE_FLAG_TEST(node, ORTE_NODE_FLAG_MAPPED)) {
                 if (ORTE_SUCCESS > (idx = opal_pointer_array_add(jdata->map->nodes, (void*)node))) {
                     ORTE_ERROR_LOG(idx);

here is the output

$ mpirun --mca rmaps_base_verbose 1 --mca odls_base_verbose 1 -np 1 ./no-disconnect
[n0:21975] [[20338,0],0] rmaps:seq called on job [20338,1]
[n0:21975] mca:rmaps:rr jobid [20338,1] node n0 is not mapped
[n0:21975] orte_odls_base_default_construct_child_list job [20338,1] node n0 [20338,1] will be unmapped
[n0:21975] [[20338,0],0] odls:dispatch [[20338,1],0] to thread 0
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
level = 0
[n0:21975] [[20338,0],0] rmaps:seq called on job [20338,2]
[n0:21975] mca:rmaps:rr jobid [20338,2] node n0 is not mapped
[n0:21975] orte_odls_base_default_construct_child_list job [20338,2] node n0 [20338,2] will be unmapped
[n0:21975] [[20338,0],0] odls:dispatch [[20338,2],0] to thread 0
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
Parent sent: level 0 (pid:21980)
level = 1
[n0:21975] [[20338,0],0] rmaps:seq called on job [20338,3]
[n0:21975] mca:rmaps:rr jobid [20338,3] node n0 is not mapped
[n0:21975] orte_odls_base_default_construct_child_list job [20338,3] node n0 0 will be unmapped
[n0:21975] [[20338,0],0] odls:dispatch [[20338,3],0] to thread 0
Verify that this test is truly working because conncurrent MPI_Comm_spawns has not worked before.
Parent sent: level 1 (pid:21983)
level = 2
[n0:21975] [[20338,0],0] rmaps:seq called on job [20338,4]
[n0:21975] mca:rmaps:rr jobid [20338,4] node n0 is not mapped
[n0:21975] [[20338,0],0] rmaps:seq called on job [20338,5]
[n0:21975] mca:rmaps:rr jobid [20338,5] node n0 is  mapped
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation. This can happen if you request a map type
(e.g., loadbalance) and the corresponding mapper was not built.

  Mapper result:    mapped
  #procs mapped:    1
  #nodes assigned:  0

--------------------------------------------------------------------------
[n0:21975] orte_odls_base_default_construct_child_list job [20338,4] node n0 [[20338,4],0] will be unmapped
[n0:21986] *** An error occurred in MPI_Comm_spawn
[n0:21986] *** reported by process [139914187505667,0]
[n0:21986] *** on communicator MPI_COMM_SELF
[n0:21986] *** MPI_ERR_SPAWN: could not spawn processes
[n0:21986] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[n0:21986] ***    and potentially your MPI job)

it looks like the race-condition is involving the ORTE_NODE_FLAG_MAPPED flag on a node.
it is tested and set in rmaps, and unset in odls
the race occurs if rmaps is invoked twice in a row (and that can happen when MPI_Comm_spawn() is invoked by several tasks in parallel). the second time, the flag is already set, and that leads to the error.

@ggouaillardet
Copy link
Contributor

while trying how to figure out a fix, i noticed this

diff --git a/orte/mca/state/base/state_base_fns.c b/orte/mca/state/base/state_base_fns.c
index 69cfa89..c165f2c 100644
--- a/orte/mca/state/base/state_base_fns.c
+++ b/orte/mca/state/base/state_base_fns.c
@@ -1,6 +1,8 @@
 /*
  * Copyright (c) 2011-2012 Los Alamos National Security, LLC.
  * Copyright (c) 2014-2017 Intel, Inc.  All rights reserved.
+ * Copyright (c) 2017      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -897,10 +899,10 @@ void orte_state_base_check_all_complete(int fd, short args, void *cbdata)
             }
             /* set the node location to NULL */
             opal_pointer_array_set_item(map->nodes, index, NULL);
-            /* maintain accounting */
-            OBJ_RELEASE(node);
             /* flag that the node is no longer in a map */
             ORTE_FLAG_UNSET(node, ORTE_NODE_FLAG_MAPPED);
+            /* maintain accounting */
+            OBJ_RELEASE(node);
         }
         OBJ_RELEASE(map);
         jdata->map = NULL;

@rhc54
Copy link
Contributor Author

rhc54 commented May 24, 2017

@ggouaillardet I pushed a fix - see what you think. I also fixed the issue you identified above, but did it a little differently. There is no need to unset the mapped flag at that point - indeed, that is another race condition problem.

@rhc54
Copy link
Contributor Author

rhc54 commented May 25, 2017

I have this working now, so far as I can tell. I have updated the seq, rankfile, and mindist mappers to preserve their functionality by setting them to work in the "old" mode where mpirun computes everything and sends it to the backend daemons. Thus, they will scale poorly compared to the other mappers, but at least will still function until someone who cares can update them.

@ggouaillardet
Copy link
Contributor

@rhc54 that fixed no-disconnect indeed !
i found an issue with loop_spawn

mpirun --host n0:1,n1:2 -np 1 ./loop_spawn

mpirun crashes and here is where and why

(gdb) bt
#0  0x00007f73ef3bd192 in orte_util_nidmap_generate_ppn (jdata=0xf21130, ppn=0x7fffd97a3f70) at ../../../src/ompi-master/orte/util/nidmap.c:1234
#1  0x00007f73ef4044d3 in orte_odls_base_default_get_add_procs_data (buffer=0xf22a60, job=3594387463) at ../../../../../src/ompi-master/orte/mca/odls/base/odls_base_default_fns.c:251
#2  0x00007f73ef4142e6 in orte_plm_base_launch_apps (fd=-1, args=4, cbdata=0xdb36f0) at ../../../../../src/ompi-master/orte/mca/plm/base/plm_base_launch_support.c:523
#3  0x00007f73ef0c05b0 in event_process_active_single_queue (base=0xd7bf50, activeq=0xd7c4d0) at ../../../../../../../src/ompi-master/opal/mca/event/libevent2022/libevent/event.c:1370
#4  0x00007f73ef0c0828 in event_process_active (base=0xd7bf50) at ../../../../../../../src/ompi-master/opal/mca/event/libevent2022/libevent/event.c:1440
#5  0x00007f73ef0c0e7a in opal_libevent2022_event_base_loop (base=0xd7bf50, flags=1) at ../../../../../../../src/ompi-master/opal/mca/event/libevent2022/libevent/event.c:1644
#6  0x00000000004015aa in orterun (argc=6, argv=0x7fffd97a4408) at ../../../../../src/ompi-master/orte/tools/orterun/orterun.c:199
#7  0x0000000000400f74 in main (argc=6, argv=0x7fffd97a4408) at ../../../../../src/ompi-master/orte/tools/orterun/main.c:13
(gdb) p nptr->procs
$1 = (opal_pointer_array_t *) 0x0
(gdb) whatis nptr
type = orte_node_t *
(gdb) p *nptr->super.super.obj_class
$2 = {cls_name = 0x7f73ef4482e1 "(void *)0) != ((opal_object_t *) (map->nodes))->obj_class", cls_parent = 0x7f73ef365a60, cls_construct = 0x7f73ef38f030 <orte_proc_construct>, 
  cls_destruct = 0x7f73ef38f168 <orte_proc_destruct>, cls_initialized = 1, cls_depth = 3, cls_construct_array = 0xdb3b10, cls_destruct_array = 0xdb3b28, cls_sizeof = 256}

note nptr is a orte_node_t *, but it really points to a orte_proc_t object (!)

the patch below can be used as a very temporary workaround

diff --git a/orte/mca/rmaps/base/rmaps_base_support_fns.c b/orte/mca/rmaps/base/rmaps_base_support_fns.c
index b9003c9..1d17e1c 100644
--- a/orte/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/orte/mca/rmaps/base/rmaps_base_support_fns.c
@@ -503,7 +503,7 @@ int orte_rmaps_base_get_target_nodes(opal_list_t *allocated_nodes, orte_std_cntr
                                      ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
                                      node->name, node->slots, node->slots_inuse));
                 opal_list_remove_item(allocated_nodes, item);
-                OBJ_RELEASE(item);  /* "un-retain" it */
+                // OBJ_RELEASE(item);  /* "un-retain" it */
                 item = next;
                 continue;
             }

@rhc54
Copy link
Contributor Author

rhc54 commented May 25, 2017

I'm at a loss on how to interpret that report combined with the patch. The two code areas appear to be completely unrelated. If a mapper is truly putting an orte_proc_t on the node array instead of an orte_node_t, then I would expect it to always fail. Yet it appears to be running correctly for me.

Can you tell me a little more about this failure? Is it on the very first comm_spawn, or a later one?

@ggouaillardet
Copy link
Contributor

my interpretation is that the orte_node_t is not correctly retained, then it ends up being freed.
at some point in time, a orte_proc_t is allocated with the same address of the orte_node_t that should have never been freed.
bottom line, no one ever puts a orte_proc_t in the node array. the node array contains an invalid pointer, and it happens this pointer now points to a orte_proc_t
makes sense ?

iirc the crash occurs around the spawn of the 5th child

@rhc54
Copy link
Contributor Author

rhc54 commented May 25, 2017

Thanks - that helps a great deal. I'll check it out.

@rhc54
Copy link
Contributor Author

rhc54 commented May 25, 2017

I'm afraid I'm batting zero here - I can't replicate it, valgrind isn't flagging it, and I can't find the problem by inspecting the code. Can you perhaps do a little more detective work for me?

@rhc54
Copy link
Contributor Author

rhc54 commented May 25, 2017

@ggouaillardet Please give this updated version a try. I found a bug that impacted the case where the HNP is not in the allocation - I can't replicate that here, but it might be the situation where you are?

@artpol84
Copy link
Contributor

@karasevb please runtime check if mindist is working.

@ggouaillardet
Copy link
Contributor

@rhc54 i can reproduce the issue with the latest updates.
do you use the very same command line ?

n0$ mpirun -np 1 --host n0:1,n1:2 ./loop_spawn

the same error occurs with orte/test/mpi or dynamic from the ibm test suite

here is an other possible fix

diff --git a/orte/mca/rmaps/base/rmaps_base_support_fns.c b/orte/mca/rmaps/base/rmaps_base_support_fns.c
index 5633789..cf8b9b7 100644
--- a/orte/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/orte/mca/rmaps/base/rmaps_base_support_fns.c
@@ -351,6 +351,7 @@ int orte_rmaps_base_get_target_nodes(opal_list_t *allocated_nodes, orte_std_cntr
         /* the list is empty - if the HNP is allocated, then add it */
         if (orte_hnp_is_allocated) {
             nd = (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, 0);
+            OBJ_RETAIN(nd);
             opal_list_append(allocated_nodes, &nd->super);
         } else {
             nd = NULL;

i just noted you must use the very same command line.
for example

n0$ mpirun -np 1 --host n0:2,n1:1 ./loop_spawn

or

n0$ mpirun -np 1 --host n2:1,n1:2 ./loop_spawn

both work just fine

@rhc54
Copy link
Contributor Author

rhc54 commented May 26, 2017

Yes, I replicated your cmd line exactly. I'll try again, and also try with your change (which looks correct to me).

@rhc54
Copy link
Contributor Author

rhc54 commented May 26, 2017

Okay, I found the difference. You must not have any hostfile (e.g., a default one) at all. That sends you down a different code path that hits the line you flagged in your fix. If I delete my default hostfile envar, then I can replicate this failure.

It's always the little differences we never think to mention that get us 😄

I confirm your patch fixes it - will commit

…ions

Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it).

Remove unneeded test

Fix memory corruption by re-initializing variable to NULL in loop

Resolve the race condition identified by @ggouaillardet by resetting the
mapped flag within the same event where it was set. There is no need to
retain the flag beyond that point as it isn't used again.

Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them.

Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers.

Have the mindist module add procs to the job's proc array as it is a fully described module

Protect the hnp-not-in-allocation case

Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@rhc54
Copy link
Contributor Author

rhc54 commented May 26, 2017

@karasevb FWIW: the Mellanox Jenkins tests the mindist mapper, and it passes that test. The mapper won't scale as well as the others until someone updates it to add backend support.

This patch also fixes --novm operations 😸

@rhc54 rhc54 merged commit 10b103a into open-mpi:master May 26, 2017
@rhc54 rhc54 deleted the topic/nodis branch May 26, 2017 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants