-
Notifications
You must be signed in to change notification settings - Fork 934
Update the distributed mapping system to maintain coherence #3524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Refs #3525 |
|
|
This change fixed the dynamics (loop_spawn and no-disconnect), but I'm not entirely satisfied with the way this works as mpirun now uses several seconds to compute all the proc node locations prior to sending the launch message when running at exascale sizes. I can improve that some by deferring the assignment of hwloc locales in mpirun to the backend, the same as is done for the compute node daemons. This would eliminate all the hwloc tree traversing that is the primary source of the delay, but would take some more thought on implementation for those mappers that operate per-hwloc-object. I propose to defer this optimization until after v3.0.0. The mindist mapper change has been implemented, but is untested. Assigning that to @artpol84 as I have no way of testing it. The seq and rank_file mappers are a problem as they read input files that may not be available on the backend. I've decided that the best way forward there is to have mpirun simply generate the full location-aware launch message for these mappers under the assumption that they are not usually used at scale. Alternatively, we could package the input file in the launch message, or using ORTE's "preload" capability to push the input file to the orted's session directory. Someone is welcome to tackle those if they have interest. These two mappers still need to be updated - at this point, they will error out. |
|
I'll look at the spawn_multiple problem, but perhaps someone could look at the rest of these failures - they have nothing to do with this PR so far as I can tell: https://mtt.open-mpi.org/index.php?do_redir=2443 and these timeouts: |
|
@rhc54 it seems mapping fails when the spawner is not vpid 0 from jobid 1 |
|
@rhc54 that looks like a race condition i applied this patch to add some debug info diff --git a/orte/mca/odls/base/odls_base_default_fns.c b/orte/mca/odls/base/odls_base_default_fns.c
index 8e4da04..bd94994 100644
--- a/orte/mca/odls/base/odls_base_default_fns.c
+++ b/orte/mca/odls/base/odls_base_default_fns.c
@@ -492,6 +492,9 @@ int orte_odls_base_default_construct_child_list(opal_buffer_t *buffer,
/* reset any node map flags we used so the next job will start clean */
for (n=0; n < jdata->map->nodes->size; n++) {
if (NULL != (node = (orte_node_t*)opal_pointer_array_get_item(jdata->map->nodes, n))) {
+ opal_output_verbose(1, orte_odls_base_framework.framework_output,
+ "%s job %s node %s %s will be unmapped",
+ __func__, ORTE_JOBID_PRINT(jdata->jobid), node->name);
ORTE_FLAG_UNSET(node, ORTE_NODE_FLAG_MAPPED);
}
}
diff --git a/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c b/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c
index c0b08e2..f2d82e2 100644
--- a/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c
+++ b/orte/mca/rmaps/round_robin/rmaps_rr_mappers.c
@@ -546,6 +546,11 @@ int orte_rmaps_rr_byobj(orte_job_t *jdata,
}
}
/* add this node to the map, if reqd */
+ opal_output_verbose(1, orte_rmaps_base_framework.framework_output,
+ "mca:rmaps:rr jobid %s node %s %s mapped",
+ ORTE_JOBID_PRINT(jdata->jobid),
+ node->name,
+ ORTE_FLAG_TEST(node, ORTE_NODE_FLAG_MAPPED)?"is ":"is not");
if (!ORTE_FLAG_TEST(node, ORTE_NODE_FLAG_MAPPED)) {
if (ORTE_SUCCESS > (idx = opal_pointer_array_add(jdata->map->nodes, (void*)node))) {
ORTE_ERROR_LOG(idx);here is the output it looks like the race-condition is involving the |
|
while trying how to figure out a fix, i noticed this diff --git a/orte/mca/state/base/state_base_fns.c b/orte/mca/state/base/state_base_fns.c
index 69cfa89..c165f2c 100644
--- a/orte/mca/state/base/state_base_fns.c
+++ b/orte/mca/state/base/state_base_fns.c
@@ -1,6 +1,8 @@
/*
* Copyright (c) 2011-2012 Los Alamos National Security, LLC.
* Copyright (c) 2014-2017 Intel, Inc. All rights reserved.
+ * Copyright (c) 2017 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -897,10 +899,10 @@ void orte_state_base_check_all_complete(int fd, short args, void *cbdata)
}
/* set the node location to NULL */
opal_pointer_array_set_item(map->nodes, index, NULL);
- /* maintain accounting */
- OBJ_RELEASE(node);
/* flag that the node is no longer in a map */
ORTE_FLAG_UNSET(node, ORTE_NODE_FLAG_MAPPED);
+ /* maintain accounting */
+ OBJ_RELEASE(node);
}
OBJ_RELEASE(map);
jdata->map = NULL; |
|
@ggouaillardet I pushed a fix - see what you think. I also fixed the issue you identified above, but did it a little differently. There is no need to unset the mapped flag at that point - indeed, that is another race condition problem. |
|
I have this working now, so far as I can tell. I have updated the seq, rankfile, and mindist mappers to preserve their functionality by setting them to work in the "old" mode where mpirun computes everything and sends it to the backend daemons. Thus, they will scale poorly compared to the other mappers, but at least will still function until someone who cares can update them. |
|
@rhc54 that fixed mpirun crashes and here is where and why note the patch below can be used as a very temporary workaround diff --git a/orte/mca/rmaps/base/rmaps_base_support_fns.c b/orte/mca/rmaps/base/rmaps_base_support_fns.c
index b9003c9..1d17e1c 100644
--- a/orte/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/orte/mca/rmaps/base/rmaps_base_support_fns.c
@@ -503,7 +503,7 @@ int orte_rmaps_base_get_target_nodes(opal_list_t *allocated_nodes, orte_std_cntr
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
node->name, node->slots, node->slots_inuse));
opal_list_remove_item(allocated_nodes, item);
- OBJ_RELEASE(item); /* "un-retain" it */
+ // OBJ_RELEASE(item); /* "un-retain" it */
item = next;
continue;
} |
|
I'm at a loss on how to interpret that report combined with the patch. The two code areas appear to be completely unrelated. If a mapper is truly putting an orte_proc_t on the node array instead of an orte_node_t, then I would expect it to always fail. Yet it appears to be running correctly for me. Can you tell me a little more about this failure? Is it on the very first comm_spawn, or a later one? |
|
my interpretation is that the iirc the crash occurs around the spawn of the 5th child |
|
Thanks - that helps a great deal. I'll check it out. |
|
I'm afraid I'm batting zero here - I can't replicate it, valgrind isn't flagging it, and I can't find the problem by inspecting the code. Can you perhaps do a little more detective work for me? |
|
@ggouaillardet Please give this updated version a try. I found a bug that impacted the case where the HNP is not in the allocation - I can't replicate that here, but it might be the situation where you are? |
|
@karasevb please runtime check if mindist is working. |
|
@rhc54 i can reproduce the issue with the latest updates. the same error occurs with here is an other possible fix diff --git a/orte/mca/rmaps/base/rmaps_base_support_fns.c b/orte/mca/rmaps/base/rmaps_base_support_fns.c
index 5633789..cf8b9b7 100644
--- a/orte/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/orte/mca/rmaps/base/rmaps_base_support_fns.c
@@ -351,6 +351,7 @@ int orte_rmaps_base_get_target_nodes(opal_list_t *allocated_nodes, orte_std_cntr
/* the list is empty - if the HNP is allocated, then add it */
if (orte_hnp_is_allocated) {
nd = (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, 0);
+ OBJ_RETAIN(nd);
opal_list_append(allocated_nodes, &nd->super);
} else {
nd = NULL;i just noted you must use the very same command line. or both work just fine |
|
Yes, I replicated your cmd line exactly. I'll try again, and also try with your change (which looks correct to me). |
|
Okay, I found the difference. You must not have any hostfile (e.g., a default one) at all. That sends you down a different code path that hits the line you flagged in your fix. If I delete my default hostfile envar, then I can replicate this failure. It's always the little differences we never think to mention that get us 😄 I confirm your patch fixes it - will commit |
…ions Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it). Remove unneeded test Fix memory corruption by re-initializing variable to NULL in loop Resolve the race condition identified by @ggouaillardet by resetting the mapped flag within the same event where it was set. There is no need to retain the flag beyond that point as it isn't used again. Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them. Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers. Have the mindist module add procs to the job's proc array as it is a fully described module Protect the hnp-not-in-allocation case Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile Signed-off-by: Ralph Castain <rhc@open-mpi.org>
|
@karasevb FWIW: the Mellanox Jenkins tests the mindist mapper, and it passes that test. The mapper won't scale as well as the others until someone updates it to add backend support. This patch also fixes |
Signed-off-by: Ralph Castain rhc@open-mpi.org