-
Notifications
You must be signed in to change notification settings - Fork 936
consolidated per-host affinity display in --mca ompi_display_comm 1 #8842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
bot:aws:retest |
The prrte --display bind option is a per-rank option, but this option
instead consolidates the binding reports on a per-host basis, and
uses a visual display based on the natural hardware ordering of the
hwloc tree.
Example output:
% mpirun --bind-to core --host hostA:4,hostB:4 --mca ompi_display_comm 1 \
--mca ompi_display_comm_aff_columns 72 ./x
```
Host 0 [hostA] ranks 0 - 3
Host 1 [hostB] ranks 4 - 7
Affinity per host: (with ompi_display_comm_aff_columns 72)
H0: [<aaaa/cccc/..../..../..../..../..../..../..../..../..../..../..../.
../..../..../..../..../..../..../..../....>][<bbbb/dddd/..../..../
.../..../..../..../..../..../..../..../..../..../..../..../..../..
./..../..../..../....>] Lranks 0-3
H1: [<aaaa/cccc/..../..../..../..../..../..../..../..../..../..../..../.
../..../..../..../..../..../..../..../....>][<bbbb/dddd/..../..../
.../..../..../..../..../..../..../..../..../..../..../..../..../..
./..../..../..../....>] Lranks 0-3
```
It tries to consolidate all the ranks on a host as shown above, but
if the ranks overlap it will start using multiple lines to display
a host (or if it runs out of letters it goes to another line and
starts over with "a" again).
I think it makes sense to have this option inside the ompi_display_comm
output because the ranks' bindings are a big factor in the communication
between them.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
0e64a06 to
61822ff
Compare
|
bot:aws:retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this addition - just some comments/questions.
| mylocalrank, nlocalranks, local_comm); | ||
| if (mylocalrank == 0) { | ||
| affstring = malloc(strlen(prefix_string) + strlen(unprefixed_affstring) + 16); | ||
| sprintf(affstring, "%s%s", prefix_string, unprefixed_affstring); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you switch this to snprintf(), or perhaps even asprintf()? This would guard against overflows a bit better, unlikely as it may be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(pushing another commit, intend to squash once confirmed it's what we want)
Sounds good, I'm switching it to snprintf. Actually I just moved all the sprintf to snprintf and all the strcpy to strncpy
| printf("Affinity per host: (with ompi_display_comm_aff_columns %d)\n", | ||
| mca_hook_comm_method_aff_columns); | ||
| } | ||
| host_leader_printstring(affstring, myleaderrank, nleaderranks, leader_comm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
affstring is only alloc'd from what I see when if (mylocalrank == 0) { is true. is this the case here, or should its useage be checked here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be okay I think because all the non-leaders early return, so that's why those checks disappear after a certain point
|
|
||
| // Each host leader fills in a "numhosts" sized array method[] of | ||
| // how it communicates with each peer. | ||
| for (i=0; i<nleaderranks; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mymethod's size is based on numhosts in the malloc above on line 409, not nleaderranks. I assume that nleaderranks should be <= numhosts, but could we make this consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(pushing another commit, intend to squash once confirmed it's what we want)
Yeah, they were equivalent, so I removed numhosts and switched them all to nleaderranks
|
|
||
| if (myleaderrank == 0) { | ||
| for (i=0; i<nleaderranks; ++i) { | ||
| //printf("%s\n", allhoststrings[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I removed it
| (*affstring)[nextchar] = 0; | ||
| some_cores_printed_under_containing_obj = 0; | ||
| } | ||
| if (obj->type == HWLOC_OBJ_CORE && some_cores_printed_under_containing_obj) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some_cores_printed_under_containing_obj could be uninitialized here - though I think the root object's type should be HWLOC_OBJ_PACKAGE....I'm betting there is a compiler warning anyway since it isn't that smart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(pushing another commit, intend to squash once confirmed it's what we want)
I agree. I forget what's allowable as far as degenerate cases. I think the trees have to have a MACHINE at the top and PU at the bottom but I doubt packages have to exist. So initializing at the top of the function would be good for the degenerate cases.
| int *position) | ||
| { | ||
| int i, r; | ||
| int nchildren_done[32], depth, some_cores_printed_under_containing_obj; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should initialize some_cores_printed_under_containing_obj to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
| * them. Without that check my PPC machines at least look like | ||
| * "[<../..>][<../..>]<><><><><><>" | ||
| */ | ||
| if (obj->memory_arity > 0 && !hwloc_bitmap_iszero(obj->memory_first_child->cpuset)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not happy when compiled with --no-oshmem:
hook_comm_method_fns.c: In function 'sprint_bitmaps':
hook_comm_method_fns.c:1096:18: error: 'struct hwloc_obj' has no member named 'memory_arity'; did you mean 'memory'?
if (obj->memory_arity > 0 && !hwloc_bitmap_iszero(obj->memory_first_child->cpuset)) {
^~~~~~~~~~~~
memory
hook_comm_method_fns.c:1096:64: error: 'struct hwloc_obj' has no member named 'memory_first_child'; did you mean 'first_child'?
if (obj->memory_arity > 0 && !hwloc_bitmap_iszero(obj->memory_first_child->cpuset)) {
^~~~~~~~~~~~~~~~~~
first_child
hook_comm_method_fns.c:1122:22: error: 'struct hwloc_obj' has no member named 'memory_arity'; did you mean 'memory'?
if (obj->memory_arity > 0 && !hwloc_bitmap_iszero(obj->memory_first_child->cpuset)) {
^~~~~~~~~~~~
memory
hook_comm_method_fns.c:1122:68: error: 'struct hwloc_obj' has no member named 'memory_first_child'; did you mean 'first_child'?
if (obj->memory_arity > 0 && !hwloc_bitmap_iszero(obj->memory_first_child->cpuset)) {
^~~~~~~~~~~~~~~~~~
first_child
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, have you double-checked those definitions with HWLOC v1.11? I'm not sure those memory fields are present back there - best to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. If this is brought over to v5, that may need to get protected if it is not in v1.11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(pushing another commit, intend to squash once confirmed it's what we want)
For this one I'd like to just go with "#if HWLOC_API_VERSION >= 0x20000" instead of trying to maintain a path that prints the numa level for the older hwloc. The numa level is kind of nice, but I don't feel like it's so important as to maintain two paths for it. So with the #ifdef I just added it will leave out the numa markers for older hwloc.
|
Any update on this PR? Austen had some suggested cleanup, but otherwise, this looks like a nice user feature to have. |
|
@markalle this is nice to have for v5 if you have time to clean up and get it in. RC1 is going out September 23rd, so it would have to come in before then. |
|
Just in the fwiw category: all the info required to generate that output is already present on the proc - there is no need to perform a collective to obtain it. A simple "PMIx_Get" would retrieve it. 🤷♂️ |
|
bot:aws:retest |
|
@markalle I think this is still desirable for v5.0.0, if you can rework without the additional collective call as Ralph suggested. |
|
I pushed a bunch of updates, mostly sprintf -> snprintf and strcpy -> strncpy. But also some hwloc versioning for the memory_arity field being part of the newer API. Rather than maintaining two paths my current approach is if hwloc is the old API it'll just omit the numa-level markers. I feel like those markers are nice to have, but not so important as to maintain two paths. About getting the data via pmix as @rhc54 said, I'm interested. I don't think that's a "must have" feature for this to go in, but I'd be interested in updating later. What format would I be able to get a remote rank/host's affinity in if I made a PMIX_Get to get it? I'd want to end up with an hwloc obj representing a tree for the remote system (Currently my snprintf etc updates are in a second commit, once confirmed it's what we want I intend to squash) |
|
There are different avenues you could pursue depending upon what you actually want to show. Every proc has access to the cpuset for every other proc in the job , so you could retrieve it using the You can also get the locality expressed as a string using the If you want the actual topology tree, you have access to that as well. If you call Note: the proc locally has a pointer to its own node's topology in the PMIx library, so retrieving that one is free. If you are in a homogeneous system, then you don't need to retrieve any others (and if you did, you'd just get the same pointer handed to you anyway - you'd just eat the overhead of finding the proc's daemon) so this can be pretty fast, especially since the cpusets and locality strings are all in shared memory. |
Post-review updates: * sprintf -> snprintf * strcpy -> strncpy * some hwloc versioning around the numa-level printout * added a opal_hwloc_base_get_topology() to make sure opal_hwloc_topology is set up, otherwise it had been relying on the luck of the draw whether other MCAs had loaded it already * switched a ompi_group_peer_lookup_existing() to ompi_group_peer_lookup() since the former can return NULL if a proc entry isn't already there Signed-off-by: Mark Allen <markalle@us.ibm.com>
e3ebf1b to
f02c903
Compare
|
Repushed with an added opal_hwloc_base_get_topology() to make sure opal_hwloc_topology is loaded (otherwise it had been relying on the luck of the draw whether other MCAs had loaded it already), and I switched an ompi_group_peer_lookup_existing() call to ompi_group_peer_lookup(). I think the current design should be okay for checkin, with loading PMIX_TOPOLOGY2 as a future upgrade. The current design is
I'm interested in the alternate design where rank0 uses pmix to get all the topologies and cpusets and computes the strings itself, but I don't think that's a necessary change before having the feature go in. And even if I did upgrade to the pmix path, since you're saying it won't necessarily be there in all environments I'd still leave this path in so that if the pmix data wasn't available it would fall back to what this PR is already doing. So I'd still like for this feature to go in as it is with pmix just being a future upgrade On the topic of adding that feature though to load the topology tree from pmix, I did experiment with that and so far my |
The prrte --display bind option is a per-rank option, but this option
instead consolidates the binding reports on a per-host basis, and
uses a visual display based on the natural hardware ordering of the
hwloc tree.
Example output:
It tries to consolidate all the ranks on a host as shown above, but
if the ranks overlap it will start using multiple lines to display
a host (or if it runs out of letters it goes to another line and
starts over with "a" again).
I think it makes sense to have this option inside the ompi_display_comm
output because the ranks' bindings are a big factor in the communication
between them.
Signed-off-by: Mark Allen markalle@us.ibm.com