Skip to content

Conversation

@markalle
Copy link
Contributor

@markalle markalle commented Apr 22, 2021

The prrte --display bind option is a per-rank option, but this option
instead consolidates the binding reports on a per-host basis, and
uses a visual display based on the natural hardware ordering of the
hwloc tree.

Example output:

$. mpirun --bind-to core --host hostA:4,hostB:4 --mca ompi_display_comm 1 \
   --mca ompi_display_comm_aff_columns 72 ./x

Host 0 [hostA] ranks 0 - 3
Host 1 [hostB] ranks 4 - 7

Affinity per host: (with ompi_display_comm_aff_columns 72)
H0: [<aaaa/cccc/..../..../..../..../..../..../..../..../..../..../..../.
      ../..../..../..../..../..../..../..../....>][<bbbb/dddd/..../..../
      .../..../..../..../..../..../..../..../..../..../..../..../..../..
      ./..../..../..../....>] Lranks 0-3
H1: [<aaaa/cccc/..../..../..../..../..../..../..../..../..../..../..../.
      ../..../..../..../..../..../..../..../....>][<bbbb/dddd/..../..../
      .../..../..../..../..../..../..../..../..../..../..../..../..../..
      ./..../..../..../....>] Lranks 0-3

It tries to consolidate all the ranks on a host as shown above, but
if the ranks overlap it will start using multiple lines to display
a host (or if it runs out of letters it goes to another line and
starts over with "a" again).

I think it makes sense to have this option inside the ompi_display_comm
output because the ranks' bindings are a big factor in the communication
between them.

Signed-off-by: Mark Allen markalle@us.ibm.com

@awlauria
Copy link
Contributor

bot:aws:retest

The prrte --display bind option is a per-rank option, but this option
instead consolidates the binding reports on a per-host basis, and
uses a visual display based on the natural hardware ordering of the
hwloc tree.

Example output:
% mpirun --bind-to core --host hostA:4,hostB:4 --mca ompi_display_comm 1 \
    --mca ompi_display_comm_aff_columns 72 ./x

```
Host 0 [hostA] ranks 0 - 3
Host 1 [hostB] ranks 4 - 7

Affinity per host: (with ompi_display_comm_aff_columns 72)
H0: [<aaaa/cccc/..../..../..../..../..../..../..../..../..../..../..../.
      ../..../..../..../..../..../..../..../....>][<bbbb/dddd/..../..../
      .../..../..../..../..../..../..../..../..../..../..../..../..../..
      ./..../..../..../....>] Lranks 0-3
H1: [<aaaa/cccc/..../..../..../..../..../..../..../..../..../..../..../.
      ../..../..../..../..../..../..../..../....>][<bbbb/dddd/..../..../
      .../..../..../..../..../..../..../..../..../..../..../..../..../..
      ./..../..../..../....>] Lranks 0-3
```

It tries to consolidate all the ranks on a host as shown above, but
if the ranks overlap it will start using multiple lines to display
a host (or if it runs out of letters it goes to another line and
starts over with "a" again).

I think it makes sense to have this option inside the ompi_display_comm
output because the ranks' bindings are a big factor in the communication
between them.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
@markalle markalle force-pushed the comm_methods_show_affinity branch from 0e64a06 to 61822ff Compare April 26, 2021 16:51
@awlauria
Copy link
Contributor

awlauria commented May 5, 2021

bot:aws:retest

Copy link
Contributor

@awlauria awlauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this addition - just some comments/questions.

mylocalrank, nlocalranks, local_comm);
if (mylocalrank == 0) {
affstring = malloc(strlen(prefix_string) + strlen(unprefixed_affstring) + 16);
sprintf(affstring, "%s%s", prefix_string, unprefixed_affstring);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you switch this to snprintf(), or perhaps even asprintf()? This would guard against overflows a bit better, unlikely as it may be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pushing another commit, intend to squash once confirmed it's what we want)

Sounds good, I'm switching it to snprintf. Actually I just moved all the sprintf to snprintf and all the strcpy to strncpy

printf("Affinity per host: (with ompi_display_comm_aff_columns %d)\n",
mca_hook_comm_method_aff_columns);
}
host_leader_printstring(affstring, myleaderrank, nleaderranks, leader_comm,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

affstring is only alloc'd from what I see when if (mylocalrank == 0) { is true. is this the case here, or should its useage be checked here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be okay I think because all the non-leaders early return, so that's why those checks disappear after a certain point


// Each host leader fills in a "numhosts" sized array method[] of
// how it communicates with each peer.
for (i=0; i<nleaderranks; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mymethod's size is based on numhosts in the malloc above on line 409, not nleaderranks. I assume that nleaderranks should be <= numhosts, but could we make this consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pushing another commit, intend to squash once confirmed it's what we want)

Yeah, they were equivalent, so I removed numhosts and switched them all to nleaderranks


if (myleaderrank == 0) {
for (i=0; i<nleaderranks; ++i) {
//printf("%s\n", allhoststrings[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this print?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I removed it

(*affstring)[nextchar] = 0;
some_cores_printed_under_containing_obj = 0;
}
if (obj->type == HWLOC_OBJ_CORE && some_cores_printed_under_containing_obj) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some_cores_printed_under_containing_obj could be uninitialized here - though I think the root object's type should be HWLOC_OBJ_PACKAGE....I'm betting there is a compiler warning anyway since it isn't that smart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pushing another commit, intend to squash once confirmed it's what we want)

I agree. I forget what's allowable as far as degenerate cases. I think the trees have to have a MACHINE at the top and PU at the bottom but I doubt packages have to exist. So initializing at the top of the function would be good for the degenerate cases.

int *position)
{
int i, r;
int nchildren_done[32], depth, some_cores_printed_under_containing_obj;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should initialize some_cores_printed_under_containing_obj to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

* them. Without that check my PPC machines at least look like
* "[<../..>][<../..>]<><><><><><>"
*/
if (obj->memory_arity > 0 && !hwloc_bitmap_iszero(obj->memory_first_child->cpuset)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not happy when compiled with --no-oshmem:

hook_comm_method_fns.c: In function 'sprint_bitmaps':
hook_comm_method_fns.c:1096:18: error: 'struct hwloc_obj' has no member named 'memory_arity'; did you mean 'memory'?
         if (obj->memory_arity > 0 && !hwloc_bitmap_iszero(obj->memory_first_child->cpuset)) {
                  ^~~~~~~~~~~~
                  memory
hook_comm_method_fns.c:1096:64: error: 'struct hwloc_obj' has no member named 'memory_first_child'; did you mean 'first_child'?
         if (obj->memory_arity > 0 && !hwloc_bitmap_iszero(obj->memory_first_child->cpuset)) {
                                                                ^~~~~~~~~~~~~~~~~~
                                                                first_child
hook_comm_method_fns.c:1122:22: error: 'struct hwloc_obj' has no member named 'memory_arity'; did you mean 'memory'?
             if (obj->memory_arity > 0 && !hwloc_bitmap_iszero(obj->memory_first_child->cpuset)) {
                      ^~~~~~~~~~~~
                      memory
hook_comm_method_fns.c:1122:68: error: 'struct hwloc_obj' has no member named 'memory_first_child'; did you mean 'first_child'?
             if (obj->memory_arity > 0 && !hwloc_bitmap_iszero(obj->memory_first_child->cpuset)) {
                                                                    ^~~~~~~~~~~~~~~~~~
                                                                    first_child

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, have you double-checked those definitions with HWLOC v1.11? I'm not sure those memory fields are present back there - best to check.

Copy link
Contributor

@awlauria awlauria May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. If this is brought over to v5, that may need to get protected if it is not in v1.11.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pushing another commit, intend to squash once confirmed it's what we want)

For this one I'd like to just go with "#if HWLOC_API_VERSION >= 0x20000" instead of trying to maintain a path that prints the numa level for the older hwloc. The numa level is kind of nice, but I don't feel like it's so important as to maintain two paths for it. So with the #ifdef I just added it will leave out the numa markers for older hwloc.

@jjhursey
Copy link
Member

jjhursey commented Sep 2, 2021

Any update on this PR? Austen had some suggested cleanup, but otherwise, this looks like a nice user feature to have.

@awlauria
Copy link
Contributor

awlauria commented Sep 2, 2021

@markalle this is nice to have for v5 if you have time to clean up and get it in. RC1 is going out September 23rd, so it would have to come in before then.

@rhc54
Copy link
Contributor

rhc54 commented Sep 2, 2021

Just in the fwiw category: all the info required to generate that output is already present on the proc - there is no need to perform a collective to obtain it. A simple "PMIx_Get" would retrieve it. 🤷‍♂️

@awlauria
Copy link
Contributor

bot:aws:retest

@gpaulsen
Copy link
Member

@markalle I think this is still desirable for v5.0.0, if you can rework without the additional collective call as Ralph suggested.

@markalle
Copy link
Contributor Author

I pushed a bunch of updates, mostly sprintf -> snprintf and strcpy -> strncpy. But also some hwloc versioning for the memory_arity field being part of the newer API. Rather than maintaining two paths my current approach is if hwloc is the old API it'll just omit the numa-level markers. I feel like those markers are nice to have, but not so important as to maintain two paths.

About getting the data via pmix as @rhc54 said, I'm interested. I don't think that's a "must have" feature for this to go in, but I'd be interested in updating later. What format would I be able to get a remote rank/host's affinity in if I made a PMIX_Get to get it? I'd want to end up with an hwloc obj representing a tree for the remote system

(Currently my snprintf etc updates are in a second commit, once confirmed it's what we want I intend to squash)

@rhc54
Copy link
Contributor

rhc54 commented Jan 13, 2022

There are different avenues you could pursue depending upon what you actually want to show. Every proc has access to the cpuset for every other proc in the job , so you could retrieve it using the PMIX_CPUSET key, converting that to an HWLOC bitmap with the hwloc_bitmap_list_sscanf function. Note that the returned string is hwloc:<bitmap> to indicate that the bitmap was generated by HWLOC as opposed to some other library.

You can also get the locality expressed as a string using the PMIX_LOCALITY_STRING key - this is the usual thing that looks something like NM1:SK2:L34:CR1:HT0-3. One could convert that to some other representation easily enough, I suppose.

If you want the actual topology tree, you have access to that as well. If you call PMIx_Get with the ID of a proc and the PMIX_TOPOLOGY2 key, you will get back a pmix_topology_t pointer for the node where that proc is executing. The struct contains a source field that just marks it as being an HWLOC tree, and a topology field that points to the root of the HWLOC tree. This would likely only be available if launched by PRRTE/mpirun - I doubt, for example, that Slurm would provide it. You'd also probably want to optimize a bit as pulling down the tree in our current implementation would be expensive, so you'd want to do all the procs on a given node before getting the next one.

Note: the proc locally has a pointer to its own node's topology in the PMIx library, so retrieving that one is free. If you are in a homogeneous system, then you don't need to retrieve any others (and if you did, you'd just get the same pointer handed to you anyway - you'd just eat the overhead of finding the proc's daemon) so this can be pretty fast, especially since the cpusets and locality strings are all in shared memory.

Post-review updates:
* sprintf -> snprintf
* strcpy -> strncpy
* some hwloc versioning around the numa-level printout
* added a opal_hwloc_base_get_topology() to make sure
  opal_hwloc_topology is set up, otherwise it had been
  relying on the luck of the draw whether other MCAs
  had loaded it already
* switched a ompi_group_peer_lookup_existing() to ompi_group_peer_lookup()
  since the former can return NULL if a proc entry isn't already there

Signed-off-by: Mark Allen <markalle@us.ibm.com>
@markalle markalle force-pushed the comm_methods_show_affinity branch from e3ebf1b to f02c903 Compare April 20, 2022 20:40
@markalle
Copy link
Contributor Author

Repushed with an added opal_hwloc_base_get_topology() to make sure opal_hwloc_topology is loaded (otherwise it had been relying on the luck of the draw whether other MCAs had loaded it already), and I switched an ompi_group_peer_lookup_existing() call to ompi_group_peer_lookup().

I think the current design should be okay for checkin, with loading PMIX_TOPOLOGY2 as a future upgrade. The current design is

  • each host leader has opal_hwloc_topology for its own hwloc tree
  • each host leader produces its own affinity string for its host
  • gatherv() on leader_comm of the strings to leader#0 who prints them

I'm interested in the alternate design where rank0 uses pmix to get all the topologies and cpusets and computes the strings itself, but I don't think that's a necessary change before having the feature go in. And even if I did upgrade to the pmix path, since you're saying it won't necessarily be there in all environments I'd still leave this path in so that if the pmix data wasn't available it would fall back to what this PR is already doing. So I'd still like for this feature to go in as it is with pmix just being a future upgrade


On the topic of adding that feature though to load the topology tree from pmix, I did experiment with that and so far my
rv = PMIx_Get(&proc, PMIX_TOPOLOGY2, NULL, 0, &val);
is just returning -46 == PMIX_ERR_NOT_FOUND. The proc came from Get PMIX_PROCID and looks okay. The runtime context is this is happening at rank0 at the bottom of MPI_Init. Are there other steps it needs to take to retrieve the pmix data? The status of other pmix data in my runs is: R0 can see everybody's PMIX_HOSTNAME, and R0 only sees PMIX_CPUSET for its local peer ranks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants