-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp the map-by NUMA support #1151
Conversation
@bgoglin Updated per your corrections. As noted, this covers the primary use-cases, but we'll need to do something about the "mindist" mapper and some of the other utilities that look at NUMA domains in support of that mapper. Lower priority, at least so far as I'm concerned. |
bot:ibm:xl:retest |
For each unique topology in the system, compute the max os_index of the CPU NUMAs by searching for the first instance of an overlapping NUMA. We assume that any overlap stems from a GPU NUMA, and that the os_index of such NUMAs starts at 255 and counts downward. Cache that cutoff and use it when computing number of NUMA objects and retrieving the Nth NUMA object. Need to extend this to the distance computations and a few other areas, but this covers the typical range of use-cases. Signed-off-by: Ralph Castain <rhc@pmix.org>
FYI, this will hopefully work for non-GPU heterogeneous memory too since I expect DRAM to be in the first NUMA nodes (so that the OS allocates there first), before HBM and/or NVDIMMs. At least it will work on KNL and should work on Xeon with DRAM+NVDIMMs (I'll try to test it next week). |
I added some printf in the code to verify that the cutoff is set to 2 when the machine has 2 DRAM nodes and 2 NVDIMM nodes, looks fine. And --map-by numa seems to make my processes alternate between both sockets as expected. |
Fair point - I actually went one better and now cache the hwloc_obj_t pointers so looking up the nth NUMA object can be done much faster.
I usually feed in a synthetic topology (so I can test a variety of scenarios), tell PRRTE not to launch the procs, and then have it output the "devel map" showing me precisely where each proc is put. If I want to watch the mapping mechanics,
and you'll get output something like the following:
|
For each unique topology in the system, compute the max os_index
of the CPU NUMAs by searching for the first instance of an
overlapping NUMA. We assume that any overlap stems from a GPU
NUMA, and that the os_index of such NUMAs starts at 255 and counts
downward. Cache that cutoff and use it when computing number of
NUMA objects and retrieving the Nth NUMA object.
Need to extend this to the distance computations and a few other
areas, but this covers the typical range of use-cases.
Signed-off-by: Ralph Castain rhc@pmix.org