Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp the map-by NUMA support #1151

Merged
merged 1 commit into from
Nov 13, 2021
Merged

Revamp the map-by NUMA support #1151

merged 1 commit into from
Nov 13, 2021

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Nov 13, 2021

For each unique topology in the system, compute the max os_index
of the CPU NUMAs by searching for the first instance of an
overlapping NUMA. We assume that any overlap stems from a GPU
NUMA, and that the os_index of such NUMAs starts at 255 and counts
downward. Cache that cutoff and use it when computing number of
NUMA objects and retrieving the Nth NUMA object.

Need to extend this to the distance computations and a few other
areas, but this covers the typical range of use-cases.

Signed-off-by: Ralph Castain rhc@pmix.org

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 13, 2021

@bgoglin Updated per your corrections. As noted, this covers the primary use-cases, but we'll need to do something about the "mindist" mapper and some of the other utilities that look at NUMA domains in support of that mapper. Lower priority, at least so far as I'm concerned.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 13, 2021

bot:ibm:xl:retest

For each unique topology in the system, compute the max os_index
of the CPU NUMAs by searching for the first instance of an
overlapping NUMA. We assume that any overlap stems from a GPU
NUMA, and that the os_index of such NUMAs starts at 255 and counts
downward. Cache that cutoff and use it when computing number of
NUMA objects and retrieving the Nth NUMA object.

Need to extend this to the distance computations and a few other
areas, but this covers the typical range of use-cases.

Signed-off-by: Ralph Castain <rhc@pmix.org>
@rhc54 rhc54 merged commit e84dc26 into openpmix:master Nov 13, 2021
@rhc54 rhc54 deleted the topic/numa2 branch November 13, 2021 18:18
@bgoglin
Copy link
Contributor

bgoglin commented Nov 13, 2021

FYI, this will hopefully work for non-GPU heterogeneous memory too since I expect DRAM to be in the first NUMA nodes (so that the OS allocates there first), before HBM and/or NVDIMMs. At least it will work on KNL and should work on Xeon with DRAM+NVDIMMs (I'll try to test it next week).

@bgoglin
Copy link
Contributor

bgoglin commented Nov 15, 2021

I added some printf in the code to verify that the cutoff is set to 2 when the machine has 2 DRAM nodes and 2 NVDIMM nodes, looks fine. And --map-by numa seems to make my processes alternate between both sockets as expected.
I coudn't find a command-line that would clearly tell me that NUMA nodes P#0 L#0 and P#1 L#2 are used (DRAM) and not P#2 L#1 and P#3 L#3 (NVDIMM). How do you get debugs from this part of prrte?
By the way, there's hwloc_bitmap_dup() instead of alloc()+copy() in prte_hwloc_base_filter_cpus(). But you could actually just store pointers to existing bitmaps in your numas array instead of duplicating all of them.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 15, 2021

By the way, there's hwloc_bitmap_dup() instead of alloc()+copy() in prte_hwloc_base_filter_cpus(). But you could actually just store pointers to existing bitmaps in your numas array instead of duplicating all of them.

Fair point - I actually went one better and now cache the hwloc_obj_t pointers so looking up the nth NUMA object can be done much faster.

How do you get debugs from this part of prrte?

I usually feed in a synthetic topology (so I can test a variety of scenarios), tell PRRTE not to launch the procs, and then have it output the "devel map" showing me precisely where each proc is put. If I want to watch the mapping mechanics, --prtemca rmaps_base_verbose 5 does a pretty good job. So it all looks like:

prterun --prtemca rmaps_base_verbose 5 --map-by numa --display map-devel --do-not-launch --prtemca hwloc_use_topo_file <file.xml> hostname

and you'll get output something like the following:

=================================   JOB MAP   =================================
Data for JOB prterun-Ralphs-iMac-2-68461@1 offset 0 Total slots allocated 24
Mapper requested: NULL  Last mapper: ppr  Mapping policy: BYNUMA:NOOVERSUBSCRIBE  Ranking policy: NUMA
Binding policy: NUMA:IF-SUPPORTED  Cpu set: N/A  PPR: 2:numa  Cpus-per-rank: N/A  Cpu Type: CORE
Num new daemons: 0	New daemon starting vpid INVALID
Num nodes: 1

Data for node: Ralphs-iMac-2	State: 3	Flags: flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:MAPPED:SLOTS_GIVEN
                resolved from Ralphs-iMac-2.local
                resolved from 192.168.0.4
                resolved from 192.168.1.197
                resolved from Ralphs-iMac-2
        Daemon: [prterun-Ralphs-iMac-2-68461@0,0]	Daemon launched: True
            Num slots: 24	Slots in use: 8	Oversubscribed: FALSE
            Num slots allocated: 24	Max slots: 0
            Num procs: 8	Next node_rank: 8
        Data for proc: [prterun-Ralphs-iMac-2-68461@1,0]
                Pid: 0	Local rank: 0	Node rank: 0	App rank: 0
                State: INITIALIZED	App_context: 0
        	Mapped:  package[0][core:0-5]
        	Binding: package[0][core:0-5]
        Data for proc: [prterun-Ralphs-iMac-2-68461@1,1]
                Pid: 0	Local rank: 1	Node rank: 1	App rank: 1
                State: INITIALIZED	App_context: 0
        	Mapped:  package[0][core:0-5]
        	Binding: package[0][core:0-5]
        Data for proc: [prterun-Ralphs-iMac-2-68461@1,2]
                Pid: 0	Local rank: 2	Node rank: 2	App rank: 2
                State: INITIALIZED	App_context: 0
        	Mapped:  package[0][core:6-11]
        	Binding: package[0][core:6-11]
        Data for proc: [prterun-Ralphs-iMac-2-68461@1,3]
                Pid: 0	Local rank: 3	Node rank: 3	App rank: 3
                State: INITIALIZED	App_context: 0
        	Mapped:  package[0][core:6-11]
        	Binding: package[0][core:6-11]
        Data for proc: [prterun-Ralphs-iMac-2-68461@1,4]
                Pid: 0	Local rank: 4	Node rank: 4	App rank: 4
                State: INITIALIZED	App_context: 0
        	Mapped:  package[0][core:12-17]
        	Binding: package[0][core:12-17]
        Data for proc: [prterun-Ralphs-iMac-2-68461@1,5]
                Pid: 0	Local rank: 5	Node rank: 5	App rank: 5
                State: INITIALIZED	App_context: 0
        	Mapped:  package[0][core:12-17]
        	Binding: package[0][core:12-17]
        Data for proc: [prterun-Ralphs-iMac-2-68461@1,6]
                Pid: 0	Local rank: 6	Node rank: 6	App rank: 6
                State: INITIALIZED	App_context: 0
        	Mapped:  package[0][core:18-23]
        	Binding: package[0][core:18-23]
        Data for proc: [prterun-Ralphs-iMac-2-68461@1,7]
                Pid: 0	Local rank: 7	Node rank: 7	App rank: 7
                State: INITIALIZED	App_context: 0
        	Mapped:  package[0][core:18-23]
        	Binding: package[0][core:18-23]

=============================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants