Sample code to show how can hwloc be used to print latency /bandwidth for initiators and targets #542

aghozzo · 2022-08-18T22:09:38Z

I couldn't find any sample C code that shows how to extract HMAT table attributes or APIs that can be used and called directly from a hello world C code .

can someone please provide a pointer on how to run and test sample code to get this information or if there are already exisiting examples ?

Thanks

bgoglin · 2022-08-19T07:00:47Z

Hello
Bandwidth and Latency are part of the "memory attributes" functionnality. "lstopo --memattrs" will show bandwidth/latency on the command-line if available (likely not available unless you have an IceLake platform).
The API is in hwloc/memattrs.h, most of the current doc is in the comments there (and a little bit in https://www.open-mpi.org/projects/hwloc/doc/v2.8.0/a00366.php#topoattrs_memattrs)
There's an example in tests/hwloc/memattrs.c (but it's quite complicated because it queries AND modifies memattrs).
I'll try to add a simple example under the doc directory.
Brice

Closes open-mpi#542 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>

bgoglin · 2022-08-19T08:51:54Z

Would this example help https://github.com/bgoglin/hwloc/blob/master/doc/examples/memory-attributes.c ?

It shows this on my laptop (no heterogeneous memory, no HMAT):

$ doc/examples/memory-attributes         
There are 1 NUMA nodes
Core L#0 cpuset = 0x00000005
Found 1 local NUMA nodes
NUMA node L#0 P#0 is local to core L#0
  bandwidth is unknown
  latency is unknown
Couldn't find best NUMA node for bandwidth to core L#0

And this on a machine with 2 kinds of memory and HMAT:

$ doc/examples/memory-attributes
There are 8 NUMA nodes
Core L#0 cpuset = 0x00000001
Found 2 local NUMA nodes
NUMA node L#0 P#0 is local to core L#0
  bandwidth = 62100 MiB/s
  latency = 191 ns
NUMA node L#1 P#4 is local to core L#0
  bandwidth = 69100 MiB/s
  latency = 227 ns
Best bandwidth NUMA node for core L#0 is L#1 P#4
Allocated buffer 0x560991f25b30

Closes open-mpi#542 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>

aghozzo · 2022-08-19T22:37:51Z

Hi Brice,

First of all let me thank you for the quick reply and code . Excellent support , appreciated
the example is a great start. I will run few test cases and check.

Please keep the issue open for a week or so to add feedback or if there is a different area for Q/A discussion we can use that.

Again Thanks a bunch

bgoglin · 2022-08-22T07:26:26Z

This issue will be autoclosed by github when I commit the new example to the repository, but feel free to ask other questions here, I'll be notified.

Closes #542 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> (cherry picked from commit 8d81da3)

aghozzo · 2022-08-25T19:25:06Z

Hi

Does hwloc work in qemu enviroment setup with HMAT >?

ls /sys/firmware/acpi/tables
APIC CEDT DSDT FACP FACS HMAT HPET MCFG SRAT WAET data dynamic

lstopo
Machine (5955MB total)
Package L#0
NUMANode L#0 (P#0 981MB)
L3 L#0 (16MB) + L2 L#0 (4096KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
Package L#1
NUMANode L#1 (P#1 1952MB)
L3 L#1 (16MB) + L2 L#1 (4096KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
Group0 L#0
NUMANode L#2 (P#2 3023MB)

when i run the sample :

There are 3 NUMA nodes
Core L#0 cpuset = 0x00000001
Found 1 local NUMA nodes
NUMA node L#0 P#0 (subtype (null)) is local to core L#0
bandwidth is unknown
latency is unknown
Couldn't find best NUMA node for bandwidth to core L#0

bgoglin · 2022-08-25T21:39:09Z

Yes I use qemu quite intensively for simulating strange memory subsystems. But it doesn't populate HMAT with anything useful by default. There's an example of what I've been using in section "HMAT performance attributes" of https://github.com/open-mpi/hwloc/wiki/Simulating-complex-memory-with-Qemu
As you see, you have to manually define the bandwidth and/or latency of every couple (CPU initiator, memory target). That's quite boring. The qemu patch cited at the end of this section was only applied upstream recently, you don't have it in your qemu, hence you must pass initiator=0 or 1 after memdev=ram2.
If you don't want to setup all this, I can provide some XML topology that hwloc may load at runtime instead of reading the platform topology. Something like this:

$ HWLOC_XMLFILE=/path/to/hwloc/git/tests/hwloc/xml/64intel64-fakeKNL-SNC4-hybrid.xml doc/examples/memory-attributes
There are 8 NUMA nodes
Core L#0 cpuset = 0x00010001,0x00010001
Found 2 local NUMA nodes
NUMA node L#0 P#0 (subtype (null)) is local to core L#0
  bandwidth = 22500 MiB/s
  latency is unknown
NUMA node L#1 P#7 (subtype MCDRAM) is local to core L#0
  bandwidth = 90000 MiB/s
  latency is unknown
Best bandwidth NUMA node for core L#0 is L#1 P#7
Allocated buffer 0x7f761b67f010 on best node

aghozzo · 2022-08-25T23:11:59Z

Hi Brice,

The XML file idea is brilliant .
I was able to re-create the example you just provided ( I didn't get latency like you) .

just 2 more questions :
is there a fake xml file that fake/shows latency and bandwidth ?
is there a documentation/explanation for these xml files as a guidance on what they are instead of going to each one and read the file ?

bgoglin · 2022-08-26T07:09:19Z

I don't have a XML with latency in the repo. But adding latencies isn't very hard:

modify the XML it manually: duplicate the Bandwidth memattr section at the end, rename into Latency and change the values.
or use hwloc-annotate to add individual latencies: something like hwloc-annotate input.xml output.xml -- numa:1 -- memattr Latency package:0 100.
There's a long example for customizing a XML topology in https://www.open-mpi.org/projects/hwloc/doc/v2.8.0/a00373.php#faq_create_asymmetric.
There's no documentation for the XML file contents itself. It was not meant to be manually modified (even if I do it very often).

aghozzo · 2022-08-29T18:00:43Z

I don't have a XML with latency in the repo. But adding latencies isn't very hard:

modify the XML it manually: duplicate the Bandwidth memattr section at the end, rename into Latency and change the values.

or use hwloc-annotate to add individual latencies: something like hwloc-annotate input.xml output.xml -- numa:1 -- memattr Latency package:0 100.
There's a long example for customizing a XML topology in https://www.open-mpi.org/projects/hwloc/doc/v2.8.0/a00373.php#faq_create_asymmetric.
There's no documentation for the XML file contents itself. It was not meant to be manually modified (even if I do it very often).

Hi,

Thanks for the Reply , much appreciated

I tried to modify the XML file

I tried replacing/renaming the word "Bandwidth" with "Latency" . didn't work
I tried duplicating the last section and changed the values and the name to Latency. didn't work.

when I changed the "Flag=6" I start seeing the Latency values :) . Thanks

bgoglin · 2022-08-29T19:06:27Z

Ah sorry, I forgot that you need to change the "flags" after Bandwidth. For latency it's 6 instead of 5 (because best latency is lowest while best bandwidth is highest).

aghozzo · 2022-10-04T18:01:27Z

@bgoglin

Hi,

I have 2 questions :

Is there a mechanism to show Initiators/targets matrix ( example bandwidth/latency/ or distance from any core to ALL NUMA nodes >?

hwloc_get_local_numanode_objs(topology, &initiator, &numa_nodes, nodes,HWLOC_LOCAL_NUMANODE_FLAG_ALL);
Even though i do FLAG_ALL , i still only get attributes for local NUMA nodes not the ones outside the package (if that is the right term to use) .

I like to know the penalty in terms of attributes for a CPU core to access memory outside of its local socket./package . (something like NUMA node local latency + latency to go off socket to access the memory )

how the system deal with CPU-less NUMA nodes (memory extension NUMA nodes) ? are those considered local to all CPU cores ?

bgoglin · 2022-10-05T08:33:54Z

@aghozzo For question 1, the mechanism already exists in hwloc, and in the ACPI HMAT table in hardware. Unfortunately the Linux kernel currently only exposes latency/bandwidth information for local CPUs (kernel developers were afraid of the matrices being too huge in memory on future machine :/). So hwloc can give you the information but it cannot get it from Linux. If you give the info to hwloc (either through XML or through the memattrs API), it'll be able to return it to you later.

For question 2, it depends on what the ACPI tables expose. If they don't expose any locality information, the CPU-less node is attached to the root object at the top of the hwloc topology and appears local to all cores. If ACPI HMAT says the best initiator is CPU0, hwloc will attached the node to that CPU. In practice, HBM and NVDIMM have no local CPUs in hardware (because CPUs are rather local to DRAM) but modern platforms are supposed to explicit their local CPUs in the HMAT (hence hwloc will expose them as local to some CPUs).

aghozzo · 2022-10-06T18:06:13Z

@bgoglin

As always, thanks for the quick reply, I really appreciate it.

"Unfortunately the Linux kernel currently only exposes latency/bandwidth information for local CPUs (kernel developers were afraid of the matrices being too huge in memory on future machine :/)"

--> is there a reference for this , I like to study more on this subject.
--> Distance seems to be calculated across all nodes in 2D matrix "numactl --hardware" , so the assumption here is distance/capacity is exposed for all but bandwidth/latency are not (only for local) ?

what is "HWLOC_MEMATTR_ID_LOCALITY" , i assumed its the normal distance but seems im wrong, im getting different number
--> i get locality =16
--> while distance:
node distances:
node 0 1
0: 10 21
1: 21 10

what is the flag/function to print distance through hwloc to get the result similar to "numactl --hardware"

                                                  node   0   1
                                                    0:  10  21
                                                    1:  21  10

bgoglin · 2022-10-07T07:41:46Z

About exposing only local CPU performance, I am trying to find the email in the linux kernel archives but I couldn't yet. It comes from when Ross Zwisler posted the ACPI HMAT patches in 2017 but there are multiple versions of the patchset and many replies. See below for more details.

numactl --hardware is different, it's really a matrix of distances which comes from the ACPI SLIT table which exposes relative latencies between cores of one NUMA node and memory of another node. ACPI HMAT is a modern table that improves SLIT by defining "initiator" and "target" to support CPU-less and memory-less nodes better, adding bandwidth, etc. Both ACPI tables (when implemented by the platform) provide entire matrices with values between all pairs of nodes) but the difference is how the kernel exposes them. SLIT is exposed in /sys/devices/system/node/nodeX/distance, one single file per node with all distances to all other nodes. HMAT is exposed in a more complex manner because it defines the notion of best initiator, etc. Individual values are in /sys/devices/system/node/nodeX/accessY/initiators/{read,write}_{bandwidth,latency}. My understanding is that exposing all HMAT values in these hierarchy will create many sysfs subdirectories and files, that's why kernel developers didn't want to expose everything first. I guess they should keep this hierarchy for best initiator and just expose all raw values separately to avoid memory issues with sysfs files.

In practice ACPI SLIT is always implemented, but sometimes it's useless because it just says 10 for local node and 20 for all remote nodes without distinguish close and far nodes. These values are exposed in hwloc "distances". ACPI HMAT is rather recent, basically only implemented since IceLake. Values seem to be more accurately defined by vendors so far. Those values are exposed in hwloc memattrs. SLIT distances and HMAT latencies should be similar in theory (except that SLIT values are normalized to 10 for local) but it's not guaranteed in practice. Also SLIT contains more values since it exposes latencies between pairs of memory-less nodes (even if they don't make sense since we need "CPUs" to define that latency).

The LOCALITY attribute is defined in hwloc/memattrs.h as the number of PUs near a NUMA node, basically the weight of its cpuset. It's designed for platforms where different kinds of memory are attached at different level, for instance one NVM node per machine, one DRAM node per socket, and one HBM per half-socket (e.g. SNC). In this case the LOCALITY of NVM would be the number of PUs in the machine > LOCALITY of DRAM (PUs per socket) > LOCALITY of HBM (PUs per half socket). It's not clear it'll be useful in practice, but it was easy to implement.

lstopo --distances -p would print the equivalent of numactl --hardware (but the ordering of nodes might be different since hwloc reorders node by logical index). It's implemented in https://github.com/open-mpi/hwloc/blob/master/utils/lstopo/lstopo-text.c#L218 which ends up calling https://github.com/open-mpi/hwloc/blob/master/utils/hwloc/misc.h#L382
To make things for simple in your specific case: Call hwloc_get_distances_by_type() for NUMA nodes. In normal cases, you'll get a single distances in return (ACPI SLIT latencies identical to numactl --hardware). The distance structure will contain an array of NUMA nodes and and an array of latencies between them.

aghozzo · 2022-10-10T20:36:45Z

@bgoglin Thanks a lot for the detailed reply , appreciated

Does hwloc support other allocation calls other than malloc ? ( I mean calloc, realloc, memalign ) . I use "hwloc_alloc_membind()" to do allocations on best target from a specific initiator .

is there an API for other calls ?

or I can set the policy through "hwloc_set_membind" and then do normal malloc/calloc/realloc and that will allocate to best_node i set in the "hwloc_set_membind" ?

bgoglin · 2022-10-14T09:52:03Z

Hello
hwloc does not have a fine-granularity allocator. We only allocate big chunks of pages and users are expected to manage smaller allocations inside those if needed. memkind and other libraries use jemalloc or ptmalloc inside such chunks to provide malloc-like APIs, but we have not decided to do the same in hwloc yet. The main reason is that many hwloc users are HPC runtimes, and many of those runtimes already have their own fine-granularity allocator.
Yes, setting the global policy with set_membind() might work in theory, but:

it's not very convenient if you have to switch from one policy to another between allocations
malloc() may reuse already allocated pages that were not entirely used or were freed recently. Those pages won't move according to set_membind() unless you request explicit migration, but this will migrate entire pages, including allocations that belong to other threads, etc.

bgoglin added a commit to bgoglin/hwloc that referenced this issue Aug 19, 2022

doc/examples: add memory-attributes.c

3f7e1b1

Closes open-mpi#542 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>

bgoglin added a commit to bgoglin/hwloc that referenced this issue Aug 19, 2022

doc/examples: add memory-attributes.c

f354185

Closes open-mpi#542 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>

bgoglin closed this as completed in 8d81da3 Aug 22, 2022

bgoglin added a commit that referenced this issue Aug 22, 2022

doc/examples: add memory-attributes.c

f0a9246

Closes #542 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> (cherry picked from commit 8d81da3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample code to show how can hwloc be used to print latency /bandwidth for initiators and targets #542

Sample code to show how can hwloc be used to print latency /bandwidth for initiators and targets #542

aghozzo commented Aug 18, 2022

bgoglin commented Aug 19, 2022

bgoglin commented Aug 19, 2022 •

edited

Loading

aghozzo commented Aug 19, 2022

bgoglin commented Aug 22, 2022

aghozzo commented Aug 25, 2022 •

edited

Loading

bgoglin commented Aug 25, 2022

aghozzo commented Aug 25, 2022 •

edited

Loading

bgoglin commented Aug 26, 2022

aghozzo commented Aug 29, 2022 •

edited

Loading

bgoglin commented Aug 29, 2022

aghozzo commented Oct 4, 2022

bgoglin commented Oct 5, 2022

aghozzo commented Oct 6, 2022

bgoglin commented Oct 7, 2022

aghozzo commented Oct 10, 2022 •

edited

Loading

bgoglin commented Oct 14, 2022

Sample code to show how can hwloc be used to print latency /bandwidth for initiators and targets #542

Sample code to show how can hwloc be used to print latency /bandwidth for initiators and targets #542

Comments

aghozzo commented Aug 18, 2022

bgoglin commented Aug 19, 2022

bgoglin commented Aug 19, 2022 • edited Loading

aghozzo commented Aug 19, 2022

bgoglin commented Aug 22, 2022

aghozzo commented Aug 25, 2022 • edited Loading

bgoglin commented Aug 25, 2022

aghozzo commented Aug 25, 2022 • edited Loading

bgoglin commented Aug 26, 2022

aghozzo commented Aug 29, 2022 • edited Loading

bgoglin commented Aug 29, 2022

aghozzo commented Oct 4, 2022

bgoglin commented Oct 5, 2022

aghozzo commented Oct 6, 2022

bgoglin commented Oct 7, 2022

aghozzo commented Oct 10, 2022 • edited Loading

bgoglin commented Oct 14, 2022

bgoglin commented Aug 19, 2022 •

edited

Loading

aghozzo commented Aug 25, 2022 •

edited

Loading

aghozzo commented Aug 25, 2022 •

edited

Loading

aghozzo commented Aug 29, 2022 •

edited

Loading

aghozzo commented Oct 10, 2022 •

edited

Loading