Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample code to show how can hwloc be used to print latency /bandwidth for initiators and targets #542

Closed
aghozzo opened this issue Aug 18, 2022 · 16 comments

Comments

@aghozzo
Copy link

aghozzo commented Aug 18, 2022

I couldn't find any sample C code that shows how to extract HMAT table attributes or APIs that can be used and called directly from a hello world C code .

can someone please provide a pointer on how to run and test sample code to get this information or if there are already exisiting examples ?

Thanks

@bgoglin
Copy link
Contributor

bgoglin commented Aug 19, 2022

Hello
Bandwidth and Latency are part of the "memory attributes" functionnality. "lstopo --memattrs" will show bandwidth/latency on the command-line if available (likely not available unless you have an IceLake platform).
The API is in hwloc/memattrs.h, most of the current doc is in the comments there (and a little bit in https://www.open-mpi.org/projects/hwloc/doc/v2.8.0/a00366.php#topoattrs_memattrs)
There's an example in tests/hwloc/memattrs.c (but it's quite complicated because it queries AND modifies memattrs).
I'll try to add a simple example under the doc directory.
Brice

bgoglin added a commit to bgoglin/hwloc that referenced this issue Aug 19, 2022
Closes open-mpi#542

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
@bgoglin
Copy link
Contributor

bgoglin commented Aug 19, 2022

Would this example help https://github.com/bgoglin/hwloc/blob/master/doc/examples/memory-attributes.c ?

It shows this on my laptop (no heterogeneous memory, no HMAT):

$ doc/examples/memory-attributes         
There are 1 NUMA nodes
Core L#0 cpuset = 0x00000005
Found 1 local NUMA nodes
NUMA node L#0 P#0 is local to core L#0
  bandwidth is unknown
  latency is unknown
Couldn't find best NUMA node for bandwidth to core L#0

And this on a machine with 2 kinds of memory and HMAT:

$ doc/examples/memory-attributes
There are 8 NUMA nodes
Core L#0 cpuset = 0x00000001
Found 2 local NUMA nodes
NUMA node L#0 P#0 is local to core L#0
  bandwidth = 62100 MiB/s
  latency = 191 ns
NUMA node L#1 P#4 is local to core L#0
  bandwidth = 69100 MiB/s
  latency = 227 ns
Best bandwidth NUMA node for core L#0 is L#1 P#4
Allocated buffer 0x560991f25b30

bgoglin added a commit to bgoglin/hwloc that referenced this issue Aug 19, 2022
Closes open-mpi#542

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
@aghozzo
Copy link
Author

aghozzo commented Aug 19, 2022

Hi Brice,

First of all let me thank you for the quick reply and code . Excellent support , appreciated
the example is a great start. I will run few test cases and check.

Please keep the issue open for a week or so to add feedback or if there is a different area for Q/A discussion we can use that.

Again Thanks a bunch

@bgoglin
Copy link
Contributor

bgoglin commented Aug 22, 2022

This issue will be autoclosed by github when I commit the new example to the repository, but feel free to ask other questions here, I'll be notified.

bgoglin added a commit that referenced this issue Aug 22, 2022
Closes #542

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
(cherry picked from commit 8d81da3)
@aghozzo
Copy link
Author

aghozzo commented Aug 25, 2022

Hi

Does hwloc work in qemu enviroment setup with HMAT >?

ls /sys/firmware/acpi/tables
APIC CEDT DSDT FACP FACS HMAT HPET MCFG SRAT WAET data dynamic

lstopo
Machine (5955MB total)
Package L#0
NUMANode L#0 (P#0 981MB)
L3 L#0 (16MB) + L2 L#0 (4096KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
Package L#1
NUMANode L#1 (P#1 1952MB)
L3 L#1 (16MB) + L2 L#1 (4096KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
Group0 L#0
NUMANode L#2 (P#2 3023MB)

when i run the sample :

There are 3 NUMA nodes
Core L#0 cpuset = 0x00000001
Found 1 local NUMA nodes
NUMA node L#0 P#0 (subtype (null)) is local to core L#0
bandwidth is unknown
latency is unknown
Couldn't find best NUMA node for bandwidth to core L#0

@bgoglin
Copy link
Contributor

bgoglin commented Aug 25, 2022

Yes I use qemu quite intensively for simulating strange memory subsystems. But it doesn't populate HMAT with anything useful by default. There's an example of what I've been using in section "HMAT performance attributes" of https://github.com/open-mpi/hwloc/wiki/Simulating-complex-memory-with-Qemu
As you see, you have to manually define the bandwidth and/or latency of every couple (CPU initiator, memory target). That's quite boring. The qemu patch cited at the end of this section was only applied upstream recently, you don't have it in your qemu, hence you must pass initiator=0 or 1 after memdev=ram2.
If you don't want to setup all this, I can provide some XML topology that hwloc may load at runtime instead of reading the platform topology. Something like this:

$ HWLOC_XMLFILE=/path/to/hwloc/git/tests/hwloc/xml/64intel64-fakeKNL-SNC4-hybrid.xml doc/examples/memory-attributes
There are 8 NUMA nodes
Core L#0 cpuset = 0x00010001,0x00010001
Found 2 local NUMA nodes
NUMA node L#0 P#0 (subtype (null)) is local to core L#0
  bandwidth = 22500 MiB/s
  latency is unknown
NUMA node L#1 P#7 (subtype MCDRAM) is local to core L#0
  bandwidth = 90000 MiB/s
  latency is unknown
Best bandwidth NUMA node for core L#0 is L#1 P#7
Allocated buffer 0x7f761b67f010 on best node

@aghozzo
Copy link
Author

aghozzo commented Aug 25, 2022

Hi Brice,

The XML file idea is brilliant .
I was able to re-create the example you just provided ( I didn't get latency like you) .

just 2 more questions :
is there a fake xml file that fake/shows latency and bandwidth ?
is there a documentation/explanation for these xml files as a guidance on what they are instead of going to each one and read the file ?

@bgoglin
Copy link
Contributor

bgoglin commented Aug 26, 2022

I don't have a XML with latency in the repo. But adding latencies isn't very hard:

  • modify the XML it manually: duplicate the Bandwidth memattr section at the end, rename into Latency and change the values.
  • or use hwloc-annotate to add individual latencies: something like hwloc-annotate input.xml output.xml -- numa:1 -- memattr Latency package:0 100.
    There's a long example for customizing a XML topology in https://www.open-mpi.org/projects/hwloc/doc/v2.8.0/a00373.php#faq_create_asymmetric.
    There's no documentation for the XML file contents itself. It was not meant to be manually modified (even if I do it very often).

@aghozzo
Copy link
Author

aghozzo commented Aug 29, 2022

I don't have a XML with latency in the repo. But adding latencies isn't very hard:

  • modify the XML it manually: duplicate the Bandwidth memattr section at the end, rename into Latency and change the values.
  • or use hwloc-annotate to add individual latencies: something like hwloc-annotate input.xml output.xml -- numa:1 -- memattr Latency package:0 100.
    There's a long example for customizing a XML topology in https://www.open-mpi.org/projects/hwloc/doc/v2.8.0/a00373.php#faq_create_asymmetric.
    There's no documentation for the XML file contents itself. It was not meant to be manually modified (even if I do it very often).

Hi,

Thanks for the Reply , much appreciated

I tried to modify the XML file

  • I tried replacing/renaming the word "Bandwidth" with "Latency" . didn't work
  • I tried duplicating the last section and changed the values and the name to Latency. didn't work.

when I changed the "Flag=6" I start seeing the Latency values :) . Thanks

@bgoglin
Copy link
Contributor

bgoglin commented Aug 29, 2022

Ah sorry, I forgot that you need to change the "flags" after Bandwidth. For latency it's 6 instead of 5 (because best latency is lowest while best bandwidth is highest).

@aghozzo
Copy link
Author

aghozzo commented Oct 4, 2022

@bgoglin

Hi,

I have 2 questions :

  1. Is there a mechanism to show Initiators/targets matrix ( example bandwidth/latency/ or distance from any core to ALL NUMA nodes >?

hwloc_get_local_numanode_objs(topology, &initiator, &numa_nodes, nodes,HWLOC_LOCAL_NUMANODE_FLAG_ALL);
Even though i do FLAG_ALL , i still only get attributes for local NUMA nodes not the ones outside the package (if that is the right term to use) .

I like to know the penalty in terms of attributes for a CPU core to access memory outside of its local socket./package . (something like NUMA node local latency + latency to go off socket to access the memory )

  1. how the system deal with CPU-less NUMA nodes (memory extension NUMA nodes) ? are those considered local to all CPU cores ?

@bgoglin
Copy link
Contributor

bgoglin commented Oct 5, 2022

@aghozzo For question 1, the mechanism already exists in hwloc, and in the ACPI HMAT table in hardware. Unfortunately the Linux kernel currently only exposes latency/bandwidth information for local CPUs (kernel developers were afraid of the matrices being too huge in memory on future machine :/). So hwloc can give you the information but it cannot get it from Linux. If you give the info to hwloc (either through XML or through the memattrs API), it'll be able to return it to you later.

For question 2, it depends on what the ACPI tables expose. If they don't expose any locality information, the CPU-less node is attached to the root object at the top of the hwloc topology and appears local to all cores. If ACPI HMAT says the best initiator is CPU0, hwloc will attached the node to that CPU. In practice, HBM and NVDIMM have no local CPUs in hardware (because CPUs are rather local to DRAM) but modern platforms are supposed to explicit their local CPUs in the HMAT (hence hwloc will expose them as local to some CPUs).

@aghozzo
Copy link
Author

aghozzo commented Oct 6, 2022

@bgoglin

As always, thanks for the quick reply, I really appreciate it.

  • "Unfortunately the Linux kernel currently only exposes latency/bandwidth information for local CPUs (kernel developers were afraid of the matrices being too huge in memory on future machine :/)"

--> is there a reference for this , I like to study more on this subject.
--> Distance seems to be calculated across all nodes in 2D matrix "numactl --hardware" , so the assumption here is distance/capacity is exposed for all but bandwidth/latency are not (only for local) ?

  • what is "HWLOC_MEMATTR_ID_LOCALITY" , i assumed its the normal distance but seems im wrong, im getting different number
    --> i get locality =16
    --> while distance:
    node distances:
    node 0 1
    0: 10 21
    1: 21 10

  • what is the flag/function to print distance through hwloc to get the result similar to "numactl --hardware"

                                                      node   0   1
                                                        0:  10  21
                                                        1:  21  10
    

@bgoglin
Copy link
Contributor

bgoglin commented Oct 7, 2022

About exposing only local CPU performance, I am trying to find the email in the linux kernel archives but I couldn't yet. It comes from when Ross Zwisler posted the ACPI HMAT patches in 2017 but there are multiple versions of the patchset and many replies. See below for more details.

numactl --hardware is different, it's really a matrix of distances which comes from the ACPI SLIT table which exposes relative latencies between cores of one NUMA node and memory of another node. ACPI HMAT is a modern table that improves SLIT by defining "initiator" and "target" to support CPU-less and memory-less nodes better, adding bandwidth, etc. Both ACPI tables (when implemented by the platform) provide entire matrices with values between all pairs of nodes) but the difference is how the kernel exposes them. SLIT is exposed in /sys/devices/system/node/nodeX/distance, one single file per node with all distances to all other nodes. HMAT is exposed in a more complex manner because it defines the notion of best initiator, etc. Individual values are in /sys/devices/system/node/nodeX/accessY/initiators/{read,write}_{bandwidth,latency}. My understanding is that exposing all HMAT values in these hierarchy will create many sysfs subdirectories and files, that's why kernel developers didn't want to expose everything first. I guess they should keep this hierarchy for best initiator and just expose all raw values separately to avoid memory issues with sysfs files.

In practice ACPI SLIT is always implemented, but sometimes it's useless because it just says 10 for local node and 20 for all remote nodes without distinguish close and far nodes. These values are exposed in hwloc "distances". ACPI HMAT is rather recent, basically only implemented since IceLake. Values seem to be more accurately defined by vendors so far. Those values are exposed in hwloc memattrs. SLIT distances and HMAT latencies should be similar in theory (except that SLIT values are normalized to 10 for local) but it's not guaranteed in practice. Also SLIT contains more values since it exposes latencies between pairs of memory-less nodes (even if they don't make sense since we need "CPUs" to define that latency).

The LOCALITY attribute is defined in hwloc/memattrs.h as the number of PUs near a NUMA node, basically the weight of its cpuset. It's designed for platforms where different kinds of memory are attached at different level, for instance one NVM node per machine, one DRAM node per socket, and one HBM per half-socket (e.g. SNC). In this case the LOCALITY of NVM would be the number of PUs in the machine > LOCALITY of DRAM (PUs per socket) > LOCALITY of HBM (PUs per half socket). It's not clear it'll be useful in practice, but it was easy to implement.

lstopo --distances -p would print the equivalent of numactl --hardware (but the ordering of nodes might be different since hwloc reorders node by logical index). It's implemented in https://github.com/open-mpi/hwloc/blob/master/utils/lstopo/lstopo-text.c#L218 which ends up calling https://github.com/open-mpi/hwloc/blob/master/utils/hwloc/misc.h#L382
To make things for simple in your specific case: Call hwloc_get_distances_by_type() for NUMA nodes. In normal cases, you'll get a single distances in return (ACPI SLIT latencies identical to numactl --hardware). The distance structure will contain an array of NUMA nodes and and an array of latencies between them.

@aghozzo
Copy link
Author

aghozzo commented Oct 10, 2022

@bgoglin Thanks a lot for the detailed reply , appreciated

Does hwloc support other allocation calls other than malloc ? ( I mean calloc, realloc, memalign ) . I use "hwloc_alloc_membind()" to do allocations on best target from a specific initiator .

is there an API for other calls ?

or I can set the policy through "hwloc_set_membind" and then do normal malloc/calloc/realloc and that will allocate to best_node i set in the "hwloc_set_membind" ?

@bgoglin
Copy link
Contributor

bgoglin commented Oct 14, 2022

Hello
hwloc does not have a fine-granularity allocator. We only allocate big chunks of pages and users are expected to manage smaller allocations inside those if needed. memkind and other libraries use jemalloc or ptmalloc inside such chunks to provide malloc-like APIs, but we have not decided to do the same in hwloc yet. The main reason is that many hwloc users are HPC runtimes, and many of those runtimes already have their own fine-granularity allocator.
Yes, setting the global policy with set_membind() might work in theory, but:

  • it's not very convenient if you have to switch from one policy to another between allocations
  • malloc() may reuse already allocated pages that were not entirely used or were freed recently. Those pages won't move according to set_membind() unless you request explicit migration, but this will migrate entire pages, including allocations that belong to other threads, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants