Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Region created on socket #0 reports numa_node #1 #235

Closed
tanabarr opened this issue Mar 9, 2023 · 15 comments
Closed

Region created on socket #0 reports numa_node #1 #235

tanabarr opened this issue Mar 9, 2023 · 15 comments

Comments

@tanabarr
Copy link

tanabarr commented Mar 9, 2023

After creating PMem regions with ipmctl, the region numa_node reported by ndctl doesn't match the socket ID reported by ipmctl (ISetID for ipmctl with RegionID 0x0001, SocketID 0x0000 matches ndctl with dev region0, numa_node 1).

$ sudo ipmctl show -o nvmxml -region                                          <?xml version="1.0"?>
 <RegionList>
  <Region>
   <SocketID>0x0000</SocketID>
   <PersistentMemoryType>AppDirect</PersistentMemoryType>
   <Capacity>1008.000 GiB</Capacity>
   <FreeCapacity>1008.000 GiB</FreeCapacity>
   <HealthState>Healthy</HealthState>
   <DimmID>0x0001, 0x0011, 0x0101, 0x0111, 0x0201, 0x0211, 0x0301, 0x0311</DimmID>
   <RegionID>0x0001</RegionID>
   <ISetID>0x04a32120b4fe1110</ISetID>
  </Region>
  <Region>
   <SocketID>0x0001</SocketID>
   <PersistentMemoryType>AppDirect</PersistentMemoryType>
   <Capacity>1008.000 GiB</Capacity>
   <FreeCapacity>1008.000 GiB</FreeCapacity>
   <HealthState>Healthy</HealthState>
   <DimmID>0x1001, 0x1011, 0x1101, 0x1111, 0x1201, 0x1211, 0x1301, 0x1311</DimmID>
   <RegionID>0x0002</RegionID>
   <ISetID>0x3a7b2120bb081110</ISetID>
  </Region>
 </RegionList>
$ sudo ndctl list -Rv
[
  {
    "dev":"region1",
    "size":1082331758592,
    "align":16777216,
    "available_size":1082331758592,
    "max_available_extent":1082331758592,
    "type":"pmem",
    "numa_node":0,
    "target_node":3,
    "iset_id":4213998300795769104,
    "persistence_domain":"memory_controller"
  },
  {
    "dev":"region0",
    "size":1082331758592,
    "align":16777216,
    "available_size":1082331758592,
    "max_available_extent":1082331758592,
    "type":"pmem",
    "numa_node":1,
    "target_node":2,
    "iset_id":334147221714768144,
    "persistence_domain":"memory_controller"
  }
]

I am confused as to why the numa_node doesn't match the socket ID, can someone help me understand please?

OS: Rocky Linux 8.6

Kernel: $ uname -a
Linux 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Thu Oct 27 15:18:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Optane + IceLake platform
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz
stepping : 6
microcode : 0xd000389

@tanabarr
Copy link
Author

tanabarr commented Mar 9, 2023

numa_node of created block device doesn't match the socket ID reported by ipmctl:

bdi/               discard_alignment  integrity/         ro                 trace/
capability         ext_range          power/             size               uevent
dax/               hidden             queue/             slaves/
[tanabarr@wolf-226 daos]$ cat /sys/class/block/pmem1/device/n
namespace  numa_node
[tanabarr@wolf-226 daos]$ cat /sys/class/block/pmem1/device/n
namespace  numa_node
[tanabarr@wolf-226 daos]$ cat /sys/class/block/pmem1/device/numa_node
0
[tanabarr@wolf-226 daos]$ sudo ipmctl show -region
 SocketID | ISetID             | PersistentMemoryType | Capacity     | FreeCapacity | HealthState
==================================================================================================
 0x0000   | 0x04a32120b4fe1110 | AppDirect            | 1008.000 GiB | 1008.000 GiB | Pending
 0x0001   | 0x3a7b2120bb081110 | AppDirect            | 1008.000 GiB | 0.000 GiB    | Healthy
[tanabarr@wolf-226 daos]$ ls -lah /dev/pmem*
brw-rw---- 1 root disk 259, 16 Mar  9 15:43 /dev/pmem1
brw-rw---- 1 root disk 259, 17 Mar  9 15:43 /dev/pmem1.1
brw-rw---- 1 root disk 259, 18 Mar  9 15:43 /dev/pmem1.2
brw-rw---- 1 root disk 259, 19 Mar  9 15:44 /dev/pmem1.3

@tanabarr
Copy link
Author

tanabarr commented Mar 9, 2023

More detail:

[tanabarr@wolf-226 daos]$ ls -lah /dev/pmem*
brw-rw---- 1 root disk 259, 20 Mar  9 16:03 /dev/pmem0
brw-rw---- 1 root disk 259, 21 Mar  9 16:04 /dev/pmem0.1
brw-rw---- 1 root disk 259, 22 Mar  9 16:04 /dev/pmem0.2
brw-rw---- 1 root disk 259, 23 Mar  9 16:05 /dev/pmem0.3
brw-rw---- 1 root disk 259, 16 Mar  9 15:43 /dev/pmem1
brw-rw---- 1 root disk 259, 17 Mar  9 15:43 /dev/pmem1.1
brw-rw---- 1 root disk 259, 18 Mar  9 15:43 /dev/pmem1.2
brw-rw---- 1 root disk 259, 19 Mar  9 15:44 /dev/pmem1.3
[tanabarr@wolf-226 daos]$ cat /sys/class/block/pmem1*/device/numa_node
0
0
0
0
[tanabarr@wolf-226 daos]$ cat /sys/class/block/pmem0*/device/numa_node
1
1
1
1
[tanabarr@wolf-226 daos]$ sudo ipmctl show -region
 SocketID | ISetID             | PersistentMemoryType | Capacity     | FreeCapacity | HealthState
==================================================================================================
 0x0000   | 0x04a32120b4fe1110 | AppDirect            | 1008.000 GiB | 0.000 GiB    | Healthy
 0x0001   | 0x3a7b2120bb081110 | AppDirect            | 1008.000 GiB | 0.000 GiB    | Healthy

SocketID should be equal to NUMA node ID of region, uniquely identified by ISetID.

[tanabarr@wolf-226 daos]$ sudo ndctl list -R
[
  {
    "dev":"region1",
    "size":1082331758592,
    "align":16777216,
    "available_size":0,
    "max_available_extent":0,
    "type":"pmem",
    "iset_id":4213998300795769104,
    "persistence_domain":"memory_controller"
  },
  {
    "dev":"region0",
    "size":1082331758592,
    "align":16777216,
    "available_size":0,
    "max_available_extent":0,
    "type":"pmem",
    "iset_id":334147221714768144,
    "persistence_domain":"memory_controller"
  }
]

region0 iset_id (334147221714768144 == 0x4A32120B4FE1110) matches with ipmctl region on socket 0
region1 iset_id (4213998300795769104 == 0x3A7B2120BB081110) matches with ipmctl region on socket 1

This doesn't correlate as namespaces on region0 are reportedly on numa_node 1:

[tanabarr@wolf-226 daos]$ sudo ndctl list -Rv -r 0
{
  "regions":[
    {
      "dev":"region0",
      "size":1082331758592,
      "align":16777216,
      "available_size":0,
      "max_available_extent":0,
      "type":"pmem",
      "numa_node":1,
      "target_node":2,
      "iset_id":334147221714768144,
      "persistence_domain":"memory_controller",
      "namespaces":[
        {
          "dev":"namespace0.2",
          "mode":"fsdax",
          "map":"dev",
          "size":266352984064,
          "uuid":"8676b101-3035-4e07-9ccc-a1a4dcab915a",
          "raw_uuid":"987ac22c-20e8-41eb-90d3-1a9d9e4bd0a5",
          "sector_size":512,
          "align":2097152,
          "blockdev":"pmem0.2",
          "numa_node":1,
          "target_node":2
        },
        ...

@tanabarr
Copy link
Author

tanabarr commented Mar 9, 2023

[tanabarr@wolf-226 daos]$ hwloc-ls
Machine (251GB total)
  Package L#0
    NUMANode L#0 (P#0 125GB)
...
    Block(NVDIMM) "pmem1.2"
    Block(NVDIMM) "pmem1"
    Block(NVDIMM) "pmem1.3"
    Block(NVDIMM) "pmem1.1"
  Package L#1
    NUMANode L#1 (P#1 126GB)
...
    Block(NVDIMM) "pmem0.1"
    Block(NVDIMM) "pmem0.2"
    Block(NVDIMM) "pmem0.3"
    Block(NVDIMM) "pmem0"

@tanabarr tanabarr changed the title numa_node inconsistent on region region created on socket #0 has numa_node #1 Mar 9, 2023
@tanabarr tanabarr changed the title region created on socket #0 has numa_node #1 Region created on socket #0 reports numa_node #1 Mar 9, 2023
@StevenPontsler
Copy link

This sounds like it might be the same as issue intel/ipmctl#156 which is closed as a defect in ndctl.

What version of ndctl are you using? Please verify that the issue has been fixed in that version.

If you are using a version of ndctl that has the issue fixed.
What is the version of ipmctl you are using? ipmctl version
And please provide a dump of the tables. ipmctl show -system [PCAT|NFIT|PMTT]

@tanabarr
Copy link
Author

tanabarr commented Mar 9, 2023

[tanabarr@wolf-226 daos]$ ndctl --version
71.1
[tanabarr@wolf-226 daos]$ sudo ipmctl version
Intel(R) Optane(TM) Persistent Memory Command Line Interface Version 03.00.00.0468
[tanabarr@wolf-226 daos]$ sudo ipmctl show -system NFIT| grep ProximityDomain
      ProximityDomain: 0x3
      ProximityDomain: 0x5

Thanks for the reply @StevenPontsler , I don't think it's intel/ipmctl#156 as I've been following that ticket and worked on related issues with @sscargal and @nolanhergert. BIOS info below:

$ dmidecode -t baseboard

# dmidecode 3.3                                                                                                                                                                                   
Getting SMBIOS data from sysfs.
SMBIOS 3.4 present.
# SMBIOS implementations newer than version 3.3.0 are not
# fully supported by this version of dmidecode.

Handle 0x00C8, DMI type 2, 15 bytes
Base Board Information
        Manufacturer: HPE
        Product Name: ProLiant DL380 Gen10 Plus
        Version: Not Specified
        Serial Number: PZCRC0CRHG00IE
        Asset Tag:
        Features:
                Board is a hosting board
                Board is removable
                Board is replaceable
        Location In Chassis: Not Specified
        Chassis Handle: 0x0000
        Type: Motherboard
        Contained Object Handles: 0

Handle 0x00CB, DMI type 41, 11 bytes
Onboard Device
        Reference Designation: Embedded SATA Controller #1
        Type: SATA Controller
        Status: Enabled
        Type Instance: 1
        Bus Address: 0000:00:17.0

$ dmidecode -t bios

Handle 0x0002, DMI type 0, 26 bytes
BIOS Information
        Vendor: HPE
        Version: U46
        Release Date: 02/02/2023
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 64 MB
        Characteristics:
                PCI is supported
                PNP is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                ESCD support is available
                Boot from CD is supported
                Selectable boot is supported
                EDD is supported
                5.25"/360 kB floppy services are supported (int 13h)
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                CGA/mono video services are supported (int 10h)
                ACPI is supported
                USB legacy is supported
                BIOS boot specification is supported
                Function key-initiated network boot is supported
                Targeted content distribution is supported
                UEFI is supported
        BIOS Revision: 1.72
        Firmware Revision: 2.63

@tanabarr
Copy link
Author

tanabarr commented Mar 9, 2023

One interesting thing, if I create just one region with ipmctl, the SocketID reported by ipmctl matches ndctl region numa_node:

[tanabarr@wolf-226 ~]$ sudo ipmctl show -region
 SocketID | ISetID             | PersistentMemoryType | Capacity     | FreeCapacity | HealthState
==================================================================================================
 0x0000   | 0x04a32120b4fe1110 | AppDirect            | 1008.000 GiB | 1008.000 GiB | Healthy
[tanabarr@wolf-226 ~]$ sudo ndctl list -Rv
[
  {
    "dev":"region0",
    "size":1082331758592,
    "align":16777216,
    "available_size":1082331758592,
    "max_available_extent":1082331758592,
    "type":"pmem",
    "numa_node":0,
    "target_node":2,
    "iset_id":334147221714768144,
    "persistence_domain":"memory_controller"
  }
]

and when I create a second region on --socket 1, the numa_node Field switches as mentioned previously.

@tanabarr
Copy link
Author

Tested on Alma Linux 8.7 and the same issue:

[tanabarr@wolf-226 daos]$ sudo ndctl list -Rv
[
  {
    "dev":"region0",
    "size":1082331758592,
    "align":16777216,
    "available_size":1082331758592,
    "max_available_extent":1082331758592,
    "type":"pmem",
    "numa_node":0,
    "target_node":2,
    "iset_id":4213998300795769104,
    "persistence_domain":"memory_controller"
  }
]
[tanabarr@wolf-226 daos]$ sudo ipmctl show -region
 SocketID | ISetID             | PersistentMemoryType | Capacity     | FreeCapacity | HealthState
==================================================================================================
 0x0001   | 0x3a7b2120bb081110 | AppDirect            | 1008.000 GiB | 1008.000 GiB | Healthy
[tanabarr@wolf-226 daos]$

@djbw
Copy link
Member

djbw commented Mar 15, 2023

A couple things:

  1. SocketID != Linux NUMA node
  2. The mechanism that Linux uses to determine NUMA nodes is by translating ACPI proximity domains, the method that ipmctl uses to determine SocketID is non-architectural, i.e. Linux can't reliably use that method and must depend on ACPI information.

The ndctl tool is only reporting the result of the kernel translation of ACPI proximity domain and that logic is the following in the ACPI NFIT driver:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/acpi/nfit/core.c#n2624

...so ultimately it is up to whatever values your BIOS is putting in the ACPI NFIT table.

@tanabarr
Copy link
Author

Thanks a lot for the response, it was the incorrect documentation in ipmctl that was throwing me off then and therefore looks like this ticket is a duplicate of intel/ipmctl#89 . So from an application perspective (DAOS) that is attempting to automate creation of PMem namespaces on a specific NUMA node, how should I correlate the SocketID in ipmctl to a NUMA node ID?

@tanabarr
Copy link
Author

So given ipmctl doesn't directly show the NUMA node ID mapped to the socket ID, should the application derive this from the proximity domain itself? The lstopo output looks reasonable/unsurprising (attached).
wolf-226
wolf-226_ipmctl_show_-system_2numa.log

@djbw
Copy link
Member

djbw commented Mar 16, 2023

If the goal is to be able to map the closest collection of CPUs to the given memory-only-numa node then that data comes from the ACPI SLIT table. The kernel is reporting that the closest distance between one of the CPU proximity domains to the PMEM proximity domain '2' is proximity domain '0'. Note that the kernel will also report '0' if the SLIT says that CPU node '1' is of equal distance to '2'. So you would need to dump the SLIT to debug further why it says that '0' is closest initiator.

@tanabarr
Copy link
Author

Thanks for that info, not that I was expecting it to change but for other reasons I updated to ndctl v76.1 on leap15.4 and the issue remains. I will try to find time to dig into the ACPI SLIT table as per https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html#system-locality-information-table-slit . I was previously looking into the SRAT table.

@djbw
Copy link
Member

djbw commented Mar 17, 2023

I was previously looking into the SRAT table.

SRAT lists the proximity domains of CPUs and address ranges, and then SLIT tells you the relative "distance" between those domains. Typically socket local DDR and CPUs share the same proximity domain, but any other memory type will have its own proximity domain. Then it is up to SLIT to say how far away that memory-only proximity domain is away from a given CPU proximity domain.

@tanabarr
Copy link
Author

This seems to have been identified as a platform bug, from Kevan Rehm @ HPE:

This is way more information than you want, but the problem is that the Proximity Domains reported by the BIOS as part of the SLIT table were incorrect, the values for CPU0, CPU1, PMEM-socket0 and PMEM-socket1 are respectively 0, 1, 3, 5. The kernel code that is responsible for mapping proximity domains to numa nodes can't handle a list with missing numbers in the middle. In these DL-3xx servers, values 2 and 4 are missiong, and so the mapping to numa node is incorrect. The BIOS coming out in May changed the Proximity Domains to 0, 1, 2. 3 with no gaps, and then the kernel numa mapping does the right thing. They sent me a special BIOS which I tried, and the fix works. Note below that you can also change a BIOS option to use shared NUMA domains instead of Isolated NUMA domains, in which case the proximity domains become just 0 (for CPU0 and PMEM0) and 1 (for CPU1 and PMEM1), so the problem goes away. You can use that setting if you don't want to wait for the new BIOS.

The kernel setting I am using is “Isolated NUMA Domains”.  
 
As an aside, I ran a test using “Shared NUMA Domains”, and this makes the problem disappear, so I think I can use this as a workaround.
 
In looking at this more deeply, I believe the problem is actually in kernel code, depending upon what the rules are for proximity domains.   For the DL-380, the four proximity domains are 0 (CPU in socket 0), 1 (CPU in socket 1), 3 (PMEM in socket 0), and 5 (PMEM in socket 1).   Is it valid for proximity domains to have non-contiguous numbers, i.e. it is valid that there are no proximity domains 2 or 4?   Because this is the root of the problem.
 
Here is the SLIT table.  If the columns are proximity domains 0, 1, 3, 5, then the values look correct.
[024h 0036   8]                   Localities : 0000000000000004
[02Ch 0044   4]                 Locality   0 : 0A 14 11 1C
[030h 0048   4]                 Locality   1 : 14 0A 1C 11
[034h 0052   4]                 Locality   2 : 11 1C 0A 1C
[038h 0056   4]                 Locality   3 : 1C 11 1C 0A
 
Kernel routine __acpi_map_pxm_to_node() maps pxm 0 to numa 0, pxm 1 to numa 1, pxm 3 to numa 2, and pxm 5 to numa 3.   (I will include some trace code that I added to the kernel that shows this.)
 
Kernel routine acpi_numa_slit_init() is responsible for setting the distances between each of the four numa nodes.   The code consists of a double for-loop with each loop going from 0 to 3.  The code treats the variables i and j as if they are proximity domains, it calls pxm_to_node() for each value, but that only works if each numa value matches its proximity domain value, and that is not the case here.  For example, the value of pxm_to_node(2) is undefined, because there is no proximity domain 2.   Proximity domain 3 is actually the PMEM on socket 0, but the code will use the distance value for the PMEM on socket 1.   And the code never even sets the distances for proximity domain 5, because that is beyond the end of the for-loop values 0-3.

@tanabarr
Copy link
Author

@djbw thanks a lot for your help, I think we can close this as it seems to be an isolated incident and fix and workaround have been identified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants