Add support for NUMA params in BareMetalHost Schema #64

digambar15 · 2019-12-23T10:14:26Z

Some parameters was missing from existing baremetalhost_crd.yaml schema.
The missing params mentioned in this spec will be useful for performance and future workloads
Numa params will be useful for compute intensive applications and NFV applications
Boot parameters are required because it shows current boot mode along with pxe interface details which will be then used for setting boot mode.

digambar15 · 2019-12-23T10:36:24Z

dhellmann · 2020-01-02T18:22:40Z

design/Proposal-to-enhance-schema-for-metal3.md

+ http://creativecommons.org/licenses/by/3.0/legalcode
+-->
+
+# Proposal to enhance Schema for Metal3


I can imagine that over time we will have a lot of schema enhancement requests. Could you give this proposal a more specific title (and rename the file to match)?

Sure @dhellmann

dhellmann · 2020-01-02T18:24:25Z

design/Proposal-to-enhance-schema-for-metal3.md

+    Parameters for `NUMA` topology:
+
+    For better performance, selection of CPUs based on the `NUMA` topology 
+    becomes necessary. In case of nodes with `DPDK` aware NICs, the CPUs for 


This information seems useful for scheduling workloads, but I don't understand how it would be used during provisioning. Can you elaborate on that?

For provisioning, this is not useful. We are adding this for future workload scenarios where user can look at the CR and plan there workloads accordingly.

When you say plan workloads, do you mean that the kubernetes scheduler would somehow use the information? Or that humans might look at it and do something?

https://kcsna2019.sched.com/event/Vv0W/present-and-future-of-hardware-topology-awareness-in-kubelet-connor-doyle is relevant for the scheduler using NUMA topology. I wouldn't want to reproduce data in the BareMetalHost that is going to be available somewhere else, like the node.

dhellmann · 2020-01-02T18:26:39Z

design/Proposal-to-enhance-schema-for-metal3.md

+                  description: Numa node id
+                size_kb:
+                  type: string
+                  description: memory in kb


Elsewhere we track RAM using Mebibytes as an integer. Is there any particular reason to switch to a string holding KB here?

I think it should MiB, will correct it.

dhellmann · 2020-01-02T18:27:36Z

design/Proposal-to-enhance-schema-for-metal3.md

+                  type: integer
+                  description: Numa node id
+                thread_siblings:
+                  items:


I don't know what thread siblings are. Is this a list of integers? What do the integers represent?

list of integers. Integers represents the CPU's

It seems like this information could be expressed more compactly with a list of lists of integers.

cpu_thread_siblings: - [1, 17] - [10, 26] - ...

How does a consumer typically use the data? Is there any reason not to favor the more compact representation?

We can present NUMA node siblings in list of list for respective nodes. It is possible.

dhellmann · 2020-01-02T18:28:57Z

design/Proposal-to-enhance-schema-for-metal3.md

+     …
+    numa_topology:
+      properties:
+        numa_node_0:


This implies that nodes have names. Is that right, or are the names made up?

Are the nodes ordered? Should numa_topology be a list of node objects instead of a mapping?

Numa node 0 means numa socket 0. It's not a new hardware node.

OK, it sounds like NUMA nodes do not really have names, then, and we should make numa_topology an ordered list instead of a mapping.

Yes, I agree

dhellmann · 2020-01-02T18:29:47Z

design/Proposal-to-enhance-schema-for-metal3.md

+          properties:
+            nics:
+              properties:
+                numa_node:


Is the numa_node value for a NIC ever going to be different from the name as defined on line 85?

We list NICs separately. Does it make sense to add a field to that existing structure to hold the NUMA node ID instead of repeating the NIC name here in this new structure?

This association of NIC defines which nic is assiciated with which NUMA node.
Updating NUMA association with NIC in NIC structure is a good idea. That way, user can identify NUMA to NIC association immidiately by looking at NIC

I guess it depends on how users are most likely to want to use the data. Are they going to start by asking which NUMA node is associated with a NIC, or are they going to start by looking for all of the resources in the system associated with a given NUMA node?

If they will start with a NIC, we should put the NUMA node ID in the existing structure. If they will start with the NUMA node ID, we should add the NIC data here. We should only need to list the name here, though, because the type and other information is already in the other structure.

It depends on people's need. Good to have all NUMA info can be associated with respective CPU, RAM or NIC so that it will be helpful for user.

OK, I think I've talked myself into adding all of the NUMA related information into a new structure, so let's work on defining that.

dhellmann · 2020-01-02T18:32:08Z

design/Proposal-to-enhance-schema-for-metal3.md

+            type: array
+            ram:
+              properties:
+                numa_node:


Since we only track RAM as a total value elsewhere, it probably makes sense to keep this section to list the breakdown. But the same question applies as above: is the numa_node ID value here ever going to be different from the node ID from line 85?

RAM details for per NUMA node (Socket) will be like below -
"ram": [{
"numa_node": 0,
"size_kb": 99289532
}, {
"numa_node": 1,
"size_kb": 100663296
}]

We can move above details under RAM which represents which NUMA node has allocated what size of RAM. Currently it's in KB but we can convert it into MB.

OK, my point about the node ID is that the value is being repeated. It is in the key in numa_topology that points to this node and then it is part of the properties for the ram details. We should only need it in one place. That means the ram value here can just be the size in MiB.

dhellmann · 2020-01-02T18:35:13Z

design/Proposal-to-enhance-schema-for-metal3.md

+
+    * current_boot_mode: `uefi`
+
+        UEFI is installed when device is manufactured. It is the first program which runs when the device is turned on. UEFI replace BIOS.


We've had several discussions about how best to represent the boot mode for hosts. I'm not sure we've come to a conclusion, yet. We might make more progress if we separate the proposal for boot mode from the proposal to collect NUMA topology details.

Okay, I will create seperate proposal for boot mode

dhellmann · 2020-01-02T18:35:48Z

design/Proposal-to-enhance-schema-for-metal3.md

+    * current_boot_mode: `uefi`
+
+        UEFI is installed when device is manufactured. It is the first program which runs when the device is turned on. UEFI replace BIOS.
+    * pxe_interface: `ff:ff:ff:ff:ff:ff`


How is this different from the existing bootMACAddress field?

Its same but it makes sense if we keep it under boot mode section because that way, user get to know which pxe nic is associated with boot mode.

The bootMACAddress is the MAC we expect to be used for PXE.

Anyway I am creating seperate PR For boot. will address these things there.

dhellmann · 2020-01-07T19:53:42Z

design/Proposal-to-enhance-schema-for-metal3.md

+    Parameters for `NUMA` topology:
+
+    For better performance, selection of CPUs based on the `NUMA` topology 
+    becomes necessary. In case of nodes with `DPDK` aware NICs, the CPUs for 


When you say plan workloads, do you mean that the kubernetes scheduler would somehow use the information? Or that humans might look at it and do something?

dhellmann · 2020-01-07T19:54:59Z

design/Proposal-to-enhance-schema-for-metal3.md

+     …
+    numa_topology:
+      properties:
+        numa_node_0:


OK, it sounds like NUMA nodes do not really have names, then, and we should make numa_topology an ordered list instead of a mapping.

dhellmann · 2020-01-07T19:56:19Z

design/Proposal-to-enhance-schema-for-metal3.md

+          properties:
+            nics:
+              properties:
+                numa_node:


I guess it depends on how users are most likely to want to use the data. Are they going to start by asking which NUMA node is associated with a NIC, or are they going to start by looking for all of the resources in the system associated with a given NUMA node?

If they will start with a NIC, we should put the NUMA node ID in the existing structure. If they will start with the NUMA node ID, we should add the NIC data here. We should only need to list the name here, though, because the type and other information is already in the other structure.

dhellmann · 2020-01-07T19:57:52Z

design/Proposal-to-enhance-schema-for-metal3.md

+            type: array
+            ram:
+              properties:
+                numa_node:


OK, my point about the node ID is that the value is being repeated. It is in the key in numa_topology that points to this node and then it is part of the properties for the ram details. We should only need it in one place. That means the ram value here can just be the size in MiB.

dhellmann · 2020-01-07T20:01:51Z

design/Proposal-to-enhance-schema-for-metal3.md

+                  type: integer
+                  description: Numa node id
+                thread_siblings:
+                  items:


It seems like this information could be expressed more compactly with a list of lists of integers.

cpu_thread_siblings: - [1, 17] - [10, 26] - ...

How does a consumer typically use the data? Is there any reason not to favor the more compact representation?

dhellmann · 2020-01-07T20:03:11Z

design/Proposal-to-enhance-schema-for-metal3.md

+    * current_boot_mode: `uefi`
+
+        UEFI is installed when device is manufactured. It is the first program which runs when the device is turned on. UEFI replace BIOS.
+    * pxe_interface: `ff:ff:ff:ff:ff:ff`


The bootMACAddress is the MAC we expect to be used for PXE.

dhellmann · 2020-01-07T20:03:54Z

design/Proposal-to-enhance-schema-for-metal3.md

+1. Introduce new section called `boot` under `hardware` in schema `metal3.io_baremetalhosts_crd.yaml`.
+2. Update the existing sections for `cpu` which will include `NUMA` related parameters in schema.
+3. Define structure for `boot` parameters in `baremetalhosts_types.go`.
+4. Define structure for `NUMA` parameters and call it in `cpu` structure in `baremetalhost_types.go`.


Does the gopher cloud library already return the NUMA data?

No, We will have to add structure for NUMA in gophercloud lib. We will have to talk to gophercloud community for that or create a PR/issue for that. I welcome your help on that.

No, We need to add NUMA structure in gopher library to support this feature. I welcome your help on that.

That gophercloud task should be on the work list here.

Yes, will add

Numa details in gophercloud merged successfully
gophercloud/gophercloud#1842

zaneb · 2020-01-08T17:14:43Z

I have one crazy idea about this - see what you think.
It might be nice if the data structures for each NUMA node were in some sense equivalent to the one for the entire system, even if that means duplicating a lot of the data. e.g. something like:

type NodeDetails struct {
    CPU CPU
    NIC []NIC
    RAMMebibytes int
}

type HardwareDetails struct {
    SystemVendor HardwareSystemVendor
    Firmware     Firmware
    NodeDetails
    NumaNodes    []NodeDetails
    Storage      []Storage
    Hostname     string
}

(This is related to my other crazy idea that k8s should avoid going down the OpenStack path of doing complex scheduling that takes into account NUMA and instead just run multiple kubelets on the same machine, one for each NUMA node, separated by cgroups. Note that I haven't shared this idea with anyone yet, so it may be legitimately crazy.)
What's missing from the current CPU/NIC/RAM data that would prevent us from doing this?

The list of CPU IDs is certainly one. For the whole machine we currently only provide a count, since you can assume the IDs are contiguous.
The list of hyperthread siblings. This is not specific to NUMA though, so it might be a valuable addition anyway. If formatted as a list of sibling lists as Doug suggested then that could eliminate the need to separately list the CPU IDs (since they'd each occur in the thread sibling list).
Anything else?

dhellmann · 2020-01-08T18:18:45Z

I have one crazy idea about this - see what you think.

I like it.

digambar15 · 2020-01-10T06:18:33Z

I have one crazy idea about this - see what you think.
It might be nice if the data structures for each NUMA node were in some sense equivalent to the one for the entire system, even if that means duplicating a lot of the data. e.g. something like:
type NodeDetails struct {
    CPU CPU
    NIC []NIC
    RAMMebibytes int
}

type HardwareDetails struct {
    SystemVendor HardwareSystemVendor
    Firmware     Firmware
    NodeDetails
    NumaNodes    []NodeDetails
    Storage      []Storage
    Hostname     string
}
(This is related to my other crazy idea that k8s should avoid going down the OpenStack path of doing complex scheduling that takes into account NUMA and instead just run multiple kubelets on the same machine, one for each NUMA node, separated by cgroups. Note that I haven't shared this idea with anyone yet, so it may be legitimately crazy.)
What's missing from the current CPU/NIC/RAM data that would prevent us from doing this?

The list of CPU IDs is certainly one. For the whole machine we currently only provide a count, since you can assume the IDs are contiguous.

The list of hyperthread siblings. This is not specific to NUMA though, so it might be a valuable addition anyway. If formatted as a list of sibling lists as Doug suggested then that could eliminate the need to separately list the CPU IDs (since they'd each occur in the thread sibling list).

Anything else?

Looks good to me. It certainly value adds and avoids repetition.
I will incorporate these things in next update.

File renamed to Add-support-for-NUMA-params-in-BareMetalHost-Schema

zaneb

Some minor comments inline, but this makes sense to me.

zaneb · 2020-01-15T15:42:56Z

design/Add-support-for-NUMA-params-in-BareMetalHost-Schema

+    }
+
+    type NumaNodes struct {
+        NumaNodeId      int 


I'm not a NUMA expert... is it possible that the IDs here would be different than the list indices, or could we just eliminate this struct and provide a list of NodeDetails?

IDs here are NumaNode 0 and NumaNode 1. so we will have only 0 and 1 here.

In that case it seems like we could just use the index into the list, i.e. do

NumaNodes []NodeDetails

above, and just get rid of this NumaNodes struct.

zaneb · 2020-01-15T15:43:27Z

design/Add-support-for-NUMA-params-in-BareMetalHost-Schema

+        VLANs     []VLAN    `json:"vlans,omitempty"`
+        VLANID    VLANID    `json:"vlanId"`
+        PXE       bool      `json:"pxe"`
+        NumaNic   []string  `json:"name"`


This is not needed any more, right?

zaneb · 2020-01-15T15:43:49Z

design/Add-support-for-NUMA-params-in-BareMetalHost-Schema

+        ClockMegahertz ClockSpeed `json:"clockMegahertz"`
+        Flags          []string   `json:"flags"`
+        Count          int        `json:"count"`
+        threadsiblings [][]int    `json:"thread_siblings"`


Not important right now, but this will need to be capitalised.

Will correct it.

zaneb · 2020-01-15T15:47:14Z

design/Add-support-for-NUMA-params-in-BareMetalHost-Schema

+
+## Summary
+
+Enhancing the `BareMetalHost` schema by adding useful parameters that we received 


Could mention NUMA specifically here now.

Will update it in next commit.

zaneb · 2020-01-15T15:48:33Z

design/Add-support-for-NUMA-params-in-BareMetalHost-Schema

+### Goals
+
+To compare the introspection data fetched from `Ironic-Inspector` through 
+`Gopher-cloud client` library with `BareMetalHost` schema and add useful parameters 


Was this copied from another PR? I don't think the goal here is really to compare, it's to make the NUMA topology of the hardware visible.

Yes, will correct it.

zaneb

/lgtm

metal3-io-bot · 2020-01-21T14:28:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: digambar15, zaneb
To complete the pull request process, please assign dhellmann
You can assign the PR to them by writing /assign @dhellmann in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sharajat · 2020-02-12T13:16:53Z

In the above Numa structure we found unecessary repetation of data in case of Nic and cpu, as the structure remains same for the data we fetch from getcpudetails() and cpu data (threadsiblings) from getnumadetails(), So it would be better if we keep Numa topology in a seperate structure inside CPU.
Below is the proposed sample structure of cpu with NUMA -:

cpu:
{
	Arch: x86_64 
	Model: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60 GHz 
	ClockMegahertz: 3700 
	Flags: [3 eopt xtopology xtpr] 
	Count: 48 
	NumaTopology: [{
		NumaNodeID: 0 
		Numadetails: {
			Nics: [p2p1 p2p2 p1p2 p1p1 em4 em2 em1 em3] 
			RAM: 99289532 
			ThreadSiblings: [[20 44][8 32][22 46][10 34][0 24][18 42][6 30][12 36][4 28][14 38][2 26][16 40]]
		}
	}, 
	{
		NumaNodeID: 1 
		Numadetails: {
			Nics: [] 
			RAM: 100663296 
                        ThreadSiblings: [[3 27][23 47][9 33][15 39][7 31][21 45][5 29][19 43][11 35][17 41][1 25][13 37]]
		}
	}]
}

pramchan · 2020-02-12T17:38:08Z

/assign @dhellmann

zaneb · 2020-02-13T03:45:49Z

In the above Numa structure we found unecessary repetation of data in case of Nic and cpu

I'm not sure if you read my original comment where I proposed this structure, but it explicitly stated this at the beginning (emphasis added):

It might be nice if the data structures for each NUMA node were in some sense equivalent to the one for the entire system, even if that means duplicating a lot of the data.

So if you disagree (which is perfectly fine - I did call it a crazy idea) then perhaps you could state how the downsides of duplicating the data outweigh the upsides of being able to look at a NUMA node as if it were a whole host.

One issue with your proposal is that it only provides information about thread siblings on hosts with NUMA.

arkadykanevsky · 2020-02-22T01:34:55Z

For
ThreadSiblings: [[3 27][23 47][9 33][15 39][7 31][21 45][5 29][19 43][11 35][17 41][1 25][13 37]]
I assume there is built in assumption that every core up to Count: 48
is listed at some NumaNode?
Let's document it.

pramchan · 2020-02-23T19:14:21Z

For
ThreadSiblings: [[3 27][23 47][9 33][15 39][7 31][21 45][5 29][19 43][11 35][17 41][1 25][13 37]]
I assume there is built in assumption that every core up to Count: 48
is listed at some NumaNode?
Let's document it.

pramchan · 2020-02-23T19:20:05Z

For NUMA
It applies only for Multiprocessor host, as otherwise NUMA the Non Uniform Memory Access for faster access between local-memory and non-local-memory of another processor does not arise. Thus there should be no logical need for NUMA parameter for single processor host.

sharajat · 2020-03-13T09:18:21Z

In the above Numa structure we found unecessary repetation of data in case of Nic and cpu

I'm not sure if you read my original comment where I proposed this structure, but it explicitly stated this at the beginning (emphasis added):

It might be nice if the data structures for each NUMA node were in some sense equivalent to the one for the entire system, even if that means duplicating a lot of the data.

So if you disagree (which is perfectly fine - I did call it a crazy idea) then perhaps you could state how the downsides of duplicating the data outweigh the upsides of being able to look at a NUMA node as if it were a whole host.

One issue with your proposal is that it only provides information about thread siblings on hosts with NUMA.

We are testing with your approach on the real hardware, We will get back to you once we get results.

metal3-io-bot · 2020-06-11T09:55:35Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

metal3-io-bot · 2020-07-11T09:59:38Z

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

metal3-io-bot · 2020-07-11T09:59:40Z

@metal3-io-bot: Closed this PR.

In response to this:

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Enhance Metal3 schema

47b1042

metal3-io-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 23, 2019

metal3-io-bot assigned dhellmann, russellb and zaneb Dec 23, 2019

dhellmann reviewed Jan 2, 2020

View reviewed changes

dhellmann requested changes Jan 7, 2020

View reviewed changes

digambar15 mentioned this pull request Jan 14, 2020

proposal to add missing boot parameters #66

Closed

Updated Pull request

f7b70b9

File renamed to Add-support-for-NUMA-params-in-BareMetalHost-Schema

zaneb reviewed Jan 15, 2020

View reviewed changes

Update Add-support-for-NUMA-params-in-BareMetalHost-Schema

9e64de8

zaneb approved these changes Jan 21, 2020

View reviewed changes

metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2020

digambar15 changed the title ~~Enhance Metal3 schema~~ Add support for NUMA params in BareMetalHost Schema Jan 22, 2020

metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2020

maelk added this to BMO v1alpha1 in Roadmap Jul 1, 2020

maelk moved this from BMO v1alpha1 to Feature request in Roadmap Jul 1, 2020

maelk moved this from Feature request to BMO v1alpha1 in Roadmap Jul 1, 2020

maelk removed this from BMO v1alpha1 in Roadmap Jul 1, 2020

metal3-io-bot closed this Jul 11, 2020


		* current_boot_mode: `uefi`

		UEFI is installed when device is manufactured. It is the first program which runs when the device is turned on. UEFI replace BIOS.


		## Summary

		Enhancing the `BareMetalHost` schema by adding useful parameters that we received

Add support for NUMA params in BareMetalHost Schema #64

Add support for NUMA params in BareMetalHost Schema #64

Conversation

digambar15 commented Dec 23, 2019

digambar15 commented Dec 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

digambar15 Jan 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

digambar15 Jan 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneb commented Jan 8, 2020

dhellmann commented Jan 8, 2020

digambar15 commented Jan 10, 2020 • edited

zaneb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneb Jan 21, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneb left a comment

Choose a reason for hiding this comment

metal3-io-bot commented Jan 21, 2020

sharajat commented Feb 12, 2020 • edited

pramchan commented Feb 12, 2020

zaneb commented Feb 13, 2020

arkadykanevsky commented Feb 22, 2020

pramchan commented Feb 23, 2020

pramchan commented Feb 23, 2020

sharajat commented Mar 13, 2020

metal3-io-bot commented Jun 11, 2020

metal3-io-bot commented Jul 11, 2020

metal3-io-bot commented Jul 11, 2020

digambar15 Jan 7, 2020 •

edited

digambar15 Jan 7, 2020 •

edited

digambar15 commented Jan 10, 2020 •

edited

zaneb Jan 21, 2020 •

edited

sharajat commented Feb 12, 2020 •

edited