Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NUMA params in BareMetalHost Schema #64

Closed
wants to merge 3 commits into from

Conversation

digambar15
Copy link
Contributor

  • Some parameters was missing from existing baremetalhost_crd.yaml schema.
  • The missing params mentioned in this spec will be useful for performance and future workloads
  • Numa params will be useful for compute intensive applications and NFV applications
  • Boot parameters are required because it shows current boot mode along with pxe interface details which will be then used for setting boot mode.

@metal3-io-bot metal3-io-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 23, 2019
@digambar15
Copy link
Contributor Author

/assign @dhellmann @zaneb @russellb

http://creativecommons.org/licenses/by/3.0/legalcode
-->

# Proposal to enhance Schema for Metal3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine that over time we will have a lot of schema enhancement requests. Could you give this proposal a more specific title (and rename the file to match)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @dhellmann

Parameters for `NUMA` topology:

For better performance, selection of CPUs based on the `NUMA` topology
becomes necessary. In case of nodes with `DPDK` aware NICs, the CPUs for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information seems useful for scheduling workloads, but I don't understand how it would be used during provisioning. Can you elaborate on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For provisioning, this is not useful. We are adding this for future workload scenarios where user can look at the CR and plan there workloads accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say plan workloads, do you mean that the kubernetes scheduler would somehow use the information? Or that humans might look at it and do something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://kcsna2019.sched.com/event/Vv0W/present-and-future-of-hardware-topology-awareness-in-kubelet-connor-doyle is relevant for the scheduler using NUMA topology. I wouldn't want to reproduce data in the BareMetalHost that is going to be available somewhere else, like the node.

description: Numa node id
size_kb:
type: string
description: memory in kb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere we track RAM using Mebibytes as an integer. Is there any particular reason to switch to a string holding KB here?

Copy link
Contributor Author

@digambar15 digambar15 Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should MiB, will correct it.

type: integer
description: Numa node id
thread_siblings:
items:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what thread siblings are. Is this a list of integers? What do the integers represent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of integers. Integers represents the CPU's

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this information could be expressed more compactly with a list of lists of integers.

cpu_thread_siblings:
   - [1, 17]
   - [10, 26]
   - ...

How does a consumer typically use the data? Is there any reason not to favor the more compact representation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can present NUMA node siblings in list of list for respective nodes. It is possible.

numa_topology:
properties:
numa_node_0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that nodes have names. Is that right, or are the names made up?

Are the nodes ordered? Should numa_topology be a list of node objects instead of a mapping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numa node 0 means numa socket 0. It's not a new hardware node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it sounds like NUMA nodes do not really have names, then, and we should make numa_topology an ordered list instead of a mapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree

properties:
nics:
properties:
numa_node:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the numa_node value for a NIC ever going to be different from the name as defined on line 85?

We list NICs separately. Does it make sense to add a field to that existing structure to hold the NUMA node ID instead of repeating the NIC name here in this new structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This association of NIC defines which nic is assiciated with which NUMA node.
Updating NUMA association with NIC in NIC structure is a good idea. That way, user can identify NUMA to NIC association immidiately by looking at NIC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how users are most likely to want to use the data. Are they going to start by asking which NUMA node is associated with a NIC, or are they going to start by looking for all of the resources in the system associated with a given NUMA node?

If they will start with a NIC, we should put the NUMA node ID in the existing structure. If they will start with the NUMA node ID, we should add the NIC data here. We should only need to list the name here, though, because the type and other information is already in the other structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on people's need. Good to have all NUMA info can be associated with respective CPU, RAM or NIC so that it will be helpful for user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I've talked myself into adding all of the NUMA related information into a new structure, so let's work on defining that.

type: array
ram:
properties:
numa_node:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only track RAM as a total value elsewhere, it probably makes sense to keep this section to list the breakdown. But the same question applies as above: is the numa_node ID value here ever going to be different from the node ID from line 85?

Copy link
Contributor Author

@digambar15 digambar15 Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAM details for per NUMA node (Socket) will be like below -
"ram": [{
"numa_node": 0,
"size_kb": 99289532
}, {
"numa_node": 1,
"size_kb": 100663296
}]

We can move above details under RAM which represents which NUMA node has allocated what size of RAM. Currently it's in KB but we can convert it into MB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, my point about the node ID is that the value is being repeated. It is in the key in numa_topology that points to this node and then it is part of the properties for the ram details. We should only need it in one place. That means the ram value here can just be the size in MiB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.


* current_boot_mode: `uefi`

UEFI is installed when device is manufactured. It is the first program which runs when the device is turned on. UEFI replace BIOS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had several discussions about how best to represent the boot mode for hosts. I'm not sure we've come to a conclusion, yet. We might make more progress if we separate the proposal for boot mode from the proposal to collect NUMA topology details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will create seperate proposal for boot mode

* current_boot_mode: `uefi`

UEFI is installed when device is manufactured. It is the first program which runs when the device is turned on. UEFI replace BIOS.
* pxe_interface: `ff:ff:ff:ff:ff:ff`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the existing bootMACAddress field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its same but it makes sense if we keep it under boot mode section because that way, user get to know which pxe nic is associated with boot mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bootMACAddress is the MAC we expect to be used for PXE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway I am creating seperate PR For boot. will address these things there.

Parameters for `NUMA` topology:

For better performance, selection of CPUs based on the `NUMA` topology
becomes necessary. In case of nodes with `DPDK` aware NICs, the CPUs for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say plan workloads, do you mean that the kubernetes scheduler would somehow use the information? Or that humans might look at it and do something?

numa_topology:
properties:
numa_node_0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it sounds like NUMA nodes do not really have names, then, and we should make numa_topology an ordered list instead of a mapping.

properties:
nics:
properties:
numa_node:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how users are most likely to want to use the data. Are they going to start by asking which NUMA node is associated with a NIC, or are they going to start by looking for all of the resources in the system associated with a given NUMA node?

If they will start with a NIC, we should put the NUMA node ID in the existing structure. If they will start with the NUMA node ID, we should add the NIC data here. We should only need to list the name here, though, because the type and other information is already in the other structure.

type: array
ram:
properties:
numa_node:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, my point about the node ID is that the value is being repeated. It is in the key in numa_topology that points to this node and then it is part of the properties for the ram details. We should only need it in one place. That means the ram value here can just be the size in MiB.

type: integer
description: Numa node id
thread_siblings:
items:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this information could be expressed more compactly with a list of lists of integers.

cpu_thread_siblings:
   - [1, 17]
   - [10, 26]
   - ...

How does a consumer typically use the data? Is there any reason not to favor the more compact representation?

* current_boot_mode: `uefi`

UEFI is installed when device is manufactured. It is the first program which runs when the device is turned on. UEFI replace BIOS.
* pxe_interface: `ff:ff:ff:ff:ff:ff`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bootMACAddress is the MAC we expect to be used for PXE.

1. Introduce new section called `boot` under `hardware` in schema `metal3.io_baremetalhosts_crd.yaml`.
2. Update the existing sections for `cpu` which will include `NUMA` related parameters in schema.
3. Define structure for `boot` parameters in `baremetalhosts_types.go`.
4. Define structure for `NUMA` parameters and call it in `cpu` structure in `baremetalhost_types.go`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the gopher cloud library already return the NUMA data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, We will have to add structure for NUMA in gophercloud lib. We will have to talk to gophercloud community for that or create a PR/issue for that. I welcome your help on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, We need to add NUMA structure in gopher library to support this feature. I welcome your help on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That gophercloud task should be on the work list here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numa details in gophercloud merged successfully
gophercloud/gophercloud#1842

@zaneb
Copy link
Member

zaneb commented Jan 8, 2020

I have one crazy idea about this - see what you think.
It might be nice if the data structures for each NUMA node were in some sense equivalent to the one for the entire system, even if that means duplicating a lot of the data. e.g. something like:

type NodeDetails struct {
    CPU CPU
    NIC []NIC
    RAMMebibytes int
}

type HardwareDetails struct {
    SystemVendor HardwareSystemVendor
    Firmware     Firmware
    NodeDetails
    NumaNodes    []NodeDetails
    Storage      []Storage
    Hostname     string
}

(This is related to my other crazy idea that k8s should avoid going down the OpenStack path of doing complex scheduling that takes into account NUMA and instead just run multiple kubelets on the same machine, one for each NUMA node, separated by cgroups. Note that I haven't shared this idea with anyone yet, so it may be legitimately crazy.)
What's missing from the current CPU/NIC/RAM data that would prevent us from doing this?

  • The list of CPU IDs is certainly one. For the whole machine we currently only provide a count, since you can assume the IDs are contiguous.
  • The list of hyperthread siblings. This is not specific to NUMA though, so it might be a valuable addition anyway. If formatted as a list of sibling lists as Doug suggested then that could eliminate the need to separately list the CPU IDs (since they'd each occur in the thread sibling list).
  • Anything else?

@dhellmann
Copy link
Member

I have one crazy idea about this - see what you think.

I like it.

@digambar15
Copy link
Contributor Author

digambar15 commented Jan 10, 2020

I have one crazy idea about this - see what you think.
It might be nice if the data structures for each NUMA node were in some sense equivalent to the one for the entire system, even if that means duplicating a lot of the data. e.g. something like:

type NodeDetails struct {
    CPU CPU
    NIC []NIC
    RAMMebibytes int
}

type HardwareDetails struct {
    SystemVendor HardwareSystemVendor
    Firmware     Firmware
    NodeDetails
    NumaNodes    []NodeDetails
    Storage      []Storage
    Hostname     string
}

(This is related to my other crazy idea that k8s should avoid going down the OpenStack path of doing complex scheduling that takes into account NUMA and instead just run multiple kubelets on the same machine, one for each NUMA node, separated by cgroups. Note that I haven't shared this idea with anyone yet, so it may be legitimately crazy.)
What's missing from the current CPU/NIC/RAM data that would prevent us from doing this?

  • The list of CPU IDs is certainly one. For the whole machine we currently only provide a count, since you can assume the IDs are contiguous.
  • The list of hyperthread siblings. This is not specific to NUMA though, so it might be a valuable addition anyway. If formatted as a list of sibling lists as Doug suggested then that could eliminate the need to separately list the CPU IDs (since they'd each occur in the thread sibling list).
  • Anything else?

Looks good to me. It certainly value adds and avoids repetition.
I will incorporate these things in next update.

File renamed to Add-support-for-NUMA-params-in-BareMetalHost-Schema
Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments inline, but this makes sense to me.

}

type NumaNodes struct {
NumaNodeId int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a NUMA expert... is it possible that the IDs here would be different than the list indices, or could we just eliminate this struct and provide a list of NodeDetails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDs here are NumaNode 0 and NumaNode 1. so we will have only 0 and 1 here.

Copy link
Member

@zaneb zaneb Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it seems like we could just use the index into the list, i.e. do

NumaNodes    []NodeDetails

above, and just get rid of this NumaNodes struct.

VLANs []VLAN `json:"vlans,omitempty"`
VLANID VLANID `json:"vlanId"`
PXE bool `json:"pxe"`
NumaNic []string `json:"name"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed any more, right?

ClockMegahertz ClockSpeed `json:"clockMegahertz"`
Flags []string `json:"flags"`
Count int `json:"count"`
threadsiblings [][]int `json:"thread_siblings"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important right now, but this will need to be capitalised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will correct it.


## Summary

Enhancing the `BareMetalHost` schema by adding useful parameters that we received
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could mention NUMA specifically here now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it in next commit.

### Goals

To compare the introspection data fetched from `Ironic-Inspector` through
`Gopher-cloud client` library with `BareMetalHost` schema and add useful parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this copied from another PR? I don't think the goal here is really to compare, it's to make the NUMA topology of the hardware visible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will correct it.

Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2020
@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: digambar15, zaneb
To complete the pull request process, please assign dhellmann
You can assign the PR to them by writing /assign @dhellmann in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@digambar15 digambar15 changed the title Enhance Metal3 schema Add support for NUMA params in BareMetalHost Schema Jan 22, 2020
@sharajat
Copy link

sharajat commented Feb 12, 2020

In the above Numa structure we found unecessary repetation of data in case of Nic and cpu, as the structure remains same for the data we fetch from getcpudetails() and cpu data (threadsiblings) from getnumadetails(), So it would be better if we keep Numa topology in a seperate structure inside CPU.
Below is the proposed sample structure of cpu with NUMA -:

cpu:
{
	Arch: x86_64 
	Model: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60 GHz 
	ClockMegahertz: 3700 
	Flags: [3 eopt xtopology xtpr] 
	Count: 48 
	NumaTopology: [{
		NumaNodeID: 0 
		Numadetails: {
			Nics: [p2p1 p2p2 p1p2 p1p1 em4 em2 em1 em3] 
			RAM: 99289532 
			ThreadSiblings: [[20 44][8 32][22 46][10 34][0 24][18 42][6 30][12 36][4 28][14 38][2 26][16 40]]
		}
	}, 
	{
		NumaNodeID: 1 
		Numadetails: {
			Nics: [] 
			RAM: 100663296 
                        ThreadSiblings: [[3 27][23 47][9 33][15 39][7 31][21 45][5 29][19 43][11 35][17 41][1 25][13 37]]
		}
	}]
}

@pramchan
Copy link

/assign @dhellmann

@zaneb
Copy link
Member

zaneb commented Feb 13, 2020

In the above Numa structure we found unecessary repetation of data in case of Nic and cpu

I'm not sure if you read my original comment where I proposed this structure, but it explicitly stated this at the beginning (emphasis added):

It might be nice if the data structures for each NUMA node were in some sense equivalent to the one for the entire system, even if that means duplicating a lot of the data.

So if you disagree (which is perfectly fine - I did call it a crazy idea) then perhaps you could state how the downsides of duplicating the data outweigh the upsides of being able to look at a NUMA node as if it were a whole host.

One issue with your proposal is that it only provides information about thread siblings on hosts with NUMA.

@arkadykanevsky
Copy link

For
ThreadSiblings: [[3 27][23 47][9 33][15 39][7 31][21 45][5 29][19 43][11 35][17 41][1 25][13 37]]
I assume there is built in assumption that every core up to Count: 48
is listed at some NumaNode?
Let's document it.

@pramchan
Copy link

For
ThreadSiblings: [[3 27][23 47][9 33][15 39][7 31][21 45][5 29][19 43][11 35][17 41][1 25][13 37]]
I assume there is built in assumption that every core up to Count: 48
is listed at some NumaNode?
Let's document it.

@pramchan
Copy link

For NUMA
It applies only for Multiprocessor host, as otherwise NUMA the Non Uniform Memory Access for faster access between local-memory and non-local-memory of another processor does not arise. Thus there should be no logical need for NUMA parameter for single processor host.

@sharajat
Copy link

In the above Numa structure we found unecessary repetation of data in case of Nic and cpu

I'm not sure if you read my original comment where I proposed this structure, but it explicitly stated this at the beginning (emphasis added):

It might be nice if the data structures for each NUMA node were in some sense equivalent to the one for the entire system, even if that means duplicating a lot of the data.

So if you disagree (which is perfectly fine - I did call it a crazy idea) then perhaps you could state how the downsides of duplicating the data outweigh the upsides of being able to look at a NUMA node as if it were a whole host.

One issue with your proposal is that it only provides information about thread siblings on hosts with NUMA.

We are testing with your approach on the real hardware, We will get back to you once we get results.

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2020
@maelk maelk added this to BMO v1alpha1 in Roadmap Jul 1, 2020
@maelk maelk moved this from BMO v1alpha1 to Feature request in Roadmap Jul 1, 2020
@maelk maelk moved this from Feature request to BMO v1alpha1 in Roadmap Jul 1, 2020
@maelk maelk removed this from BMO v1alpha1 in Roadmap Jul 1, 2020
@metal3-io-bot
Copy link
Contributor

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

@metal3-io-bot
Copy link
Contributor

@metal3-io-bot: Closed this PR.

In response to this:

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants