-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow numa topology passthrough VMIs with cpupinning #5846
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the approach.
From what I can tell, this sets us up to work naturally once topology aware scheduling is available.
If someone wanted to use numa passthrough today and wanted to get a consistent looking topology, is it practical to leverage topology manager to do this? I seem to remember topology manager would help by co-locating the Pod's cpus to a single Numa node, but i don't remember any scenario where it would help actually spread cpus evenly across numa nodes.
I guess what I'm asking is, what's the practical scenario where numa passthrough makes sense given the tools we have in k8s today?
I wish that would be the case. :) Even if we would have a topology aware scheduler, the CPU manager would still need to honor the requested topology and make the correct assignment. This is not the case today.
The topology manager requires the workload to request a certain device that would provide a topology hint according to which the cpu manger will try to allocate the physical CPUs. Unfortunately, it cannot accommodate arbitrary topology requests, i.e 2 sockets, 2 cores, 1 thread. Even with this approach, we cannot always get a "consistent" topology. On a very fragmented node the guest may end up getting CPUs from various physical sockets. For example, a guest requesting 4 cpus, which lands on a node with 4 sockets, may pin 2 vcpus to physical socket 0, 1 vcpus on socket 1, and 1 on socket 2.
I think this current approach is great and the best we can do under these conditions. I'm sure it will mostly fit for "full system" guests that will end up pinning all of their vCPUs on most of the physical CPUs on the host. |
Right now I see two use-cases:
|
Yes. Let me also add @fromanirh who may have more thoughts on the current PR approach and the future of all of this. Once this PR is in, I hope that I can start promoting our use-cases in k8s with the help of Francescos team. |
|
ack. I'll take me some time to review, though. |
|
@davidvossel @vladikr should be ready for review. Note that one commit (1c75bcc) moves a few functions in the e2e tests in a fresh package which touches quite some files, but that are basically just import path changes. |
|
/retest |
If the memory is divisable by the requested hugepagesize, it may still not be equally distributable between the nodes. To overcome this, first split the requested memory if necessary in smaller equally sized chunks and then assign the rest of the numa nodes one by one to the numa nodes. Signed-off-by: Roman Mohr <rmohr@redhat.com>
Signed-off-by: Roman Mohr <rmohr@redhat.com>
Signed-off-by: Roman Mohr <rmohr@redhat.com>
Prepare e2e tests for a checks and skip package. For that some common functions have to be extracted from `tests/utils.go` into `tests/util/util.go`. Signed-off-by: Roman Mohr <rmohr@redhat.com>
This can be used for all tests which require cpu pinning, including the followup numa tests. Signed-off-by: Roman Mohr <rmohr@redhat.com>
Signed-off-by: Roman Mohr <rmohr@redhat.com>
ps does not like this signal and it seems that under some circumstances this signal is forwarded from runc. The origin seems to a change in golang for non-voluntary preemtion in go-routines. Signed-off-by: Roman Mohr <rmohr@redhat.com>
Drop the boolean which was used to indicate NUMA passthrough and use the new NUMA struct instead. The first numa mapping strategy gets its own struct, to allow accumulating sub-settings for this strategy in an exclusive space for it, without overlapping with other strategy settings. Signed-off-by: Roman Mohr <rmohr@redhat.com>
Make hugepages a requirement for numa usage. Signed-off-by: Roman Mohr <rmohr@redhat.com>
Set allocation mode for memory to `immediate` in the domain xml. Signed-off-by: Roman Mohr <rmohr@redhat.com>
|
/retest |
Signed-off-by: Roman Mohr <rmohr@redhat.com>
|
@vladikr added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. Thanks @rmohr!
/lgtm
|
/retest |
|
/retest |
|
@rmohr: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
First introduced by kubevirt#5846 way back in v0.43.0 this feature the NUMA feature gate is now deprecated with the feature state graduated to GA and as such always enabled. Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
First introduced by kubevirt#5846 way back in v0.43.0 the NUMA feature gate is now deprecated with the feature state graduated to GA and as such always enabled. Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
First introduced by kubevirt#5846 way back in v0.43.0 the NUMA feature gate is now deprecated with the feature state graduated to GA and as such always enabled. Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
First introduced by kubevirt#5846 way back in v0.43.0 the NUMA feature gate is now deprecated with the feature state graduated to GA and as such always enabled. Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
What this PR does / why we need it:
This PR adds a struct
spec.cpu.numa.guestMappingPassthroughto the API which can be used in combination withspec.cpu.dedicatedCpuPinningandspec.memory.hugepages. TheNUMAfeature gate must be enabled. Later on additional mapping strategies may be added. The strategy gets its own struct to allow sub-settings for this strategy to be bound on a common place for the strategy.virt-launcherwill then get from the nodelabeller the host topology. It will use this information to form an efficient virtual numa topology based on the cpu-manager assigned CPUs. It is not the exact host numa-topology, but it will create a numa topology with the following properties:An example VMI which requests cpu pinning, numa passthrough and hugepages may look like this
Use cases
Known limitations
virt-handlerrejects a VM which does not request enough hugepages so that at least one hugepage can be assigned to every node.New information flows in the code
virt-handlercollects via nodelabeller on startup numa topology informationvirt-handleron controller startupVirtualMachineOptionsinSyncVirtualMachineto virt-launcher and the domain xml converterThis means that except from API validation the additions are only affecting our node-components.
*** Possible future sub-settings for this mapping strategy ***
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Release note: