Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow numa topology passthrough VMIs with cpupinning #5846

Merged
merged 16 commits into from
Jun 30, 2021

Conversation

rmohr
Copy link
Member

@rmohr rmohr commented Jun 16, 2021

What this PR does / why we need it:

This PR adds a struct spec.cpu.numa.guestMappingPassthrough to the API which can be used in combination with spec.cpu.dedicatedCpuPinning and spec.memory.hugepages. The NUMA feature gate must be enabled. Later on additional mapping strategies may be added. The strategy gets its own struct to allow sub-settings for this strategy to be bound on a common place for the strategy. virt-launcher will then get from the nodelabeller the host topology. It will use this information to form an efficient virtual numa topology based on the cpu-manager assigned CPUs. It is not the exact host numa-topology, but it will create a numa topology with the following properties:

  • the guest numa node has less or equal cpus
  • all cpus on that guest numa node are pinned to cpus on the same host numa node
  • hugepages are distributed as equally as possible between the nodes. They can not be distributed completely equal, since not every memory request is divisable between the numa node count.

An example VMI which requests cpu pinning, numa passthrough and hugepages may look like this

apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
  name: numavm
spec:
  domain:
    cpu:
      cores: 4
      dedicatedCpuPlacement: true
      numa:
        guestMappingPassthrough : {}
    devices:
      disks:
      - disk:
          bus: virtio
        name: containerdisk
      - disk:
          bus: virtio
        name: cloudinitdisk
    resources:
      requests:
        memory: 64Mi
    memory:
      hugepages:
        pageSize: 2Mi
  volumes:
  - containerDisk:
      image: registry:5000/kubevirt/cirros-container-disk-demo:devel
    name: containerdisk
  - cloudInitNoCloud:
      userData: |
        #!/bin/sh
        echo 'printed from cloud-init userdata'
    name: cloudinitdisk

Use cases

  • The single-workload use-case: A single workload runs on the node which requests pinning for all CPUs. If we have a few homogeneous nodes, her we can migrate and keep the numa pinning performant on the target host.
  • The multi-workload use-case: When a VM requests CPU pinning nowadays, it gets the CPUs which are available. They can easily span across various host numa nodes. With this feature we can re-creates these parts of the numa topology where the CPUs are part of and can this way ensure that the guest OS has a correct view on CPU and memory proximity which leads to improved performance as well. Here we can not migrate and keep the numa pinning performant on the target host.

Known limitations

  1. Migrations do not yet work properly with it. It also requires fixing the CPU pinning issues.
  2. It can happen that virt-handler rejects a VM which does not request enough hugepages so that at least one hugepage can be assigned to every node.
  3. Based on the consumption of hugepages by other workloads or based on the assigned CPUs it can happen that not enough hugepages are available for each node.The VM will then not start.

New information flows in the code

  1. virt-handler collects via nodelabeller on startup numa topology information
  2. the information is passed to the vm controller in virt-handler on controller startup
  3. when a VMI which requst numa passthrough is started, the topology information os passed via GRPC in the VirtualMachineOptions in SyncVirtualMachine to virt-launcher and the domain xml converter
  4. The converter takes the node topology info, the VMI and the cpu pinning information to construct the numa topology

This means that except from API validation the additions are only affecting our node-components.

*** Possible future sub-settings for this mapping strategy ***

  • Add booleans and/or include/exclude lists for mapping pci devices to numanode
  • Advanced memory settings (allow memory overlap between host numa nodes, or disabling restrictions at all)

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

  • I rearranged some helper functions to be able to create Skip checks in a fresh isolated package. Some of the helper functions are quite broadly used. However the chode changes regarding that are just import changes.
  • The e2e tests can run on our current kubevirtci setup. This should be sufficient for an initial merge. Once we have Switch to booting via kernel and initrd from outside the image kubevirtci#593 I can introduce an extra CI lane with more thrilling topologies.

Release note:

Add "spec.cpu.numaTopologyPassthrough" which allows emulating a host-alligned virtual numa topology for high performance

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/XXL labels Jun 16, 2021
@kubevirt-bot kubevirt-bot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/build-change Categorizes PRs as related to changing build files of virt-* components labels Jun 16, 2021
@rmohr
Copy link
Member Author

rmohr commented Jun 16, 2021

/cc @vladikr
/cc @fabiand

FYI

Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the approach.

From what I can tell, this sets us up to work naturally once topology aware scheduling is available.

If someone wanted to use numa passthrough today and wanted to get a consistent looking topology, is it practical to leverage topology manager to do this? I seem to remember topology manager would help by co-locating the Pod's cpus to a single Numa node, but i don't remember any scenario where it would help actually spread cpus evenly across numa nodes.

I guess what I'm asking is, what's the practical scenario where numa passthrough makes sense given the tools we have in k8s today?

@vladikr
Copy link
Member

vladikr commented Jun 16, 2021

From what I can tell, this sets us up to work naturally once topology aware scheduling is available.

I wish that would be the case. :) Even if we would have a topology aware scheduler, the CPU manager would still need to honor the requested topology and make the correct assignment. This is not the case today.

If someone wanted to use numa passthrough today and wanted to get a consistent looking topology, is it practical to leverage topology manager to do this? I seem to remember topology manager would help by co-locating the Pod's cpus to a single Numa node, but i don't remember any scenario where it would help actually spread cpus evenly across numa nodes.

The topology manager requires the workload to request a certain device that would provide a topology hint according to which the cpu manger will try to allocate the physical CPUs. Unfortunately, it cannot accommodate arbitrary topology requests, i.e 2 sockets, 2 cores, 1 thread.

Even with this approach, we cannot always get a "consistent" topology. On a very fragmented node the guest may end up getting CPUs from various physical sockets. For example, a guest requesting 4 cpus, which lands on a node with 4 sockets, may pin 2 vcpus to physical socket 0, 1 vcpus on socket 1, and 1 on socket 2.
It may not cause any issues for the guest os, but it won't be consistent or symmetric.

I guess what I'm asking is, what's the practical scenario where numa passthrough makes sense given the tools we have in k8s today?

I think this current approach is great and the best we can do under these conditions. I'm sure it will mostly fit for "full system" guests that will end up pinning all of their vCPUs on most of the physical CPUs on the host.

@rmohr
Copy link
Member Author

rmohr commented Jun 17, 2021

I guess what I'm asking is, what's the practical scenario where numa passthrough makes sense given the tools we have in k8s today?

Right now I see two use-cases:

  • The single-workload use-case: A single workload runs on the node which requests pinning for all CPUs. If we have a few homogeneous nodes, her we can migrate and keep the numa pinning performant on the target host.
  • The multi-workload usecase: When a VM requests CPU pinning nowadays, it gets the CPUs which are available. They can easily span across various host numa nodes. With this feature we can re-creates these parts of the numa topology where the CPUs are part of and can this way ensure that the guest OS has a correct view on CPU and memory proximity which leads to improved performance as well. Here we can not migrate and keep the numa pinning performant on the target host.

@rmohr
Copy link
Member Author

rmohr commented Jun 17, 2021

The topology manager requires the workload to request a certain device that would provide a topology hint according to which the cpu manger will try to allocate the physical CPUs. Unfortunately, it cannot accommodate arbitrary topology requests, i.e 2 sockets, 2 cores, 1 thread.

Even with this approach, we cannot always get a "consistent" topology. On a very fragmented node the guest may end up getting CPUs from various physical sockets. For example, a guest requesting 4 cpus, which lands on a node with 4 sockets, may pin 2 vcpus to physical socket 0, 1 vcpus on socket 1, and 1 on socket 2.
It may not cause any issues for the guest os, but it won't be consistent or symmetric.

Yes. Let me also add @fromanirh who may have more thoughts on the current PR approach and the future of all of this. Once this PR is in, I hope that I can start promoting our use-cases in k8s with the help of Francescos team.

@ffromani
Copy link
Contributor

ack. I'll take me some time to review, though.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2021
@rmohr rmohr changed the title [WIP] Allow numa topology passthrough VMIs with cpupinning Allow numa topology passthrough VMIs with cpupinning Jun 17, 2021
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 17, 2021
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2021
@rmohr
Copy link
Member Author

rmohr commented Jun 17, 2021

@davidvossel @vladikr should be ready for review. Note that one commit (1c75bcc) moves a few functions in the e2e tests in a fresh package which touches quite some files, but that are basically just import path changes.

@kubevirt-bot kubevirt-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jun 17, 2021
@rmohr
Copy link
Member Author

rmohr commented Jun 17, 2021

/retest

@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jun 17, 2021
If the memory is divisable by the requested hugepagesize, it may still
not be equally distributable between the nodes. To overcome this, first
split the requested memory if necessary in smaller equally sized chunks
and then assign the rest of the numa nodes one by one to the numa nodes.

Signed-off-by: Roman Mohr <rmohr@redhat.com>
Signed-off-by: Roman Mohr <rmohr@redhat.com>
Signed-off-by: Roman Mohr <rmohr@redhat.com>
Prepare e2e tests for a checks and skip package. For that some common
functions have to be extracted from `tests/utils.go` into
`tests/util/util.go`.

Signed-off-by: Roman Mohr <rmohr@redhat.com>
This can be used for all tests which require cpu pinning, including the
followup numa tests.

Signed-off-by: Roman Mohr <rmohr@redhat.com>
Signed-off-by: Roman Mohr <rmohr@redhat.com>
ps does not like this signal and it seems that under some circumstances
this signal is forwarded from runc. The origin seems to a change in
golang for non-voluntary preemtion in go-routines.

Signed-off-by: Roman Mohr <rmohr@redhat.com>
Drop the boolean which was used to indicate NUMA passthrough and use the
new NUMA struct instead. The first numa mapping strategy gets its own
struct, to allow accumulating sub-settings for this strategy in an
exclusive space for it, without overlapping with other strategy
settings.

Signed-off-by: Roman Mohr <rmohr@redhat.com>
Make hugepages a requirement for numa usage.

Signed-off-by: Roman Mohr <rmohr@redhat.com>
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2021
Set allocation mode for memory to `immediate` in the domain xml.

Signed-off-by: Roman Mohr <rmohr@redhat.com>
@rmohr
Copy link
Member Author

rmohr commented Jun 24, 2021

/retest

Signed-off-by: Roman Mohr <rmohr@redhat.com>
@rmohr
Copy link
Member Author

rmohr commented Jun 28, 2021

@vladikr added the NUMA featuregate. Looking forward to another review. Thanks!

Copy link
Member

@vladikr vladikr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. Thanks @rmohr!
/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2021
@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@rmohr
Copy link
Member Author

rmohr commented Jun 29, 2021

/retest

@kubevirt-bot
Copy link
Contributor

kubevirt-bot commented Jun 29, 2021

@rmohr: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubevirt-e2e-k8s-1.17-rook-ceph 5782d52 link /test pull-kubevirt-e2e-k8s-1.17-rook-ceph
pull-kubevirt-e2e-k8s-1.20-sig-compute 8acb137 link /test pull-kubevirt-e2e-k8s-1.20-sig-compute

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kubevirt-bot kubevirt-bot merged commit bf4b936 into kubevirt:main Jun 30, 2021
lyarwood added a commit to lyarwood/kubevirt that referenced this pull request Jul 16, 2024
First introduced by kubevirt#5846 way back in v0.43.0 this feature the NUMA
feature gate is now deprecated with the feature state graduated to GA
and as such always enabled.

Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
lyarwood added a commit to lyarwood/kubevirt that referenced this pull request Jul 16, 2024
First introduced by kubevirt#5846 way back in v0.43.0 the NUMA feature gate is
now deprecated with the feature state graduated to GA and as such always
enabled.

Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
lyarwood added a commit to lyarwood/kubevirt that referenced this pull request Sep 3, 2024
First introduced by kubevirt#5846 way back in v0.43.0 the NUMA feature gate is
now deprecated with the feature state graduated to GA and as such always
enabled.

Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
lyarwood added a commit to lyarwood/kubevirt that referenced this pull request Sep 3, 2024
First introduced by kubevirt#5846 way back in v0.43.0 the NUMA feature gate is
now deprecated with the feature state graduated to GA and as such always
enabled.

Signed-off-by: Lee Yarwood <lyarwood@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/build-change Categorizes PRs as related to changing build files of virt-* components lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants