Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guest-to-request memory headroom ratio #9322

Conversation

iholder101
Copy link
Contributor

@iholder101 iholder101 commented Feb 27, 2023

What this PR does / why we need it:
Kubevirt computes [1] the estimation of the total memory overhead needed for infra components inside virt-launcher which are responsible to run the guest (e.g. libvirtd, qemu, etc).

Not only that this overhead currently suffers from known issues and non-accurate calculation which needs to be fixed - this calculation is in essence an educated guess / estimation, and not an accurate calculation. The reason is that even if a careful profiling will take place (which is a very difficult task to do, since the environments on which we would profile makes the results biased), there are still many components we cannot control, e.g. kernel drivers, kernel configuration, inner QEMU buffer allocations, etc.

To solve this problem, we need to both keep improving the overhead estimations, but also provide a solution for the cluster admin to explicitly add some overhead. This is useful in cases where a special configuration is used, or if the cluster-admin chooses to reduce the risk of workloads OOM killed during node pressures in exchange of workloads consuming more memory (therefore less VMIs could be scheduled on a node).

[1] https://github.com/kubevirt/kubevirt/blob/v0.59.0-alpha.2/pkg/virt-controller/services/renderresources.go#L272

Which issue(s) this PR fixes:
Fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=2165618
https://bugzilla.redhat.com/show_bug.cgi?id=2164593

Special notes for your reviewer:
For a reference, you can look at the temporary solution implemented in HCO: kubevirt/hyperconverged-cluster-operator#2206

Release note:

Add guest-to-request memory headroom ratio.
This can be enabled by setting `kubevirt.spec.configuration.additionalGuestMemoryOverheadRatio = "1.234"`

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Feb 27, 2023
@kubevirt-bot kubevirt-bot added size/L kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Feb 27, 2023
@iholder101
Copy link
Contributor Author

/cc @acardace

@iholder101 iholder101 force-pushed the feature/guest-to-request-memory-headroom-ratio branch from bd64a18 to 4b28a47 Compare February 27, 2023 09:12
@iholder101
Copy link
Contributor Author

/cc @vladikr @fabiand

@@ -16853,6 +16853,10 @@
"description": "EvictionStrategy defines at the cluster level if the VirtualMachineInstance should be migrated instead of shut-off in case of a node drain. If the VirtualMachineInstance specific field is set it overrides the cluster level one.",
"type": "string"
},
"guestToRequestMemoryHeadroomRatio": {
"description": "GuestToRequestMemoryHeadroomRatio can be used to set a ratio between VM's memory guest and the memory allocated to the compute container. A higher ratio means that the VMs would be less compromised by node pressures, but would mean that fewer VMs could be scheduled to a node. If not set, the default is 1.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a note if the internal overhead calculations are still taken into account or not (ignored)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.
FYI: they are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@iholder101 iholder101 force-pushed the feature/guest-to-request-memory-headroom-ratio branch 2 times, most recently from 3bd2539 to ed8b899 Compare February 27, 2023 11:24
Copy link
Member

@vladikr vladikr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @iholder101

I think we do need such an additional tunable for the admin to increase the VM calculated overhead in the cluster.

That said, I would expect the additional calculated "headroom" simply be part of the GetMemoryOverhead function. That way the overhead calculation will stay consistent, without the need to separately update the requests and limits.
When the requests/limits are calculated, virt-controller adds the overhead to it, it could also add the additional admin-provided overhead at the same step.
This would also work for the case when the guest overhead should be overcommitted: using the OvercommitGuestOverhead.

I'm also not sure about the guestToRequestMemoryHeadroomRatio name.
When I think about the main purpose of this tunable, I think that it is to allow the admin to provide an additional overhead - knowing that our overhead calculation is "best effort".
Would it make sense to rename it to something like additinalGuestOverhead or increaseFuestOverheadRatio?

func GetMemoryOverhead(vmi *v1.VirtualMachineInstance, cpuArch string) *resource.Quantity {

@iholder101 iholder101 force-pushed the feature/guest-to-request-memory-headroom-ratio branch from ed8b899 to eda7381 Compare March 1, 2023 13:00
@iholder101
Copy link
Contributor Author

iholder101 commented Mar 1, 2023

Hey @vladikr, thanks for your review!

Thanks for this PR @iholder101

I think we do need such an additional tunable for the admin to increase the VM calculated overhead in the cluster.

That said, I would expect the additional calculated "headroom" simply be part of the GetMemoryOverhead function. That way the overhead calculation will stay consistent, without the need to separately update the requests and limits. When the requests/limits are calculated, virt-controller adds the overhead to it, it could also add the additional admin-provided overhead at the same step. This would also work for the case when the guest overhead should be overcommitted: using the OvercommitGuestOverhead.

That's a very good point! Thank you!
I've made the required changed :)

I just want to note that now this field controls the overhead itself only, not the ratio between the guest memory and requests. IOW, the previous implementation did requests = (<guest-memory> + <overhead>)*<ratio> while the current implementation does requests = <guest-memory> + <overhead>*<ratio>. I actually think it's much better this way.

I'm also not sure about the guestToRequestMemoryHeadroomRatio name. When I think about the main purpose of this tunable, I think that it is to allow the admin to provide an additional overhead - knowing that our overhead calculation is "best effort". Would it make sense to rename it to something like additinalGuestOverhead or increaseFuestOverheadRatio?

func GetMemoryOverhead(vmi *v1.VirtualMachineInstance, cpuArch string) *resource.Quantity {

We can discuss the name.
The reason I don't like names like you suggested is that the headroom never actually reaches the guest itself. IMO something like additinalGuestOverhead implies that the overhead affects the guest somehow, which is somewhat misleading.

WDYT about something like additinalVirtInfraOverhead?

@iholder101
Copy link
Contributor Author

Unit tests pass locally
/test pull-kubevirt-unit-test

@@ -1304,7 +1304,7 @@ func checkForKeepLauncherAfterFailure(vmi *v1.VirtualMachineInstance) bool {
}

func (t *templateService) VMIResourcePredicates(vmi *v1.VirtualMachineInstance, networkToResourceMap map[string]string) VMIResourcePredicates {
memoryOverhead := GetMemoryOverhead(vmi, t.clusterConfig.GetClusterCPUArch())
memoryOverhead := GetMemoryOverhead(vmi, t.clusterConfig.GetClusterCPUArch(), t.clusterConfig.GetConfig().GuestToRequestMemoryHeadroomRatio)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there was an easy way to simply get the config from within the GetMemoryOverhead function...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was a bit painful to write..
I'm not sure how we can improve this in a straight forward way, as this function is being called from different containers. We can always fetch Kubevirt with an API call, but that would be costly in terms of performance.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2023
@vladikr
Copy link
Member

vladikr commented Mar 1, 2023

Hey @vladikr, thanks for your review!

Thanks for this PR @iholder101
I think we do need such an additional tunable for the admin to increase the VM calculated overhead in the cluster.
That said, I would expect the additional calculated "headroom" simply be part of the GetMemoryOverhead function. That way the overhead calculation will stay consistent, without the need to separately update the requests and limits. When the requests/limits are calculated, virt-controller adds the overhead to it, it could also add the additional admin-provided overhead at the same step. This would also work for the case when the guest overhead should be overcommitted: using the OvercommitGuestOverhead.

That's a very good point! Thank you! I've made the required changed :)

👍

I just want to note that now this field controls the overhead itself only, not the ratio between the guest memory and requests. IOW, the previous implementation did requests = (<guest-memory> + <overhead>)*<ratio> while the current implementation does requests = <guest-memory> + <overhead>*<ratio>. I actually think it's much better this way.

I'm also not sure about the guestToRequestMemoryHeadroomRatio name. When I think about the main purpose of this tunable, I think that it is to allow the admin to provide an additional overhead - knowing that our overhead calculation is "best effort". Would it make sense to rename it to something like additinalGuestOverhead or increaseFuestOverheadRatio?

func GetMemoryOverhead(vmi *v1.VirtualMachineInstance, cpuArch string) *resource.Quantity {

We can discuss the name. The reason I don't like names like you suggested is that the headroom never actually reaches the guest itself. IMO something like additinalGuestOverhead implies that the overhead affects the guest somehow, which is somewhat misleading.

WDYT about something like additinalVirtInfraOverhead?

I understand your point. I didn't actually read it this way so far. To me "Guest Overhead" on the Virtual Machine Instance reads as an overhead caused by the guest...
My main concern with names is consistency. We already use the term "Guest Overhead" in many places including other tunables OvercommitGuestOverhead [1].
If you prefer not to use this term, then we should find something similar that is not confusing.

Does it make sense?

[1] https://kubevirt.io/user-guide/operations/node_overcommit/#overcommit-the-guest-overhead

@iholder101 iholder101 force-pushed the feature/guest-to-request-memory-headroom-ratio branch 2 times, most recently from df95da1 to 950213e Compare March 2, 2023 11:40
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 2, 2023
@iholder101
Copy link
Contributor Author

@vladikr changed the name to AdditinalGuestOverheadRatio
PTAL

@@ -16825,6 +16825,10 @@
"description": "KubeVirtConfiguration holds all kubevirt configurations",
"type": "object",
"properties": {
"additinalGuestOverheadRatio": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please be more specific with the name
I think this is only calculating the memory, thus
additinalGuestMemoryOverheadRatio could be an option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladikr and I discussed the name in comments above.
At least to me, it feels a bit misleading that we mention "guest memory", as the guest memory itself is not affected in any way. @vladikr claimed that this is the naming we chose for many other fields, so I guess it's a topic for further refactoring.

I suggested the name additinalVirtInfraOverhead above, as it specifically refers to the virt infra overhead - not to the guest memory.

Please share your thoughts :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definetly include memory - if it is memory specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to additinalGuestMemoryOverheadRatio.

AdditionalGuestMemoryOverheadRatio can be used to set a
ratio between VM's memory guest and the memory allocated
to the compute container. A higher ratio means that the
VMs would be less compromised by node pressures, but
would mean that fewer VMs could be scheduled to a node.
If not set, the default is 1.

Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
This is better both in terms of performance and
in terms of safety.

Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
@iholder101 iholder101 force-pushed the feature/guest-to-request-memory-headroom-ratio branch from 950213e to 38743f1 Compare March 2, 2023 13:21
@vladikr
Copy link
Member

vladikr commented Mar 2, 2023

Thanks!

/approve

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vladikr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2023
@kubevirt-bot
Copy link
Contributor

kubevirt-bot commented Mar 2, 2023

@iholder101: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubevirt-fossa 38743f1 link false /test pull-kubevirt-fossa
pull-kubevirt-check-tests-for-flakes 38743f1 link false /test pull-kubevirt-check-tests-for-flakes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@iholder101
Copy link
Contributor Author

/test pull-kubevirt-manifests
/test pull-kubevirt-e2e-k8s-1.26-sig-storage-cgroupsv2

@acardace
Copy link
Member

acardace commented Mar 2, 2023

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2023
@acardace
Copy link
Member

acardace commented Mar 2, 2023

/cherrypick release-0.59

@kubevirt-bot
Copy link
Contributor

@acardace: once the present PR merges, I will cherry-pick it on top of release-0.59 in a new PR and assign it to you.

In response to this:

/cherrypick release-0.59

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-bot kubevirt-bot merged commit 40049bc into kubevirt:main Mar 2, 2023
@kubevirt-bot
Copy link
Contributor

@acardace: #9322 failed to apply on top of branch "release-0.59":

Applying: Add AdditionalGuestMemoryOverheadRatio to Kubevirt CR
Applying: Add guest-to-request memory headroom to compute containers
Using index info to reconstruct a base tree...
M	pkg/virt-controller/services/renderresources.go
M	pkg/virt-handler/vm.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/virt-handler/vm.go
Auto-merging pkg/virt-controller/services/renderresources.go
CONFLICT (content): Merge conflict in pkg/virt-controller/services/renderresources.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 Add guest-to-request memory headroom to compute containers
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-0.59

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tghfly
Copy link

tghfly commented Jun 5, 2023

How to validate the parameter? I modified CR and set “additionalGuestMemoryOverheadRatio: 1.3", but it still encounters OOM. Can you provide an example YAML file?

@iholder101
Copy link
Contributor Author

How to validate the parameter? I modified CR and set “additionalGuestMemoryOverheadRatio: 1.3", but it still encounters OOM. Can you provide an example YAML file?

Modifying kubevirt.spec.configuration.additionalGuestMemoryOverheadRatio is exactly what needs to be done.

Perhaps 1.3 is not enough for your case?

Some guiding questions:

  • How much memory is allocated for the VM (i.e. from the guest perspective)?
  • How much memory is allocated to the compute container in virt-launcher Pod?
  • When the container was OOMed - how much memory did it consume?

You can always try setting a higher value and see if it solves it for your case, but I'd recommend to answer the questions above to ensure your get the right values.

In addition, keep in mind that the multiplier (e.g. 1.3) does not multiply the memory that's configured for the VM, but multiplies the virt-infra memory that's allocated automatically by Kubevirt. For example, let's say that you define a VM with 2GB of memory, and let's say that Kubevirt automatically adds 200M for virt infra. In this case the total memory of the compute container is 2.2G. If you set additionalGuestMemoryOverheadRatio to be 1.3, that applies only to the virt infra, so now the container's memory would be: 2GB+1.3*200M = 2GB+260M=2.26G.

@tghfly
Copy link

tghfly commented Jun 6, 2023

vm.yaml

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vm01
  namespace: default
spec:
  running: true
  template:
    metadata:
      labels:
        kubevirt.io/size: small
        kubevirt.io/domain: vm01
      annotations:
        ovn.kubernetes.io/allow_live_migration: "true"
        ovn.kubernetes.io/logical_switch: provider
        ovn.kubernetes.io/ip_address: 192.168.1.2
    spec:
      nodeSelector:
        kubernetes.io/hostname: k8s-01
      dnsConfig:
        nameservers:
        - 192.168.1.250
      dnsPolicy: "None"
      architecture: amd64
      domain:     
        cpu:
          cores: 4
          model: host-passthrough
        #memory:
        #  guest: 8000Mi
        devices:
          disks:
            - name: root-disk
              disk:
                bus: virtio
              cache: writeback
          interfaces:
            - name: default
              bridge: {}
              model: virtio
          rng: {}
        machine:
          type: q35
        resources:
          limits:
            cpu: 4 
            memory: 8Gi
          requests:
            cpu: 4
            memory: 8Gi
      networks:
        - name: default
          pod: {}
      volumes:
        - name: root-disk
          persistentVolumeClaim:
            claimName: pvc-win10-bootdisk
  • The vm has been allocated 8 GiB of memory(see on windows10 system ).
  • The virt-launcher pod has been allocated 8.36 GiB of memory.
    virt-launcher pods
 Limits:
      cpu:                            4
      devices.kubevirt.io/kvm:        1
      devices.kubevirt.io/tun:        1
      devices.kubevirt.io/vhost-net:  1
      memory:                         8974342024
    Requests:
      cpu:                            4
      devices.kubevirt.io/kvm:        1
      devices.kubevirt.io/tun:        1
      devices.kubevirt.io/vhost-net:  1
      ephemeral-storage:              50M
      memory:                         8974342024
  • When OOM occurs, the memory is around 8.36 GiB(watch -n 1 kubectl top pod).
    In addition,Where can I check how much memory is allocated for virt infra, such as 200M? how to calculate?

@iholder101
Copy link
Contributor Author

iholder101 commented Jun 7, 2023

@tghfly I see you set memory limits == requests. Can you explain why do you set memory limits? Memory limits are dangerous, and I wouldn't define them unless there's a good reason to do so, as if you exceed the limit the Pod is going to be OOM killed.

BTW, another tip: it's recommended to use spec.domain.memory.guest, and remove spec.domain.resources entirely. In other words: determine the guest's memory and let Kubevirt determine the exact resources of the pod for you.

After you remove the limits, I'd try to see if the virt-launcher pod consumes more memory than it requests. On most cases the answer would be no, but if the answer is yes, try setting additionalGuestMemoryOverheadRatio to raise the requested amount.

Hope this helps. Good luck!

@tghfly
Copy link

tghfly commented Jun 7, 2023

@iholder101 Thanks,We use "limits == requests" to ensure Guaranteed QoS. And in the NUMA binding and CPU Pin
scenarios, it is also necessary to set “limits == requests”.

@iholder101
Copy link
Contributor Author

@iholder101 Thanks,We use "limits == requests" to ensure Guaranteed QoS. And in the NUMA binding and CPU Pin scenarios, it is also necessary to set “limits == requests”.

Fair enough. You're right, a Guaranteed QoS is necessary sometimes.
But as I've said, having limits is dangerous and you have to stand by the values you define.

You can either set a different spec.domain.memory.guest and spec.domain.resources.memory.request values to manually define the virt overhead, or play with additionalGuestMemoryOverheadRatio until you have enough memory to not be killed.

@tghfly
Copy link

tghfly commented Jun 7, 2023

Yes, we are currently planning to do so. Thank you again for your patient explanation.

@iholder101
Copy link
Contributor Author

Yes, we are currently planning to do so. Thank you again for your patient explanation.

Sure thing! Good luck!

@xpivarc
Copy link
Member

xpivarc commented Sep 13, 2023

@iholder101 Thanks,We use "limits == requests" to ensure Guaranteed QoS. And in the NUMA binding and CPU Pin scenarios, it is also necessary to set “limits == requests”.

@tghfly I think it would be better if you use DedicatedCPUPlacement as this will ensure other containers are in Guaranteed QoS as well and the limits/requests will be set for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants