Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppressible VMs: Support VMs that are memory-backed by a file #9636

Closed
wants to merge 10 commits into from

Conversation

iholder101
Copy link
Contributor

@iholder101 iholder101 commented Apr 20, 2023

What this PR does / why we need it:
When a node is under a heavy memory pressure, it's useful to provide a way for the OS to free some memory to avoid workloads being killed. Swap is a great tool to achieve exactly that - during node pressures pages can be swapped out to free some memory.

Unfortunately, swap is still alpha in Kubernetes. While there is an effort to move swap into Beta in Kubernetes [1][2][3], it might take some time until the feature kicks in and gets stable enough for production use.

Another approach, which is very similar to swap, is to back the VM's memory by a file. Since from the OS this is yet another regular file, contents of this file could be swapped to disk during node pressures. Similarly to swap, some of the VM's memory would be flushed to the backing storage to free memory.

To enable this feature, a VMI needs be defined as follows:

apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
  name: my-vmi
spec:
  domain:
    memory:
      backed:
        file: {}

Implementation details
Behind the scenes, libvirt's MemoryBacking configurations is being used. For more info: https://libvirt.org/formatdomain.html#memory-backing

In addition, the VMI's total virtual memory is being added to its ephemeral-storage requests on the virt-launcher level. This ensures that kubelet is aware of this amount of ephemeral storage, which would decrease the probability of the VMI being killed during disk pressures.

In addition, file memory backing in not supported for high-performance VMIs (e.g. VMIs with dedicated CPUs / Realtime VMIs, etc). A validating webhook is in charge of denying such VMIs.

Furthermore, all of the disks must be of cache mode "none", otherwise the VMI would be rejected. The reason is that if cache is turned on, it's possible that the workload would fill up the node's file cache and therefore be accounted for more memory (file cache + virtual memory). Since there's no clarity at the moment regarding which memory would be reclaimed and the general behavior during pressures, it's better to start off conservative until we're sure how things work behind the scenes.

Release note:

Suppressable VMs: Support VMs that are memory-backed by a file

@kubevirt-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Apr 20, 2023
@iholder101
Copy link
Contributor Author

/test all

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 20, 2023
@iholder101
Copy link
Contributor Author

/retest-required

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2023
@iholder101 iholder101 force-pushed the feature/FileMemoryBackedVM branch 2 times, most recently from d7ecc58 to 46dc1cc Compare April 25, 2023 14:34
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2023
@iholder101 iholder101 force-pushed the feature/FileMemoryBackedVM branch 2 times, most recently from 15d08c6 to ac29232 Compare April 25, 2023 14:44
@vladikr
Copy link
Member

vladikr commented Apr 25, 2023

/test pull-kubevirt-e2e-k8s-1.26-sig-compute

@vladikr
Copy link
Member

vladikr commented Apr 25, 2023

/test pull-kubevirt-e2e-k8s-1.25-sig-compute-migrations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should test migration with file backed mem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@vladikr
Copy link
Member

vladikr commented Apr 25, 2023

I think we need to prevent file-backed memory with any high-performance features. Simply reject it in the webhook. (IIRC we had IsHighPerformanceVMI somewhere)

@iholder101 iholder101 force-pushed the feature/FileMemoryBackedVM branch 3 times, most recently from c5ab9a5 to 00841f4 Compare April 26, 2023 10:31
@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2023
@iholder101 iholder101 marked this pull request as ready for review April 26, 2023 14:10
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2023
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@iholder101 iholder101 force-pushed the feature/FileMemoryBackedVM branch 2 times, most recently from 2db1211 to b9b671f Compare April 27, 2023 14:35
Signed-off-by: Itamar Holder <iholder@redhat.com>
When VMI is configured with file memory backing:
the requests for ephemeral-memory would include
the VMI's virtual memory
where libvirt dumps the virtual memory file.

Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
In addition, reject if not all of the VMI's disks
are defined with CacheMode "none"

Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
@iholder101 iholder101 marked this pull request as ready for review April 28, 2023 06:03
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2023
@iholder101
Copy link
Contributor Author

I think we need to prevent file-backed memory with any high-performance features. Simply reject it in the webhook. (IIRC we had IsHighPerformanceVMI somewhere)

Done.
Didn't find such a function, so I've introduced one.

@iholder101
Copy link
Contributor Author

Also: Could this be extended to have support in instanceTypes as well?

Definitely!
I do think it can wait to follow-up PRs, though.

@iholder101
Copy link
Contributor Author

/cc @xpivarc

@kubevirt-bot
Copy link
Contributor

kubevirt-bot commented Apr 28, 2023

@iholder101: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubevirt-fossa 5539ec4 link false /test pull-kubevirt-fossa
pull-kubevirt-code-lint 5539ec4 link false /test pull-kubevirt-code-lint
pull-kubevirt-e2e-arm64 5539ec4 link false /test pull-kubevirt-e2e-arm64

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@xpivarc xpivarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good I would just ask to move out the memfd change as that is pretty isolated and not related change to this.

I believe we should also block the following cases:

  1. Post-copy migration (you cannot use userfaultfd here)
  2. VFIO uses - general PCI pass-through, Mediated devices and gpus.

Would be interesting to also try memory dump.

@@ -384,6 +387,15 @@ type Machine struct {
Type string `json:"type"`
}

type Backed struct {
// File backs the VM's memory by a file. Using this configuration allows the node to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: File backs the VMs memory and a node cache is used to offset a performance hit by using slower storage . The node is able to reclaim the VMs memory as it is just cache(disregarding dirty pages) and it is always stored on the disk. Note: For now the ephemeral storage is used to back this file.

@@ -1381,9 +1381,7 @@ func Convert_v1_VirtualMachineInstance_To_api_Domain(vmi *v1.VirtualMachineInsta
domain.Spec.MemoryBacking = &api.MemoryBacking{
HugePages: &api.HugePages{},
}
if val := vmi.Annotations[v1.MemfdMemoryBackend]; val != "false" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this deprecation from this PR as it does not have relation to the content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it does not have relation to the content.

I understand your point, but not sure I entirely agree in this specific context.
Currently, memory backing is being used in certain scenarios (e.g. when huge pages / virtiofs is being used) and is always backed by memfd. When this code was merged, almost 3 years ago, there was a concern regarding memfd, therefore the annotation is being introduced.

As written in the PR itself here:

it makes sense [to add this annotation] if there are any concerns. We can always remove the check once it becomes more stable.

After 2.5~ years I think it's perfectly fine to remove it.

In addition, even today, the annotation is effectively broken, as it only affects huge pages and not virtiofs usage, as can be seen here.

So I definitely think it should be deprecated, and my opinion is that this PR is a good opportunity for a bit of refactoring. Since we're messing with memory backing here, I think it is relevant, but if you insist we can deprecate it in a follow-up PR.

Memory: uint64(vcpu.GetVirtualMemory(vmi).Value() / int64(1024)),
Unit: "KiB",
if util.IsFileMemoryBackedVmi(vmi) {
if domain.Spec.MemoryBacking == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d27deba same as previous commit.

spec:
domain:
devices: {}
memory:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also showcase that we want to request higher guest memory here in order to actually get some benefit from this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

@@ -3991,6 +3993,35 @@ var _ = Describe("Template", func() {

})

Context("with file backed memory", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash 3224ed8

return
}

getField := func() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just reuse the field variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's over-thinking, but the rational was to avoid the duplication of field.Child("domain", "memory", "backed", "file").String() in both if branches. I could just assign it to a variable once and re-use it, but that's unnecessary if no error occurred, therefore benefits performance.

But again, maybe it's over-thinking.

})
}

for _, disk := range spec.Domain.Devices.Disks {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could default to CacheNone in the converter if there is no cache set. The user then doesn't need to explicitly opt out from Cache. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, sounds great to me

@fabiand fabiand changed the title Suppressable VMs: Support VMs that are memory-backed by a file Suppressible VMs: Support VMs that are memory-backed by a file May 2, 2023
@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2023
@kubevirt-bot
Copy link
Contributor

@iholder101: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@berrange
Copy link
Contributor

Another approach, which is very similar to swap, is to back the VM's memory by a file. Since from the OS this is yet another regular file, contents of this file could be swapped to disk during node pressures. Similarly to swap, some of the VM's memory would be flushed to the backing storage to free memory.

I don't think that description reflects how QEMU's memory-backend-file will behave, and I don't believe the actual behaviour will be similar to 'swap' in several key ways.

When using file backed guest RAM there are two primary factors affecting the behaviour

  • the filesystem type (physical FS backed by storage eg ext4, xfs, btrfs, vs virtual FS backend by RAM eg tmpfs, hugetlbfs)
  • the access mode (shared vs private in libvirt terminology, or share=on|off in QEMU's terminology)

The memory-backend-file is really intended for use with virtual filesystems, either as a means to get a guest backed by huge pages, or as a way to get non-anonymous RAM shared with external backend processes (eg vhost-user device backends). The ability to use it with real files is more of a side-effect of the implementation. This change, however, is proposing backing the guest RAM with a file on a physical FS, so I'll ignore the virtual FS behaviour henceforth focusing just on behaviour with a physical FS.

When the access mode is private (share=off for QEMU), QEMU uses mmap() with the MAP_PRIVATE flag. Initially reads will be satisfied from data in the backing file (should be all zeroes assuming a freshly created backing file with no data), and any writes will trigger copy-on-write for the RAM mappings. IOW, writes in guest RAM will NEVER propagate to the underlying backing file, not even when the host is under memory pressure. Functionally, assuming a freshly created backing file with all-zeroes, this should be indistinguishable from the VM using normal anonymous allocate RAM. IOW this mode doesn't help address the stated problem of not having swap available when the host has memory pressure AFAICT.

When the access mode is shared (share=on for QEMU), QEMU uses mmap() with the MAP_SHARED flag. Again reads are satisfied from the data in the backing file, but writes will also propagate through to the backing file. It should have a benefit when the host has memory pressure because the writes can always be discarded from host RAM once written to backing file. The key difference from swap though, is that these writes are being done CONTINUOUSLY, not merely when the host is under memory pressure, so this is inherently going to perform worse than swap (probably very much worse) for all the time the VM is running. The performance of the VMs is going to be dependent on the combination of performance of the storage holding the memory backing file, and how RAM write-heavy the guest workload. Most workloads will be fairly RAM write heavy, after all any interesting application is going to be doing memory writes continuously.

Using rotating media is likely a non-starter, as the performance of that is going to be way to poor to cope. SSDs become more viable from a random access performance POV, but have significant questions about longevity. The SSD warranties specify how many times the entire drive capacity can be (over-)written per day (DWPD). The SSD write workload implied by continuously flushing out all guest RAM writes looks enormous compared to the write workload from typical application's disk I/O needs. IOW, having guest RAM backed by files on SSD looks likely to burn through SSD write cycle life very quickly indeed, ultimately resulting in guest RAM reads returning bad data and dead storage hardware.

When the access mode is not set in libvirt guest XML, the VM should get QEMU's default, which is share=off. I don't see this patch setting the access mode. Thus I believe this change is currently not achieving anything. Guest RAM writes will never propagate to the underlying storage at all and behaviour should be indistinguishable from normal anonymous allocated RAM. If the patch were changed, however, to request shared access mode, then guest VM performance is going to degrade significantly from the continuous host I/O triggered, and host storage lifetime is going to be significantly degraded too.

I can understand that Kubevirt is between a rock & a hard place WRT kubernetes support for the use of swap. AFAICT though, this proposal does not provide an alternative that has behaviour similar to swap.

@fabiand
Copy link
Member

fabiand commented May 11, 2023

Dan, many thanks for your review and comments.

The key difference from swap though, is that these writes are being done CONTINUOUSLY, not merely when the host is under memory pressure, so this is inherently going to perform worse than swap (probably very much worse) for all the time the VM is running.

Yes, this is what we expect.

The performance of the VMs is going to be dependent on the combination of performance of the storage holding the memory backing file, and how RAM write-heavy the guest workload. Most workloads will be fairly RAM write heavy, after all any interesting application is going to be doing memory writes continuously.

Yes, we are expecting and advising to only use this method with VMs under low utilization. Ideally we can even recommend to move to something else if the utilization becomes to high.

this proposal does not provide an alternative that has behaviour similar to swap.

Yes. And the work on enabling swap continuous with an elevated attention for these reasons.

@berrange
Copy link
Contributor

Dan, many thanks for your review and comments.

The key difference from swap though, is that these writes are being done CONTINUOUSLY, not merely when the host is under memory pressure, so this is inherently going to perform worse than swap (probably very much worse) for all the time the VM is running.

Yes, this is what we expect.

Have you benchmarked to see exactly how much worse ?

I took QEMU's RAM stress test program that does simple XOR across all of RAM, forever in a loop, and compared its speed with a KVM guest backed by anonymous memory, vs file on SSD/NVMe. On my server with traditional SSDs the performance degradation was x50, on my laptop with NVME SSD it was still degraded by x35. IOW, a test that took 1 second to dirty 12 GB of RAM, now took 35-50 seconds.

NB, this was all consumer grade SSD hardware though, so moderate data transfer rates compared to state of the art.

The performance of the VMs is going to be dependent on the combination of performance of the storage holding the memory backing file, and how RAM write-heavy the guest workload. Most workloads will be fairly RAM write heavy, after all any interesting application is going to be doing memory writes continuously.

Yes, we are expecting and advising to only use this method with VMs under low utilization. Ideally we can even recommend to move to something else if the utilization becomes to high.

Even for VMs with "light" workloads that is an incredibly large hit, such that something that previously consumed 10% CPU might now consume 100% CPU. It is like rolling back to CPUs that are a decade older or worse when combined with consumer grade SSDs.

having guest RAM backed by files on SSD looks likely to burn through SSD write cycle life very quickly indeed, ultimately resulting in guest RAM reads returning bad data and dead storage hardware

I explored some possible figures for this

For the purposes of their warranties, SSD / NVMe vendors quote either a TBW (total bytes written) figure, or a DWPD (drive writes per day). They're not going to design the disks to start suffering wear related failures at the warrantied limit, as they'd get too many returns from early failure. There will be some head room designed in, such that the large majority of disks sold will exceed the warrantied lifetime. Still, this is the only official lifespan data we can access to, so here goes....

My dev server's consumer grade SSD at 240 GB with 80 TB TBW figure means the entire contents can be re-written 80 * 1024 / 240 == 341 times. To put it another way, I can over-write the whole disk volume once per day, for a bit less than 1 year before the warranty is invalidated. Its quoted write performance though is 350 MB/s, which means it can theoretically re-write the whole disk in ~722 seconds, or 12 minutes. In one day it can re-write the whole disk as much as 120 times.

IOW, if we use this SSD as guest RAM and the guest is busy enough to max out 350 MB/s write performance, it could exhaust the warrantied lifetime of the drive in 3 days. Very bad, even if there is large designed in headroom for lifespan.

Lets says the guests are "low utilization" though and only cause 35 MB/s writes on the SSD. Now the warrantied lifetime will be extended to 30 days. Still terrible.

Definitely don't want to be using file-as-guest-RAM on this particular consumer grade SSD!

Data centers might have (but not guaranteed) up-specced their servers with enterprise grade SSDs, with larger capacities and longer warrantied lifetime.

Lets pick an example enterprise driver targetted at "mixed workloads" (as opposed to mostly read workloads).

https://www.kingston.com/unitedkingdom/en/ssd/dc1500m-data-center-ssd

The 2 TB model is warrantied for 3362 TBW, which is 1681 full disk overwrites, so much improved on the consumer grade. Its write performance is much better though, at 1700 MB/s, so can theoretically re-write the whole drive in 1233 seconds, or 20 minutes. In 1 day it can rewrite the whole drive 72 times.

Thus if guests are busy enough to max the 1700 MB/s write speed, the warrantied lifetime could be exhausted in 23 days.

For a fairer comparison, if we assume guests are merely busy at the same 350 MB/s as my previous consumer SSD example, the enterprise disk would last 110 days, compared to the consumer disk at 3 days. The ratio of overall disk capacity vs MB/s write rates makes a big impact we see.

If we assume guests are "low utilization" at 35 MB/s, the enterprise disk warrantied life would extend to 1100 days (~3 years)

NB this assumes the disk is otherwise entirely empty and thus only used for used for guest RAM. If the disk is storing other data, then the portion of the disk being re-written will be a fraction of its overall capacity and thus is liable to decrease the time until wear failures.

How do all the transfer figures compare to RAM though ? The best DDR4 RAM could reach 35200 MB/s IIUC. That is x20 faster than the enterprise grade SSD. IOW, our "low utilization" example above was way too optimistic. Even a guest which is running at 10% of its theoretical potential for DDR4 RAM writes, might be able to max out the enterprise SSD transfer rate.

So we're potentially back to the 23 days of warrantied lifetime, even for low utilization guests, if we assume the entire SSD capacity is allocated to guest RAM.

Unless I've majorly screwed up my analysis somewhere here, I feel like the consideration of high vs low utilization guests isn't the important factor in the lifetime of the SSD drive. It looks too easy for guests to max out the storage transfer rate of even enterprise SSDs, even if they're not maxing out their virtual CPU resources.

The predominant factor determining working lifetime of the driver will be the ratio between cumulative RAM of all guests on the host, vs capacity of the SSD/NVMe storage. The storage capacity would need to exceed the total guest RAM by a decent multiple (perhaps x10) to adequately reduce the risk of early SSD failures due to rewrite wear. It would certainly need to be enterprise quality storage too.

Overall, I'd feel pretty wary of promoting all this as a solution to users, both in terms of the likely guest workload performance impact, and its implications for the physical hardware lifespan of the underlying storage used for RAM.

@xpivarc
Copy link
Member

xpivarc commented May 19, 2023

@berrange Thank you for the math! ;)
I think we can safely say there were more worries over this than benefits. To add to it one of many other worries was that we are going to saturate the disk which is also used by other overlays for container or the disk is used by cri, images, and kubelet. Leading to unstable nodes and causing evictions.

For all these reasons we need to come up with a better backend than ephemeral storage which is out of the box from Kubernetes.
/close

@kubevirt-bot
Copy link
Contributor

@xpivarc: Closed this PR.

In response to this:

@berrange Thank you for the math! ;)
I think we can safely say there were more worries over this than benefits. To add to it one of many other worries was that we are going to saturate the disk which is also used by other overlays for container or the disk is used by cri, images, and kubelet. Leading to unstable nodes and causing evictions.

For all these reasons we need to come up with a better backend than ephemeral storage which is out of the box from Kubernetes.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants