-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suppressible VMs: Support VMs that are memory-backed by a file #9636
Conversation
Skipping CI for Draft Pull Request. |
/test all |
/retest-required |
bdefefa
to
083d382
Compare
d7ecc58
to
46dc1cc
Compare
15d08c6
to
ac29232
Compare
/test pull-kubevirt-e2e-k8s-1.26-sig-compute |
/test pull-kubevirt-e2e-k8s-1.25-sig-compute-migrations |
tests/migration_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should test migration with file backed mem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I think we need to prevent file-backed memory with any high-performance features. Simply reject it in the webhook. (IIRC we had IsHighPerformanceVMI somewhere) |
c5ab9a5
to
00841f4
Compare
00841f4
to
734bdae
Compare
0cc4b8d
to
46f7e0a
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
2db1211
to
b9b671f
Compare
Signed-off-by: Itamar Holder <iholder@redhat.com>
When VMI is configured with file memory backing: the requests for ephemeral-memory would include the VMI's virtual memory where libvirt dumps the virtual memory file. Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
In addition, reject if not all of the VMI's disks are defined with CacheMode "none" Signed-off-by: Itamar Holder <iholder@redhat.com>
b9b671f
to
ab27566
Compare
Signed-off-by: Itamar Holder <iholder@redhat.com>
ab27566
to
5539ec4
Compare
Done. |
Definitely! |
/cc @xpivarc |
@iholder101: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good I would just ask to move out the memfd
change as that is pretty isolated and not related change to this.
I believe we should also block the following cases:
- Post-copy migration (you cannot use
userfaultfd
here) - VFIO uses - general PCI pass-through, Mediated devices and gpus.
Would be interesting to also try memory dump.
@@ -384,6 +387,15 @@ type Machine struct { | |||
Type string `json:"type"` | |||
} | |||
|
|||
type Backed struct { | |||
// File backs the VM's memory by a file. Using this configuration allows the node to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: File backs the VM
s memory and a node cache is used to offset a performance hit by using slower storage . The node is able to reclaim the VMs memory as it is just cache(disregarding dirty pages) and it is always stored on the disk. Note: For now the ephemeral storage is used to back this file.
@@ -1381,9 +1381,7 @@ func Convert_v1_VirtualMachineInstance_To_api_Domain(vmi *v1.VirtualMachineInsta | |||
domain.Spec.MemoryBacking = &api.MemoryBacking{ | |||
HugePages: &api.HugePages{}, | |||
} | |||
if val := vmi.Annotations[v1.MemfdMemoryBackend]; val != "false" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this deprecation from this PR as it does not have relation to the content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it does not have relation to the content.
I understand your point, but not sure I entirely agree in this specific context.
Currently, memory backing is being used in certain scenarios (e.g. when huge pages / virtiofs is being used) and is always backed by memfd
. When this code was merged, almost 3 years ago, there was a concern regarding memfd
, therefore the annotation is being introduced.
As written in the PR itself here:
it makes sense [to add this annotation] if there are any concerns. We can always remove the check once it becomes more stable.
After 2.5~ years I think it's perfectly fine to remove it.
In addition, even today, the annotation is effectively broken, as it only affects huge pages and not virtiofs usage, as can be seen here.
So I definitely think it should be deprecated, and my opinion is that this PR is a good opportunity for a bit of refactoring. Since we're messing with memory backing here, I think it is relevant, but if you insist we can deprecate it in a follow-up PR.
Memory: uint64(vcpu.GetVirtualMemory(vmi).Value() / int64(1024)), | ||
Unit: "KiB", | ||
if util.IsFileMemoryBackedVmi(vmi) { | ||
if domain.Spec.MemoryBacking == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d27deba same as previous commit.
spec: | ||
domain: | ||
devices: {} | ||
memory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also showcase that we want to request higher guest memory here in order to actually get some benefit from this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
@@ -3991,6 +3993,35 @@ var _ = Describe("Template", func() { | |||
|
|||
}) | |||
|
|||
Context("with file backed memory", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash 3224ed8
return | ||
} | ||
|
||
getField := func() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just reuse the field
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's over-thinking, but the rational was to avoid the duplication of field.Child("domain", "memory", "backed", "file").String()
in both if
branches. I could just assign it to a variable once and re-use it, but that's unnecessary if no error occurred, therefore benefits performance.
But again, maybe it's over-thinking.
}) | ||
} | ||
|
||
for _, disk := range spec.Domain.Devices.Disks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could default to CacheNone
in the converter if there is no cache set. The user then doesn't need to explicitly opt out from Cache
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, sounds great to me
@iholder101: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I don't think that description reflects how QEMU's When using file backed guest RAM there are two primary factors affecting the behaviour
The When the access mode is When the access mode is Using rotating media is likely a non-starter, as the performance of that is going to be way to poor to cope. SSDs become more viable from a random access performance POV, but have significant questions about longevity. The SSD warranties specify how many times the entire drive capacity can be (over-)written per day (DWPD). The SSD write workload implied by continuously flushing out all guest RAM writes looks enormous compared to the write workload from typical application's disk I/O needs. IOW, having guest RAM backed by files on SSD looks likely to burn through SSD write cycle life very quickly indeed, ultimately resulting in guest RAM reads returning bad data and dead storage hardware. When the access mode is not set in libvirt guest XML, the VM should get QEMU's default, which is I can understand that Kubevirt is between a rock & a hard place WRT kubernetes support for the use of swap. AFAICT though, this proposal does not provide an alternative that has behaviour similar to swap. |
Dan, many thanks for your review and comments.
Yes, this is what we expect.
Yes, we are expecting and advising to only use this method with VMs under low utilization. Ideally we can even recommend to move to something else if the utilization becomes to high.
Yes. And the work on enabling swap continuous with an elevated attention for these reasons. |
Have you benchmarked to see exactly how much worse ? I took QEMU's RAM stress test program that does simple XOR across all of RAM, forever in a loop, and compared its speed with a KVM guest backed by anonymous memory, vs file on SSD/NVMe. On my server with traditional SSDs the performance degradation was x50, on my laptop with NVME SSD it was still degraded by x35. IOW, a test that took 1 second to dirty 12 GB of RAM, now took 35-50 seconds. NB, this was all consumer grade SSD hardware though, so moderate data transfer rates compared to state of the art.
Even for VMs with "light" workloads that is an incredibly large hit, such that something that previously consumed 10% CPU might now consume 100% CPU. It is like rolling back to CPUs that are a decade older or worse when combined with consumer grade SSDs.
I explored some possible figures for this For the purposes of their warranties, SSD / NVMe vendors quote either a TBW (total bytes written) figure, or a DWPD (drive writes per day). They're not going to design the disks to start suffering wear related failures at the warrantied limit, as they'd get too many returns from early failure. There will be some head room designed in, such that the large majority of disks sold will exceed the warrantied lifetime. Still, this is the only official lifespan data we can access to, so here goes.... My dev server's consumer grade SSD at 240 GB with 80 TB TBW figure means the entire contents can be re-written 80 * 1024 / 240 == 341 times. To put it another way, I can over-write the whole disk volume once per day, for a bit less than 1 year before the warranty is invalidated. Its quoted write performance though is 350 MB/s, which means it can theoretically re-write the whole disk in ~722 seconds, or 12 minutes. In one day it can re-write the whole disk as much as 120 times. IOW, if we use this SSD as guest RAM and the guest is busy enough to max out 350 MB/s write performance, it could exhaust the warrantied lifetime of the drive in 3 days. Very bad, even if there is large designed in headroom for lifespan. Lets says the guests are "low utilization" though and only cause 35 MB/s writes on the SSD. Now the warrantied lifetime will be extended to 30 days. Still terrible. Definitely don't want to be using file-as-guest-RAM on this particular consumer grade SSD! Data centers might have (but not guaranteed) up-specced their servers with enterprise grade SSDs, with larger capacities and longer warrantied lifetime. Lets pick an example enterprise driver targetted at "mixed workloads" (as opposed to mostly read workloads). https://www.kingston.com/unitedkingdom/en/ssd/dc1500m-data-center-ssd The 2 TB model is warrantied for 3362 TBW, which is 1681 full disk overwrites, so much improved on the consumer grade. Its write performance is much better though, at 1700 MB/s, so can theoretically re-write the whole drive in 1233 seconds, or 20 minutes. In 1 day it can rewrite the whole drive 72 times. Thus if guests are busy enough to max the 1700 MB/s write speed, the warrantied lifetime could be exhausted in 23 days. For a fairer comparison, if we assume guests are merely busy at the same 350 MB/s as my previous consumer SSD example, the enterprise disk would last 110 days, compared to the consumer disk at 3 days. The ratio of overall disk capacity vs MB/s write rates makes a big impact we see. If we assume guests are "low utilization" at 35 MB/s, the enterprise disk warrantied life would extend to 1100 days (~3 years) NB this assumes the disk is otherwise entirely empty and thus only used for used for guest RAM. If the disk is storing other data, then the portion of the disk being re-written will be a fraction of its overall capacity and thus is liable to decrease the time until wear failures. How do all the transfer figures compare to RAM though ? The best DDR4 RAM could reach 35200 MB/s IIUC. That is x20 faster than the enterprise grade SSD. IOW, our "low utilization" example above was way too optimistic. Even a guest which is running at 10% of its theoretical potential for DDR4 RAM writes, might be able to max out the enterprise SSD transfer rate. So we're potentially back to the 23 days of warrantied lifetime, even for low utilization guests, if we assume the entire SSD capacity is allocated to guest RAM. Unless I've majorly screwed up my analysis somewhere here, I feel like the consideration of high vs low utilization guests isn't the important factor in the lifetime of the SSD drive. It looks too easy for guests to max out the storage transfer rate of even enterprise SSDs, even if they're not maxing out their virtual CPU resources. The predominant factor determining working lifetime of the driver will be the ratio between cumulative RAM of all guests on the host, vs capacity of the SSD/NVMe storage. The storage capacity would need to exceed the total guest RAM by a decent multiple (perhaps x10) to adequately reduce the risk of early SSD failures due to rewrite wear. It would certainly need to be enterprise quality storage too. Overall, I'd feel pretty wary of promoting all this as a solution to users, both in terms of the likely guest workload performance impact, and its implications for the physical hardware lifespan of the underlying storage used for RAM. |
@berrange Thank you for the math! ;) For all these reasons we need to come up with a better backend than ephemeral storage which is out of the box from Kubernetes. |
@xpivarc: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
When a node is under a heavy memory pressure, it's useful to provide a way for the OS to free some memory to avoid workloads being killed. Swap is a great tool to achieve exactly that - during node pressures pages can be swapped out to free some memory.
Unfortunately, swap is still alpha in Kubernetes. While there is an effort to move swap into Beta in Kubernetes [1][2][3], it might take some time until the feature kicks in and gets stable enough for production use.
Another approach, which is very similar to swap, is to back the VM's memory by a file. Since from the OS this is yet another regular file, contents of this file could be swapped to disk during node pressures. Similarly to swap, some of the VM's memory would be flushed to the backing storage to free memory.
To enable this feature, a VMI needs be defined as follows:
Implementation details
Behind the scenes, libvirt's MemoryBacking configurations is being used. For more info: https://libvirt.org/formatdomain.html#memory-backing
In addition, the VMI's total virtual memory is being added to its
ephemeral-storage
requests on the virt-launcher level. This ensures that kubelet is aware of this amount of ephemeral storage, which would decrease the probability of the VMI being killed during disk pressures.In addition, file memory backing in not supported for high-performance VMIs (e.g. VMIs with dedicated CPUs / Realtime VMIs, etc). A validating webhook is in charge of denying such VMIs.
Furthermore, all of the disks must be of cache mode "none", otherwise the VMI would be rejected. The reason is that if cache is turned on, it's possible that the workload would fill up the node's file cache and therefore be accounted for more memory (file cache + virtual memory). Since there's no clarity at the moment regarding which memory would be reclaimed and the general behavior during pressures, it's better to start off conservative until we're sure how things work behind the scenes.
Release note: